Top 16 apache-arrow Open-Source Projects

pixie

19 5,305 9.4 C++

Instant Kubernetes-Native Application Observability

Project mention: Grafana Beyla: OSS eBPF auto-instrumentation for application observability | news.ycombinator.com | 2023-09-13

AWS Data Wrangler

9 3,811 9.4 Python

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Project mention: Read files from s3 using Pandas/s3fs or AWS Data Wrangler? | /r/dataengineering | 2023-12-06

I had no problem with awswrangler (https://github.com/aws/aws-sdk-pandas) and it supports reading and writing partitions which was really helpful and a few other optimizations that made it a great tool

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
lance

10 3,296 9.8 Rust

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..

Project mention: The Nimble File Format by Meta | news.ycombinator.com | 2024-04-25

frostdb

5 1,216 9.5 Go

❄️ Coolest database around 🧊 Embeddable column database written in Go.

Project mention: Polar Signals Cloud Is Generally Available | news.ycombinator.com | 2023-10-10

> In addition to that we built a custom columnar database
I did some digging in your blog history and it seems that is referencing https://www.polarsignals.com/blog/posts/2022/07/22/frostdb-i... and digging into the "but why?" section <https://github.com/polarsignals/frostdb#why-you-should-use-f...> seems to imply you favored the embedded feature over having something standalone, but I would enjoy hearing (or reading a blog post!) about why you felt it was a better use of your engineering to make your own columar DB versus using one of the existing columanr dbs that I have seen referenced a ton in other Show HN announcements around both logging and metrics services

functime

5 914 9.5 Python

Time-series machine learning at scale. Built with Polars for embarrassingly parallel feature extraction and forecasts on panel data.

Project mention: functime: NEW Data - star count:616.0 | /r/algoprojects | 2023-11-08

awkward

4 796 9.6 Python

Manipulate JSON-like data with NumPy-like idioms.

Project mention: Efficient Jagged Arrays | news.ycombinator.com | 2023-07-03

there's a whole ecosystem in Python originally developed for high energy physics data processing: https://github.com/scikit-hep/awkward all because Numpy demands square N-dimensional array
Same technique used everywhere, here's a simple Julia pkg for the same thing: https://github.com/JuliaArrays/ArraysOfArrays.jl/blob/3a6f5b...
But Julia at least has the decency to just support ragged Vector{Vector} out of the box, and it's not that slow

ustore

15 489 9.6 C++

Multi-Modal Database replacing MongoDB, Neo4J, and Elastic with 1 faster ACID solution, with NetworkX and Pandas interfaces, and bindings for C 99, C++ 17, Python 3, Java, GoLang 🗄️
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
geopolars

3 493 6.3 Rust

Geospatial extensions for Polars
parquet-wasm

6 466 9.0 Rust

Rust-based WebAssembly bindings to read and write Apache Parquet data

Project mention: FLaNK AI Weekly for 29 April 2024 | dev.to | 2024-04-29

lonboard

1 416 9.5 Python

A Python library for fast, interactive geospatial vector data visualization in Jupyter.

Project mention: Parquet-WASM: Rust-based WebAssembly bindings to read and write Parquet data | news.ycombinator.com | 2024-04-22

I'll let Kyle chime in but I tested it a few months ago with millions of polygons on an M2 16GB of RAM laptop and it worked very well.
There is a library by the same author called lonboard that provides the JS bits inside JupyterLab. https://github.com/developmentseed/lonboard
I think it is based on the Kepler.gl / Deck.gl data loaders that go straight to GPU from network.

arrow-julia

4 277 6.2 Julia

Official Julia implementation of Apache Arrow
space

1 136 8.9 Python

Unified storage framework for the entire machine learning lifecycle (by google)

Project mention: Unified storage framework for the entire machine learning lifecycle | news.ycombinator.com | 2024-02-28

arrow-js-ffi

1 91 8.1 TypeScript

Zero-copy reading of Arrow data from WebAssembly

Project mention: Parquet-WASM: Rust-based WebAssembly bindings to read and write Parquet data | news.ycombinator.com | 2024-04-22

Arrow JS is just ArrayBuffers underneath. You do want to amortize some operations to avoid unnecessary conversions. I.e. Arrow JS stores strings as UTF-8, but native JS strings are UTF-16 I believe.
Arrow is especially powerful across the WASM <--> JS boundary! In fact, I wrote a library to interpret Arrow from Wasm memory into JS without any copies [0]. (Motivating blog post [1])
[0]: https://github.com/kylebarron/arrow-js-ffi
[1]: https://observablehq.com/@kylebarron/zero-copy-apache-arrow-...

red_amber

1 62 8.5 Ruby

A dataframe library for Rubyists.
awesome-pandas-alternatives

1 29 10.0

Awesome list of alternative dataframe libraries in Python.
udsb

1 8 4.4 Jupyter Notebook

Unlimited Data-Science Benchmarks for Numeric, Tabular and Graph Workloads
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

apache-arrow related posts

Parquet-WASM: Rust-based WebAssembly bindings to read and write Parquet data

5 projects | news.ycombinator.com | 22 Apr 2024
Polar Signals Cloud Is Generally Available

1 project | news.ycombinator.com | 10 Oct 2023
I agree that Arrow Tables are great, but we decided to keep the library focused on the Pandas interface. [wont implement]

1 project | /r/programmingcirclejerk | 21 Sep 2022
Benchmarking Pandas, CuDF, Modin, Apache Arrow and Spark on a Billion Taxi Rides dataset

2 projects | /r/Python | 21 Sep 2022
Rust 1.63.0

14 projects | news.ycombinator.com | 11 Aug 2022
arcticDB: embedded columnar database written in Go

2 projects | /r/golang | 4 May 2022
How to adapt Arrow.Table columns (naturally per record batch basis) into CuArrays for GPU processing?

1 project | /r/Julia | 2 Mar 2022
A note from our sponsor - SaaSHub
www.saashub.com | 10 May 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source apache-arrow projects? This list will help you:

	Project	Stars
1	pixie	5,305
2	AWS Data Wrangler	3,811
3	lance	3,296
4	frostdb	1,216
5	functime	914
6	awkward	796
7	ustore	489
8	geopolars	493
9	parquet-wasm	466
10	lonboard	416
11	arrow-julia	277
12	space	136
13	arrow-js-ffi	91
14	red_amber	62
15	awesome-pandas-alternatives	29
16	udsb	8

apache-arrow

Top 16 apache-arrow Open-Source Projects

apache-arrow related posts

Parquet-WASM: Rust-based WebAssembly bindings to read and write Parquet data

Polar Signals Cloud Is Generally Available

I agree that Arrow Tables are great, but we decided to keep the library focused on the Pandas interface. [wont implement]

Benchmarking Pandas, CuDF, Modin, Apache Arrow and Spark on a Billion Taxi Rides dataset

Rust 1.63.0

arcticDB: embedded columnar database written in Go

How to adapt Arrow.Table columns (naturally per record batch basis) into CuArrays for GPU processing?

Index