Top 23 Arrow Open-Source Projects

polars

144 25,837 10.0 Rust

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

Project mention: Why Python's Integer Division Floors (2010) | news.ycombinator.com | 2024-02-28

This is because 0.1 is in actuality the floating point value value 0.1000000000000000055511151231257827021181583404541015625, and thus 1 divided by it is ever so slightly smaller than 10. Nevertheless, fpround(1 / fpround(1 / 10)) = 10 exactly.
I found out about this recently because in Polars I defined a // b for floats to be (a / b).floor(), which does return 10 for this computation. Since Python's correctly-rounded division is rather expensive, I chose to stick to this (more context: https://github.com/pola-rs/polars/issues/14596#issuecomment-...).

Apache Arrow

75 13,442 10.0 C++

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Project mention: How moving from Pandas to Polars made me write better code without writing better code | dev.to | 2024-03-05

In comes Polars: a brand new dataframe library, or how the author Ritchie Vink describes it... a query engine with a dataframe frontend. Polars is built on top of the Arrow memory format and is written in Rust, which is a modern performant and memory-safe systems programming language similar to C/C++.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
arrow

11 8,546 4.4 Python

🏹 Better dates & times for Python (by arrow-py)
cudf

23 7,257 9.9 C++

cuDF - GPU DataFrame Library

Project mention: A Polars exploration into Kedro | dev.to | 2023-05-17

The interesting thing about Polars is that it does not try to be a drop-in replacement to pandas, like Dask, cuDF, or Modin, and instead has its own expressive API. Despite being a young project, it quickly got popular thanks to its easy installation process and its “lightning fast” performance.

Kategory

31 5,954 8.9 Kotlin

Λrrow - Functional companion to Kotlin's Standard Library (by arrow-kt)

Project mention: Java 21 makes me like Java again | news.ycombinator.com | 2023-09-16

Yeah, it has nice funcional capabilities and libraries (like Arrow[0]).
[0]: https://arrow-kt.io

arrow-datafusion

55 4,924 9.9 Rust

Apache DataFusion SQL Query Engine

Project mention: Velox: Meta's Unified Execution Engine [pdf] | news.ycombinator.com | 2024-03-25

Python's Substrait seems like the biggest/most-used competitor-ish out there. I'd love some compare & contrast; my sense is that Substrait has a smaller ambition, and more wants to be a language for talking about execution rather than a full on execution engine. https://github.com/substrait-io/substrait
We can also see from the DataFusion discussion that they too see themselves as a bit of a Velox competitor. https://github.com/apache/arrow-datafusion/discussions/6441

roapi

24 3,069 6.9 Rust

Create full-fledged APIs for slowly moving datasets without writing a single line of code.

Project mention: Full-fledged APIs for slowly moving datasets without writing code | news.ycombinator.com | 2023-10-25

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
LakeSoul

21 2,294 9.3 Java

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
arrow-ballista

12 1,259 8.4 Rust

Apache Arrow Ballista Distributed Query Engine

Project mention: Polars | news.ycombinator.com | 2024-01-08

Not super on topic because this is all immature and not integrated with one another yet, but there is a scaled-out rust data-frames-on-arrow implementation called ballista that could maybe? form the backend of a polars scale out approach: https://github.com/apache/arrow-ballista

react-archer

1 1,063 2.8 TypeScript

🏹 Draw arrows between React elements 🖋
vscode-data-preview

2 522 2.7 TypeScript

Data Preview 🈸 extension for importing 📤 viewing 🔎 slicing 🔪 dicing 🎲 charting 📊 & exporting 📥 large JSON array/config, YAML, Apache Arrow, Avro, Parquet & Excel data files
ustore

15 485 9.6 C++

Multi-Modal Database replacing MongoDB, Neo4J, and Elastic with 1 faster ACID solution, with NetworkX and Pandas interfaces, and bindings for C 99, C++ 17, Python 3, Java, GoLang 🗄️
r-polars

5 385 9.8 R

Bring polars to R

Project mention: Polars R Package | news.ycombinator.com | 2024-02-08

Arrow 🏹

0 384 3.7 Swift

🏹 Parse JSON with style (by freshOS)
arrow-datafusion-comet

2 365 9.1 Rust

Apache Arrow DataFusion Comet Spark Accelerator

Project mention: Apache Arrow DataFusion Comet Spark Accelerator | news.ycombinator.com | 2024-03-07

duckdb-rs

3 357 7.3 Rust

Ergonomic bindings to duckdb for Rust
puffin

1 277 7.6

Serverless HTAP cloud data platform powered by Arrow × DuckDB × Iceberg (by sutoiku)

Project mention: Throwing lots of data at DuckDB and Athena | news.ycombinator.com | 2023-04-23

[3] https://github.com/sutoiku/puffin
One possible thing to look into would be whether this dataset is partitioned too much. My understanding is that the recommended file size for individual parquet files is 512MB to 1GB, whereas here they are 50MB. It would be interesting to see the impact of the partitioning strategy on these benchmarks.
[4] https://parquet.apache.org/docs/file-format/configurations/

pqrs

4 245 4.9 Rust

Command line tool for inspecting Parquet files
parquet-wasm

4 223 8.8 Rust

Rust-based WebAssembly bindings to read and write Apache Parquet data

Project mention: Goodbye, Node.js Buffer | news.ycombinator.com | 2023-10-24

nodejs-polars is node-specific and uses native FFI. polars can be compiled to Wasm but doesn't yet have a js API out of the box.
As for the fastest way to serialize data to Pandas data to the browser, you should use Parquet; it's the fastest to write on the Python side and read on the JS side, while also being compressed. See https://github.com/kylebarron/parquet-wasm (full disclosure, I wrote this)

spark-clickhouse-connector

1 167 8.1 Scala

Spark ClickHouse Connector build on DataSourceV2 API
s2protocol-rs

4 102 8.6 Rust

Starcraft 2 Protocol Replay Reader

Project mention: New version of s2protocol-rs SC2Replay parsing crate | /r/starcraft2 | 2023-10-06

ordered-arrowverse

2 96 4.5 HTML

A listing of all shows in the Arrowverse in watch order to ensure continuity and sensible ordering for crossover episodes
vinum

5 65 0.0 C++

Vinum is a SQL processor for Python, designed for data analysis workflows and in-memory analytics.
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-03-25.

Arrow related posts

Velox: Meta's Unified Execution Engine [pdf]
2 projects | news.ycombinator.com | 25 Mar 2024
Apache Arrow DataFusion Comet Spark Accelerator
1 project | news.ycombinator.com | 7 Mar 2024
How moving from Pandas to Polars made me write better code without writing better code
2 projects | dev.to | 5 Mar 2024
Why Python's Integer Division Floors (2010)
1 project | news.ycombinator.com | 28 Feb 2024
Transforming Postgres into a Fast OLAP Database
3 projects | news.ycombinator.com | 7 Feb 2024
Polars R Package
1 project | news.ycombinator.com | 8 Feb 2024
Polars
11 projects | news.ycombinator.com | 8 Jan 2024
A note from our sponsor - WorkOS
workos.com | 19 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source Arrow projects? This list will help you:

	Project	Stars
1	polars	25,837
2	Apache Arrow	13,442
3	arrow	8,546
4	cudf	7,257
5	Kategory	5,954
6	arrow-datafusion	4,924
7	roapi	3,069
8	LakeSoul	2,294
9	arrow-ballista	1,259
10	react-archer	1,063
11	vscode-data-preview	522
12	ustore	485
13	r-polars	385
14	Arrow 🏹	384
15	arrow-datafusion-comet	365
16	duckdb-rs	357
17	puffin	277
18	pqrs	245
19	parquet-wasm	223
20	spark-clickhouse-connector	167
21	s2protocol-rs	102
22	ordered-arrowverse	96
23	vinum	65