The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →
Top 23 Arrow Open-Source Projects
-
Apache Arrow
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
LakeSoul
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
-
vscode-data-preview
Data Preview 🈸 extension for importing 📤 viewing 🔎 slicing 🔪 dicing 🎲 charting 📊 & exporting 📥 large JSON array/config, YAML, Apache Arrow, Avro, Parquet & Excel data files
-
ustore
Multi-Modal Database replacing MongoDB, Neo4J, and Elastic with 1 faster ACID solution, with NetworkX and Pandas interfaces, and bindings for C 99, C++ 17, Python 3, Java, GoLang 🗄️
-
ordered-arrowverse
A listing of all shows in the Arrowverse in watch order to ensure continuity and sensible ordering for crossover episodes
-
vinum
Vinum is a SQL processor for Python, designed for data analysis workflows and in-memory analytics.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
This is because 0.1 is in actuality the floating point value value 0.1000000000000000055511151231257827021181583404541015625, and thus 1 divided by it is ever so slightly smaller than 10. Nevertheless, fpround(1 / fpround(1 / 10)) = 10 exactly.
I found out about this recently because in Polars I defined a // b for floats to be (a / b).floor(), which does return 10 for this computation. Since Python's correctly-rounded division is rather expensive, I chose to stick to this (more context: https://github.com/pola-rs/polars/issues/14596#issuecomment-...).
Project mention: How moving from Pandas to Polars made me write better code without writing better code | dev.to | 2024-03-05In comes Polars: a brand new dataframe library, or how the author Ritchie Vink describes it... a query engine with a dataframe frontend. Polars is built on top of the Arrow memory format and is written in Rust, which is a modern performant and memory-safe systems programming language similar to C/C++.
The interesting thing about Polars is that it does not try to be a drop-in replacement to pandas, like Dask, cuDF, or Modin, and instead has its own expressive API. Despite being a young project, it quickly got popular thanks to its easy installation process and its “lightning fast” performance.
Yeah, it has nice funcional capabilities and libraries (like Arrow[0]).
[0]: https://arrow-kt.io
Python's Substrait seems like the biggest/most-used competitor-ish out there. I'd love some compare & contrast; my sense is that Substrait has a smaller ambition, and more wants to be a language for talking about execution rather than a full on execution engine. https://github.com/substrait-io/substrait
We can also see from the DataFusion discussion that they too see themselves as a bit of a Velox competitor. https://github.com/apache/arrow-datafusion/discussions/6441
Project mention: Full-fledged APIs for slowly moving datasets without writing code | news.ycombinator.com | 2023-10-25
Not super on topic because this is all immature and not integrated with one another yet, but there is a scaled-out rust data-frames-on-arrow implementation called ballista that could maybe? form the backend of a polars scale out approach: https://github.com/apache/arrow-ballista
Project mention: Apache Arrow DataFusion Comet Spark Accelerator | news.ycombinator.com | 2024-03-07
[3] https://github.com/sutoiku/puffin
One possible thing to look into would be whether this dataset is partitioned too much. My understanding is that the recommended file size for individual parquet files is 512MB to 1GB, whereas here they are 50MB. It would be interesting to see the impact of the partitioning strategy on these benchmarks.
[4] https://parquet.apache.org/docs/file-format/configurations/
nodejs-polars is node-specific and uses native FFI. polars can be compiled to Wasm but doesn't yet have a js API out of the box.
As for the fastest way to serialize data to Pandas data to the browser, you should use Parquet; it's the fastest to write on the Python side and read on the JS side, while also being compressed. See https://github.com/kylebarron/parquet-wasm (full disclosure, I wrote this)
Arrow related posts
- Velox: Meta's Unified Execution Engine [pdf]
- Apache Arrow DataFusion Comet Spark Accelerator
- How moving from Pandas to Polars made me write better code without writing better code
- Why Python's Integer Division Floors (2010)
- Transforming Postgres into a Fast OLAP Database
- Polars R Package
- Polars
-
A note from our sponsor - WorkOS
workos.com | 19 Apr 2024
Index
What are some of the best open-source Arrow projects? This list will help you:
Project | Stars | |
---|---|---|
1 | polars | 25,837 |
2 | Apache Arrow | 13,442 |
3 | arrow | 8,546 |
4 | cudf | 7,257 |
5 | Kategory | 5,954 |
6 | arrow-datafusion | 4,924 |
7 | roapi | 3,069 |
8 | LakeSoul | 2,294 |
9 | arrow-ballista | 1,259 |
10 | react-archer | 1,063 |
11 | vscode-data-preview | 522 |
12 | ustore | 485 |
13 | r-polars | 385 |
14 | Arrow 🏹 | 384 |
15 | arrow-datafusion-comet | 365 |
16 | duckdb-rs | 357 |
17 | puffin | 277 |
18 | pqrs | 245 |
19 | parquet-wasm | 223 |
20 | spark-clickhouse-connector | 167 |
21 | s2protocol-rs | 102 |
22 | ordered-arrowverse | 96 |
23 | vinum | 65 |