polars
arrow-ballista
Our great sponsors
polars | arrow-ballista | |
---|---|---|
144 | 12 | |
26,043 | 1,259 | |
6.1% | 6.0% | |
10.0 | 8.4 | |
3 days ago | 7 days ago | |
Rust | Rust | |
MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
polars
-
Why Python's Integer Division Floors (2010)
This is because 0.1 is in actuality the floating point value value 0.1000000000000000055511151231257827021181583404541015625, and thus 1 divided by it is ever so slightly smaller than 10. Nevertheless, fpround(1 / fpround(1 / 10)) = 10 exactly.
I found out about this recently because in Polars I defined a // b for floats to be (a / b).floor(), which does return 10 for this computation. Since Python's correctly-rounded division is rather expensive, I chose to stick to this (more context: https://github.com/pola-rs/polars/issues/14596#issuecomment-...).
-
Polars
https://github.com/pola-rs/polars/releases/tag/py-0.19.0
-
Stuff I Learned during Hanukkah of Data 2023
That turned out to be related to pola-rs/polars#11912, and this linked comment provided a deceptively simple solution - use PARSE_DECLTYPES when creating the connection:
- Polars 0.20 Released
- Segunda linguagem
- Polars: Dataframes powered by a multithreaded query engine, written in Rust
- Summing columns in remote Parquet files using DuckDB
- Polars 0.34 is released. (A query engine focussing on DataFrame front ends)
arrow-ballista
-
Polars
Not super on topic because this is all immature and not integrated with one another yet, but there is a scaled-out rust data-frames-on-arrow implementation called ballista that could maybe? form the backend of a polars scale out approach: https://github.com/apache/arrow-ballista
-
Rust vs. Go in 2023
> Is Rust's compile-time GC about something other than performance somehow?
AFAIK, memory safety and language features as RAII is also available in C++, for instance. About the reasons for slow compilation, take a look at https://www.reddit.com/r/rust/comments/xna9mb/why_are_rust_p...
Not having a GC is also about not having a runtime as you mention (e.g. nice for creating Python extensions and embedded systems programming) and also more runtime deterministic performance: on that, if I'm not mistaken that was the reason for Discourse switching to Rust and also, e.g.: "the choice of Rust as the main execution language avoids the overhead of GC pauses and results in deterministic processing times" https://github.com/apache/arrow-ballista/blob/main/README.md
- Ballista (Rust) vs Apache Spark. A Tale of Woe.
-
Evolution and Trends of Data Engineering 2022/23
Ballista (Arrow-Rust), which is largely inspired by Apache Spark, there are some interesting differences.
-
Data Engineering with Rust
https://github.com/jorgecarleitao/arrow2 https://github.com/apache/arrow-datafusion https://github.com/apache/arrow-ballista https://github.com/pola-rs/polars https://github.com/duckdb/duckdb
- Any job processing framework like Spark but in Rust?
-
Is Apache Arrow DataFusion and Ballista the future of big data engineering/science?
Source: https://github.com/apache/arrow-ballista
-
Pure Python Distributed SQL Engine
Can you explain how this might differ from something like https://github.com/apache/arrow-ballista
I've seen several variants of "next-gen" spark, but nowhere have I really seen the different tradeoffs/advantages/disadvantages between them.
- Scala or Rust? which one will rule in future?
-
Welcome to Comprehensive Rust
Rust has amazing integration with Python through PyO3 [1] so see it like a safe alternative for high performance calculations. The ecosystem itself is starting to come together exciting projects like Polars [2] (Pandas alternative), nalgebra [3], Datafusion [4] and Ballista [5]
[1] https://github.com/PyO3/pyo3
[2] https://github.com/pola-rs/polars/
[3] https://docs.rs/nalgebra/latest/nalgebra/
[4] https://github.com/apache/arrow-datafusion
[5] https://github.com/apache/arrow-ballista
What are some alternatives?
vaex - Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
duckdb - DuckDB is an in-process SQL OLAP Database Management System
modin - Modin: Scale your Pandas workflows by changing a single line of code
lance - Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
arrow-datafusion - Apache DataFusion SQL Query Engine
seafowl - Analytical database for data-driven Web applications 🪶
DataFrames.jl - In-memory tabular data in Julia
connector-x - Fastest library to load data from DB to DataFrames in Rust and Python
datatable - A Python package for manipulating 2-dimensional tabular data structures
opteryx - 🦖 A SQL-on-everything Query Engine you can execute over multiple databases and file formats. Query your data, where it lives.
Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
sqlglot - Python SQL Parser and Transpiler