polars
Daft
polars | Daft | |
---|---|---|
144 | 7 | |
26,218 | 1,684 | |
2.9% | 3.7% | |
10.0 | 9.8 | |
4 days ago | 5 days ago | |
Rust | Rust | |
MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
polars
-
Why Python's Integer Division Floors (2010)
This is because 0.1 is in actuality the floating point value value 0.1000000000000000055511151231257827021181583404541015625, and thus 1 divided by it is ever so slightly smaller than 10. Nevertheless, fpround(1 / fpround(1 / 10)) = 10 exactly.
I found out about this recently because in Polars I defined a // b for floats to be (a / b).floor(), which does return 10 for this computation. Since Python's correctly-rounded division is rather expensive, I chose to stick to this (more context: https://github.com/pola-rs/polars/issues/14596#issuecomment-...).
-
Polars
https://github.com/pola-rs/polars/releases/tag/py-0.19.0
-
Stuff I Learned during Hanukkah of Data 2023
That turned out to be related to pola-rs/polars#11912, and this linked comment provided a deceptively simple solution - use PARSE_DECLTYPES when creating the connection:
- Polars 0.20 Released
- Segunda linguagem
- Polars: Dataframes powered by a multithreaded query engine, written in Rust
- Summing columns in remote Parquet files using DuckDB
- Polars 0.34 is released. (A query engine focussing on DataFrame front ends)
Daft
-
Daft: Distributed DataFrame for Python
There are benchmarks here - https://github.com/Eventual-Inc/Daft?tab=readme-ov-file#benc.... Seems to outperform Dask by a fair bit.
-
Daft: A High-Performance Distributed Dataframe Library for Multimodal Data
Hi (one of the maintainers here), that is a good suggestion! I wasn't aware of that project. I went ahead and made an issue to add `export DO_NOT_TRACK=1` as one of the variables we track! https://github.com/Eventual-Inc/Daft/issues/1015
-
Daft: The Distributed Python Dataframe
We are looking at supporting other distributed backends as well - please drop by our discussion forums (https://github.com/Eventual-Inc/Daft/discussions) and drop us a message if you have any suggestions! We’d love to hear from you :)
What are some alternatives?
vaex - Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
xvc - A robust (🐢) and fast (🐇) MLOps tool for managing data and pipelines in Rust (🦀)
modin - Modin: Scale your Pandas workflows by changing a single line of code
hamilton - A scalable general purpose micro-framework for defining dataflows. THIS REPOSITORY HAS BEEN MOVED TO www.github.com/dagworks-inc/hamilton
datafusion - Apache DataFusion SQL Query Engine
deeplake - Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
DataFrames.jl - In-memory tabular data in Julia
quokka - Making data lake work for time series
datatable - A Python package for manipulating 2-dimensional tabular data structures
lightflus - A Lightweight, Cloud-Native Stateful Distributed Dataflow Engine
Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
hamilton - Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage and metadata. Runs and scales everywhere python does.