polars
arrow-datafusion
Our great sponsors
polars | arrow-datafusion | |
---|---|---|
144 | 55 | |
25,298 | 4,789 | |
5.7% | 5.7% | |
10.0 | 9.9 | |
5 days ago | 7 days ago | |
Rust | Rust | |
MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
polars
-
Polars
- handling of categoricals in polars seemed a little underbaked, though my main complaint, that categories cannot be pre-defined, seems to have been recently addressed: https://github.com/pola-rs/polars/issues/10705
-
Stuff I Learned during Hanukkah of Data 2023
That turned out to be related to pola-rs/polars#11912, and this linked comment provided a deceptively simple solution - use PARSE_DECLTYPES when creating the connection:
- Segunda linguagem
-
Summing columns in remote Parquet files using DuckDB
Looks like somebody requested it after reading your TIL. https://github.com/pola-rs/polars/issues/12493#issuecomment-...
It will be in the next release. (later today?)
-
What are you rewriting in rust?
I am a maintainer for a dataframe interface called polars
-
[Crowdsourcing] Is there any code you really wished used named function arguments?
For example with polars, the python library extensively uses named arguments, but in rust we have to use either a builder pattern or macros. The builder pattern tends to be much more verbose than the named argument equivalent. There is currently a draft PR implementing python style named arguments for some of the most common functions.
- Polars cookbook (Jupyter)
-
Working with Rust
Seeing a lot of great libraries coming out with python bindings in the data world e.g delta-rs Polars. I see it growing in this space as a C++ alternative
arrow-datafusion
-
Velox: Meta's Unified Execution Engine [pdf]
Python's Substrait seems like the biggest/most-used competitor-ish out there. I'd love some compare & contrast; my sense is that Substrait has a smaller ambition, and more wants to be a language for talking about execution rather than a full on execution engine. https://github.com/substrait-io/substrait
We can also see from the DataFusion discussion that they too see themselves as a bit of a Velox competitor. https://github.com/apache/arrow-datafusion/discussions/6441
-
What I Talk About When I Talk About Query Optimizer (Part 1): IR Design
Agree, substrait is a really cool project! Related: if you like substrait you might want to check out datafusion too. The project is a query execution engine built on top of Apache Arrow (with SQL parser, query planner & optimizer, execution engine, extensible user defined functions, among others) and it implements a substrait provider and consumer: https://github.com/apache/arrow-datafusion/tree/main/datafus...
BTW you can see a version of what an industrial strength query optimizer / execution engine looks like in Rust https://arrow.apache.org/datafusion/
(can also use it in your own projects)
It is quite similar to what is described in this post
-
DuckDB performance improvements with the latest release
The draft contains some preliminary benchmark results, comparing it to DuckDB.
Would be curious how the performance compares to [DataFusion](https://github.com/apache/arrow-datafusion) as one of the top contenders to DuckDB on this area (albeit they being different in a lot of parts, I find it one of the closest compared to all others).
ClickBench (from ClickHouse) has some benchmarks[1] where it can be compared, but am not super sure how up to date it is. At least a while back, they were majorly out of date and haven't looked too closely on whether they are keeping it fair for everyone else :)
[1]: benchmark.clickhouse.com
-
GlareDB: An open source SQL database to query and analyze distributed data
Apache Arrow is a pretty common memory structure these days. Datafusion is an open query engine built in Rust started by Andy Grove.
-
DuckDB 0.8.0
DuckDB is a great piece of software if you are
If you are looking for a query engine implemented in a safe language (Rust) I definitely suggest checking out DataFusion. It is comparable to DuckDB in performance, has all the standard built in SQL functionality, and is extensible in pretty much all areas (query language, data formats, catalogs, user defined functions, etc)
https://arrow.apache.org/datafusion/
Disclaimer I am a maintainer of DataFusion
-
Data Engineering with Rust
https://github.com/jorgecarleitao/arrow2 https://github.com/apache/arrow-datafusion https://github.com/apache/arrow-ballista https://github.com/pola-rs/polars https://github.com/duckdb/duckdb
-
Bridging Async and Sync Rust Code - A lesson learned while working with Tokio
Problem comes when you want to do this inside an async context since we couldn't block an async task. https://users.rust-lang.org/t/sync-function-invoking-async/43364/6 You might need to do it in another runtime/thread. It is not recommended to do this, but sometimes it is unavoidable while implementing a third-party trait. https://github.com/apache/arrow-datafusion/issues/3777 However, I believe this isn't a problem particular to tokio, or any specific runtime.
What are some alternatives?
vaex - Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second ๐
modin - Modin: Scale your Pandas workflows by changing a single line of code
DataFrames.jl - In-memory tabular data in Julia
datatable - A Python package for manipulating 2-dimensional tabular data structures
ClickHouse - ClickHouseยฎ is a free analytics DBMS for big data
Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
db-benchmark - reproducible benchmark of database-like ops
databend - ๐๐ฎ๐๐ฎ, ๐๐ป๐ฎ๐น๐๐๐ถ๐ฐ๐ & ๐๐. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
rust-numpy - PyO3-based Rust bindings of the NumPy C-API
hdf5-rust - HDF5 for Rust
tidypolars - Tidy interface to polars