bacon
arrow-datafusion
Our great sponsors
bacon | arrow-datafusion | |
---|---|---|
2 | 48 | |
175 | 4,016 | |
- | 3.6% | |
0.0 | 9.5 | |
21 days ago | 2 days ago | |
Rust | Rust | |
MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
bacon
- Any role that Rust could have in the Data world (Big Data, Data Science, Machine learning, etc.)?
-
Scientific Computing in Rust
See the github repo here https://github.com/aftix/bacon
arrow-datafusion
-
GlareDB: An open source SQL database to query and analyze distributed data
Apache Arrow is a pretty common memory structure these days. Datafusion is an open query engine built in Rust started by Andy Grove.
-
DuckDB 0.8.0
DuckDB is a great piece of software if you are
If you are looking for a query engine implemented in a safe language (Rust) I definitely suggest checking out DataFusion. It is comparable to DuckDB in performance, has all the standard built in SQL functionality, and is extensible in pretty much all areas (query language, data formats, catalogs, user defined functions, etc)
https://arrow.apache.org/datafusion/
Disclaimer I am a maintainer of DataFusion
-
Data Engineering with Rust
https://github.com/jorgecarleitao/arrow2 https://github.com/apache/arrow-datafusion https://github.com/apache/arrow-ballista https://github.com/pola-rs/polars https://github.com/duckdb/duckdb
-
Bridging Async and Sync Rust Code - A lesson learned while working with Tokio
Problem comes when you want to do this inside an async context since we couldn't block an async task. https://users.rust-lang.org/t/sync-function-invoking-async/43364/6 You might need to do it in another runtime/thread. It is not recommended to do this, but sometimes it is unavoidable while implementing a third-party trait. https://github.com/apache/arrow-datafusion/issues/3777 However, I believe this isn't a problem particular to tokio, or any specific runtime.
- Using Rust to write a Data Pipeline. Thoughts. Musings.
-
Demystifying Apache Arrow
You can see some of the benchmarks in DataFusion (part of the Arrow project and built with Arrow as the underlying in-memory format) https://github.com/apache/arrow-datafusion/blob/master/bench...
Disclaimer: I'm a committer on the Arrow project and contributor to DataFusion.
-
Scala or Rust? which one will rule in future?
polars and datafusion seem very promising
-
Welcome to Comprehensive Rust
Rust has amazing integration with Python through PyO3 [1] so see it like a safe alternative for high performance calculations. The ecosystem itself is starting to come together exciting projects like Polars [2] (Pandas alternative), nalgebra [3], Datafusion [4] and Ballista [5]
[1] https://github.com/PyO3/pyo3
[2] https://github.com/pola-rs/polars/
[3] https://docs.rs/nalgebra/latest/nalgebra/
-
Command-line data analytics made easy
It could be the NDJSON parser (DF source: [0]) or could be a variety of other factors. Looking at the ROAPI release archive [1], it doesn't ship with the definitive `columnq` binary from your comment, so it could also have something to do with compilation-time flags.
FWIW, we use the Parquet format with DataFusion and get very good speeds similar to DuckDB [2], e.g. 1.5s to run a more complex aggregation query `SELECT date_trunc('month', tpep_pickup_datetime) AS month, COUNT(*) AS total_trips, SUM(total_amount) FROM tripdata GROUP BY 1 ORDER BY 1 ASC)` on a 55M row subset of NY Taxi trip data.
[0]: https://github.com/apache/arrow-datafusion/blob/master/dataf...
[1]: https://github.com/roapi/roapi/releases/tag/roapi-v0.8.0
SPyQL is really cool and its design is very smart, with it being able to leverage normal Python functions!
As far as similar tools go, I recommend taking a look at DataFusion[0], dsq[1], and OctoSQL[2].
DataFusion is a very (very very) fast command-line SQL engine but with limited support for data formats.
dsq is based on SQLite which means it has to load data into SQLite first, but then gives you the whole breath of SQLite, it also supports many data formats, but is slower at the same time.
OctoSQL is faster, extensible through plugins, and supports incremental query execution, so you can i.e. calculate a running group by + count while tailing a log file. It also supports normal databases, not just file formats, so you can i.e. join with a Postgres table.
[0]: https://github.com/apache/arrow-datafusion
[1]: https://github.com/multiprocessio/dsq
[2]: https://github.com/cube2222/octosql
Disclaimer: Author of OctoSQL
What are some alternatives?
polars - Fast multi-threaded, hybrid-out-of-core query engine focussing on DataFrame front-ends
ClickHouse - ClickHouse® is a free analytics DBMS for big data
db-benchmark - reproducible benchmark of database-like ops
databend - A modern cloud data warehouse focusing on reducing cost and complexity for your massive-scale analytics needs. Open source alternative to Snowflake. Also available in the cloud: https://app.databend.com 🧠
tikv - Distributed transactional key-value database, originally created to complement TiDB
nushell - A new type of shell
sea-query - 🔱 A dynamic SQL query builder for MySQL, Postgres and SQLite
datafuse - An elastic and reliable Cloud Warehouse, offers Blazing Fast Query and combines Elasticity, Simplicity, Low cost of the Cloud, built to make the Data Cloud easy [Moved to: https://github.com/datafuselabs/databend]
Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
duckdb - DuckDB is an in-process SQL OLAP Database Management System
awesome-rewrite-it-in-rust - A curated list of replacements for existing software written in Rust [Moved to: https://github.com/TaKO8Ki/awesome-alternatives-in-rust]
Apache Spark - Apache Spark - A unified analytics engine for large-scale data processing