awesome-rewrite-it-in-rust
DISCONTINUED
arrow-datafusion
Our great sponsors
- SonarLint - Clean code begins in your IDE with SonarLint
- InfluxDB - Access the most powerful time series database as a service
- CodiumAI - TestGPT | Generating meaningful tests for busy devs
- ONLYOFFICE ONLYOFFICE Docs — document collaboration in your environment
awesome-rewrite-it-in-rust | arrow-datafusion | |
---|---|---|
6 | 47 | |
1,556 | 3,678 | |
- | 7.1% | |
8.5 | 9.9 | |
almost 2 years ago | 3 days ago | |
Rust | Rust | |
MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
awesome-rewrite-it-in-rust
- Replacements for existing software written in Rust
-
Awesome Rewrite It In Rust - A curated list of replacements for existing software written in Rust
[For all contributors] Do you think I should change the repository name?https://github.com/TaKO8Ki/awesome-rewrite-it-in-rust/issues/29
arrow-datafusion
-
DuckDB 0.8.0
DuckDB is a great piece of software if you are
If you are looking for a query engine implemented in a safe language (Rust) I definitely suggest checking out DataFusion. It is comparable to DuckDB in performance, has all the standard built in SQL functionality, and is extensible in pretty much all areas (query language, data formats, catalogs, user defined functions, etc)
https://arrow.apache.org/datafusion/
Disclaimer I am a maintainer of DataFusion
-
Data Engineering with Rust
https://github.com/jorgecarleitao/arrow2 https://github.com/apache/arrow-datafusion https://github.com/apache/arrow-ballista https://github.com/pola-rs/polars https://github.com/duckdb/duckdb
-
Bridging Async and Sync Rust Code - A lesson learned while working with Tokio
Problem comes when you want to do this inside an async context since we couldn't block an async task. https://users.rust-lang.org/t/sync-function-invoking-async/43364/6 You might need to do it in another runtime/thread. It is not recommended to do this, but sometimes it is unavoidable while implementing a third-party trait. https://github.com/apache/arrow-datafusion/issues/3777 However, I believe this isn't a problem particular to tokio, or any specific runtime.
- Using Rust to write a Data Pipeline. Thoughts. Musings.
-
Demystifying Apache Arrow
You can see some of the benchmarks in DataFusion (part of the Arrow project and built with Arrow as the underlying in-memory format) https://github.com/apache/arrow-datafusion/blob/master/bench...
Disclaimer: I'm a committer on the Arrow project and contributor to DataFusion.
-
Scala or Rust? which one will rule in future?
polars and datafusion seem very promising
-
Welcome to Comprehensive Rust
Rust has amazing integration with Python through PyO3 [1] so see it like a safe alternative for high performance calculations. The ecosystem itself is starting to come together exciting projects like Polars [2] (Pandas alternative), nalgebra [3], Datafusion [4] and Ballista [5]
[1] https://github.com/PyO3/pyo3
[2] https://github.com/pola-rs/polars/
[3] https://docs.rs/nalgebra/latest/nalgebra/
-
Command-line data analytics made easy
It could be the NDJSON parser (DF source: [0]) or could be a variety of other factors. Looking at the ROAPI release archive [1], it doesn't ship with the definitive `columnq` binary from your comment, so it could also have something to do with compilation-time flags.
FWIW, we use the Parquet format with DataFusion and get very good speeds similar to DuckDB [2], e.g. 1.5s to run a more complex aggregation query `SELECT date_trunc('month', tpep_pickup_datetime) AS month, COUNT(*) AS total_trips, SUM(total_amount) FROM tripdata GROUP BY 1 ORDER BY 1 ASC)` on a 55M row subset of NY Taxi trip data.
[0]: https://github.com/apache/arrow-datafusion/blob/master/dataf...
[1]: https://github.com/roapi/roapi/releases/tag/roapi-v0.8.0
SPyQL is really cool and its design is very smart, with it being able to leverage normal Python functions!
As far as similar tools go, I recommend taking a look at DataFusion[0], dsq[1], and OctoSQL[2].
DataFusion is a very (very very) fast command-line SQL engine but with limited support for data formats.
dsq is based on SQLite which means it has to load data into SQLite first, but then gives you the whole breath of SQLite, it also supports many data formats, but is slower at the same time.
OctoSQL is faster, extensible through plugins, and supports incremental query execution, so you can i.e. calculate a running group by + count while tailing a log file. It also supports normal databases, not just file formats, so you can i.e. join with a Postgres table.
[0]: https://github.com/apache/arrow-datafusion
[1]: https://github.com/multiprocessio/dsq
[2]: https://github.com/cube2222/octosql
Disclaimer: Author of OctoSQL
-
Welcome to InfluxDB IOx: InfluxData’s New Storage Engine
Just wanted to give a shout out to Apache DataFusion[0] that IOx relies on a lot (and contributes to as well!).
It's a framework for writing query engines in Rust that takes care of a lot of heavy lifting around parsing SQL, type casting, constructing and transforming query plans and optimizing them. It's pluggable, making it easy to write custom data sources, optimizer rules, query nodes etc.
It's has very good single-node performance (there's even a way to compile it with SIMD support) and Ballista [1] extends that to build it into a distributed query engine.
Plenty of other projects use it besides IOx, including VegaFusion, ROAPI, Cube.js's preaggregation store. We're heavily using it to build Seafowl [2], an analytical database that's optimized for running SQL queries directly from the user's browser (caching, CDNs, low latency, some WASM support, all that fun stuff).
[0] https://github.com/apache/arrow-datafusion
What are some alternatives?
polars - Fast multi-threaded, hybrid-out-of-core DataFrame library in Rust | Python | Node.js
ClickHouse - ClickHouse® is a free analytics DBMS for big data
db-benchmark - reproducible benchmark of database-like ops
databend - A modern cloud data warehouse focusing on reducing cost and complexity for your massive-scale analytics needs. Open source alternative to Snowflake. Also available in the cloud: https://app.databend.com 🧠
tikv - Distributed transactional key-value database, originally created to complement TiDB
nushell - A new type of shell
sea-query - 🔱 A dynamic SQL query builder for MySQL, Postgres and SQLite
ripgrep - ripgrep recursively searches directories for a regex pattern while respecting your gitignore
datafuse - An elastic and reliable Cloud Warehouse, offers Blazing Fast Query and combines Elasticity, Simplicity, Low cost of the Cloud, built to make the Data Cloud easy [Moved to: https://github.com/datafuselabs/databend]
Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
volta - Volta: JS Toolchains as Code. ⚡
aws-sdk-rust - AWS SDK for the Rust Programming Language