arrow-datafusion
db-benchmark
Our great sponsors
arrow-datafusion | db-benchmark | |
---|---|---|
26 | 50 | |
1,974 | 217 | |
6.4% | 3.7% | |
9.9 | 2.9 | |
5 days ago | about 2 months ago | |
Rust | R | |
Apache License 2.0 | Mozilla Public License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
arrow-datafusion
- Anda para aqui alguém a brincar com Rust (linguagem)?
-
Future of Databricks
Ballista is one possible Spark replacement, runs on top of Apache DataFusion https://arrow.apache.org/datafusion/
-
Ask HN: What can I do with 48GB of RAM?
Have fun with in-memory columnar databases or SQL engines and see how fast they are (the ones that use CPU-efficient data structures and data element access, plus SIMD processing). For example Apache Arrow datafusion [1]
- DataFusion: Embeddable query engine for Apache Arrow
- How to use multiple Parquet files with Datafusion dataframe?
-
Distributed systems you'd like to see in Rust?
This project looks cool: https://github.com/apache/arrow-datafusion
- Any role that Rust could have in the Data world (Big Data, Data Science, Machine learning, etc.)?
-
Show HN: Box – Data Transformation Pipelines in Rust DataFusion
A while ago I posted a link to [Arc](https://news.ycombinator.com/item?id=26573930) a declarative method for defining repeatable data pipelines which execute against [Apache Spark](https://spark.apache.org/).
Today I would like to present a proof-of-concept implementation of the [Arc declarative ETL framework](https://arc.tripl.ai) against [Apache Datafusion](https://arrow.apache.org/datafusion/) which is an Ansi SQL (Postgres) execution engine based upon Apache Arrow and built with Rust.
The idea of providing a declarative 'configuration' language for defining data pipelines was planned from the beginning of the Arc project to allow changing execution engines without having to rewrite the base business logic (the part that is valuable to your business). Instead, by defining an abstraction layer, we can change the execution engine and run the same logic with different execution characteristics.
The benefit of the DataFusion over Apache Spark is a significant increase in speed and reduction in execution resource requirements. Even through a Docker-for-Mac inefficiency layer the same job completes in ~4 seconds with DataFusion vs ~24 seconds with Apache Spark (including JVM startup time). Without Docker-for-Mac layer end-to-end execution times of 0.5 second for the same example job (TPC-H) is possible. * the aim is not to start a benchmarking flamewar but to provide some indicative data *.
The purpose of this post is to gather feedback from the community whether you would use a tool like this, what features would be required for you to use it (MVP) or whether you would be interested in contributing to the project. I would also like to highlight the excellent work being done by the DataFusion/Arrow (and Apache) community for providing such amazing tools to us all as open source projects.
-
Rust and what it needs to gain space in computation-oriented applications
You should check out polars, datafusion, influxdb iox and databend, all written in native Rust and powered by the Apache Arrow format. Polars in particular is pretty dam fast and has bindings for Python.
-
How to pass dataframes between Rust and Python?
A solution for either Polars or Datafusion (or something else?) would be fine. For both libraries, python packages exist, that contain the python bindings: https://github.com/pola-rs/polars/tree/master/py-polars https://github.com/apache/arrow-datafusion/tree/master/python
db-benchmark
-
Friendlier SQL with DuckDB
Hi, good to hear that you guys care about testing. One thing apart from the Github issues that led me to believe it might not be super stable yet was the benchmark results on https://h2oai.github.io/db-benchmark/ which make it look like it couldn't handle the 50GB case due to a out of memory error. I see that the benchmark and the used versions are about a year old so maybe things changed a lot since then. Can you chime in regarding the current story of running bigger DBs like 1TB on a machine with just 32GB or so RAM? Especially regardung data mutations and DDL queries. Thanks!
-
I used a new dataframe library (polars) to wrangle 300M prices and discover some of the most expensive hospitals in America. Code/notebook in article
Per these benchmarks it appears Polars is an order of magnitude more performant and it's lazy and Rust is just kinda sexy.
-
Benchmarking for loops vs apply and others
This is a much more comprehensive set of benchmarks: https://h2oai.github.io/db-benchmark/
-
Why is R viewed badly over here? Also, as a college student, should I prioritize Python instead?
Its not like pandas is faster than tidyverse either on all the bechmarks, and data.table is faster than both. https://h2oai.github.io/db-benchmark/
-
Resources for data cleaning
Language isn't really important here; what's important is tooling, and R definitely has the tooling. I would look at this benchmark reference for database-like operations, and you'll see that data.table (a very fast and memory-efficient R package) consistently ranks as one of the fastest tools out there that can also support a wide range of memory loads.
- The fastest tool for querying large JSON files is written in Python (benchmark)
- Polars 0.20.0 release
-
How Easy It Is to Re-use Old Pandas Code in Spark 3.2
It seems to me that the Spark model is much more sensible in terms of performance. In Spark, individual tasks are finally compiled into optimized Java code. As I understand Dusk works, a separate Python process is run for each data subset. So because of this architecture, Dusk is unlikely to ever get Spark performance. By the way, both systems build and optimize the operation graph. This is confirmed by benchmarks: https://h2oai.github.io/db-benchmark/
-
Count frequency of breed by animal ID (details in comments)
You can try to compare dplyr and data.table in this benchmark of different data sizes and operations: https://h2oai.github.io/db-benchmark/
-
Massive R analysis of Data Science Language and Job Trends 2022
I use both R and Python, and agree with 1, 3 and 4, but for most data wrangling, data.table in R beats most of the competition, including pandas in Python: https://h2oai.github.io/db-benchmark/
What are some alternatives?
polars - Fast multi-threaded DataFrame library in Rust | Python | Node.js
ClickHouse - ClickHouse® is a free analytics DBMS for big data
tikv - Distributed transactional key-value database, originally created to complement TiDB
nushell - A new type of shell
datafuse - An elastic and reliable Cloud Warehouse, offers Blazing Fast Query and combines Elasticity, Simplicity, Low cost of the Cloud, built to make the Data Cloud easy [Moved to: https://github.com/datafuselabs/databend]
databend - A modern Elasticity and Performance cloud data warehouse, activate your object storage for real-time analytics.
sktime - A unified framework for machine learning with time series
awesome-rewrite-it-in-rust - A curated list of replacements for existing software written in Rust [Moved to: https://github.com/TaKO8Ki/awesome-alternatives-in-rust]
Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
sea-query - 🔱 A dynamic SQL query builder for MySQL, Postgres and SQLite