db-benchmark
Apache Arrow
Our great sponsors
db-benchmark | Apache Arrow | |
---|---|---|
54 | 42 | |
219 | 9,537 | |
4.6% | 2.7% | |
2.7 | 10.0 | |
about 2 months ago | 6 days ago | |
R | C++ | |
Mozilla Public License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
db-benchmark
-
Will Rust-based data frame library Polars dethrone Pandas? We evaluate on 1M+ Stack Overflow questions
We didn't conduct our own benchmarks for this post, but in this comparison from ~1 year ago, Polars emerged as the fastest https://h2oai.github.io/db-benchmark/
-
Amazon Redshift Re-Invented
No major issues but the JavaScript bindings (which are different to their wasm bindings) that I use leave a lot to be desired. To DuckDB's credit, they seem to have top-notch CPP and Python bindings that even support the efficient memory-mapped Arrow format that's super-efficient in cross-language / cross-process scenarios in addition to being a top-notch in-memory representation of Panda-like data-frames.
DuckDB's is in constant development but doesn't yet have native cross-version export/import feature (since its developers claim DuckDB hasn't reached maturity to stabilise its on-disk formats just yet).
I also keep an eye on https://h2oai.github.io/db-benchmark/ Pola.rs and DataFusion sound the most exciting.
It also remains to be seen how DataBrick's delta.io develops (might come in handy for much much larger datasets).
- C++ for data analysis
-
Fast Lane to Learning R
I strongly recommend data.table R. Tidyverse is an improvement on base R, no question. Data.table has less intuitive syntax and can be harder to learn, but is lightning fast and memory efficient. If you're working with more than 1M rows, you should be using data.table.
Here are some benchmarks: https://h2oai.github.io/db-benchmark/
-
Friendlier SQL with DuckDB
Hi, good to hear that you guys care about testing. One thing apart from the Github issues that led me to believe it might not be super stable yet was the benchmark results on https://h2oai.github.io/db-benchmark/ which make it look like it couldn't handle the 50GB case due to a out of memory error. I see that the benchmark and the used versions are about a year old so maybe things changed a lot since then. Can you chime in regarding the current story of running bigger DBs like 1TB on a machine with just 32GB or so RAM? Especially regardung data mutations and DDL queries. Thanks!
-
I used a new dataframe library (polars) to wrangle 300M prices and discover some of the most expensive hospitals in America. Code/notebook in article
Per these benchmarks it appears Polars is an order of magnitude more performant and it's lazy and Rust is just kinda sexy.
-
Benchmarking for loops vs apply and others
This is a much more comprehensive set of benchmarks: https://h2oai.github.io/db-benchmark/
-
Why is R viewed badly over here? Also, as a college student, should I prioritize Python instead?
Its not like pandas is faster than tidyverse either on all the bechmarks, and data.table is faster than both. https://h2oai.github.io/db-benchmark/
-
Resources for data cleaning
Language isn't really important here; what's important is tooling, and R definitely has the tooling. I would look at this benchmark reference for database-like operations, and you'll see that data.table (a very fast and memory-efficient R package) consistently ranks as one of the fastest tools out there that can also support a wide range of memory loads.
- The fastest tool for querying large JSON files is written in Python (benchmark)
Apache Arrow
- Any project ideas that need a BIG Optimization effort
-
How to use Spark and Pandas to prepare big data
Pandas user-defined function (UDF) is built on top of Apache Arrow. Pandas UDF improves data performance by allowing developers to scale their workloads and leverage Panda’s APIs in Apache Spark. Pandas UDF works with Pandas APIs inside the function, and works with Apache Arrow to exchange data.
-
Spice.ai v0.6-alpha is now available!
Building upon the Apache Arrow support in v0.6-alpha, Spice.ai now includes new Apache Arrow data processor and Apache Arrow Flight data connector components! Together, these create a high-performance bulk-data transport directly into the Spice.ai ML engine. Coupled with big data systems from the Apache Arrow ecosystem like Hive, Drill, Spark, Snowflake, and BigQuery, it's now easier than ever to combine big data with Spice.ai.
-
Arrowdantic 0.1.0 released
Arrowdantic is a small Python library backed by a mature Rust implementation of Apache Arrow that can interoperate with * Parquet * Apache Arrow and * ODBC (databases).
-
Introducing Spice.xyz - Data and AI infrastructure for web3
🔥 Some cool things for eth/finance. We have per-block pool reserve data for Uniswap and Sushiswap and a Python SDK which lets you get data into Pandas, NumPy in 4 lines of code so you can use all the Python ecosystem of finance libraries you are used to. It uses Apache Arrow as the transport, so much faster than JSON. Here's an example Kaggle notebook: https://www.kaggle.com/code/spiceluke/spice-xyz-ethereum-blocks
-
C++ Jobs - Q2 2022
Technologies: Apache Arrow, Flatbuffers, C++ Actor Framework, Linux, Docker, Kubernetes, Wireguard
-
What are the differences between feather and parquet?
Both are columnar (disk-)storage formats for use in data analysis systems. Both are integrated within Apache Arrow (pyarrow package for python) and aredesigned to correspond with Arrow as a columnar in-memory analytics layer.
-
Intro to Apache Arrow
More information about Apache Arrow can be found at https://arrow.apache.org/ Leave a comment if you have any questions or feedback.
-
Apache Arrow Feature Parity Timeline?
Apache Arrow Feature Matrix
-
Apache Arrow Flight SQL: Accelerating Database Access
I'm not tuned into Arrow all that much. I've some of the about and stuff, but the code examples (to my eye) look really complex and complicated. [1]
Could someone point me to a more glossy "arrow flight sql for dummies" examples? What I'm gleaning from this (or am I wrong?) is you could use a JDBC driver + arrow jdbc client and write... SQL? Or is it something a lot different?
Is this the sort of thing where you could just add a plugin to postgres and be arrowified or something?
[1] https://github.com/apache/arrow/blob/release-7.0.0/java/flig...
What are some alternatives?
h5py - HDF5 for Python -- The h5py package is a Pythonic interface to the HDF5 binary data format.
Apache Spark - Apache Spark - A unified analytics engine for large-scale data processing
polars - Fast multi-threaded DataFrame library in Rust | Python | Node.js
arrow-datafusion - Apache Arrow DataFusion SQL Query Engine
duckdb_and_r - My thoughts and examples on DuckDB and R
Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
ta-lib - Python wrapper for TA-Lib (http://ta-lib.org/).
arquero - Query processing and transformation of array-backed data tables.
Apache HBase - Apache HBase
ClickHouse - ClickHouse® is a free analytics DBMS for big data
spark-rapids - Spark RAPIDS plugin - accelerate Apache Spark with GPUs