Apache Arrow Datafusion 5.0.0 release

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

datafusion

55 5,020 9.9 Rust

Apache DataFusion SQL Query Engine

Disclosure: I am a contributor to Datafusion.
I have done a lot of work in the ETL space in Apache Spark to build Arc (https://arc.tripl.ai/) and have ported a lot of the basic functionality of Arc to Datafusion as a proof-of-concept. The appeal to me of the Apache Spark and Datafusion engines is the ability to a) seperate compute and storage b) express transformation logic in SQL.
Performance: From those early experiments Datafusion would frequently finish processing an entire job _before_ the SparkContext could be started - even on a local Spark instance. Obviously this is at smaller data sizes but in my experience a lot of ETL is about repeatable processes not necessarily huge datasets.
Compatibility: Those experiments were done a few months ago and the SQL compatibility of the Datafusion engine has improved extremely rapidly (WINDOW functions were recently added). There is still some missing SQL functionality (for example to run all the TPC-H queries https://github.com/apache/arrow-datafusion/tree/master/bench...) but it is moving quickly.

db-benchmark

91 319 0.0 R

reproducible benchmark of database-like ops

There is a PR from me with for db-benchmark. For the group by benchmarks, on my machine, it is currently somewhat slower than the fastest (Polars).
https://github.com/h2oai/db-benchmark/pull/182
Also we do support running TPC-H benchmarks. For the queries we can run, those are already finishing faster than Spark. We are planning to do more benchmarking and optimizations in the future.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
arc

14 166 5.3 Scala

Arc is an opinionated framework for defining data pipelines which are predictable, repeatable and manageable. (by tripl-ai)

Disclosure: I am a contributor to Datafusion.
I have done a lot of work in the ETL space in Apache Spark to build Arc (https://arc.tripl.ai/) and have ported a lot of the basic functionality of Arc to Datafusion as a proof-of-concept. The appeal to me of the Apache Spark and Datafusion engines is the ability to a) seperate compute and storage b) express transformation logic in SQL.
Performance: From those early experiments Datafusion would frequently finish processing an entire job _before_ the SparkContext could be started - even on a local Spark instance. Obviously this is at smaller data sizes but in my experience a lot of ETL is about repeatable processes not necessarily huge datasets.
Compatibility: Those experiments were done a few months ago and the SQL compatibility of the Datafusion engine has improved extremely rapidly (WINDOW functions were recently added). There is still some missing SQL functionality (for example to run all the TPC-H queries https://github.com/apache/arrow-datafusion/tree/master/bench...) but it is moving quickly.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

How to generate a great website and reference manual for your R package
1 project | dev.to | 10 Apr 2024
Array Languages: R vs. APL
1 project | news.ycombinator.com | 21 Mar 2024
Data.table: R's data.table package extends data.frame
1 project | news.ycombinator.com | 15 Mar 2024
Database-Like Ops Benchmark
1 project | news.ycombinator.com | 9 Mar 2024
Fable: Forecasting Models for Tidy Time Series
1 project | news.ycombinator.com | 3 Mar 2024

Apache Arrow Datafusion 5.0.0 release

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Post date: 24 Aug 2021

datafusion

db-benchmark

WorkOS

arc

Related posts