Apache Arrow Datafusion 5.0.0 release

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • datafusion

    Apache DataFusion SQL Query Engine

  • Disclosure: I am a contributor to Datafusion.

    I have done a lot of work in the ETL space in Apache Spark to build Arc (https://arc.tripl.ai/) and have ported a lot of the basic functionality of Arc to Datafusion as a proof-of-concept. The appeal to me of the Apache Spark and Datafusion engines is the ability to a) seperate compute and storage b) express transformation logic in SQL.

    Performance: From those early experiments Datafusion would frequently finish processing an entire job _before_ the SparkContext could be started - even on a local Spark instance. Obviously this is at smaller data sizes but in my experience a lot of ETL is about repeatable processes not necessarily huge datasets.

    Compatibility: Those experiments were done a few months ago and the SQL compatibility of the Datafusion engine has improved extremely rapidly (WINDOW functions were recently added). There is still some missing SQL functionality (for example to run all the TPC-H queries https://github.com/apache/arrow-datafusion/tree/master/bench...) but it is moving quickly.

  • db-benchmark

    reproducible benchmark of database-like ops

  • There is a PR from me with for db-benchmark. For the group by benchmarks, on my machine, it is currently somewhat slower than the fastest (Polars).

    https://github.com/h2oai/db-benchmark/pull/182

    Also we do support running TPC-H benchmarks. For the queries we can run, those are already finishing faster than Spark. We are planning to do more benchmarking and optimizations in the future.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • arc

    Arc is an opinionated framework for defining data pipelines which are predictable, repeatable and manageable. (by tripl-ai)

  • Disclosure: I am a contributor to Datafusion.

    I have done a lot of work in the ETL space in Apache Spark to build Arc (https://arc.tripl.ai/) and have ported a lot of the basic functionality of Arc to Datafusion as a proof-of-concept. The appeal to me of the Apache Spark and Datafusion engines is the ability to a) seperate compute and storage b) express transformation logic in SQL.

    Performance: From those early experiments Datafusion would frequently finish processing an entire job _before_ the SparkContext could be started - even on a local Spark instance. Obviously this is at smaller data sizes but in my experience a lot of ETL is about repeatable processes not necessarily huge datasets.

    Compatibility: Those experiments were done a few months ago and the SQL compatibility of the Datafusion engine has improved extremely rapidly (WINDOW functions were recently added). There is still some missing SQL functionality (for example to run all the TPC-H queries https://github.com/apache/arrow-datafusion/tree/master/bench...) but it is moving quickly.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts