Apache Arrow Datafusion 5.0.0 release

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • OPS - Build and Run Open Source Unikernels
  • Scout APM - Less time debugging, more time building
  • SonarLint - Deliver Cleaner and Safer Code - Right in Your IDE of Choice!
  • GitHub repo arrow-datafusion

    Apache Arrow DataFusion and Ballista query engines

    Disclosure: I am a contributor to Datafusion.

    I have done a lot of work in the ETL space in Apache Spark to build Arc (https://arc.tripl.ai/) and have ported a lot of the basic functionality of Arc to Datafusion as a proof-of-concept. The appeal to me of the Apache Spark and Datafusion engines is the ability to a) seperate compute and storage b) express transformation logic in SQL.

    Performance: From those early experiments Datafusion would frequently finish processing an entire job _before_ the SparkContext could be started - even on a local Spark instance. Obviously this is at smaller data sizes but in my experience a lot of ETL is about repeatable processes not necessarily huge datasets.

    Compatibility: Those experiments were done a few months ago and the SQL compatibility of the Datafusion engine has improved extremely rapidly (WINDOW functions were recently added). There is still some missing SQL functionality (for example to run all the TPC-H queries https://github.com/apache/arrow-datafusion/tree/master/bench...) but it is moving quickly.

  • GitHub repo db-benchmark

    reproducible benchmark of database-like ops

    There is a PR from me with for db-benchmark. For the group by benchmarks, on my machine, it is currently somewhat slower than the fastest (Polars).

    https://github.com/h2oai/db-benchmark/pull/182

    Also we do support running TPC-H benchmarks. For the queries we can run, those are already finishing faster than Spark. We are planning to do more benchmarking and optimizations in the future.

  • OPS

    OPS - Build and Run Open Source Unikernels. Quickly and easily build and deploy open source unikernels in tens of seconds. Deploy in any language to any cloud.

  • GitHub repo arc

    Arc is an opinionated framework for defining data pipelines which are predictable, repeatable and manageable. (by tripl-ai)

    Disclosure: I am a contributor to Datafusion.

    I have done a lot of work in the ETL space in Apache Spark to build Arc (https://arc.tripl.ai/) and have ported a lot of the basic functionality of Arc to Datafusion as a proof-of-concept. The appeal to me of the Apache Spark and Datafusion engines is the ability to a) seperate compute and storage b) express transformation logic in SQL.

    Performance: From those early experiments Datafusion would frequently finish processing an entire job _before_ the SparkContext could be started - even on a local Spark instance. Obviously this is at smaller data sizes but in my experience a lot of ETL is about repeatable processes not necessarily huge datasets.

    Compatibility: Those experiments were done a few months ago and the SQL compatibility of the Datafusion engine has improved extremely rapidly (WINDOW functions were recently added). There is still some missing SQL functionality (for example to run all the TPC-H queries https://github.com/apache/arrow-datafusion/tree/master/bench...) but it is moving quickly.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts