arc vs arrow-datafusion

arc

Arc is an opinionated framework for defining data pipelines which are predictable, repeatable and manageable. (by tripl-ai)

Suggest topics

Source Code

arc.tripl.ai

Suggest alternative

Edit details

arrow-datafusion

Apache DataFusion SQL Query Engine (by apache)

Arrow Big Data Dataframe datafusion Olap Python query-engine Rust SQL

Source Code

arrow.apache.org

Suggest alternative

Edit details

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

arc		arrow-datafusion
	Project
14	Mentions	55
166	Stars	4,924
1.8%	Growth	4.9%
5.3	Activity	9.9
2 months ago	Latest Commit	7 days ago
Scala	Language	Rust
MIT License	License	Apache License 2.0

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

arc

Posts with mentions or reviews of arc. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2021-11-30.

Show HN: Box – Data Transformation Pipelines in Rust DataFusion
4 projects | news.ycombinator.com | 30 Nov 2021

A while ago I posted a link to [Arc](https://news.ycombinator.com/item?id=26573930) a declarative method for defining repeatable data pipelines which execute against [Apache Spark](https://spark.apache.org/).
Today I would like to present a proof-of-concept implementation of the [Arc declarative ETL framework](https://arc.tripl.ai) against [Apache Datafusion](https://arrow.apache.org/datafusion/) which is an Ansi SQL (Postgres) execution engine based upon Apache Arrow and built with Rust.
The idea of providing a declarative 'configuration' language for defining data pipelines was planned from the beginning of the Arc project to allow changing execution engines without having to rewrite the base business logic (the part that is valuable to your business). Instead, by defining an abstraction layer, we can change the execution engine and run the same logic with different execution characteristics.
The benefit of the DataFusion over Apache Spark is a significant increase in speed and reduction in execution resource requirements. Even through a Docker-for-Mac inefficiency layer the same job completes in ~4 seconds with DataFusion vs ~24 seconds with Apache Spark (including JVM startup time). Without Docker-for-Mac layer end-to-end execution times of 0.5 second for the same example job (TPC-H) is possible. * the aim is not to start a benchmarking flamewar but to provide some indicative data *.
The purpose of this post is to gather feedback from the community whether you would use a tool like this, what features would be required for you to use it (MVP) or whether you would be interested in contributing to the project. I would also like to highlight the excellent work being done by the DataFusion/Arrow (and Apache) community for providing such amazing tools to us all as open source projects.
Apache Arrow Datafusion 5.0.0 release
6 projects | news.ycombinator.com | 24 Aug 2021

Disclosure: I am a contributor to Datafusion.
I have done a lot of work in the ETL space in Apache Spark to build Arc (https://arc.tripl.ai/) and have ported a lot of the basic functionality of Arc to Datafusion as a proof-of-concept. The appeal to me of the Apache Spark and Datafusion engines is the ability to a) seperate compute and storage b) express transformation logic in SQL.
Performance: From those early experiments Datafusion would frequently finish processing an entire job _before_ the SparkContext could be started - even on a local Spark instance. Obviously this is at smaller data sizes but in my experience a lot of ETL is about repeatable processes not necessarily huge datasets.
Compatibility: Those experiments were done a few months ago and the SQL compatibility of the Datafusion engine has improved extremely rapidly (WINDOW functions were recently added). There is still some missing SQL functionality (for example to run all the TPC-H queries https://github.com/apache/arrow-datafusion/tree/master/bench...) but it is moving quickly.
Arc - an opinionated framework for defining data pipelines which are predictable, repeatable and manageable.
1 project | /r/bigdata | 25 Mar 2021

1 project | /r/coding | 25 Mar 2021

1 project | /r/programming | 25 Mar 2021

2 projects | /r/functionalprogramming | 25 Mar 2021

1 project | /r/dataengineering | 25 Mar 2021

1 project | /r/scala | 25 Mar 2021

1 project | /r/coolgithubprojects | 25 Mar 2021

1 project | /r/opensource | 25 Mar 2021

arrow-datafusion

Posts with mentions or reviews of arrow-datafusion. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-03-25.

Velox: Meta's Unified Execution Engine [pdf]
2 projects | news.ycombinator.com | 25 Mar 2024

Python's Substrait seems like the biggest/most-used competitor-ish out there. I'd love some compare & contrast; my sense is that Substrait has a smaller ambition, and more wants to be a language for talking about execution rather than a full on execution engine. https://github.com/substrait-io/substrait
We can also see from the DataFusion discussion that they too see themselves as a bit of a Velox competitor. https://github.com/apache/arrow-datafusion/discussions/6441
What I Talk About When I Talk About Query Optimizer (Part 1): IR Design
7 projects | news.ycombinator.com | 29 Jan 2024

Agree, substrait is a really cool project! Related: if you like substrait you might want to check out datafusion too. The project is a query execution engine built on top of Apache Arrow (with SQL parser, query planner & optimizer, execution engine, extensible user defined functions, among others) and it implements a substrait provider and consumer: https://github.com/apache/arrow-datafusion/tree/main/datafus...
DuckDB performance improvements with the latest release
8 projects | news.ycombinator.com | 6 Nov 2023

The draft contains some preliminary benchmark results, comparing it to DuckDB.
https://github.com/apache/arrow-datafusion/issues/6782
Apache Arrow DataFusion
1 project | news.ycombinator.com | 1 Oct 2023
GlareDB: An open source SQL database to query and analyze distributed data
4 projects | /r/dataengineering | 8 Jun 2023

Apache Arrow is a pretty common memory structure these days. Datafusion is an open query engine built in Rust started by Andy Grove.
DuckDB 0.8.0
5 projects | news.ycombinator.com | 17 May 2023

DuckDB is a great piece of software if you are
If you are looking for a query engine implemented in a safe language (Rust) I definitely suggest checking out DataFusion. It is comparable to DuckDB in performance, has all the standard built in SQL functionality, and is extensible in pretty much all areas (query language, data formats, catalogs, user defined functions, etc)
https://arrow.apache.org/datafusion/
Disclaimer I am a maintainer of DataFusion
Data Engineering with Rust
5 projects | /r/rust | 9 May 2023

https://github.com/jorgecarleitao/arrow2 https://github.com/apache/arrow-datafusion https://github.com/apache/arrow-ballista https://github.com/pola-rs/polars https://github.com/duckdb/duckdb
Polars: Computing a new column from multiple columns - there must be a better way
1 project | /r/rust | 4 May 2023
Bridging Async and Sync Rust Code - A lesson learned while working with Tokio
3 projects | /r/rust | 10 Mar 2023

Problem comes when you want to do this inside an async context since we couldn't block an async task. https://users.rust-lang.org/t/sync-function-invoking-async/43364/6 You might need to do it in another runtime/thread. It is not recommended to do this, but sometimes it is unavoidable while implementing a third-party trait. https://github.com/apache/arrow-datafusion/issues/3777 However, I believe this isn't a problem particular to tokio, or any specific runtime.
Using Rust to write a Data Pipeline. Thoughts. Musings.
5 projects | /r/rust | 14 Jan 2023

Compare arc vs arrow-datafusion and see what are their differences.

arc

arrow-datafusion

arc

arrow-datafusion