Top 23 ETL Open-Source Projects

Airflow

169 34,485 10.0 Python

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Project mention: Building in Public: Leveraging Tublian's AI Copilot for My Open Source Contributions | dev.to | 2024-02-12

Contributing to Apache Airflow's open-source project immersed me in collaborative coding. Experienced maintainers rigorously reviewed my contributions, providing constructive feedback. This ongoing dialogue refined the codebase and honed my understanding of best practices.

airbyte

139 13,923 10.0 Python

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

Project mention: Launch HN: Bracket (YC W22) – Two-Way Sync Between Salesforce and Postgres | news.ycombinator.com | 2023-12-12

I'l also give a shout-out to Airbyte (https://airbyte.com/), with which I've had some limited success with integrating Salesforce to a local database. The particular pull for Airbyte is that we can self-host the open source version, rather than pay Fivetran a significant sum to do this for us.
It's an immature tool, so I don't yet know that I can claim we've spent _less_ than Fivetran on the additional engineering and ops time, but it feels like it has potential to do so once stabilized.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
doris

42 11,314 10.0 Java

Apache Doris is an easy-to-use, high performance and unified analytics database.

Project mention: Variant in Apache Doris 2.1.0: a new data type 8 times faster than JSON for semi-structured data analysis | dev.to | 2024-03-27

As an open-source real-time data warehouse, Apache Doris provides semi-structured data processing capabilities, and the newly-released version 2.1.0 makes a stride in this direction. Before V2.1, Apache Doris stores semi-structured data as JSON files. However, during query execution, the real-time parsing of JSON data leads to high CPU and I/O consumption in addition to high query latency, especially when the dataset is huge and complicated. Moreover, the lack of a pre-defined schema means there is no handle for query optimization.

dagster

46 10,173 10.0 Python

An orchestration platform for the development, production, and observation of data assets.

Project mention: Experience with Dagster.io? | news.ycombinator.com | 2023-07-25

Benthos

76 7,559 9.6 Go

Fancy stream processing made operationally mundane

Project mention: Ask HN: Who is hiring? (December 2023) | news.ycombinator.com | 2023-12-01

Mage

77 7,001 9.9 Python

🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

Project mention: FLaNK AI-April 22, 2024 | dev.to | 2024-04-22

steampipe

146 6,379 9.7 Go

Zero-ETL, infinite possibilities. Live query APIs, code & more with SQL. No DB required.

Project mention: Steampipe: Dynamically query APIs, code and more with SQL | news.ycombinator.com | 2024-04-04

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
kestra

32 6,340 9.9 Java

Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.

Project mention: A High-Performance, Java-Based Orchestration Platform | /r/java | 2023-10-11

Kestra's communication is asynchronous and based on a queuing mechanism. It leverages the Micronaut framework and offers two runners: one that uses a database (JDBC) for both the message queue and resource storage, and another that uses Kafka as the message queue and Elasticsearch as the resource storage. The platform is fully extensible and plugin-based, providing a rich set of plugins for various workflow tasks, triggers, and data storage options. For those interested, the GitHub repository is available here: https://github.com/kestra-io/kestra

cloudquery

102 5,581 10.0 Go

The open source high performance ELT framework powered by Apache Arrow

Project mention: We might want to regularly keep track of how important each server is | news.ycombinator.com | 2024-02-06

Check out CloudQuery - https://github.com/cloudquery/cloudquery for an easy cloud asset inventory.

orchest

44 4,020 4.5 TypeScript

Build data pipelines, the easy way 🛠️
Rudderstack

83 3,926 9.8 Go

Privacy and Security focused Segment-alternative, in Golang and React

Project mention: Rudderstack Switches to Elastic License | news.ycombinator.com | 2023-09-08

AWS Data Wrangler

9 3,802 9.4 Python

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Project mention: Read files from s3 using Pandas/s3fs or AWS Data Wrangler? | /r/dataengineering | 2023-12-06

I had no problem with awswrangler (https://github.com/aws/aws-sdk-pandas) and it supports reading and writing partitions which was really helpful and a few other optimizations that made it a great tool

ethereum-etl

3 2,819 5.8 Python

Python scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in Google BigQuery https://goo.gl/oY5BCQ

Project mention: Blockchain transactions decoding: making wallet activity understandable | dev.to | 2023-10-27

Event is a log entity which EVM smart contracts can emit during transaction execution. Events are very good at signalling that an some action has taken place on-chain. Applications can subscribe and listen to events to trigger some off-chain logic or they can index, transform and store events in some off-chain storage (look at The Graph protocol or Ethereum ETL).

quadratic

9 2,711 10.0 TypeScript

Quadratic | Data Science Spreadsheet with Python & SQL

Project mention: Quadratic – Open-Source Spreadsheet Is Now Multiplayer | news.ycombinator.com | 2024-02-01

https://github.com/quadratichq/quadratic/issues

incubator-devlake

10 2,424 9.9 Go

Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragmented data from DevOps tools, extracting insights for engineering excellence, developer experience, and community growth.
mara-pipelines

3 2,054 6.0 Python

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
awesome-business-intelligence

1 1,952 0.0

Actively curated list of awesome BI tools. PRs welcome!
awesome-node-based-uis

6 1,899 6.5

A curated list with resources about node-based UIs
go-streams

9 1,753 6.6 Go

A lightweight stream processing library for Go
Kiba

7 1,722 0.0 Ruby

Data processing & ETL framework for Ruby

Project mention: Ask HN: What side projects landed you a job? | news.ycombinator.com | 2023-12-03

I started https://github.com/thbar/kiba#kiba-etl to scratch my own itch & be able to write properly structured ETL jobs in Ruby. It was a blank-slate rewrite of something larger (activewarehouse-etl) which I could not maintain anymore.
This landed me not strictly a job, but long term consulting gigs with a number of companies in EU, UK & US.
The job was directly related to the project: companies wanted the expertise of data engineering & ETL, often with Kiba directly, but also in general.
This "side project" was totally worth it :-)

peerdb

5 1,615 9.9 Go

Fast, Simple and a cost effective tool to replicate data from Postgres to Data Warehouses, Queues and Storage

Project mention: Pgwire: a Rust library for PostgreSQL compatible application | news.ycombinator.com | 2024-03-20

We at PeerDB (https://github.com/PeerDB-io/peerdb) were early adopters of Pgwire to implement our Postgres-compatible SQL Layer to do ETL. Very easy to work with. Saved us multiple months of effort to build it from scratch.

dozer

20 1,446 9.7 Rust

Dozer is a real-time data movement tool that leverages CDC from various sources and moves data into various sinks. (by getdozer)

Project mention: Show HN: Find simple open source bounties to solve and get paid | news.ycombinator.com | 2023-08-19

https://github.com/getdozer/dozer/issues/1631#issuecomment-1...
and then something has gone off the rails about the accounting process since
  Trigger.dev

pyspark-example-project

1 1,370 0.0 Python

Implementing best practices for PySpark ETL jobs and applications.
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

ETL related posts

Flow PHP: the first and most advanced PHP ETL framework
1 project | news.ycombinator.com | 16 Apr 2024
Pgwire: a Rust library for PostgreSQL compatible application
2 projects | news.ycombinator.com | 20 Mar 2024
Show HN: Open-source x64 and Arm GitHub runners. Reduces GitHub Actions bill 10x
7 projects | news.ycombinator.com | 30 Jan 2024
Show HN: Open-source Rule-based PDF parser for RAG
9 projects | news.ycombinator.com | 23 Jan 2024
Show HN: Save Prometheus SLO data to Kafka or fvector for long term storage
1 project | news.ycombinator.com | 4 Jan 2024
Why do companies still build data ingestion tooling instead of using a third-party tool like Airbyte?
1 project | /r/dataengineering | 6 Dec 2023
Reviving My Open Source FME Clone
1 project | /r/gis | 6 Dec 2023
A note from our sponsor - InfluxDB
www.influxdata.com | 26 Apr 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source ETL projects? This list will help you:

	Project	Stars
1	Airflow	34,485
2	airbyte	13,923
3	doris	11,314
4	dagster	10,173
5	Benthos	7,559
6	Mage	7,001
7	steampipe	6,379
8	kestra	6,340
9	cloudquery	5,581
10	orchest	4,020
11	Rudderstack	3,926
12	AWS Data Wrangler	3,802
13	ethereum-etl	2,819
14	quadratic	2,711
15	incubator-devlake	2,424
16	mara-pipelines	2,054
17	awesome-business-intelligence	1,952
18	awesome-node-based-uis	1,899
19	go-streams	1,753
20	Kiba	1,722
21	peerdb	1,615
22	dozer	1,446
23	pyspark-example-project	1,370