Top 23 data-integration Open-Source Projects

Airflow

169 34,397 10.0 Python

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Project mention: Building in Public: Leveraging Tublian's AI Copilot for My Open Source Contributions | dev.to | 2024-02-12

Contributing to Apache Airflow's open-source project immersed me in collaborative coding. Experienced maintainers rigorously reviewed my contributions, providing constructive feedback. This ongoing dialogue refined the codebase and honed my understanding of best practices.
airbyte

139 13,821 10.0 Python

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

Project mention: Launch HN: Bracket (YC W22) – Two-Way Sync Between Salesforce and Postgres | news.ycombinator.com | 2023-12-12

I'l also give a shout-out to Airbyte (https://airbyte.com/), with which I've had some limited success with integrating Salesforce to a local database. The particular pull for Airbyte is that we can self-host the open source version, rather than pay Fivetran a significant sum to do this for us.
It's an immature tool, so I don't yet know that I can claim we've spent _less_ than Fivetran on the additional engineering and ops time, but it feels like it has potential to do so once stabilized.
InfluxDB

www.influxdata.com
sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
dagster

46 10,114 10.0 Python

An orchestration platform for the development, production, and observation of data assets.

Project mention: Experience with Dagster.io? | news.ycombinator.com | 2023-07-25
seatunnel

29 7,204 9.8 Java

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.

Project mention: FLaNK Weekly 31 December 2023 | dev.to | 2023-12-31
Mage

76 6,953 9.9 Python

🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

Project mention: A mage on the Hero’s Journey: a fantasy epic on how a startup rose from the ashes | dev.to | 2023-06-12

In the coming years, Mage will create a cooperative experience so that developers can build data pipelines with their team and level up together. After that journey, Mage will go on an epic quest to create the 1st open world community experience in the data universe.
kestra

32 6,260 9.9 Java

Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.

Project mention: A High-Performance, Java-Based Orchestration Platform | /r/java | 2023-10-11

Kestra's communication is asynchronous and based on a queuing mechanism. It leverages the Micronaut framework and offers two runners: one that uses a database (JDBC) for both the message queue and resource storage, and another that uses Kafka as the message queue and Elasticsearch as the resource storage. The platform is fully extensible and plugin-based, providing a rich set of plugins for various workflow tasks, triggers, and data storage options. For those interested, the GitHub repository is available here: https://github.com/kestra-io/kestra
cloudquery

102 5,565 10.0 Go

The open source high performance ELT framework powered by Apache Arrow

Project mention: We might want to regularly keep track of how important each server is | news.ycombinator.com | 2024-02-06

Check out CloudQuery - https://github.com/cloudquery/cloudquery for an easy cloud asset inventory.
WorkOS

workos.com
sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
hudi

20 5,038 9.9 Java

Upserts, Deletes And Incremental Processing on Big Data.

Project mention: Getting Started with Flink SQL, Apache Iceberg and DynamoDB Catalog | dev.to | 2023-12-18

Apache Iceberg is one of the three types of lakehouse, the other two are Apache Hudi and Delta Lake.
Rudderstack

83 3,919 9.8 Go

Privacy and Security focused Segment-alternative, in Golang and React

Project mention: Rudderstack Switches to Elastic License | news.ycombinator.com | 2023-09-08
jitsu

13 3,823 9.8 TypeScript

Jitsu is an open-source Segment alternative. Fully-scriptable data ingestion engine for modern data teams. Set-up a real-time data pipeline in minutes, not days
paradedb

16 3,756 9.8 Rust

Postgres for Search and Analytics

Project mention: Using ClickHouse to scale an events engine | news.ycombinator.com | 2024-04-11
awesome-single-cell

3 2,898 5.5

Community-curated list of software packages and data resources for single-cell, including RNA-seq, ATAC-seq, etc.
fluvio

26 2,624 9.6 Rust

Lean and mean distributed stream processing system written in rust and web assembly.

Project mention: Ask HN: WebSocket Relay? | news.ycombinator.com | 2024-02-27
incubator-devlake

10 2,420 9.9 Go

Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragmented data from DevOps tools, extracting insights for engineering excellence, developer experience, and community growth.
ingestr

4 2,289 8.9 Python

ingestr is a CLI tool to copy data between any databases with a single command seamlessly.

Project mention: FLaNK 04 March 2024 | dev.to | 2024-03-04
mara-pipelines

3 2,054 6.0 Python

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
bitsail

1 1,575 6.6 Java

BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
kuwala

33 755 0.0 JavaScript

Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demograp

Project mention: Show HN: GeoSage – A ETL Webtool for Geo and Demographics Data from the Open Web | news.ycombinator.com | 2023-10-05

--> Google Trends Data for Regions (Coming Soon)
The tool goes beyond our previously published CLI tool (https://github.com/kuwala-io/kuwala/tree/master/kuwala) by providing a hostable solution with a user-friendly interface. We have not open-sourced it yet but a demo is available here: https://geosage.kuwala.io/.
Urban planners can utilize movement data to analyze foot traffic in different city zones. Marketers can leverage demographic data to tailor campaigns more effectively. Developers can build their apps on top of it.
To round it up .... GeoSage brings...
Unified Data Management: Access data from OSM, Facebook, and soon Google, all in one place.
transfer

7 525 9.4 Go

Database replication platform that leverages change data capture. Stream production data from databases to your data warehouse (Snowflake, BigQuery, Redshift) in real-time.

Project mention: Migrate mongodb Datawarehouse to snowflake | /r/snowflake | 2023-12-04
conduit

7 339 9.4 Go

Conduit streams data between data stores. Kafka Connect replacement. No JVM required. (by ConduitIO)

Project mention: Pulling CDC data from Postgres | /r/dataengineering | 2023-04-30

I'd like to mention Conduit + its Postgres connector. The Pg connector comes built-in, so all that is needed is a single Conduit binary to get started. It relies on WAL, but the connector creates the replication slot itself (if needed).
scarches

1 309 7.5 Jupyter Notebook

Reference mapping for single-cell genomics
recap

2 306 8.9 Python

Work with your web service, database, and streaming schemas in a single format.

Project mention: Recap: A python library for describing database tables and serialization formats with minimal type coercion. | /r/dataengineering | 2023-07-12

The Github Repo: https://github.com/recap-build/recap
cuelake

2 284 0.0 JavaScript

Use SQL to build ELT pipelines on a data lakehouse.
SaaSHub

www.saashub.com
sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-04-11.

data-integration related posts

Ask HN: Best way to mirror a Postgres database to parquet?
1 project | news.ycombinator.com | 10 Apr 2024
Ingestr: CLI tool to copy data between any databases with a single command
1 project | news.ycombinator.com | 27 Feb 2024
We might want to regularly keep track of how important each server is
1 project | news.ycombinator.com | 6 Feb 2024
Show HN: Pg_analytics – Speed Up Postgres Analytical Queries by 94x
1 project | news.ycombinator.com | 31 Jan 2024
ParadeDB – PostgreSQL for Search
1 project | news.ycombinator.com | 2 Jan 2024
Postgresql index
1 project | /r/SQL | 11 Dec 2023
Show HN: GeoSage – A ETL Webtool for Geo and Demographics Data from the Open Web
1 project | news.ycombinator.com | 5 Oct 2023
A note from our sponsor - WorkOS
workos.com | 18 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source data-integration projects? This list will help you:

	Project	Stars
1	Airflow	34,397
2	airbyte	13,821
3	dagster	10,114
4	seatunnel	7,204
5	Mage	6,953
6	kestra	6,260
7	cloudquery	5,565
8	hudi	5,038
9	Rudderstack	3,919
10	jitsu	3,823
11	paradedb	3,756
12	awesome-single-cell	2,898
13	fluvio	2,624
14	incubator-devlake	2,420
15	ingestr	2,289
16	mara-pipelines	2,054
17	bitsail	1,575
18	kuwala	755
19	transfer	525
20	conduit	339
21	scarches	309
22	recap	306
23	cuelake	284