Top 11 data-ingestion Open-Source Projects

seatunnel

5 7,182 9.8 Java

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.

Project mention: FLaNK Weekly 31 December 2023 | dev.to | 2023-12-31
paradedb

16 3,756 9.8 Rust

Postgres for Search and Analytics

Project mention: Using ClickHouse to scale an events engine | news.ycombinator.com | 2024-04-11
InfluxDB

www.influxdata.com
sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
ingestr

4 2,289 8.9 Python

ingestr is a CLI tool to copy data between any databases with a single command seamlessly.

Project mention: FLaNK 04 March 2024 | dev.to | 2024-03-04
broadway

4 2,280 6.0 Elixir

Concurrent and multi-stage data ingestion and data processing with Elixir

Project mention: Switching to Elixir | news.ycombinator.com | 2023-11-09

You can actually have "background jobs" in very different ways in Elixir.
> I want background work to live on different compute capacity than http requests, both because they have very different resources usage
In Elixir, because of the way the BEAM works (the unit of parallelism is much cheaper and consume a low amount of memory), "incoming http requests" and related "workers" are not as expensive (a lot less actually) compared to other stacks (for instance Ruby and Python), where it is quite critical to release "http workers" and not hold the connection (which is what lead to the creation of background job tools like Resque, DelayedJob, Sidekiq, Celery...).
This means that you can actually hold incoming HTTP connections a lot longer without troubles.
A consequence of this is that implementing "reverse proxies", or anything calling third party servers _right in the middle_ of your own HTTP call, is usually perfectly acceptable (something I've done more than a couple of times, the latest one powering the reverse proxy behind https://transport.data.gouv.fr - code available at https://github.com/etalab/transport-site/tree/master/apps/un...).
As a consequence, what would be a bad pattern in Python or Ruby (holding the incoming HTTP connection) is not a problem with Elixir.
> because I want to have state or queues in front of background work so there's a well-defined process for retry, error handling, and back-pressure.
Unless you deal with immediate stuff like reverse proxying or cheap "one off async tasks" (like recording a metric), there also are solutions to have more "stateful" background works in Elixir, too.
A popular background job queue is https://github.com/sorentwo/oban (roughly similar to Sidekiq at al), which uses Postgres.
It handles retries, errors etc.
But it's not the only solution, as you have other tools dedicated to processing, such as Broadway (https://github.com/dashbitco/broadway), which handles back-pressure, fault-tolerance, batching etc natively.
You also have more simple options, such as flow (https://github.com/dashbitco/flow), gen_stage (https://github.com/elixir-lang/gen_stage), Task.async_stream (https://hexdocs.pm/elixir/1.12/Task.html#async_stream/5) etc.
It allows to use the "right tool for the job" quite easily.
It is also interesting to note there is no need to "go evented" if you need to fetch data from multiple HTTP servers: it can happen in the exact same process (even: in a background task attached to your HTTP server), as done here https://transport.data.gouv.fr/explore (if you zoom you will see vehicle moving in realtime, and ~80 data sources are being polled every 10 seconds & broadcasted to the visitors via pubsub & websockets).
Pravega

0 1,965 8.5 Java

Pravega - Streaming as a new software defined storage primitive
paimon

1 1,792 9.9 Java

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

Project mention: Open Source Advent Fun Wraps Up! | dev.to | 2024-01-05

18. Apache Paimon | Github | tutorial
multiwoven

15 613 9.9 Ruby

🔥 Open Source Reverse ETL and Customer Data Platform (CDP). An open-source alternative to tools like Hightouch, Census, and RudderStack.

Project mention: Temporal.io: It Just Works | news.ycombinator.com | 2024-03-14
WorkOS

workos.com
sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
cuelake

0 284 0.0 JavaScript

Use SQL to build ELT pipelines on a data lakehouse.
squirrel-core

0 277 5.9 Python

A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way :chestnut:
squirrel-datasets-core

0 43 2.3 Python

Squirrel dataset hub
Shift

1 8 8.1 Go

Shift is a high performance better alternative to Airbyte, Singer, Meltano (by piyushsingariya)

Project mention: Alternative to Airbyte, Singer and Meltano | /r/dataengineering | 2023-08-11

As side hobby I started working on this personal project https://github.com/piyushsingariya/Kaku

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-04-11.

data-ingestion related posts

Ask HN: Best way to mirror a Postgres database to parquet?
1 project | news.ycombinator.com | 10 Apr 2024
Temporal.io: It Just Works
1 project | news.ycombinator.com | 14 Mar 2024
The lightweight Open CDP and Reverse ETL for your data warehouse
1 project | news.ycombinator.com | 13 Mar 2024
Why an open source Salesforce CDP alternative is needed
1 project | news.ycombinator.com | 11 Mar 2024
Why companies need a open source Customer Data Platform (CDP)?
1 project | news.ycombinator.com | 8 Mar 2024
Ubicloud wants to build open-source alternative to AWS in Ruby
1 project | news.ycombinator.com | 6 Mar 2024
Show HN: ReverseETL – The open-source alternative to Hightouch and Census
1 project | news.ycombinator.com | 2 Mar 2024
A note from our sponsor - WorkOS
workos.com | 16 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source data-ingestion projects? This list will help you:

	Project	Stars
1	seatunnel	7,182
2	paradedb	3,756
3	ingestr	2,289
4	broadway	2,280
5	Pravega	1,965
6	paimon	1,792
7	multiwoven	613
8	cuelake	284
9	squirrel-core	277
10	squirrel-datasets-core	43
11	Shift	8