SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 data-pipeline Open-Source Projects
-
incubator-dolphinscheduler
Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
-
unstructured
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
-
ragflow
RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
elementary
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
-
meltano
Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
-
odd-platform
First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.
-
data-engineering-wiki
The best place to learn data engineering. Built and maintained by the data engineering community.
-
optimus
Optimus is an easy-to-use, reliable, and performant workflow orchestrator for data transformation, data modeling, pipelines, and data quality management. (by raystack)
-
transfer
Database replication platform that leverages change data capture. Stream production data from databases to your data warehouse (Snowflake, BigQuery, Redshift) in real-time.
-
dbt-data-reliability
dbt package that is part of Elementary, the dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
-
Dataplane
Dataplane is a data platform that makes it easy to construct a data mesh with automated data pipelines and workflows.
-
awesome-kubeflow
A curated list of awesome projects and resources related to Kubeflow (a CNCF incubating project)
-
core
An Open Source PHP Reporting Framework that helps you to write perfect data reports or to construct awesome dashboards in PHP. Working great with all PHP versions from 5.6 to latest 8.0. Fully compatible with all kinds of MVC frameworks like Laravel, CodeIgniter, Symfony. (by koolreport)
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Building in Public: Leveraging Tublian's AI Copilot for My Open Source Contributions | dev.to | 2024-02-12Contributing to Apache Airflow's open-source project immersed me in collaborative coding. Experienced maintainers rigorously reviewed my contributions, providing constructive feedback. This ongoing dialogue refined the codebase and honed my understanding of best practices.
Be careful with unstructured:
https://github.com/Unstructured-IO/unstructured/blob/d11c70c...
from: https://github.com/open-webui/open-webui/issues/687
Project mention: RAGFlow is an open-source RAG engine based on deep document understanding | news.ycombinator.com | 2024-04-01Just link them to https://github.com/infiniflow/ragflow/blob/main/rag/llm/chat... :)
Project mention: meltano VS cloudquery - a user suggested alternative | libhunt.com/r/meltano | 2023-06-02
Project mention: OpenDataDiscovery 0.15 with Data Deprecation and Metadata Stale | news.ycombinator.com | 2023-08-04
You can check odpf github, they created some dataops tools using go, one of the example is optimus (https://github.com/odpf/optimus) which is a data pipeline orchestrator
Here's the project: https://github.com/vmware/versatile-data-kit
Project mention: Recap: A python library for describing database tables and serialization formats with minimal type coercion. | /r/dataengineering | 2023-07-12The Github Repo: https://github.com/recap-build/recap
Project mention: Ask HN: How do your ML teams version datasets and models? | news.ycombinator.com | 2023-09-28I've used DVC in the past and generally liked its approach. That said, I wholeheartedly agree that it's clunky. It does a lot of things implicitly, which can make it hard to reason about. It was also extremely slow for medium-sized dataset (low 10s of GBs).
In response, I created a command-line tool that addresses these issues[0]. To reduce the comparison to an analogy: Dud : DVC :: Flask : Django.
[0]: https://github.com/kevin-hanselman/dud
data-pipelines related posts
-
OpenDataDiscovery 0.15 with Data Deprecation and Metadata Stale
-
Experience with Dagster.io?
-
Dagster tutorials
-
The Dagster Master Plan
-
A mage on the Hero’s Journey: a fantasy epic on how a startup rose from the ashes
-
ODD Platform - An open-source data discovery and observability service - v0.12 release
-
ODD Platform - An open-source data discovery and observability service - v0.12 release
-
A note from our sponsor - SaaSHub
www.saashub.com | 6 May 2024
Index
What are some of the best open-source data-pipeline projects? This list will help you:
Project | Stars | |
---|---|---|
1 | Airflow | 34,570 |
2 | incubator-dolphinscheduler | 12,025 |
3 | dagster | 10,274 |
4 | Mage | 7,050 |
5 | unstructured | 6,515 |
6 | ragflow | 6,117 |
7 | orchest | 4,022 |
8 | fluvio | 2,663 |
9 | elementary | 1,740 |
10 | meltano | 1,597 |
11 | mleap | 1,494 |
12 | odd-platform | 1,115 |
13 | data-engineering-wiki | 1,036 |
14 | dataform | 791 |
15 | optimus | 737 |
16 | transfer | 533 |
17 | versatile-data-kit | 410 |
18 | dbt-data-reliability | 343 |
19 | recap | 306 |
20 | Dataplane | 184 |
21 | awesome-kubeflow | 181 |
22 | dud | 166 |
23 | core | 152 |
Sponsored