data-pipelines

Open-source projects categorized as data-pipelines

Top 23 data-pipeline Open-Source Projects

  • Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

  • Project mention: Building in Public: Leveraging Tublian's AI Copilot for My Open Source Contributions | dev.to | 2024-02-12

    Contributing to Apache Airflow's open-source project immersed me in collaborative coding. Experienced maintainers rigorously reviewed my contributions, providing constructive feedback. This ongoing dialogue refined the codebase and honed my understanding of best practices.

  • incubator-dolphinscheduler

    Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • dagster

    An orchestration platform for the development, production, and observation of data assets.

  • Project mention: Experience with Dagster.io? | news.ycombinator.com | 2023-07-25
  • Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

  • Project mention: FLaNK AI-April 22, 2024 | dev.to | 2024-04-22
  • unstructured

    Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

  • Project mention: LlamaCloud and LlamaParse | news.ycombinator.com | 2024-02-20

    Be careful with unstructured:

    https://github.com/Unstructured-IO/unstructured/blob/d11c70c...

    from: https://github.com/open-webui/open-webui/issues/687

  • ragflow

    RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.

  • Project mention: RAGFlow is an open-source RAG engine based on deep document understanding | news.ycombinator.com | 2024-04-01

    Just link them to https://github.com/infiniflow/ragflow/blob/main/rag/llm/chat... :)

  • orchest

    Build data pipelines, the easy way 🛠️

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • fluvio

    Lean and mean distributed stream processing system written in rust and web assembly.

  • Project mention: Ask HN: WebSocket Relay? | news.ycombinator.com | 2024-02-27
  • elementary

    The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.

  • meltano

    Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

  • Project mention: meltano VS cloudquery - a user suggested alternative | libhunt.com/r/meltano | 2023-06-02
  • mleap

    MLeap: Deploy ML Pipelines to Production

  • odd-platform

    First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.

  • Project mention: OpenDataDiscovery 0.15 with Data Deprecation and Metadata Stale | news.ycombinator.com | 2023-08-04
  • data-engineering-wiki

    The best place to learn data engineering. Built and maintained by the data engineering community.

  • Project mention: Data Engineering Glossary | news.ycombinator.com | 2023-07-17
  • dataform

    Dataform is a framework for managing SQL based data operations in BigQuery

  • optimus

    Optimus is an easy-to-use, reliable, and performant workflow orchestrator for data transformation, data modeling, pipelines, and data quality management. (by raystack)

  • Project mention: Data Engineering Tools in Go | /r/dataengineering | 2023-05-18

    You can check odpf github, they created some dataops tools using go, one of the example is optimus (https://github.com/odpf/optimus) which is a data pipeline orchestrator

  • transfer

    Database replication platform that leverages change data capture. Stream production data from databases to your data warehouse (Snowflake, BigQuery, Redshift) in real-time.

  • Project mention: Migrate mongodb Datawarehouse to snowflake | /r/snowflake | 2023-12-04
  • versatile-data-kit

    One framework to develop, deploy and operate data workflows with Python and SQL.

  • Project mention: Looking for a data blogger | /r/opensource | 2023-05-19

    Here's the project: https://github.com/vmware/versatile-data-kit

  • dbt-data-reliability

    dbt package that is part of Elementary, the dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.

  • recap

    Work with your web service, database, and streaming schemas in a single format.

  • Project mention: Recap: A python library for describing database tables and serialization formats with minimal type coercion. | /r/dataengineering | 2023-07-12

    The Github Repo: https://github.com/recap-build/recap

  • Dataplane

    Dataplane is a data platform that makes it easy to construct a data mesh with automated data pipelines and workflows.

  • awesome-kubeflow

    A curated list of awesome projects and resources related to Kubeflow (a CNCF incubating project)

  • dud

    A lightweight CLI tool for versioning data alongside source code and building data pipelines.

  • Project mention: Ask HN: How do your ML teams version datasets and models? | news.ycombinator.com | 2023-09-28

    I've used DVC in the past and generally liked its approach. That said, I wholeheartedly agree that it's clunky. It does a lot of things implicitly, which can make it hard to reason about. It was also extremely slow for medium-sized dataset (low 10s of GBs).

    In response, I created a command-line tool that addresses these issues[0]. To reduce the comparison to an analogy: Dud : DVC :: Flask : Django.

    [0]: https://github.com/kevin-hanselman/dud

  • core

    An Open Source PHP Reporting Framework that helps you to write perfect data reports or to construct awesome dashboards in PHP. Working great with all PHP versions from 5.6 to latest 8.0. Fully compatible with all kinds of MVC frameworks like Laravel, CodeIgniter, Symfony. (by koolreport)

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

data-pipelines related posts

  • OpenDataDiscovery 0.15 with Data Deprecation and Metadata Stale

    1 project | news.ycombinator.com | 4 Aug 2023
  • Experience with Dagster.io?

    1 project | news.ycombinator.com | 25 Jul 2023
  • Dagster tutorials

    1 project | /r/dataengineering | 26 Jun 2023
  • The Dagster Master Plan

    2 projects | /r/dataengineering | 16 Jun 2023
  • A mage on the Hero’s Journey: a fantasy epic on how a startup rose from the ashes

    2 projects | dev.to | 12 Jun 2023
  • ODD Platform - An open-source data discovery and observability service - v0.12 release

    1 project | /r/aipromptprogramming | 27 May 2023
  • ODD Platform - An open-source data discovery and observability service - v0.12 release

    1 project | /r/artificial | 26 May 2023
  • A note from our sponsor - SaaSHub
    www.saashub.com | 6 May 2024
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source data-pipeline projects? This list will help you:

Project Stars
1 Airflow 34,570
2 incubator-dolphinscheduler 12,025
3 dagster 10,274
4 Mage 7,050
5 unstructured 6,515
6 ragflow 6,117
7 orchest 4,022
8 fluvio 2,663
9 elementary 1,740
10 meltano 1,597
11 mleap 1,494
12 odd-platform 1,115
13 data-engineering-wiki 1,036
14 dataform 791
15 optimus 737
16 transfer 533
17 versatile-data-kit 410
18 dbt-data-reliability 343
19 recap 306
20 Dataplane 184
21 awesome-kubeflow 181
22 dud 166
23 core 152

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com