Python data-pipelines

Open-source Python projects categorized as data-pipelines

Top 17 Python data-pipeline Projects

data-pipelines
  1. Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

    Project mention: MCP Specification – 2025-06-18 | news.ycombinator.com | 2025-06-19
  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. pathway

    Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.

    Project mention: Fullstack Open Source Projects That Will Help You Become AI Devs (Python, JavaScript, AI) | dev.to | 2025-05-27

    Give Pathway a try: https://github.com/pathwaycom/pathway 🌟 Pathway on GitHub

  4. dagster

    An orchestration platform for the development, production, and observation of data assets.

    Project mention: Dagster | news.ycombinator.com | 2025-06-23
  5. Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

    Project mention: Wk 3 Orchestration: MLOPs with DataTalks | dev.to | 2025-02-22

    Here, we use the free Mage Ai orchestration tool.

  6. preswald

    Preswald is a WASM packager for Python-based interactive data apps: bundle full complex data workflows, particularly visualizations, into single files, runnable completely in-browser, using Pyodide, DuckDB, Pandas, and Plotly, Matplotlib, etc. Build dashboards, reports, and notebooks that run offline, load fast, and share like a document.

    Project mention: Revolutionizing Data Apps: Build Interactive Dashboards with Just Python! | dev.to | 2025-03-19

    View the Project on GitHub

  7. docetl

    A system for agentic LLM-powered data processing and ETL

    Project mention: DocETL – open-source framework for complex document processing pipelines | news.ycombinator.com | 2024-10-21
  8. meltano

    Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. pyper

    Concurrent Python made simple

    Project mention: This Week In Python | dev.to | 2025-01-17

    pyper – Concurrent Python made simple

  11. versatile-data-kit

    One framework to develop, deploy and operate data workflows with Python and SQL.

  12. dbt-data-reliability

    dbt package that is part of Elementary, the dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.

  13. recap

    Work with your web service, database, and streaming schemas in a single format.

  14. patterns-devkit

    Data pipelines from re-usable components

  15. conductor-python

    Conductor OSS SDK for Python programming language

    Project mention: Show HN: Durable Python Workflows | news.ycombinator.com | 2025-04-22

    take a look at https://github.com/conductor-sdk/conductor-python which is easier and will not force you to write with specific framework.

  16. kenobi

    Easiest way to monitor asynchronous data pipelines

    Project mention: Ask HN: Who is hiring? (August 2024) | news.ycombinator.com | 2024-08-01

    Doctor Droid (https://drdroid.io) | ML Engineer (15-20LPA INR) | Bangalore, INDIA | ONSITE | Full-time

    We are a DevTool. We help engineers debug production issues faster. We do this thought our Open Source automation tooling for on-call and SRE engineers. We work with tech enterprises to help optimise their DevEx / On-Call routines.

    Requirement:

  17. dagster-odp

    A configuration-driven framework for building Dagster pipelines that enables teams to create and manage data workflows using YAML/JSON instead of code

    Project mention: Declarative Data Pipelines: Moving from Code to Configuration | dev.to | 2025-02-04

    To demonstrate how dagster-odp brings these concepts together, we'll implement the same S3 to BigQuery pipeline we discussed earlier, but using a declarative approach. The complete implementation consists of three main components: resource configuration, task definition, and workflow configuration.

  18. SmartPipeline

    A framework for rapid development of robust data pipelines following a simple design pattern

  19. analytics_data_where_house

    An analytics engineering sandbox focusing on real estates prices in Cook County, IL

    Project mention: Show HN: OpenTimes – Free travel times between U.S. Census geographies | news.ycombinator.com | 2025-03-17

    Thank you for this excellent post! I've been developing [my own platform](https://github.com/MattTriano/analytics_data_where_house) that curates a data warehouse mostly of census and socrata datasets but I haven't really had a good way to share the products with anyone as it's a bit too heavyweight. I've been trying to find alternate solutions to that issue (I'm currently building out a much smaller [platform](https://github.com/MattTriano/fbi_cde_data) to process the FBI's NIBRS datasets), and your post has given me a few great implementations to study and experiment with.

    Thanks!

  20. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python data-pipelines discussion

Log in or Post with

Python data-pipelines related posts

  • Dagster

    1 project | news.ycombinator.com | 23 Jun 2025
  • Personal Picks: Data Product News (March 19, 2025)

    1 project | dev.to | 22 Mar 2025
  • Wk 3 Orchestration: MLOPs with DataTalks

    2 projects | dev.to | 22 Feb 2025
  • Show HN: Pyper – Concurrent Python Made Simple

    3 projects | news.ycombinator.com | 12 Jan 2025
  • Monolith to Microservices: Should I Migrate and How?

    6 projects | dev.to | 3 Oct 2024
  • AI Strategy Guide: How to Scale AI Across Your Business

    4 projects | dev.to | 11 May 2024
  • Experience with Dagster.io?

    1 project | news.ycombinator.com | 25 Jul 2023
  • A note from our sponsor - SaaSHub
    www.saashub.com | 24 Jun 2025
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source data-pipeline projects in Python? This list will help you:

# Project Stars
1 Airflow 40,629
2 pathway 27,601
3 dagster 13,411
4 Mage 8,382
5 preswald 4,149
6 docetl 2,292
7 meltano 2,114
8 pyper 1,433
9 versatile-data-kit 450
10 dbt-data-reliability 444
11 recap 343
12 patterns-devkit 108
13 conductor-python 75
14 kenobi 59
15 dagster-odp 33
16 SmartPipeline 27
17 analytics_data_where_house 9

Sponsored
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?