Python data-engineering

Open-source Python projects categorized as data-engineering

Top 23 Python data-engineering Projects

data-engineering
  1. Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

    Project mention: Airflow AI SDK to build simple LLM workflows | news.ycombinator.com | 2025-03-26

    Hi HN,

    We've built an SDK for building DAGs / data pipelines with LLMs in Apache Airflow [1] using Pydantic AI [2] under the hood. I've seen success across the board with Airflow users building simple LLM workflows before moving on to "AI agents". In my experience, the noise around building agents means that people forget that there are other ways to get more immediate value out of LLMs.

    Coupling Airflow for orchestration and Pydantic AI for LLM interactions has turned out to be a very pragmatic approach to building these workflows (and agents). Neither tool "gets in the way" of what you're trying to do. Airflow's been around for 10+ years and has a very well-built orchestration engine rich with everything you need to write production grade data pipelines, and Pydantic AI's been a refreshing take on working with LLMs.

    Would love some feedback from this community!

    [1] https://github.com/apache/airflow

  2. Judoscale

    Save 47% on cloud hosting with autoscaling that just works. Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.

    Judoscale logo
  3. Prefect

    The easiest way to build, run, and monitor data pipelines at scale.

    Project mention: Show HN: Flow – A Dynamic Task Engine for AI Agents Without DAG | news.ycombinator.com | 2024-12-02

    - https://github.com/PrefectHQ/prefect

  4. Taipy

    Turns Data and AI algorithms into production-ready web applications in no time.

    Project mention: Top 40 Open-source Developer Tools with the Most GitHub Stars | dev.to | 2025-04-20

    GitHub: https://github.com/Avaiga/taipy

  5. airbyte

    The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

    Project mention: Personal Picks: Data Product News (April 16, 2025) | dev.to | 2025-04-15
  6. Cookbook

    The Data Engineering Cookbook

  7. dagster

    An orchestration platform for the development, production, and observation of data assets.

    Project mention: Personal Picks: Data Product News (March 19, 2025) | dev.to | 2025-03-22
  8. great_expectations

    Always know what to expect from your data.

  9. InfluxDB

    InfluxDB high-performance time series database. Collect, organize, and act on massive volumes of high-resolution data to power real-time intelligent systems.

    InfluxDB logo
  10. xonsh

    :shell: Python-powered shell. Full-featured and cross-platform.

    Project mention: Advanced Shell Scripting with Bash (2006) [pdf] | news.ycombinator.com | 2025-04-17

    (Not sure about the equivalent of shlex.quote, but in the worst case, you can just use "from shlex import quote as q" or something).

    So yes, there are good alternatives to bash - even Python based.

    [0] https://xon.sh/

  11. Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

    Project mention: Wk 3 Orchestration: MLOPs with DataTalks | dev.to | 2025-02-22

    Here, we use the free Mage Ai orchestration tool.

  12. feast

    The Open Source Feature Store for AI/ML

    Project mention: Transforming Your PDFs for RAG with Open Source Using Docling, Milvus, and Feast | news.ycombinator.com | 2025-04-22

    Hey folks!

    I recently gave a talk with the Milvus Community showing a demo of how to transform PDFs with Feast using Docling for RAG.

    The tutorial is available here: https://github.com/feast-dev/feast/tree/master/examples/rag-...

    And the video is available here: https://www.youtube.com/watch?v=DPPtr9Q6_qE

    The goal with having a feature store transform and retrieve your data for RAG is that (1) we make it easy to configure vector retrieval with just a boolean in the code declaration and (2) you can use existing tooling that data scientists / ml engineers are already familiar with.

    I'd love any feedback or ideas on how we could make things better or easier. The Feast maintainers have quite a lot in the pipeline (batch transformations, support for Ray, computer vision and more).

    Thanks a ton!

  13. AWS Data Wrangler

    pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

  14. ploomber

    The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

  15. dlt

    data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

    Project mention: Data Loading Tool | news.ycombinator.com | 2024-12-14
  16. soda-core

    :zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io

  17. meltano

    Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

  18. pyspark-example-project

    Implementing best practices for PySpark ETL jobs and applications.

  19. bytewax

    Python Stream Processing

    Project mention: Roast my new Python library for stream processing | news.ycombinator.com | 2025-04-03

    Interesting! How do you see this comparing with Bytewax - https://github.com/bytewax/bytewax

  20. Udacity-Data-Engineering-Projects

    Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.

  21. mlrun

    MLRun is an open source MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications.

  22. pyjanitor

    Clean APIs for data cleaning. Python implementation of R package Janitor

  23. quix-streams

    A Python library for building containerized ML and Generative AI applications with Apache Kafka.

    Project mention: Show HN: Denormalized – Embeddable Stream Processing in Rust and DataFusion | news.ycombinator.com | 2024-08-15

    Congratulations on launching your project! We spoke back in March at a Kafka Summit London social meetup and talked all things Python and Kafka (I work on https://github.com/quixio/quix-streams). Always great to see a new stream processing project tackle a new segment

  24. DataEngineeringProject

    Example end to end data engineering project.

  25. pyper

    Concurrent Python made simple

    Project mention: This Week In Python | dev.to | 2025-01-17

    pyper – Concurrent Python made simple

  26. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python data-engineering discussion

Log in or Post with

Python data-engineering related posts

  • Personal Picks: Data Product News (April 16, 2025)

    1 project | dev.to | 15 Apr 2025
  • Roast my new Python library for stream processing

    1 project | news.ycombinator.com | 3 Apr 2025
  • airbyte VS cocoindex - a user suggested alternative

    2 projects | 1 Apr 2025
  • The DOJ Still Wants Google to Sell Off Chrome

    4 projects | news.ycombinator.com | 8 Mar 2025
  • Start contributing to a Popular Open Source Project

    2 projects | dev.to | 28 Jan 2025
  • Data Orchestration Tool Analysis: Airflow, Dagster, Flyte

    3 projects | dev.to | 23 Jan 2025
  • Can AI finally generate best practice code? I think so.

    2 projects | dev.to | 19 Dec 2024
  • A note from our sponsor - Judoscale
    judoscale.com | 28 Apr 2025
    Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues. Learn more →

Index

What are some of the best open-source data-engineering projects in Python? This list will help you:

# Project Stars
1 Airflow 39,794
2 Prefect 19,124
3 Taipy 17,996
4 airbyte 17,989
5 Cookbook 14,243
6 dagster 13,031
7 great_expectations 10,342
8 xonsh 8,747
9 Mage 8,264
10 feast 6,002
11 AWS Data Wrangler 4,007
12 ploomber 3,557
13 dlt 3,516
14 soda-core 2,069
15 meltano 2,034
16 pyspark-example-project 1,860
17 bytewax 1,713
18 Udacity-Data-Engineering-Projects 1,566
19 mlrun 1,518
20 pyjanitor 1,414
21 quix-streams 1,358
22 DataEngineeringProject 1,237
23 pyper 1,189

Sponsored
Save 47% on cloud hosting with autoscaling that just works
Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.
judoscale.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?