Python data-engineering

Open-source Python projects categorized as data-engineering

Top 23 Python data-engineering Projects

  • Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

    Project mention: Building in Public: Leveraging Tublian's AI Copilot for My Open Source Contributions | | 2024-02-12

    Contributing to Apache Airflow's open-source project immersed me in collaborative coding. Experienced maintainers rigorously reviewed my contributions, providing constructive feedback. This ongoing dialogue refined the codebase and honed my understanding of best practices.

  • Prefect

    The easiest way to build, run, and monitor data pipelines at scale.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • airbyte

    The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

    Project mention: Launch HN: Bracket (YC W22) – Two-Way Sync Between Salesforce and Postgres | | 2023-12-12

    I'l also give a shout-out to Airbyte (, with which I've had some limited success with integrating Salesforce to a local database. The particular pull for Airbyte is that we can self-host the open source version, rather than pay Fivetran a significant sum to do this for us.

    It's an immature tool, so I don't yet know that I can claim we've spent _less_ than Fivetran on the additional engineering and ops time, but it feels like it has potential to do so once stabilized.

  • dagster

    An orchestration platform for the development, production, and observation of data assets.

    Project mention: Experience with | | 2023-07-25
  • great_expectations

    Always know what to expect from your data.

    Project mention: Data Quality at Scale with Great Expectations, Spark, and Airflow on EMR | | 2023-04-24

    Great Expectations (GE) is an open-source data validation tool that helps ensure data quality.

  • Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data.

    Project mention: A mage on the Hero’s Journey: a fantasy epic on how a startup rose from the ashes | | 2023-06-12

    In the coming years, Mage will create a cooperative experience so that developers can build data pipelines with their team and level up together. After that journey, Mage will go on an epic quest to create the 1st open world community experience in the data universe.

  • taipy

    Turns Data and AI algorithms into production-ready web applications in no time.

    Project mention: Show HN: Building data and AI apps, an alternative to Streamlit | | 2024-02-12
  • Onboard AI

    ChatGPT with full context of any GitHub repo. Onboard AI learns any GitHub repo in minutes and lets you chat with it to locate functionality, understand different parts, and generate new code. Use it for free at

  • feast

    Feature Store for Machine Learning

    Project mention: What's Happening with Feast? | | 2023-12-07
  • AWS Data Wrangler

    pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

    Project mention: Read files from s3 using Pandas/s3fs or AWS Data Wrangler? | /r/dataengineering | 2023-12-06

    I had no problem with awswrangler ( and it supports reading and writing partitions which was really helpful and a few other optimizations that made it a great tool

  • ploomber

    The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

    Project mention: Show HN: JupySQL – a SQL client for Jupyter (ipython-SQL successor) | | 2023-12-06

    - One-click sharing powered by Ploomber Cloud:


    Note that JupySQL is a fork of ipython-sql; which is no longer actively developed. Catherine, ipython-sql's creator, was kind enough to pass the project to us (check out ipython-sql's README).

    We'd love to learn what you think and what features we can ship for JupySQL to be the best SQL client! Please let us know in the comments!

  • data-diff

    Compare tables within or across databases

    Project mention: How to Check 2 SQL Tables Are the Same | | 2023-07-26

    If the issue happen a lot, there is also:

    That is a nice tool to do it cross database as well.

    I think it's based on checksum method.

  • phidata

    Build AI Assistants using function calling

    Project mention: Show HN: Use function calling to build AI Assistants | | 2024-02-27
  • soda-core

    :zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas)

    Project mention: Looking for Unit Testing framework in Database Migration Process | /r/dataengineering | 2023-03-23
  • meltano

    Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

    Project mention: meltano VS cloudquery - a user suggested alternative | | 2023-06-02
  • dlt

    data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

    Project mention: Ask HN: Freelancer? Seeking freelancer? (December 2023) | | 2023-12-03


    dltHub is looking for a freelance help in the following repos:


  • pyspark-example-project

    Implementing best practices for PySpark ETL jobs and applications.

  • Udacity-Data-Engineering-Projects

    Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.

    Project mention: Pitanje za data engineering? | /r/programiranje | 2023-06-30
  • pyjanitor

    Clean APIs for data cleaning. Python implementation of R package Janitor

    Project mention: Sub library with useful code | /r/learnpython | 2023-05-19
  • mlrun

    Machine Learning automation and tracking

  • bytewax

    Python Stream Processing

    Project mention: Near Real Time Ingestion to DB using Python | /r/dataengineering | 2023-12-06

    You can probably use Python to solve your problem, there are many ways you can speed up your deserialization/flattening. I work on Bytewax ( and I wouldn't mention it if it wasn't a good fit, but I think it's worth looking at here. It is a stream processor that makes it easy to scale, maintain order, track progress, and you just write native Python.

  • DataEngineeringProject

    Example end to end data engineering project.

  • NeumAI

    Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.

    Project mention: Show HN: Neum AI – Open-source large-scale RAG framework | | 2023-11-21

    Interesting to see that the semantic chunking in the tools library is a wrapper around GPT-4. Asks GPT for the python code and executes it:

  • vectorflow

    VectorFlow is a high volume vector embedding pipeline that ingests raw data, transforms it into vectors and writes it to a vector DB of your choice. (by dgarnitz)

    Project mention: FLaNK Weekly 08 Jan 2024 | | 2024-01-08
  • WorkOS

    The modern API for authentication & user identity. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-02-27.

Python data-engineering related posts


What are some of the best open-source data-engineering projects in Python? This list will help you:

Project Stars
1 Airflow 33,606
2 Prefect 14,114
3 airbyte 13,265
4 dagster 9,682
5 great_expectations 9,276
6 Mage 6,609
7 taipy 5,824
8 feast 5,156
9 AWS Data Wrangler 3,745
10 ploomber 3,335
11 data-diff 2,755
12 phidata 2,431
13 soda-core 1,685
14 meltano 1,511
15 dlt 1,352
16 pyspark-example-project 1,312
17 Udacity-Data-Engineering-Projects 1,295
18 pyjanitor 1,254
19 mlrun 1,216
20 bytewax 1,050
21 DataEngineeringProject 922
22 NeumAI 729
23 vectorflow 613
The modern API for authentication & user identity.
The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.