Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues. Learn more →
Top 23 Python data-engineering Projects
-
Hi HN,
We've built an SDK for building DAGs / data pipelines with LLMs in Apache Airflow [1] using Pydantic AI [2] under the hood. I've seen success across the board with Airflow users building simple LLM workflows before moving on to "AI agents". In my experience, the noise around building agents means that people forget that there are other ways to get more immediate value out of LLMs.
Coupling Airflow for orchestration and Pydantic AI for LLM interactions has turned out to be a very pragmatic approach to building these workflows (and agents). Neither tool "gets in the way" of what you're trying to do. Airflow's been around for 10+ years and has a very well-built orchestration engine rich with everything you need to write production grade data pipelines, and Pydantic AI's been a refreshing take on working with LLMs.
Would love some feedback from this community!
[1] https://github.com/apache/airflow
-
Judoscale
Save 47% on cloud hosting with autoscaling that just works. Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.
-
Project mention: Show HN: Flow – A Dynamic Task Engine for AI Agents Without DAG | news.ycombinator.com | 2024-12-02
- https://github.com/PrefectHQ/prefect
-
Project mention: Top 40 Open-source Developer Tools with the Most GitHub Stars | dev.to | 2025-04-20
GitHub: https://github.com/Avaiga/taipy
-
airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
-
-
-
-
InfluxDB
InfluxDB high-performance time series database. Collect, organize, and act on massive volumes of high-resolution data to power real-time intelligent systems.
-
Project mention: Advanced Shell Scripting with Bash (2006) [pdf] | news.ycombinator.com | 2025-04-17
(Not sure about the equivalent of shlex.quote, but in the worst case, you can just use "from shlex import quote as q" or something).
So yes, there are good alternatives to bash - even Python based.
[0] https://xon.sh/
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
Here, we use the free Mage Ai orchestration tool.
-
Project mention: Transforming Your PDFs for RAG with Open Source Using Docling, Milvus, and Feast | news.ycombinator.com | 2025-04-22
Hey folks!
I recently gave a talk with the Milvus Community showing a demo of how to transform PDFs with Feast using Docling for RAG.
The tutorial is available here: https://github.com/feast-dev/feast/tree/master/examples/rag-...
And the video is available here: https://www.youtube.com/watch?v=DPPtr9Q6_qE
The goal with having a feature store transform and retrieve your data for RAG is that (1) we make it easy to configure vector retrieval with just a boolean in the code declaration and (2) you can use existing tooling that data scientists / ml engineers are already familiar with.
I'd love any feedback or ideas on how we could make things better or easier. The Feast maintainers have quite a lot in the pipeline (batch transformations, support for Ray, computer vision and more).
Thanks a ton!
-
AWS Data Wrangler
pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
-
-
-
soda-core
:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
-
meltano
Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
-
-
Project mention: Roast my new Python library for stream processing | news.ycombinator.com | 2025-04-03
Interesting! How do you see this comparing with Bytewax - https://github.com/bytewax/bytewax
-
Udacity-Data-Engineering-Projects
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
-
mlrun
MLRun is an open source MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications.
-
-
quix-streams
A Python library for building containerized ML and Generative AI applications with Apache Kafka.
Project mention: Show HN: Denormalized – Embeddable Stream Processing in Rust and DataFusion | news.ycombinator.com | 2024-08-15Congratulations on launching your project! We spoke back in March at a Kafka Summit London social meetup and talked all things Python and Kafka (I work on https://github.com/quixio/quix-streams). Always great to see a new stream processing project tackle a new segment
-
-
pyper – Concurrent Python made simple
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
Python data-engineering discussion
Python data-engineering related posts
-
Personal Picks: Data Product News (April 16, 2025)
-
Roast my new Python library for stream processing
-
airbyte VS cocoindex - a user suggested alternative
2 projects | 1 Apr 2025 -
The DOJ Still Wants Google to Sell Off Chrome
-
Start contributing to a Popular Open Source Project
-
Data Orchestration Tool Analysis: Airflow, Dagster, Flyte
-
Can AI finally generate best practice code? I think so.
-
A note from our sponsor - Judoscale
judoscale.com | 28 Apr 2025
Index
What are some of the best open-source data-engineering projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | Airflow | 39,794 |
2 | Prefect | 19,124 |
3 | Taipy | 17,996 |
4 | airbyte | 17,989 |
5 | Cookbook | 14,243 |
6 | dagster | 13,031 |
7 | great_expectations | 10,342 |
8 | xonsh | 8,747 |
9 | Mage | 8,264 |
10 | feast | 6,002 |
11 | AWS Data Wrangler | 4,007 |
12 | ploomber | 3,557 |
13 | dlt | 3,516 |
14 | soda-core | 2,069 |
15 | meltano | 2,034 |
16 | pyspark-example-project | 1,860 |
17 | bytewax | 1,713 |
18 | Udacity-Data-Engineering-Projects | 1,566 |
19 | mlrun | 1,518 |
20 | pyjanitor | 1,414 |
21 | quix-streams | 1,358 |
22 | DataEngineeringProject | 1,237 |
23 | pyper | 1,189 |