InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more β
Top 23 Python ETL Projects
-
Hi HN,
We've built an SDK for building DAGs / data pipelines with LLMs in Apache Airflow [1] using Pydantic AI [2] under the hood. I've seen success across the board with Airflow users building simple LLM workflows before moving on to "AI agents". In my experience, the noise around building agents means that people forget that there are other ways to get more immediate value out of LLMs.
Coupling Airflow for orchestration and Pydantic AI for LLM interactions has turned out to be a very pragmatic approach to building these workflows (and agents). Neither tool "gets in the way" of what you're trying to do. Airflow's been around for 10+ years and has a very well-built orchestration engine rich with everything you need to write production grade data pipelines, and Pydantic AI's been a refreshing take on working with LLMs.
Would love some feedback from this community!
[1] https://github.com/apache/airflow
-
InfluxDB
InfluxDB β Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
Project mention: pathway VS cocoindex - a user suggested alternative | libhunt.com/r/pathway | 2025-04-01
-
airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
-
-
Mage
π§ The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
Here, we use the free Mage Ai orchestration tool.
-
AWS Data Wrangler
pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
-
ethereum-etl
Python scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in Google BigQuery https://goo.gl/oY5BCQ
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
-
mara-pipelines
A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
-
Project mention: DocETL β open-source framework for complex document processing pipelines | news.ycombinator.com | 2024-10-21
-
-
-
NeumAI
Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.
-
-
koheesio
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
koheesio β framework for building efficient data pipelines
-
baby-names-analysis
Data ETL & Analysis on the dataset 'Baby Names from Social Security Card Applications - National Data'.
-
-
pudl
The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
-
sycamore
π Sycamore is an LLM-powered search and analytics platform for unstructured data. (by aryn-ai)
-
Weβre adding this as we speak. Ollama support is already there, and hereβs vLLM inference: https://github.com/vlm-run/vlmrun-hub/pull/120
-
-
bitcoin-etl
ETL scripts for Bitcoin, Litecoin, Dash, Zcash, Doge, Bitcoin Cash. Available in Google BigQuery https://goo.gl/oY5BCQ
-
ethereum-etl-airflow
Airflow DAGs for exporting, loading, and parsing the Ethereum blockchain data. How to get any Ethereum smart contract into BigQuery https://towardsdatascience.com/how-to-get-any-ethereum-smart-contract-into-bigquery-in-8-mins-bab5db1fdeee
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python ETL discussion
Python ETL related posts
-
Streaming data with FastAPI & Vue made easy
-
airbyte VS cocoindex - a user suggested alternative
2 projects | 1 Apr 2025 -
Automate structured data extraction from PDF / Word by OpenAI and CocoIndex
-
Replace OCR with Vision Language Models
-
Show HN: I built an open-source data pipeline tool in Go
-
Data Engineering with DLT and REST
-
Show HN: Open-source Rule-based PDF parser for RAG
-
A note from our sponsor - InfluxDB
www.influxdata.com | 16 May 2025
Index
What are some of the best open-source ETL projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | Airflow | 40,060 |
2 | pathway | 24,669 |
3 | airbyte | 18,103 |
4 | dagster | 13,107 |
5 | Mage | 8,312 |
6 | AWS Data Wrangler | 4,016 |
7 | ethereum-etl | 3,017 |
8 | sqlmesh | 2,304 |
9 | mara-pipelines | 2,082 |
10 | docetl | 1,947 |
11 | pyspark-example-project | 1,860 |
12 | pgsync | 1,277 |
13 | NeumAI | 854 |
14 | eland | 676 |
15 | koheesio | 637 |
16 | baby-names-analysis | 564 |
17 | redun | 544 |
18 | pudl | 538 |
19 | sycamore | 515 |
20 | vlmrun-hub | 506 |
21 | versatile-data-kit | 449 |
22 | bitcoin-etl | 421 |
23 | ethereum-etl-airflow | 416 |