Python ETL

Open-source Python projects categorized as ETL

Top 23 Python ETL Projects

  1. Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

    Project mention: Airflow AI SDK to build simple LLM workflows | news.ycombinator.com | 2025-03-26

    Hi HN,

    We've built an SDK for building DAGs / data pipelines with LLMs in Apache Airflow [1] using Pydantic AI [2] under the hood. I've seen success across the board with Airflow users building simple LLM workflows before moving on to "AI agents". In my experience, the noise around building agents means that people forget that there are other ways to get more immediate value out of LLMs.

    Coupling Airflow for orchestration and Pydantic AI for LLM interactions has turned out to be a very pragmatic approach to building these workflows (and agents). Neither tool "gets in the way" of what you're trying to do. Airflow's been around for 10+ years and has a very well-built orchestration engine rich with everything you need to write production grade data pipelines, and Pydantic AI's been a refreshing take on working with LLMs.

    Would love some feedback from this community!

    [1] https://github.com/apache/airflow

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. pathway

    Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.

    Project mention: pathway VS cocoindex - a user suggested alternative | libhunt.com/r/pathway | 2025-04-01
  4. airbyte

    The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

    Project mention: Personal Picks: Data Product News (April 16, 2025) | dev.to | 2025-04-15
  5. dagster

    An orchestration platform for the development, production, and observation of data assets.

    Project mention: Personal Picks: Data Product News (March 19, 2025) | dev.to | 2025-03-22
  6. Mage

    πŸ§™ The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

    Project mention: Wk 3 Orchestration: MLOPs with DataTalks | dev.to | 2025-02-22

    Here, we use the free Mage Ai orchestration tool.

  7. AWS Data Wrangler

    pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

  8. ethereum-etl

    Python scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in Google BigQuery https://goo.gl/oY5BCQ

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. sqlmesh

    Scalable and efficient data transformation framework - backwards compatible with dbt.

  11. mara-pipelines

    A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

  12. docetl

    A system for agentic LLM-powered data processing and ETL

    Project mention: DocETL – open-source framework for complex document processing pipelines | news.ycombinator.com | 2024-10-21
  13. pyspark-example-project

    Implementing best practices for PySpark ETL jobs and applications.

  14. pgsync

    Postgres to Elasticsearch/OpenSearch sync (by toluaina)

  15. NeumAI

    Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.

  16. eland

    Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

  17. koheesio

    Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.

    Project mention: This Week in Python | dev.to | 2024-06-07

    koheesio – framework for building efficient data pipelines

  18. baby-names-analysis

    Data ETL & Analysis on the dataset 'Baby Names from Social Security Card Applications - National Data'.

  19. redun

    Yet another redundant workflow engine

  20. pudl

    The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.

  21. sycamore

    🍁 Sycamore is an LLM-powered search and analytics platform for unstructured data. (by aryn-ai)

  22. vlmrun-hub

    A hub for various industry-specific schemas to be used with VLMs.

    Project mention: Replace OCR with Vision Language Models | news.ycombinator.com | 2025-02-26

    We’re adding this as we speak. Ollama support is already there, and here’s vLLM inference: https://github.com/vlm-run/vlmrun-hub/pull/120

  23. versatile-data-kit

    One framework to develop, deploy and operate data workflows with Python and SQL.

  24. bitcoin-etl

    ETL scripts for Bitcoin, Litecoin, Dash, Zcash, Doge, Bitcoin Cash. Available in Google BigQuery https://goo.gl/oY5BCQ

  25. ethereum-etl-airflow

    Airflow DAGs for exporting, loading, and parsing the Ethereum blockchain data. How to get any Ethereum smart contract into BigQuery https://towardsdatascience.com/how-to-get-any-ethereum-smart-contract-into-bigquery-in-8-mins-bab5db1fdeee

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python ETL discussion

Log in or Post with

Python ETL related posts

  • Streaming data with FastAPI & Vue made easy

    2 projects | dev.to | 25 Apr 2025
  • airbyte VS cocoindex - a user suggested alternative

    2 projects | 1 Apr 2025
  • Automate structured data extraction from PDF / Word by OpenAI and CocoIndex

    4 projects | dev.to | 28 Mar 2025
  • Replace OCR with Vision Language Models

    7 projects | news.ycombinator.com | 26 Feb 2025
  • Show HN: I built an open-source data pipeline tool in Go

    6 projects | news.ycombinator.com | 17 Dec 2024
  • Data Engineering with DLT and REST

    2 projects | dev.to | 28 Nov 2024
  • Show HN: Open-source Rule-based PDF parser for RAG

    9 projects | news.ycombinator.com | 23 Jan 2024
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 16 May 2025
    InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more β†’

Index

What are some of the best open-source ETL projects in Python? This list will help you:

# Project Stars
1 Airflow 40,060
2 pathway 24,669
3 airbyte 18,103
4 dagster 13,107
5 Mage 8,312
6 AWS Data Wrangler 4,016
7 ethereum-etl 3,017
8 sqlmesh 2,304
9 mara-pipelines 2,082
10 docetl 1,947
11 pyspark-example-project 1,860
12 pgsync 1,277
13 NeumAI 854
14 eland 676
15 koheesio 637
16 baby-names-analysis 564
17 redun 544
18 pudl 538
19 sycamore 515
20 vlmrun-hub 506
21 versatile-data-kit 449
22 bitcoin-etl 421
23 ethereum-etl-airflow 416

Sponsored
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com