Python ETL

Open-source Python projects categorized as ETL

Top 23 Python ETL Projects

  • Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

  • Project mention: Building in Public: Leveraging Tublian's AI Copilot for My Open Source Contributions | dev.to | 2024-02-12

    Contributing to Apache Airflow's open-source project immersed me in collaborative coding. Experienced maintainers rigorously reviewed my contributions, providing constructive feedback. This ongoing dialogue refined the codebase and honed my understanding of best practices.

  • airbyte

    The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

  • Project mention: Launch HN: Bracket (YC W22) – Two-Way Sync Between Salesforce and Postgres | news.ycombinator.com | 2023-12-12

    I'l also give a shout-out to Airbyte (https://airbyte.com/), with which I've had some limited success with integrating Salesforce to a local database. The particular pull for Airbyte is that we can self-host the open source version, rather than pay Fivetran a significant sum to do this for us.

    It's an immature tool, so I don't yet know that I can claim we've spent _less_ than Fivetran on the additional engineering and ops time, but it feels like it has potential to do so once stabilized.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • dagster

    An orchestration platform for the development, production, and observation of data assets.

  • Project mention: Experience with Dagster.io? | news.ycombinator.com | 2023-07-25
  • Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

  • Project mention: A mage on the Hero’s Journey: a fantasy epic on how a startup rose from the ashes | dev.to | 2023-06-12

    In the coming years, Mage will create a cooperative experience so that developers can build data pipelines with their team and level up together. After that journey, Mage will go on an epic quest to create the 1st open world community experience in the data universe.

  • AWS Data Wrangler

    pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

  • Project mention: Read files from s3 using Pandas/s3fs or AWS Data Wrangler? | /r/dataengineering | 2023-12-06

    I had no problem with awswrangler (https://github.com/aws/aws-sdk-pandas) and it supports reading and writing partitions which was really helpful and a few other optimizations that made it a great tool

  • ethereum-etl

    Python scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in Google BigQuery https://goo.gl/oY5BCQ

  • Project mention: Blockchain transactions decoding: making wallet activity understandable | dev.to | 2023-10-27

    Event is a log entity which EVM smart contracts can emit during transaction execution. Events are very good at signalling that an some action has taken place on-chain. Applications can subscribe and listen to events to trigger some off-chain logic or they can index, transform and store events in some off-chain storage (look at The Graph protocol or Ethereum ETL).

  • mara-pipelines

    A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • pyspark-example-project

    Implementing best practices for PySpark ETL jobs and applications.

  • sqlmesh

    Efficient data transformation and modeling framework that is backwards compatible with dbt.

  • Project mention: Launch HN: Serra (YC S23) – Open-source, Python-based dbt alternative | news.ycombinator.com | 2023-08-14

    There is also sqlmesh (https://sqlmesh.com/). Pretty new as well. It introduces some interesting concepts. For smaller dbt projects it could be a drop-in replacement as it allows importing dbt projects.

  • pgsync

    Postgres to Elasticsearch/OpenSearch sync (by toluaina)

  • NeumAI

    Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.

  • Project mention: Show HN: Neum AI – Open-source large-scale RAG framework | news.ycombinator.com | 2023-11-21

    Interesting to see that the semantic chunking in the tools library is a wrapper around GPT-4. Asks GPT for the python code and executes it: https://github.com/NeumTry/NeumAI/blob/main/neumai-tools/neu...

  • eland

    Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

  • Project mention: I'm getting elasticsearch.BadRequestError: BadRequestError(400, 'illegal_argument_exception', "specified fields can't be null or empty") using Eland library | /r/elasticsearch | 2023-05-02

    We have a fix for this issue reported here merged and pending a release. Hopefully that release will happen in the next few days, then you can upgrade and the default experience for everyone won't be as confusing :)

  • baby-names-analysis

    Data ETL & Analysis on the dataset 'Baby Names from Social Security Card Applications - National Data'.

  • redun

    Yet another redundant workflow engine

  • Project mention: Redun: Yet another redundant workflow engine | news.ycombinator.com | 2023-08-11
  • versatile-data-kit

    One framework to develop, deploy and operate data workflows with Python and SQL.

  • Project mention: Looking for a data blogger | /r/opensource | 2023-05-19

    Here's the project: https://github.com/vmware/versatile-data-kit

  • ethereum-etl-airflow

    Airflow DAGs for exporting, loading, and parsing the Ethereum blockchain data. How to get any Ethereum smart contract into BigQuery https://towardsdatascience.com/how-to-get-any-ethereum-smart-contract-into-bigquery-in-8-mins-bab5db1fdeee

  • Project mention: ethereum-etl-airflow: NEW Data - star count:358.0 | /r/algoprojects | 2023-07-10
  • bitcoin-etl

    ETL scripts for Bitcoin, Litecoin, Dash, Zcash, Doge, Bitcoin Cash. Available in Google BigQuery https://goo.gl/oY5BCQ

  • astro-sdk

    Astro SDK allows rapid and clean development of {Extract, Load, Transform} workflows using Python and SQL, powered by Apache Airflow.

  • Project mention: Orchestration: Thoughts on Dagster, Airflow and Prefect? | /r/dataengineering | 2023-06-01

    Have you tried the Astro SDK? https://github.com/astronomer/astro-sdk

  • paperetl

    📄 ⚙️ ETL processes for medical and scientific papers

  • Project mention: Show HN: Open-source Rule-based PDF parser for RAG | news.ycombinator.com | 2024-01-23
  • recap

    Work with your web service, database, and streaming schemas in a single format.

  • Project mention: Recap: A python library for describing database tables and serialization formats with minimal type coercion. | /r/dataengineering | 2023-07-12

    The Github Repo: https://github.com/recap-build/recap

  • usaspending-api

    Server application to serve U.S. federal spending data via a RESTful API

  • trex

    Enforce structured output from LLMs 100% of the time (by automorphic-ai)

  • Project mention: Show HN: Generate JSON mock data for testing/initial app development | news.ycombinator.com | 2023-10-03

    A friend of mine built a tool called Trex that you might find helpful, check it out here: https://github.com/automorphic-ai/trex

    It's very consistent at generating templated data.

  • reddit-detective

    Play detective on Reddit: Discover political disinformation campaigns, secret influencers and more

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-02-12.

Python ETL related posts

Index

What are some of the best open-source ETL projects in Python? This list will help you:

Project Stars
1 Airflow 34,397
2 airbyte 13,821
3 dagster 10,173
4 Mage 6,953
5 AWS Data Wrangler 3,797
6 ethereum-etl 2,819
7 mara-pipelines 2,054
8 pyspark-example-project 1,370
9 sqlmesh 1,231
10 pgsync 1,045
11 NeumAI 772
12 eland 608
13 baby-names-analysis 564
14 redun 484
15 versatile-data-kit 409
16 ethereum-etl-airflow 387
17 bitcoin-etl 385
18 astro-sdk 315
19 paperetl 316
20 recap 306
21 usaspending-api 283
22 trex 237
23 reddit-detective 206

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com