SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python ETL Projects
-
airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
-
AWS Data Wrangler
pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
-
ethereum-etl
Python scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in Google BigQuery https://goo.gl/oY5BCQ
-
mara-pipelines
A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
NeumAI
Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.
-
baby-names-analysis
Data ETL & Analysis on the dataset 'Baby Names from Social Security Card Applications - National Data'.
-
ethereum-etl-airflow
Airflow DAGs for exporting, loading, and parsing the Ethereum blockchain data. How to get any Ethereum smart contract into BigQuery https://towardsdatascience.com/how-to-get-any-ethereum-smart-contract-into-bigquery-in-8-mins-bab5db1fdeee
-
bitcoin-etl
ETL scripts for Bitcoin, Litecoin, Dash, Zcash, Doge, Bitcoin Cash. Available in Google BigQuery https://goo.gl/oY5BCQ
-
astro-sdk
Astro SDK allows rapid and clean development of {Extract, Load, Transform} workflows using Python and SQL, powered by Apache Airflow.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Building in Public: Leveraging Tublian's AI Copilot for My Open Source Contributions | dev.to | 2024-02-12Contributing to Apache Airflow's open-source project immersed me in collaborative coding. Experienced maintainers rigorously reviewed my contributions, providing constructive feedback. This ongoing dialogue refined the codebase and honed my understanding of best practices.
Project mention: Launch HN: Bracket (YC W22) – Two-Way Sync Between Salesforce and Postgres | news.ycombinator.com | 2023-12-12I'l also give a shout-out to Airbyte (https://airbyte.com/), with which I've had some limited success with integrating Salesforce to a local database. The particular pull for Airbyte is that we can self-host the open source version, rather than pay Fivetran a significant sum to do this for us.
It's an immature tool, so I don't yet know that I can claim we've spent _less_ than Fivetran on the additional engineering and ops time, but it feels like it has potential to do so once stabilized.
Project mention: Read files from s3 using Pandas/s3fs or AWS Data Wrangler? | /r/dataengineering | 2023-12-06I had no problem with awswrangler (https://github.com/aws/aws-sdk-pandas) and it supports reading and writing partitions which was really helpful and a few other optimizations that made it a great tool
Project mention: Blockchain transactions decoding: making wallet activity understandable | dev.to | 2023-10-27Event is a log entity which EVM smart contracts can emit during transaction execution. Events are very good at signalling that an some action has taken place on-chain. Applications can subscribe and listen to events to trigger some off-chain logic or they can index, transform and store events in some off-chain storage (look at The Graph protocol or Ethereum ETL).
Project mention: Launch HN: Serra (YC S23) – Open-source, Python-based dbt alternative | news.ycombinator.com | 2023-08-14There is also sqlmesh (https://sqlmesh.com/). Pretty new as well. It introduces some interesting concepts. For smaller dbt projects it could be a drop-in replacement as it allows importing dbt projects.
Project mention: Show HN: Neum AI – Open-source large-scale RAG framework | news.ycombinator.com | 2023-11-21Interesting to see that the semantic chunking in the tools library is a wrapper around GPT-4. Asks GPT for the python code and executes it: https://github.com/NeumTry/NeumAI/blob/main/neumai-tools/neu...
Project mention: I'm getting elasticsearch.BadRequestError: BadRequestError(400, 'illegal_argument_exception', "specified fields can't be null or empty") using Eland library | /r/elasticsearch | 2023-05-02We have a fix for this issue reported here merged and pending a release. Hopefully that release will happen in the next few days, then you can upgrade and the default experience for everyone won't be as confusing :)
Here's the project: https://github.com/vmware/versatile-data-kit
Project mention: Orchestration: Thoughts on Dagster, Airflow and Prefect? | /r/dataengineering | 2023-06-01Have you tried the Astro SDK? https://github.com/astronomer/astro-sdk
Project mention: Show HN: Open-source Rule-based PDF parser for RAG | news.ycombinator.com | 2024-01-23
Project mention: Recap: A python library for describing database tables and serialization formats with minimal type coercion. | /r/dataengineering | 2023-07-12The Github Repo: https://github.com/recap-build/recap
Project mention: Show HN: Generate JSON mock data for testing/initial app development | news.ycombinator.com | 2023-10-03A friend of mine built a tool called Trex that you might find helpful, check it out here: https://github.com/automorphic-ai/trex
It's very consistent at generating templated data.
Project mention: Is there something wrong with me, I hate dbt, what am I missing ? | /r/dataengineering | 2023-05-15This just feels like you aren’t using the plentiful tools to make those “mind-numbingly slow” dev steps faster. For ex., using dbt-coves to generate the staging models with casting to types in a couple clicks. And pulling directly from Fivetran tables is just poor practice, with the additional steps needed to do it “right” being inconsequential at best.
Python ETL related posts
- Show HN: Open-source Rule-based PDF parser for RAG
- Prism: the easiest way to create robust data workflows. Accessible via CLI
- Show HN: Prism – a framework for creating robust data science workflows
- Show HN: Prism – Data Orchestration in Python
- Introducing Prism: A Novel, Open-Source Data Orchestration Software. Feedback needed!
- Prism - a lightweight, yet powerful data orchestration platform in Python. Accessible via CLI
- Intelligently transform unstructured to structured output (JSON, Regex, CFG)
-
A note from our sponsor - SaaSHub
www.saashub.com | 25 Apr 2024
Index
What are some of the best open-source ETL projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | Airflow | 34,397 |
2 | airbyte | 13,923 |
3 | dagster | 10,173 |
4 | Mage | 7,001 |
5 | AWS Data Wrangler | 3,797 |
6 | ethereum-etl | 2,819 |
7 | mara-pipelines | 2,054 |
8 | pyspark-example-project | 1,370 |
9 | sqlmesh | 1,249 |
10 | pgsync | 1,053 |
11 | NeumAI | 774 |
12 | eland | 608 |
13 | baby-names-analysis | 564 |
14 | redun | 484 |
15 | versatile-data-kit | 410 |
16 | ethereum-etl-airflow | 387 |
17 | bitcoin-etl | 386 |
18 | astro-sdk | 317 |
19 | paperetl | 316 |
20 | recap | 306 |
21 | usaspending-api | 283 |
22 | trex | 238 |
23 | dbt-coves | 208 |
Sponsored