Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today. Learn more →
Top 23 Python ETL Projects
-
-
Scout Monitoring
Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
-
airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
-
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
-
AWS Data Wrangler
pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
-
ethereum-etl
Python scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in Google BigQuery https://goo.gl/oY5BCQ
-
mara-pipelines
A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
-
NeumAI
Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.
-
-
baby-names-analysis
Data ETL & Analysis on the dataset 'Baby Names from Social Security Card Applications - National Data'.
-
-
-
bitcoin-etl
ETL scripts for Bitcoin, Litecoin, Dash, Zcash, Doge, Bitcoin Cash. Available in Google BigQuery https://goo.gl/oY5BCQ
-
ethereum-etl-airflow
Airflow DAGs for exporting, loading, and parsing the Ethereum blockchain data. How to get any Ethereum smart contract into BigQuery https://towardsdatascience.com/how-to-get-any-ethereum-smart-contract-into-bigquery-in-8-mins-bab5db1fdeee
-
astro-sdk
Astro SDK allows rapid and clean development of {Extract, Load, Transform} workflows using Python and SQL, powered by Apache Airflow.
-
-
-
-
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
An integral part of an ML project is data acquisition and data transformation into the required format. This involves creating ETL (extract, transform, load) pipelines and running them periodically. Airflow is an open source platform that helps engineers create and manage complex data pipelines. Furthermore, the support for Python programming language makes it easy for ML teams to adopt Airflow.
Project mention: How to Build a Chat App with Your Postgres Data using Agent Cloud | dev.to | 2024-05-13AgentCloud uses Airbyte to build data pipelines, which allow us to split, chunk, and embed data from over 300 data sources, including Postgres.
Level 1 of MLOps is when you've put each lifecycle stage and their intefaces in an automated pipeline. The pipeline could be a python or bash script, or it could be a directed acyclic graph run by some orchestration framework like Airflow, dagster or one of the cloud-provider offerings. AI- or data-specific platforms like MLflow, ClearML and dvc also feature pipeline capabilities.
Project mention: Read files from s3 using Pandas/s3fs or AWS Data Wrangler? | /r/dataengineering | 2023-12-06I had no problem with awswrangler (https://github.com/aws/aws-sdk-pandas) and it supports reading and writing partitions which was really helpful and a few other optimizations that made it a great tool
Project mention: Blockchain transactions decoding: making wallet activity understandable | dev.to | 2023-10-27Event is a log entity which EVM smart contracts can emit during transaction execution. Events are very good at signalling that an some action has taken place on-chain. Applications can subscribe and listen to events to trigger some off-chain logic or they can index, transform and store events in some off-chain storage (look at The Graph protocol or Ethereum ETL).
Project mention: Launch HN: Serra (YC S23) – Open-source, Python-based dbt alternative | news.ycombinator.com | 2023-08-14There is also sqlmesh (https://sqlmesh.com/). Pretty new as well. It introduces some interesting concepts. For smaller dbt projects it could be a drop-in replacement as it allows importing dbt projects.
Project mention: Show HN: Neum AI – Open-source large-scale RAG framework | news.ycombinator.com | 2023-11-21Interesting to see that the semantic chunking in the tools library is a wrapper around GPT-4. Asks GPT for the python code and executes it: https://github.com/NeumTry/NeumAI/blob/main/neumai-tools/neu...
Project mention: Show HN: Open-source Rule-based PDF parser for RAG | news.ycombinator.com | 2024-01-23
Project mention: Recap: A python library for describing database tables and serialization formats with minimal type coercion. | /r/dataengineering | 2023-07-12The Github Repo: https://github.com/recap-build/recap
Project mention: Show HN: Generate JSON mock data for testing/initial app development | news.ycombinator.com | 2023-10-03A friend of mine built a tool called Trex that you might find helpful, check it out here: https://github.com/automorphic-ai/trex
It's very consistent at generating templated data.
Python ETL discussion
Python ETL related posts
-
Show HN: Open-source Rule-based PDF parser for RAG
-
Prism: the easiest way to create robust data workflows. Accessible via CLI
-
Show HN: Prism – a framework for creating robust data science workflows
-
Show HN: Prism – Data Orchestration in Python
-
Introducing Prism: A Novel, Open-Source Data Orchestration Software. Feedback needed!
-
Prism - a lightweight, yet powerful data orchestration platform in Python. Accessible via CLI
-
Intelligently transform unstructured to structured output (JSON, Regex, CFG)
-
A note from our sponsor - Scout Monitoring
www.scoutapm.com | 16 Jun 2024
Index
What are some of the best open-source ETL projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | Airflow | 35,036 |
2 | airbyte | 14,584 |
3 | dagster | 10,552 |
4 | Mage | 7,284 |
5 | AWS Data Wrangler | 3,835 |
6 | ethereum-etl | 2,851 |
7 | mara-pipelines | 2,057 |
8 | sqlmesh | 1,424 |
9 | pyspark-example-project | 1,370 |
10 | pgsync | 1,090 |
11 | NeumAI | 795 |
12 | eland | 621 |
13 | baby-names-analysis | 563 |
14 | redun | 489 |
15 | versatile-data-kit | 414 |
16 | bitcoin-etl | 390 |
17 | ethereum-etl-airflow | 389 |
18 | astro-sdk | 326 |
19 | paperetl | 321 |
20 | recap | 309 |
21 | usaspending-api | 287 |
22 | trex | 239 |
23 | dbt-coves | 223 |