Top 9 Python dataengineering Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
pyspark-on-aws-emr
The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.
-
metadata-guardian
Provide an easy way with Python to protect your data sources by searching its metadata. 🛡️
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
ticker_selection_BI_dashboard
Data Engineering Project: 4 shares of a stock data extraction, upload on MySql used to be in a BI project
If the issue happen a lot, there is also: https://github.com/datafold/data-diff
That is a nice tool to do it cross database as well.
I think it's based on checksum method.
Project mention: Launch HN: Serra (YC S23) – Open-source, Python-based dbt alternative | news.ycombinator.com | 2023-08-14There is also sqlmesh (https://sqlmesh.com/). Pretty new as well. It introduces some interesting concepts. For smaller dbt projects it could be a drop-in replacement as it allows importing dbt projects.
Project mention: Launch HN: Grai (YC S22) – Open-Source Data Observability Platform | news.ycombinator.com | 2023-07-17Elastic v2 if one is interested in such things: https://github.com/grai-io/grai-core/blob/v0.1.33/LICENSE
Index
What are some of the best open-source dataengineering projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | data-diff | 2,842 |
2 | sqlmesh | 1,249 |
3 | grai-core | 269 |
4 | data-engineer-challenge | 25 |
5 | pyDag | 24 |
6 | pyspark-on-aws-emr | 24 |
7 | metadata-guardian | 18 |
8 | livyc | 3 |
9 | ticker_selection_BI_dashboard | 2 |
Sponsored