Top 10 Python dataengineering Projects
-
-
Judoscale
Save 47% on cloud hosting with autoscaling that just works. Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.
-
-
pyspark-on-aws-emr
The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.
-
-
-
metadata-guardian
Provide an easy way with Python to protect your data sources by searching its metadata. 🛡️
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
Project mention: AthenaSQL: SQL query builder for AWS Athena, inspired by pySpark SQL | news.ycombinator.com | 2024-11-04
Hi Everyone,
I work in adtech, where we handle massive log-level data. To cut costs and improve performance for ML and optimization, my team and I chose a lakehouse approach using AWS (S3 + OTFs / partitioned Parquet + Athena + Glue).
One challenge we faced with this data stack was managing Athena queries in our ETL jobs. Since Athena handles much of our data-heavy processing, we ended up storing hundreds of lines of query code as strings in Python scripts, which quickly became a nightmare to maintain.
We needed something similar to PySpark SQL that could output SQL string compatible with Athena. So we built athenaSQL. It mimics the PySpark SQL API, providing a familiar interface and outputting SQL queries directly.
It is far from complete at the moment but it has most of the basic query statements. I would love it if you could test it out and share any feedback! I hope someone is in need of such a tool, if it lacks the functionality you are seeking, let’s build it together! And feel free to critique it as much as you like. :)
github: https://github.com/nabilseid/athenaSQL
docs: github.com/nabilseid/athenaSQL
-
ticker_selection_BI_dashboard
Data Engineering Project: 4 shares of a stock data extraction, upload on MySql used to be in a BI project
-
Python dataengineering discussion
Python dataengineering related posts
Index
What are some of the best open-source dataengineering projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | sqlmesh | 2,256 |
2 | grai-core | 303 |
3 | beginner_de_project_stream | 100 |
4 | pyspark-on-aws-emr | 26 |
5 | data-engineer-challenge | 25 |
6 | pyDag | 24 |
7 | metadata-guardian | 16 |
8 | athenaSQL | 6 |
9 | ticker_selection_BI_dashboard | 4 |
10 | livyc | 3 |