Python dataengineering

Open-source Python projects categorized as dataengineering

Top 10 Python dataengineering Projects

dataengineering
  1. sqlmesh

    Scalable and efficient data transformation framework - backwards compatible with dbt.

  2. Judoscale

    Save 47% on cloud hosting with autoscaling that just works. Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.

    Judoscale logo
  3. grai-core

  4. beginner_de_project_stream

    Simple stream processing pipeline

  5. pyspark-on-aws-emr

    The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.

  6. data-engineer-challenge

    Challenge Data Engineer

  7. pyDag

    Scheduling Big Data Workloads and Data Pipelines in the Cloud with pyDag

  8. metadata-guardian

    Provide an easy way with Python to protect your data sources by searching its metadata. 🛡️

  9. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  10. athenaSQL

    SQL builder for AWS Athena, inspired by sparkSQL

    Project mention: AthenaSQL: SQL query builder for AWS Athena, inspired by pySpark SQL | news.ycombinator.com | 2024-11-04

    Hi Everyone,

    I work in adtech, where we handle massive log-level data. To cut costs and improve performance for ML and optimization, my team and I chose a lakehouse approach using AWS (S3 + OTFs / partitioned Parquet + Athena + Glue).

    One challenge we faced with this data stack was managing Athena queries in our ETL jobs. Since Athena handles much of our data-heavy processing, we ended up storing hundreds of lines of query code as strings in Python scripts, which quickly became a nightmare to maintain.

    We needed something similar to PySpark SQL that could output SQL string compatible with Athena. So we built athenaSQL. It mimics the PySpark SQL API, providing a familiar interface and outputting SQL queries directly.

    It is far from complete at the moment but it has most of the basic query statements. I would love it if you could test it out and share any feedback! I hope someone is in need of such a tool, if it lacks the functionality you are seeking, let’s build it together! And feel free to critique it as much as you like. :)

    github: https://github.com/nabilseid/athenaSQL

    docs: github.com/nabilseid/athenaSQL

  11. ticker_selection_BI_dashboard

    Data Engineering Project: 4 shares of a stock data extraction, upload on MySql used to be in a BI project

  12. livyc

    Apache Spark as a Service with Apache Livy Client

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python dataengineering discussion

Log in or Post with

Python dataengineering related posts

Index

What are some of the best open-source dataengineering projects in Python? This list will help you:

# Project Stars
1 sqlmesh 2,256
2 grai-core 303
3 beginner_de_project_stream 100
4 pyspark-on-aws-emr 26
5 data-engineer-challenge 25
6 pyDag 24
7 metadata-guardian 16
8 athenaSQL 6
9 ticker_selection_BI_dashboard 4
10 livyc 3

Sponsored
Save 47% on cloud hosting with autoscaling that just works
Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.
judoscale.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?