Top 23 Python data-engineering Projects
The easiest way to build, run, and monitor data pipelines at scale.Project mention: Prefect - The easiest way to automate your data | reddit.com/r/github | 2022-05-21
Always know what to expect from your data.Project mention: Soda Core (OSS) is now GA! So, why should you add checks to your data pipelines? | reddit.com/r/dataengineering | 2022-06-28
GE is arguably the most well known OSS alternative to Soda Core. The third option is deequ, originally developed and released in OSS by AWS. Our community has told us that Soda Core is different because it’s easy to get going and embed into data pipelines. And it also allows some of the check authoring work to be moved to other members of the data team. I'm sure there are also scenarios where Soda Core is not the best option. For example, when you only use Pandas dataframes or develop in Scala.
Clean code begins in your IDE with SonarLint. Up your coding game and discover issues early. SonarLint is a free plugin that helps you find & fix bugs and security issues from the moment you start writing code. Install from your favorite IDE marketplace today.
Feature Store for Machine Learning
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).Project mention: Automate some wrangling and data visualization in Python | reddit.com/r/aws | 2022-01-03
The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️Project mention: How do you deal with parallelising parts of an ML pipeline especially on Python? | reddit.com/r/mlops | 2022-08-12
I also recommend checking ploomber out, this open source can help you build code as templates, parallelize it and parameterize it. There are also some reporting and debugging tools in there!
Efficiently diff rows across two different databases.Project mention: data-diff | reddit.com/r/devopspro | 2022-07-11
Clean APIs for data cleaning. Python implementation of R package JanitorProject mention: how important are learning the data manipulation libraries? | reddit.com/r/learndatascience | 2022-03-25
Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.
Data reliability tools for SQL- and Spark-accessible dataProject mention: Show HN: Soda Core is now GA – Test data like you would test your code | news.ycombinator.com | 2022-06-28
Machine Learning automation and trackingProject mention: Discussion on Need of Feature Stores | reddit.com/r/mlops | 2022-07-17
Example end to end data engineering project.
A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.
Data processing and modelling framework for automating tasks (incl. Python & SQL transformations).Project mention: Average reply times from some of my Facebook friends over the last few years [OC], full article here: https://medium.com/@timsugaipov/taking-your-facebook-messenger-data-further-f9da079b1409?source=friends_link&sk=3bd04bb35ad9a4b6f586300e52f96e4f | reddit.com/r/dataisbeautiful | 2021-11-01
Data Processing: SAYN
Data pipelines from re-usable components
Create and manage data pipes with Meerschaum.Project mention: Python ETL - Jupyter/Pandas/Postgresql(DW) - Project Structure and Scripting | reddit.com/r/ETL | 2022-06-23
I'm the author of the ETL framework Meerschaum which is meant for this exact purpose. You can build an ETL pipeline in a few lines of Python, e.g. here's a quick video. Check out the Getting Started guide and the docs on writing your first plugin to get your data flowing!
(project & tutorial) dag pipeline tests + ci/cd setup
Snowflake Snowpark Python APIProject mention: [GitHub] snowflakedb/snowpark-python: Snowflake Snowpark Python API (open source!) | reddit.com/r/snowflake | 2022-06-16
Convert monolithic Jupyter notebooks 📙 into maintainable Ploomber pipelines. 📊Project mention: Tips and Tricks to Use Jupyter Notebooks Effectively | dev.to | 2022-08-01
If you're looking to improve your Jupyter workflow, check out Ploomber's open-source projects: Ploomber for developing modular data pipelines, Soorgeon for refactoring and cleaning), or nbsnapshot for notebook testing.
Soda Spark is a PySpark library that helps you with testing your data in Spark DataframesProject mention: How do you test your pipelines? | reddit.com/r/dataengineering | 2022-01-23
Since you already have Spark setup, perhaps it would be easier to build a DataFrames by loading data from different tables and validate it in one go ? You can give soda-spark a try (disclosure: I'm one of the developers), using which you can specify your checks using YAML declaratively and run the validations in spark jobs.
ByteHub: making feature stores simple
Official repo for the Materialize + Redpanda + dbt Hack Day 2022, including a sample project to get everyone started!Project mention: Using Redpanda with Materialize and dbt for a faster, safer Kappa architecture | dev.to | 2022-03-16
This is state-of-the-art Kappa Architecture: Redpanda as a fast, durable log; Materialize for SQL based streaming; and dbt for dataOps. This stack combines speed, ease of use, developer productivity, and governance. Best of all, you do not need to invest in setting up a large infrastructure: this entire stack can be packaged to run as a single Docker Compose project in your own laptop or workstation. You can try it out for yourself using this sample project.
Data profiling, testing, and monitoring for SQL accessible data.Project mention: Data Quality - Great Expectations for Data Engineers | reddit.com/r/dataengineering | 2022-03-18
I might be a bit biased, but that was my opinion before even I started contributing to Soda SQL.
Template Dagster repo using poetry and a single Docker container; works well with CICDProject mention: Developing in Dagster | dev.to | 2022-03-25
The associated code repo can be found here
Sample project showing reliable data ingestion application using FastAPI and dramatiqProject mention: Create and deploy a reliable data ingestion service with FastAPI, SQLModel and Dramatiq | reddit.com/r/FastAPI | 2021-09-02
Here is the GitHub repository with the source code of the app: https://github.com/frankie567/fastapi-dramatiq-data-ingestion
Python data-engineering related posts
Ploomber Convert: A free online tool to convert Jupyter notebooks to PDF
1 project | reddit.com/r/IPython | 9 Aug 2022
Analyze and plot 5.5M records in 20s with BigQuery and Ploomber
2 projects | dev.to | 8 Aug 2022
Tips and Tricks to Use Jupyter Notebooks Effectively
3 projects | dev.to | 1 Aug 2022
Discussion on Need of Feature Stores
1 project | reddit.com/r/mlops | 17 Jul 2022
[D] Your Preferred Feature Stores?
8 projects | reddit.com/r/datascience | 3 Jul 2022
ELT of my own Strava data using the Strava API, MySQL, Python, S3, Redshift, and Airflow
2 projects | reddit.com/r/dataengineering | 24 Jun 2022
Python ETL - Jupyter/Pandas/Postgresql(DW) - Project Structure and Scripting
1 project | reddit.com/r/ETL | 23 Jun 2022
What are some of the best open-source data-engineering projects in Python? This list will help you:
|4||AWS Data Wrangler||2,999|
Are you hiring? Post a new remote job listing for free.