Top 23 Python data-engineering Projects
-
-
Project mention: Soda Core (OSS) is now GA! So, why should you add checks to your data pipelines? | reddit.com/r/dataengineering | 2022-06-28
GE is arguably the most well known OSS alternative to Soda Core. The third option is deequ, originally developed and released in OSS by AWS. Our community has told us that Soda Core is different because it’s easy to get going and embed into data pipelines. And it also allows some of the check authoring work to be moved to other members of the data team. I'm sure there are also scenarios where Soda Core is not the best option. For example, when you only use Pandas dataframes or develop in Scala.
-
SonarLint
Clean code begins in your IDE with SonarLint. Up your coding game and discover issues early. SonarLint is a free plugin that helps you find & fix bugs and security issues from the moment you start writing code. Install from your favorite IDE marketplace today.
-
-
AWS Data Wrangler
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Project mention: Automate some wrangling and data visualization in Python | reddit.com/r/aws | 2022-01-03 -
Project mention: How do you deal with parallelising parts of an ML pipeline especially on Python? | reddit.com/r/mlops | 2022-08-12
I also recommend checking ploomber out, this open source can help you build code as templates, parallelize it and parameterize it. There are also some reporting and debugging tools in there!
-
-
Project mention: how important are learning the data manipulation libraries? | reddit.com/r/learndatascience | 2022-03-25
-
Scout APM
Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.
-
Project mention: Show HN: Soda Core is now GA – Test data like you would test your code | news.ycombinator.com | 2022-06-28
-
-
-
Skytrax-Data-Warehouse
A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.
-
sayn
Data processing and modelling framework for automating tasks (incl. Python & SQL transformations).
Project mention: Average reply times from some of my Facebook friends over the last few years [OC], full article here: https://medium.com/@timsugaipov/taking-your-facebook-messenger-data-further-f9da079b1409?source=friends_link&sk=3bd04bb35ad9a4b6f586300e52f96e4f | reddit.com/r/dataisbeautiful | 2021-11-01Data Processing: SAYN
-
-
Project mention: Python ETL - Jupyter/Pandas/Postgresql(DW) - Project Structure and Scripting | reddit.com/r/ETL | 2022-06-23
I'm the author of the ETL framework Meerschaum which is meant for this exact purpose. You can build an ETL pipeline in a few lines of Python, e.g. here's a quick video. Check out the Getting Started guide and the docs on writing your first plugin to get your data flowing!
-
-
Project mention: [GitHub] snowflakedb/snowpark-python: Snowflake Snowpark Python API (open source!) | reddit.com/r/snowflake | 2022-06-16
-
If you're looking to improve your Jupyter workflow, check out Ploomber's open-source projects: Ploomber for developing modular data pipelines, Soorgeon for refactoring and cleaning), or nbsnapshot for notebook testing.
-
soda-spark
Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
Since you already have Spark setup, perhaps it would be easier to build a DataFrames by loading data from different tables and validate it in one go ? You can give soda-spark a try (disclosure: I'm one of the developers), using which you can specify your checks using YAML declaratively and run the validations in spark jobs.
-
-
mz-hack-day-2022
Official repo for the Materialize + Redpanda + dbt Hack Day 2022, including a sample project to get everyone started!
Project mention: Using Redpanda with Materialize and dbt for a faster, safer Kappa architecture | dev.to | 2022-03-16This is state-of-the-art Kappa Architecture: Redpanda as a fast, durable log; Materialize for SQL based streaming; and dbt for dataOps. This stack combines speed, ease of use, developer productivity, and governance. Best of all, you do not need to invest in setting up a large infrastructure: this entire stack can be packaged to run as a single Docker Compose project in your own laptop or workstation. You can try it out for yourself using this sample project.
-
Project mention: Data Quality - Great Expectations for Data Engineers | reddit.com/r/dataengineering | 2022-03-18
I might be a bit biased, but that was my opinion before even I started contributing to Soda SQL.
-
dagster-example-pipeline
Template Dagster repo using poetry and a single Docker container; works well with CICD
The associated code repo can be found here
-
fastapi-dramatiq-data-ingestion
Sample project showing reliable data ingestion application using FastAPI and dramatiq
Project mention: Create and deploy a reliable data ingestion service with FastAPI, SQLModel and Dramatiq | reddit.com/r/FastAPI | 2021-09-02Here is the GitHub repository with the source code of the app: https://github.com/frankie567/fastapi-dramatiq-data-ingestion
Python data-engineering related posts
- Ploomber Convert: A free online tool to convert Jupyter notebooks to PDF
- Analyze and plot 5.5M records in 20s with BigQuery and Ploomber
- Tips and Tricks to Use Jupyter Notebooks Effectively
- Discussion on Need of Feature Stores
- [D] Your 🫵 Preferred Feature Stores?
- ELT of my own Strava data using the Strava API, MySQL, Python, S3, Redshift, and Airflow
- Python ETL - Jupyter/Pandas/Postgresql(DW) - Project Structure and Scripting
Index
What are some of the best open-source data-engineering projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | Prefect | 9,778 |
2 | great_expectations | 6,989 |
3 | feast | 3,466 |
4 | AWS Data Wrangler | 2,999 |
5 | ploomber | 2,614 |
6 | data-diff | 1,515 |
7 | pyjanitor | 956 |
8 | soda-core | 947 |
9 | mlrun | 778 |
10 | DataEngineeringProject | 546 |
11 | Skytrax-Data-Warehouse | 106 |
12 | sayn | 106 |
13 | patterns-devkit | 78 |
14 | Meerschaum | 69 |
15 | airflow-testing-ci-workflow | 66 |
16 | snowpark-python | 59 |
17 | soorgeon | 55 |
18 | soda-spark | 55 |
19 | bytehub | 49 |
20 | mz-hack-day-2022 | 48 |
21 | soda-sql | 38 |
22 | dagster-example-pipeline | 34 |
23 | fastapi-dramatiq-data-ingestion | 28 |
Are you hiring? Post a new remote job listing for free.