Python data-engineering

Open-source Python projects categorized as data-engineering Edit details

Top 23 Python data-engineering Projects

  • Prefect

    The easiest way to build, run, and monitor data pipelines at scale.

    Project mention: Prefect - The easiest way to automate your data | reddit.com/r/github | 2022-05-21
  • great_expectations

    Always know what to expect from your data.

    Project mention: Soda Core (OSS) is now GA! So, why should you add checks to your data pipelines? | reddit.com/r/dataengineering | 2022-06-28

    GE is arguably the most well known OSS alternative to Soda Core. The third option is deequ, originally developed and released in OSS by AWS. Our community has told us that Soda Core is different because it’s easy to get going and embed into data pipelines. And it also allows some of the check authoring work to be moved to other members of the data team. I'm sure there are also scenarios where Soda Core is not the best option. For example, when you only use Pandas dataframes or develop in Scala.

  • SonarLint

    Clean code begins in your IDE with SonarLint. Up your coding game and discover issues early. SonarLint is a free plugin that helps you find & fix bugs and security issues from the moment you start writing code. Install from your favorite IDE marketplace today.

  • feast

    Feature Store for Machine Learning

    Project mention: [D] Your 🫵 Preferred Feature Stores? | reddit.com/r/datascience | 2022-07-03
  • AWS Data Wrangler

    Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

    Project mention: Automate some wrangling and data visualization in Python | reddit.com/r/aws | 2022-01-03
  • ploomber

    The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

    Project mention: How do you deal with parallelising parts of an ML pipeline especially on Python? | reddit.com/r/mlops | 2022-08-12

    I also recommend checking ploomber out, this open source can help you build code as templates, parallelize it and parameterize it. There are also some reporting and debugging tools in there!

  • data-diff

    Efficiently diff rows across two different databases.

    Project mention: data-diff | reddit.com/r/devopspro | 2022-07-11
  • pyjanitor

    Clean APIs for data cleaning. Python implementation of R package Janitor

    Project mention: how important are learning the data manipulation libraries? | reddit.com/r/learndatascience | 2022-03-25
  • Scout APM

    Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.

  • soda-core

    Data reliability tools for SQL- and Spark-accessible data

    Project mention: Show HN: Soda Core is now GA – Test data like you would test your code | news.ycombinator.com | 2022-06-28
  • mlrun

    Machine Learning automation and tracking

    Project mention: Discussion on Need of Feature Stores | reddit.com/r/mlops | 2022-07-17
  • DataEngineeringProject

    Example end to end data engineering project.

  • Skytrax-Data-Warehouse

    A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

  • sayn

    Data processing and modelling framework for automating tasks (incl. Python & SQL transformations).

    Project mention: Average reply times from some of my Facebook friends over the last few years [OC], full article here: https://medium.com/@timsugaipov/taking-your-facebook-messenger-data-further-f9da079b1409?source=friends_link&sk=3bd04bb35ad9a4b6f586300e52f96e4f | reddit.com/r/dataisbeautiful | 2021-11-01

    Data Processing: SAYN

  • patterns-devkit

    Data pipelines from re-usable components

  • Meerschaum

    Create and manage data pipes with Meerschaum.

    Project mention: Python ETL - Jupyter/Pandas/Postgresql(DW) - Project Structure and Scripting | reddit.com/r/ETL | 2022-06-23

    I'm the author of the ETL framework Meerschaum which is meant for this exact purpose. You can build an ETL pipeline in a few lines of Python, e.g. here's a quick video. Check out the Getting Started guide and the docs on writing your first plugin to get your data flowing!

  • airflow-testing-ci-workflow

    (project & tutorial) dag pipeline tests + ci/cd setup

  • snowpark-python

    Snowflake Snowpark Python API

    Project mention: [GitHub] snowflakedb/snowpark-python: Snowflake Snowpark Python API (open source!) | reddit.com/r/snowflake | 2022-06-16
  • soorgeon

    Convert monolithic Jupyter notebooks 📙 into maintainable Ploomber pipelines. 📊

    Project mention: Tips and Tricks to Use Jupyter Notebooks Effectively | dev.to | 2022-08-01

    If you're looking to improve your Jupyter workflow, check out Ploomber's open-source projects: Ploomber for developing modular data pipelines, Soorgeon for refactoring and cleaning), or nbsnapshot for notebook testing.

  • soda-spark

    Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes

    Project mention: How do you test your pipelines? | reddit.com/r/dataengineering | 2022-01-23

    Since you already have Spark setup, perhaps it would be easier to build a DataFrames by loading data from different tables and validate it in one go ? You can give soda-spark a try (disclosure: I'm one of the developers), using which you can specify your checks using YAML declaratively and run the validations in spark jobs.

  • bytehub

    ByteHub: making feature stores simple

    Project mention: [D] Your 🫵 Preferred Feature Stores? | reddit.com/r/datascience | 2022-07-03
  • mz-hack-day-2022

    Official repo for the Materialize + Redpanda + dbt Hack Day 2022, including a sample project to get everyone started!

    Project mention: Using Redpanda with Materialize and dbt for a faster, safer Kappa architecture | dev.to | 2022-03-16

    This is state-of-the-art Kappa Architecture: Redpanda as a fast, durable log; Materialize for SQL based streaming; and dbt for dataOps. This stack combines speed, ease of use, developer productivity, and governance. Best of all, you do not need to invest in setting up a large infrastructure: this entire stack can be packaged to run as a single Docker Compose project in your own laptop or workstation. You can try it out for yourself using this sample project.

  • soda-sql

    Data profiling, testing, and monitoring for SQL accessible data.

    Project mention: Data Quality - Great Expectations for Data Engineers | reddit.com/r/dataengineering | 2022-03-18

    I might be a bit biased, but that was my opinion before even I started contributing to Soda SQL.

  • dagster-example-pipeline

    Template Dagster repo using poetry and a single Docker container; works well with CICD

    Project mention: Developing in Dagster | dev.to | 2022-03-25

    The associated code repo can be found here

  • fastapi-dramatiq-data-ingestion

    Sample project showing reliable data ingestion application using FastAPI and dramatiq

    Project mention: Create and deploy a reliable data ingestion service with FastAPI, SQLModel and Dramatiq | reddit.com/r/FastAPI | 2021-09-02

    Here is the GitHub repository with the source code of the app: https://github.com/frankie567/fastapi-dramatiq-data-ingestion

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-08-12.

Python data-engineering related posts

Index

What are some of the best open-source data-engineering projects in Python? This list will help you:

Project Stars
1 Prefect 9,778
2 great_expectations 6,989
3 feast 3,466
4 AWS Data Wrangler 2,999
5 ploomber 2,614
6 data-diff 1,515
7 pyjanitor 956
8 soda-core 947
9 mlrun 778
10 DataEngineeringProject 546
11 Skytrax-Data-Warehouse 106
12 sayn 106
13 patterns-devkit 78
14 Meerschaum 69
15 airflow-testing-ci-workflow 66
16 snowpark-python 59
17 soorgeon 55
18 soda-spark 55
19 bytehub 49
20 mz-hack-day-2022 48
21 soda-sql 38
22 dagster-example-pipeline 34
23 fastapi-dramatiq-data-ingestion 28
Find remote jobs at our new job board 99remotejobs.com. There are 3 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com