Top 14 Python data-quality Projects

ydata-profiling

43 12,022 8.5 Python

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

Project mention: FLaNK 25 December 2023 | dev.to | 2023-12-26

great_expectations

15 9,440 9.9 Python

Always know what to expect from your data.
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
cleanlab

69 8,592 9.4 Python

The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

Project mention: [Research] Detecting Annotation Errors in Semantic Segmentation Data | /r/MachineLearning | 2023-11-05

We have feely open-sourced our new method for improving segmentation data, published a paper on the research behind it, and released a 5-min code tutorial. You can also read more in the blog if you'd like.

feast

8 5,246 9.3 Python

Feature Store for Machine Learning

Project mention: What's Happening with Feast? | news.ycombinator.com | 2023-12-07

data-diff

20 2,830 9.6 Python

Compare tables within or across databases

Project mention: How to Check 2 SQL Tables Are the Same | news.ycombinator.com | 2023-07-26

If the issue happen a lot, there is also: https://github.com/datafold/data-diff
That is a nice tool to do it cross database as well.
I think it's based on checksum method.

soda-core

5 1,751 9.0 Python

:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
cleanvision

4 921 7.3 Python

Automatically find issues in image datasets and practice data-centric computer vision.
InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
piperider

6 466 9.5 Python

Code review for data in dbt

Project mention: Show HN: PipeRider – open-source Data Impact Analysis for dbt changes | news.ycombinator.com | 2023-09-06

Encord Active

6 420 9.1 Python

Open source active learning toolkit to find failure modes in your computer vision models, prioritize data to label next, and drive data curation to improve model performance.

Project mention: Launch HN: Encord (YC W21) – Unit testing for computer vision models | news.ycombinator.com | 2024-01-31

We base our pricing on your user and consumption scale and would be happy to discuss this with you directly. Please feel free to explore the OS version of Active at https://github.com/encord-team/encord-active. Note that some features, such as natural language search using GPU accelerated APIs, are not included in the cloud version.

cuallee

5 105 9.1 Python

Possibly the fastest DataFrame-agnostic quality check library in town.

Project mention: Show HN: Snowflake Data Quality Checks in Python | news.ycombinator.com | 2024-02-11

swiple

1 77 2.7 Python

Swiple enables you to easily observe, understand, validate and improve the quality of your data
soda-spark

1 60 0.0 Python

Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
panda_patrol

2 21 9.2 Python

Project mention: Show HN: Data monitoring and profiling with 1 function call | news.ycombinator.com | 2023-12-13

data_check

1 4 8.3 Python

data and pipeline testing with and for SQL
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python data-quality related posts

Show HN: Snowflake Data Quality Checks in Python
1 project | news.ycombinator.com | 11 Feb 2024
Show HN: Data monitoring and profiling with 1 function call
1 project | news.ycombinator.com | 13 Dec 2023
[Research] Detecting Annotation Errors in Semantic Segmentation Data
1 project | /r/MachineLearning | 5 Nov 2023
[R] Automated Quality Assurance for Object Detection Datasets
1 project | /r/computervision | 28 Sep 2023
Show HN: PipeRider – open-source Data Impact Analysis for dbt changes
3 projects | news.ycombinator.com | 6 Sep 2023
[D] Is accurately estimating image quality even possible?
3 projects | /r/MachineLearning | 22 Apr 2023
Looking for Unit Testing framework in Database Migration Process
3 projects | /r/dataengineering | 23 Mar 2023
A note from our sponsor - WorkOS
workos.com | 24 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source data-quality projects in Python? This list will help you:

	Project	Stars
1	ydata-profiling	12,022
2	great_expectations	9,440
3	cleanlab	8,592
4	feast	5,246
5	data-diff	2,830
6	soda-core	1,751
7	cleanvision	921
8	piperider	466
9	Encord Active	420
10	cuallee	105
11	swiple	77
12	soda-spark	60
13	panda_patrol	21
14	data_check	4