The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →
Top 14 Python data-quality Projects
-
ydata-profiling
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
-
soda-core
:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
Encord Active
Open source active learning toolkit to find failure modes in your computer vision models, prioritize data to label next, and drive data curation to improve model performance.
-
swiple
Swiple enables you to easily observe, understand, validate and improve the quality of your data
-
soda-spark
Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: [Research] Detecting Annotation Errors in Semantic Segmentation Data | /r/MachineLearning | 2023-11-05We have feely open-sourced our new method for improving segmentation data, published a paper on the research behind it, and released a 5-min code tutorial. You can also read more in the blog if you'd like.
If the issue happen a lot, there is also: https://github.com/datafold/data-diff
That is a nice tool to do it cross database as well.
I think it's based on checksum method.
Project mention: Show HN: PipeRider – open-source Data Impact Analysis for dbt changes | news.ycombinator.com | 2023-09-06
Project mention: Launch HN: Encord (YC W21) – Unit testing for computer vision models | news.ycombinator.com | 2024-01-31We base our pricing on your user and consumption scale and would be happy to discuss this with you directly. Please feel free to explore the OS version of Active at https://github.com/encord-team/encord-active. Note that some features, such as natural language search using GPU accelerated APIs, are not included in the cloud version.
Project mention: Show HN: Snowflake Data Quality Checks in Python | news.ycombinator.com | 2024-02-11
Project mention: Show HN: Data monitoring and profiling with 1 function call | news.ycombinator.com | 2023-12-13
Python data-quality related posts
- Show HN: Snowflake Data Quality Checks in Python
- Show HN: Data monitoring and profiling with 1 function call
- [Research] Detecting Annotation Errors in Semantic Segmentation Data
- [R] Automated Quality Assurance for Object Detection Datasets
- Show HN: PipeRider – open-source Data Impact Analysis for dbt changes
- [D] Is accurately estimating image quality even possible?
- Looking for Unit Testing framework in Database Migration Process
-
A note from our sponsor - WorkOS
workos.com | 24 Apr 2024
Index
What are some of the best open-source data-quality projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | ydata-profiling | 12,022 |
2 | great_expectations | 9,440 |
3 | cleanlab | 8,592 |
4 | feast | 5,246 |
5 | data-diff | 2,830 |
6 | soda-core | 1,751 |
7 | cleanvision | 921 |
8 | piperider | 466 |
9 | Encord Active | 420 |
10 | cuallee | 105 |
11 | swiple | 77 |
12 | soda-spark | 60 |
13 | panda_patrol | 21 |
14 | data_check | 4 |
Sponsored