InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →
Top 16 Python data-quality Projects
-
ydata-profiling
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
WhatTheDuck does SQL with duckdb-wasm IIRC
Pygwalker does open-source descriptive statistics and charts from pandas dataframes: https://github.com/Kanaries/pygwalker
ydata-profiling does Exploratory Data Analysis (EDA) with Pandas and Spark DataFrames and integrates with various apps: https://github.com/ydataai/ydata-profiling
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Project mention: Ask HN: Not a webdev, why are these sites so good? | news.ycombinator.com | 2024-06-18https://cleanlab.ai/
-
-
Project mention: Launch HN: Enhanced Radar (YC W25) – A safety net for air traffic control | news.ycombinator.com | 2025-03-04
Are there already bird not a bird datasets?
Procedures for creating "bird on Multispectral plane radar and video" dataset(s):
Tag birds on the dashcam video with timecoded sensor data and a segmentation and annotation tool.
Pinch to zoom, auto-edge detect, classification probability, sensor status
voxel51/fiftyone does segmentation and annotation with video and possibly Multispectral data: https://github.com/voxel51/fiftyone
-
Project mention: Transforming Your PDFs for RAG with Open Source Using Docling, Milvus, and Feast | news.ycombinator.com | 2025-04-22
Hey folks!
I recently gave a talk with the Milvus Community showing a demo of how to transform PDFs with Feast using Docling for RAG.
The tutorial is available here: https://github.com/feast-dev/feast/tree/master/examples/rag-...
And the video is available here: https://www.youtube.com/watch?v=DPPtr9Q6_qE
The goal with having a feature store transform and retrieve your data for RAG is that (1) we make it easy to configure vector retrieval with just a boolean in the code declaration and (2) you can use existing tooling that data scientists / ml engineers are already familiar with.
I'd love any feedback or ideas on how we could make things better or easier. The Feast maintainers have quite a lot in the pipeline (batch transformations, support for Ray, computer vision and more).
Thanks a ton!
-
soda-core
:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
-
Encord Active
Open source active learning toolkit to find failure modes in your computer vision models, prioritize data to label next, and drive data curation to improve model performance.
-
-
-
data-observability-installer
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.
DataKitchen Data Observability Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.
-
swiple
Swiple enables you to easily observe, understand, validate and improve the quality of your data
-
soda-spark
Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python data-quality discussion
Python data-quality related posts
-
Ask HN: Not a webdev, why are these sites so good?
-
Show HN: Snowflake Data Quality Checks in Python
-
Show HN: Data monitoring and profiling with 1 function call
-
[Research] Detecting Annotation Errors in Semantic Segmentation Data
-
[R] Automated Quality Assurance for Object Detection Datasets
-
Show HN: PipeRider – open-source Data Impact Analysis for dbt changes
-
[D] Is accurately estimating image quality even possible?
-
A note from our sponsor - InfluxDB
www.influxdata.com | 21 May 2025
Index
What are some of the best open-source data-quality projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | ydata-profiling | 12,911 |
2 | cleanlab | 10,526 |
3 | great_expectations | 10,386 |
4 | fiftyone | 9,488 |
5 | feast | 6,060 |
6 | soda-core | 2,085 |
7 | cleanvision | 1,079 |
8 | piperider | 487 |
9 | Encord Active | 449 |
10 | feathub | 329 |
11 | cuallee | 188 |
12 | data-observability-installer | 117 |
13 | swiple | 83 |
14 | soda-spark | 63 |
15 | panda_patrol | 21 |
16 | data_check | 4 |