InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →
Top 12 Python data-cleaning Projects
-
cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
Project mention: Launch HN: Enhanced Radar (YC W25) – A safety net for air traffic control | news.ycombinator.com | 2025-03-04
Are there already bird not a bird datasets?
Procedures for creating "bird on Multispectral plane radar and video" dataset(s):
Tag birds on the dashcam video with timecoded sensor data and a segmentation and annotation tool.
Pinch to zoom, auto-edge detect, classification probability, sensor status
voxel51/fiftyone does segmentation and annotation with video and possibly Multispectral data: https://github.com/voxel51/fiftyone
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
Here, we use the free Mage Ai orchestration tool.
-
-
Optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)
-
-
Encord Active
Open source active learning toolkit to find failure modes in your computer vision models, prioritize data to label next, and drive data curation to improve model performance.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
sliceguard
A library for detecting problematic data segments in structured and unstructured data with few lines of code.
-
From our own experiences building high-performing visual AI systems, we know well that AI/ML specialists struggle with the challenges of curating high-quality datasets. That's why we’ve invested in tools and plugins such as the data quality plugin for FiftyOne, which helps you find problematic images in your dataset such as blurry images, too bright or too dark images, and potentially noisy images. And this deduplication plugin for FiftyOne helps you find near and exact duplicates in your dataset.
-
From our own experiences building high-performing visual AI systems, we know well that AI/ML specialists struggle with the challenges of curating high-quality datasets. That's why we’ve invested in tools and plugins such as the data quality plugin for FiftyOne, which helps you find problematic images in your dataset such as blurry images, too bright or too dark images, and potentially noisy images. And this deduplication plugin for FiftyOne helps you find near and exact duplicates in your dataset.
-
quantclean
🧹 Quantclean is a program that reformats financial dataset to US Equity TradeBar (Quantconnect format)
-
Python data-cleaning discussion
Python data-cleaning related posts
-
Data Quality: The Hidden Driver of AI Success
-
Ask HN: Not a webdev, why are these sites so good?
-
[Research] Detecting Annotation Errors in Semantic Segmentation Data
-
[R] Automated Quality Assurance for Object Detection Datasets
-
Need Suggestion for transformation tool
-
Updating dbt Cloud pricing to support long-term community growth 50$->100$
-
Airflow for Data Ingestion
-
A note from our sponsor - InfluxDB
www.influxdata.com | 23 Jun 2025
Index
What are some of the best open-source data-cleaning projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | cleanlab | 10,635 |
2 | fiftyone | 9,617 |
3 | Mage | 8,375 |
4 | pandera | 3,861 |
5 | Optimus | 1,513 |
6 | skrub | 1,412 |
7 | Encord Active | 450 |
8 | sliceguard | 64 |
9 | image-quality-issues | 32 |
10 | image-deduplication-plugin | 18 |
11 | quantclean | 18 |
12 | scikit-clean | 16 |