Python data-cleaning

Open-source Python projects categorized as data-cleaning

Top 12 Python data-cleaning Projects

data-cleaning
  1. cleanlab

    The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. fiftyone

    Refine high-quality datasets and visual AI models

    Project mention: Launch HN: Enhanced Radar (YC W25) – A safety net for air traffic control | news.ycombinator.com | 2025-03-04

    Are there already bird not a bird datasets?

    Procedures for creating "bird on Multispectral plane radar and video" dataset(s):

    Tag birds on the dashcam video with timecoded sensor data and a segmentation and annotation tool.

    Pinch to zoom, auto-edge detect, classification probability, sensor status

    voxel51/fiftyone does segmentation and annotation with video and possibly Multispectral data: https://github.com/voxel51/fiftyone

  4. Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

    Project mention: Wk 3 Orchestration: MLOPs with DataTalks | dev.to | 2025-02-22

    Here, we use the free Mage Ai orchestration tool.

  5. pandera

    A light-weight, flexible, and expressive statistical data testing library

  6. Optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)

  7. skrub

    Machine learning with dataframes

  8. Encord Active

    Open source active learning toolkit to find failure modes in your computer vision models, prioritize data to label next, and drive data curation to improve model performance.

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. sliceguard

    A library for detecting problematic data segments in structured and unstructured data with few lines of code.

  11. image-quality-issues

    FiftyOne Plugin for finding common image quality issues

    Project mention: Data Quality: The Hidden Driver of AI Success | dev.to | 2024-11-12

    From our own experiences building high-performing visual AI systems, we know well that AI/ML specialists struggle with the challenges of curating high-quality datasets. That's why we’ve invested in tools and plugins such as the data quality plugin for FiftyOne, which helps you find problematic images in your dataset such as blurry images, too bright or too dark images, and potentially noisy images. And this deduplication plugin for FiftyOne helps you find near and exact duplicates in your dataset.

  12. image-deduplication-plugin

    Remove exact and approximate duplicates from your dataset in FiftyOne!

    Project mention: Data Quality: The Hidden Driver of AI Success | dev.to | 2024-11-12

    From our own experiences building high-performing visual AI systems, we know well that AI/ML specialists struggle with the challenges of curating high-quality datasets. That's why we’ve invested in tools and plugins such as the data quality plugin for FiftyOne, which helps you find problematic images in your dataset such as blurry images, too bright or too dark images, and potentially noisy images. And this deduplication plugin for FiftyOne helps you find near and exact duplicates in your dataset.

  13. quantclean

    🧹 Quantclean is a program that reformats financial dataset to US Equity TradeBar (Quantconnect format)

  14. scikit-clean

    A collection of algorithms for detecting and handling label noise

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python data-cleaning discussion

Log in or Post with

Python data-cleaning related posts

  • Data Quality: The Hidden Driver of AI Success

    4 projects | dev.to | 12 Nov 2024
  • Ask HN: Not a webdev, why are these sites so good?

    1 project | news.ycombinator.com | 18 Jun 2024
  • [Research] Detecting Annotation Errors in Semantic Segmentation Data

    1 project | /r/MachineLearning | 5 Nov 2023
  • [R] Automated Quality Assurance for Object Detection Datasets

    1 project | /r/computervision | 28 Sep 2023
  • Need Suggestion for transformation tool

    1 project | /r/dataengineering | 21 Dec 2022
  • Updating dbt Cloud pricing to support long-term community growth 50$->100$

    1 project | /r/dataengineering | 16 Dec 2022
  • Airflow for Data Ingestion

    1 project | /r/dataengineering | 29 Nov 2022
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 23 Jun 2025
    InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →

Index

What are some of the best open-source data-cleaning projects in Python? This list will help you:

# Project Stars
1 cleanlab 10,635
2 fiftyone 9,617
3 Mage 8,375
4 pandera 3,861
5 Optimus 1,513
6 skrub 1,412
7 Encord Active 450
8 sliceguard 64
9 image-quality-issues 32
10 image-deduplication-plugin 18
11 quantclean 18
12 scikit-clean 16

Sponsored
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?