Python data-cleaning

Open-source Python projects categorized as data-cleaning

Top 12 Python data-cleaning Projects

  • cleanlab

    The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

    Project mention: Ask HN: Not a webdev, why are these sites so good? | | 2024-06-18

  • Scout Monitoring

    Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in is all you need to start monitoring your apps. Sign up for our free tier today.

    Scout Monitoring logo
  • Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data.

    Project mention: 25 Open Source AI Tools to Cut Your Development Time in Half | | 2024-07-11

    Mage AI is a data transforming and integrating framework that allows data scientists and ML engineers to build and automate data pipelines without extensive coding. Data scientists can easily connect to their data sources, ingest data, and build production-ready data pipelines within Mage notebooks.

  • pandera

    A light-weight, flexible, and expressive statistical data testing library

  • Optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)

  • skrub

    Prepping tables for machine learning

    Project mention: Are there any Python libraries for Data Cleansing ? | /r/dataengineering | 2023-12-08
  • Encord Active

    Open source active learning toolkit to find failure modes in your computer vision models, prioritize data to label next, and drive data curation to improve model performance.

    Project mention: Launch HN: Encord (YC W21) – Unit testing for computer vision models | | 2024-01-31

    We base our pricing on your user and consumption scale and would be happy to discuss this with you directly. Please feel free to explore the OS version of Active at Note that some features, such as natural language search using GPU accelerated APIs, are not included in the cloud version.

  • FuzzTypes

    Pydantic extension for annotating autocorrecting fields.

    Project mention: FuzzTypes: Pydantic Library for Auto-Correcting Annotation Types | | 2024-03-15
  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • sliceguard

    A library for detecting problematic data segments in structured and unstructured data with few lines of code.

    Project mention: [P] Python Library for Quickly Detecting Problematic Data Segments | /r/MachineLearning | 2023-08-25

    I'm building a library for quickly detecting problematic data slices (clusters) when developing machine learning models.

  • image-quality-issues

    FiftyOne Plugin for finding common image quality issues

    Project mention: Plugin for Building and Managing Plugins! | | 2024-02-09

    Week 0: 🌩️ Image Quality Issues & 📈 Concept Interpolation

  • quantclean

    🧹 Quantclean is a program that reformats financial dataset to US Equity TradeBar (Quantconnect format)

  • scikit-clean

    A collection of algorithms for detecting and handling label noise

    Project mention: Ask HN: What side projects landed you a job? | | 2023-12-03

    Among all these feel-good stories, how about one with a bit different ending?

    During my masters, I created a ML library that dealt with noise in dataset. I implemented bunch of papers, but unlike your usual research code, I spent a long time obsessing about it's API, performance, created documentation, CI- the whole shebang [1]. But then, like avg research code, I moved on and promptly forgot about it.

    One day about a year ago the cofounder of a very new, small startup working on something similar texted me about the project on linkedin. We chatted for a bit, but as a guy who thinks he's too cool for linkedin, I next logged in and saw his last message about wanting to collaborate about 3/4 months after the fact.

    Well they raised $25 million dollars a few months ago :(


  • image-deduplication-plugin

    Remove exact and approximate duplicates from your dataset in FiftyOne!

    Project mention: Plugin for Building and Managing Plugins! | | 2024-02-09

    Week 4: 🪞Image Deduplication

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python data-cleaning discussion

Log in or Post with

Python data-cleaning related posts


What are some of the best open-source data-cleaning projects in Python? This list will help you:

Project Stars
1 cleanlab 9,113
2 Mage 7,469
3 pandera 3,147
4 Optimus 1,462
5 skrub 1,055
6 Encord Active 428
7 FuzzTypes 203
8 sliceguard 59
9 image-quality-issues 24
10 quantclean 17
11 scikit-clean 14
12 image-deduplication-plugin 13

Free Django app performance insights with Scout Monitoring
Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in is all you need to start monitoring your apps. Sign up for our free tier today.