data-cleaning

Top 19 data-cleaning Open-Source Projects

  • cleanlab

    The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

    Project mention: [Research] Detecting Annotation Errors in Semantic Segmentation Data | /r/MachineLearning | 2023-11-05

    We have feely open-sourced our new method for improving segmentation data, published a paper on the research behind it, and released a 5-min code tutorial. You can also read more in the blog if you'd like.

  • miller

    Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

    Project mention: Qsv: Efficient CSV CLI Toolkit | news.ycombinator.com | 2023-12-22
  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

    Project mention: A mage on the Hero’s Journey: a fantasy epic on how a startup rose from the ashes | dev.to | 2023-06-12

    In the coming years, Mage will create a cooperative experience so that developers can build data pipelines with their team and level up together. After that journey, Mage will go on an epic quest to create the 1st open world community experience in the data universe.

  • pandera

    A light-weight, flexible, and expressive statistical data testing library

  • Optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)

  • janitor

    simple tools for data cleaning in R

  • skrub

    Prepping tables for machine learning

    Project mention: Are there any Python libraries for Data Cleansing ? | /r/dataengineering | 2023-12-08
  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • schema-inspector

    Schema-Inspector is a simple JavaScript object sanitization and validation module.

  • Encord Active

    Open source active learning toolkit to find failure modes in your computer vision models, prioritize data to label next, and drive data curation to improve model performance.

    Project mention: Launch HN: Encord (YC W21) – Unit testing for computer vision models | news.ycombinator.com | 2024-01-31

    We base our pricing on your user and consumption scale and would be happy to discuss this with you directly. Please feel free to explore the OS version of Active at https://github.com/encord-team/encord-active. Note that some features, such as natural language search using GPU accelerated APIs, are not included in the cloud version.

  • validate

    Professional data validation for the R environment (by data-cleaning)

  • feature-engineering-tutorials

    Data Science Feature Engineering and Selection Tutorials

  • FuzzTypes

    Pydantic extension for annotating autocorrecting fields.

    Project mention: FuzzTypes: Pydantic Library for Auto-Correcting Annotation Types | news.ycombinator.com | 2024-03-15
  • akvo-lumen

    Make sense of your data

  • desbordante-core

    Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

    Project mention: Show HN: Desbordante 1.0.0 Released | news.ycombinator.com | 2023-12-11
  • image-quality-issues

    FiftyOne Plugin for finding common image quality issues

    Project mention: Plugin for Building and Managing Plugins! | dev.to | 2024-02-09

    Week 0: 🌩️ Image Quality Issues & 📈 Concept Interpolation

  • quantclean

    🧹 Quantclean is a program that reformats financial dataset to US Equity TradeBar (Quantconnect format)

  • scikit-clean

    A collection of algorithms for detecting and handling label noise

    Project mention: Ask HN: What side projects landed you a job? | news.ycombinator.com | 2023-12-03

    Among all these feel-good stories, how about one with a bit different ending?

    During my masters, I created a ML library that dealt with noise in dataset. I implemented bunch of papers, but unlike your usual research code, I spent a long time obsessing about it's API, performance, created documentation, CI- the whole shebang [1]. But then, like avg research code, I moved on and promptly forgot about it.

    One day about a year ago the cofounder of a very new, small startup working on something similar texted me about the project on linkedin. We chatted for a bit, but as a guy who thinks he's too cool for linkedin, I next logged in and saw his last message about wanting to collaborate about 3/4 months after the fact.

    Well they raised $25 million dollars a few months ago :(

    [1] https://github.com/Shihab-Shahriar/scikit-clean

  • image-deduplication-plugin

    Remove exact and approximate duplicates from your dataset in FiftyOne!

    Project mention: Plugin for Building and Managing Plugins! | dev.to | 2024-02-09

    Week 4: 🪞Image Deduplication

  • csv_log_cleaner

    Clean CSV files to conform to a type schema by streaming them through small memory buffers using multiple threads and logging data loss.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-03-15.

data-cleaning related posts

Index

What are some of the best open-source data-cleaning projects? This list will help you:

Project Stars
1 cleanlab 8,592
2 miller 8,542
3 Mage 6,953
4 pandera 2,976
5 Optimus 1,441
6 janitor 1,337
7 skrub 1,006
8 schema-inspector 504
9 Encord Active 420
10 validate 400
11 feature-engineering-tutorials 263
12 FuzzTypes 183
13 akvo-lumen 63
14 desbordante-core 61
15 image-quality-issues 20
16 quantclean 16
17 scikit-clean 13
18 image-deduplication-plugin 8
19 csv_log_cleaner 2
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com