Open-source projects categorized as data-cleaning

Top 7 data-cleaning Open-Source Projects

  • GitHub repo miller

    Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

    Project mention: Consultare un databate XML, JSON, CVS o RDF | reddit.com/r/ItalyInformatica | 2021-03-31
  • GitHub repo cleanlab

    The standard package for machine learning with noisy labels and finding mislabeled data. Works with most datasets and models.

    Project mention: [R] Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks | reddit.com/r/MachineLearning | 2021-03-29

    👍An easy first step to find label errors in datasets is cleanlab: https://github.com/cgnorthcutt/cleanlab

  • GitHub repo Optimus

    :truck: Agile Data Preparation Workflows made easy with pandas, dask, cudf, dask_cudf and pyspark (by ironmussa)

  • GitHub repo janitor

    simple tools for data cleaning in R

  • GitHub repo validate

    Professional data validation for the R environment (by data-cleaning)

    Project mention: How to verify your data? | reddit.com/r/Rlanguage | 2021-01-21

    To me it sounds as if you want to test your data in between steps or at the end. Two tools come to mind: https://docs.ropensci.org/assertr/ and https://github.com/data-cleaning/validate

  • GitHub repo akvo-lumen

    Make sense of your data

  • GitHub repo Skytrax-Data-Warehouse

    A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

    Project mention: Open source contributions for a Data Engineer? | reddit.com/r/dataengineering | 2021-04-16

    Always open to accept contributions to my project (Skytrax Data Warehouse). If you are into data stuff support my work at youtube as well (One Developer Pirate), I mostly make data-oriented videos. These days I'm making a SQL course from a data analysis perspective that is expected to release in next week.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-04-16.


What are some of the best open-source data-cleaning projects? This list will help you:

Project Stars
1 miller 2,710
2 cleanlab 1,822
3 Optimus 997
4 janitor 996
5 validate 275
6 akvo-lumen 58
7 Skytrax-Data-Warehouse 55