Python data-quality

Open-source Python projects categorized as data-quality

Top 16 Python data-quality Projects

data-quality
  1. ydata-profiling

    1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

    Project mention: The DuckDB Local UI | news.ycombinator.com | 2025-03-12

    WhatTheDuck does SQL with duckdb-wasm IIRC

    Pygwalker does open-source descriptive statistics and charts from pandas dataframes: https://github.com/Kanaries/pygwalker

    ydata-profiling does Exploratory Data Analysis (EDA) with Pandas and Spark DataFrames and integrates with various apps: https://github.com/ydataai/ydata-profiling

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. cleanlab

    The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

    Project mention: Ask HN: Not a webdev, why are these sites so good? | news.ycombinator.com | 2024-06-18

    https://cleanlab.ai/

  4. great_expectations

    Always know what to expect from your data.

  5. fiftyone

    Refine high-quality datasets and visual AI models

    Project mention: Launch HN: Enhanced Radar (YC W25) – A safety net for air traffic control | news.ycombinator.com | 2025-03-04

    Are there already bird not a bird datasets?

    Procedures for creating "bird on Multispectral plane radar and video" dataset(s):

    Tag birds on the dashcam video with timecoded sensor data and a segmentation and annotation tool.

    Pinch to zoom, auto-edge detect, classification probability, sensor status

    voxel51/fiftyone does segmentation and annotation with video and possibly Multispectral data: https://github.com/voxel51/fiftyone

  6. feast

    The Open Source Feature Store for AI/ML

    Project mention: Transforming Your PDFs for RAG with Open Source Using Docling, Milvus, and Feast | news.ycombinator.com | 2025-04-22

    Hey folks!

    I recently gave a talk with the Milvus Community showing a demo of how to transform PDFs with Feast using Docling for RAG.

    The tutorial is available here: https://github.com/feast-dev/feast/tree/master/examples/rag-...

    And the video is available here: https://www.youtube.com/watch?v=DPPtr9Q6_qE

    The goal with having a feature store transform and retrieve your data for RAG is that (1) we make it easy to configure vector retrieval with just a boolean in the code declaration and (2) you can use existing tooling that data scientists / ml engineers are already familiar with.

    I'd love any feedback or ideas on how we could make things better or easier. The Feast maintainers have quite a lot in the pipeline (batch transformations, support for Ray, computer vision and more).

    Thanks a ton!

  7. soda-core

    :zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io

  8. cleanvision

    Automatically find issues in image datasets and practice data-centric computer vision.

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. piperider

    Code review for data in dbt

  11. Encord Active

    Open source active learning toolkit to find failure modes in your computer vision models, prioritize data to label next, and drive data curation to improve model performance.

  12. feathub

    FeatHub - A stream-batch unified feature store for real-time machine learning

  13. cuallee

    Possibly the fastest DataFrame-agnostic quality check library in town.

  14. data-observability-installer

    Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.

    Project mention: New: Open Source Data Observability | dev.to | 2024-05-22

    DataKitchen Data Observability Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.

  15. swiple

    Swiple enables you to easily observe, understand, validate and improve the quality of your data

  16. soda-spark

    Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes

  17. panda_patrol

  18. data_check

    data and pipeline testing with and for SQL

  19. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python data-quality discussion

Log in or Post with

Python data-quality related posts

  • Ask HN: Not a webdev, why are these sites so good?

    1 project | news.ycombinator.com | 18 Jun 2024
  • Show HN: Snowflake Data Quality Checks in Python

    1 project | news.ycombinator.com | 11 Feb 2024
  • Show HN: Data monitoring and profiling with 1 function call

    1 project | news.ycombinator.com | 13 Dec 2023
  • [Research] Detecting Annotation Errors in Semantic Segmentation Data

    1 project | /r/MachineLearning | 5 Nov 2023
  • [R] Automated Quality Assurance for Object Detection Datasets

    1 project | /r/computervision | 28 Sep 2023
  • Show HN: PipeRider – open-source Data Impact Analysis for dbt changes

    3 projects | news.ycombinator.com | 6 Sep 2023
  • [D] Is accurately estimating image quality even possible?

    3 projects | /r/MachineLearning | 22 Apr 2023
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 21 May 2025
    InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →

Index

What are some of the best open-source data-quality projects in Python? This list will help you:

# Project Stars
1 ydata-profiling 12,911
2 cleanlab 10,526
3 great_expectations 10,386
4 fiftyone 9,488
5 feast 6,060
6 soda-core 2,085
7 cleanvision 1,079
8 piperider 487
9 Encord Active 449
10 feathub 329
11 cuallee 188
12 data-observability-installer 117
13 swiple 83
14 soda-spark 63
15 panda_patrol 21
16 data_check 4

Sponsored
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?