Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 11 Python data-cleaning Projects
-
cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
Optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)
-
Encord Active
Open source active learning toolkit to find failure modes in your computer vision models, prioritize data to label next, and drive data curation to improve model performance.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
quantclean
🧹 Quantclean is a program that reformats financial dataset to US Equity TradeBar (Quantconnect format)
Project mention: [Research] Detecting Annotation Errors in Semantic Segmentation Data | /r/MachineLearning | 2023-11-05We have feely open-sourced our new method for improving segmentation data, published a paper on the research behind it, and released a 5-min code tutorial. You can also read more in the blog if you'd like.
Project mention: Are there any Python libraries for Data Cleansing ? | /r/dataengineering | 2023-12-08
Project mention: Launch HN: Encord (YC W21) – Unit testing for computer vision models | news.ycombinator.com | 2024-01-31We base our pricing on your user and consumption scale and would be happy to discuss this with you directly. Please feel free to explore the OS version of Active at https://github.com/encord-team/encord-active. Note that some features, such as natural language search using GPU accelerated APIs, are not included in the cloud version.
Project mention: FuzzTypes: Pydantic Library for Auto-Correcting Annotation Types | news.ycombinator.com | 2024-03-15
Week 0: 🌩️ Image Quality Issues & 📈 Concept Interpolation
Among all these feel-good stories, how about one with a bit different ending?
During my masters, I created a ML library that dealt with noise in dataset. I implemented bunch of papers, but unlike your usual research code, I spent a long time obsessing about it's API, performance, created documentation, CI- the whole shebang [1]. But then, like avg research code, I moved on and promptly forgot about it.
One day about a year ago the cofounder of a very new, small startup working on something similar texted me about the project on linkedin. We chatted for a bit, but as a guy who thinks he's too cool for linkedin, I next logged in and saw his last message about wanting to collaborate about 3/4 months after the fact.
Well they raised $25 million dollars a few months ago :(
[1] https://github.com/Shihab-Shahriar/scikit-clean
Week 4: 🪞Image Deduplication
Python data-cleaning related posts
- [Research] Detecting Annotation Errors in Semantic Segmentation Data
- [R] Automated Quality Assurance for Object Detection Datasets
- Need Suggestion for transformation tool
- Updating dbt Cloud pricing to support long-term community growth 50$->100$
- Airflow for Data Ingestion
- ETL tool
- Show HN: Mage, Fivetran alternative for ELT and data integrations
-
A note from our sponsor - InfluxDB
www.influxdata.com | 23 Apr 2024
Index
What are some of the best open-source data-cleaning projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | cleanlab | 8,592 |
2 | Mage | 7,001 |
3 | pandera | 2,994 |
4 | Optimus | 1,441 |
5 | skrub | 1,009 |
6 | Encord Active | 420 |
7 | FuzzTypes | 188 |
8 | image-quality-issues | 20 |
9 | quantclean | 16 |
10 | scikit-clean | 13 |
11 | image-deduplication-plugin | 8 |
Sponsored