Our great sponsors
-
cleanlab
Discontinued The standard package for machine learning with noisy labels and finding mislabeled data. Works with most datasets and models. [Moved to: https://github.com/cleanlab/cleanlab] (by cgnorthcutt)
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Once you have your pipeline, model included, with all the transformers defined and parametrized, you could use an optimizing approach like the one in the examples of this library: https://github.com/JaimeArboleda/nestedcvtraining Do you think it will be a good idea? Or am I oversimplifying?
I am an author on this, so I am biased. Around half a decade ago, we began developing a field at MIT called confident learning [ paper | blog | reddit post ] that takes a data-centric approach: instead of improving the model quality, it improves the data label quality. It's used by Google, Facebook, and is open-sourced in Python as the cleanlab package.
Related posts
- Show HN: Simple (but clever) algorithms can find label issues in datasets
- [D] In which ML field can I make significant contribution without significant compute?
- [D] A simple trick to quickly verify data
- [P] Cleanlab Vizzy — learn how to automatically find label errors and out-of-distribution data
- Show HN: Cleanlab Vizzy – automatically find label errors and bad data