examples
deep-active-learning
examples | deep-active-learning | |
---|---|---|
12 | 1 | |
99 | 758 | |
- | - | |
7.8 | 10.0 | |
2 months ago | over 1 year ago | |
Jupyter Notebook | Python | |
GNU Affero General Public License v3.0 | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
examples
- FLaNK AI - 15 April 2024
-
[R] Detecting Dataset Drift and Non-IID Sampling: A k-Nearest Neighbors approach that works for Image/Text/Audio/Numeric Data
I just published a paper detailing this non-IID check and open-sourced its code in the cleanlab package — just one line of code will check for this and many other types of issues in your dataset.
-
Datalab: A Linter for ML Datasets
I recently published a blog introducing Datalab and an open-source Python implementation that is easy-to-use for all data types (image, text, tabular, audio, etc). For data scientists, I’ve made a quick Jupyter tutorial to run Datalab on your own data.
-
Finetuning Large Language Models -- An introduction to the core ideas and approaches
Cool read! I just finished up a notebook where I show how noisy labels can drastically impact the performance of Open AI LLMs. I first fine-tune the well-known Davinci model (the backbone of ChatGPT) on the original data and report an accuracy of 63%. I then use the open-source package cleanlab to find examples that are incorrectly labeled and drop them from the training data. This step increases the fine-tuning accuracy to 66% (better accuracy with less data). Finally, I correct the mislabeled examples and fine-tuning accuracy jumps to 77%!
-
What are some active research areas in Machine Learning Systems?
The entire field of data-centric AI is an active field that is pretty new --- it focuses on the data side of ML as opposed to just model optimization. Our company is building an open-source package cleanlab that is becoming the DCAI standard.
-
[Research] ActiveLab: Active Learning with Data Re-Labeling
I recently published a paper introducing this novel method and an open-source Python implementation that is easy-to-use for all data types (image, text, tabular, audio, etc). For data scientists, I’ve made a quick Jupyter tutorial to run ActiveLab on your own data. For ML researchers, I’ve made all of our benchmarking code available for reproducibility so you can see for yourself how effective ActiveLab is in practice.
-
cleanlab open-source --- expanded support for Active Learning and other data-centric AI tasks
suggest which data is most informative to (re)label next (active learning) (link)
- Strategies for selecting what data to annotate?
- [D] Can someone point to research on determining usefulness of samples/datasets for training ML models?
-
cleanlab: an open-source python framework for data-centric AI
In one-line of python, cleanlab can automatically: 1) find mislabeled data + train robust models 2) detect outliers 3) estimate consensus + annotator-quality for datasets labeled by multiple annotators 4) suggest which data is best to label or re-label next (active learning)
deep-active-learning
What are some alternatives?
token-label-error-benchmarks - Benchmarking methods for label error detection in token classification tasks
lightly - A python library for self-supervised learning on images.
awesome-active-learning - A curated list of awesome Active Learning
argilla - Argilla is a collaboration platform for AI engineers and domain experts that require high-quality outputs, full data ownership, and overall efficiency.
notebooks - Repo for various jupyter notebooks.
cleanlab - The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
multiannotator-benchmarks - Benchmarking algorithms for assessing quality of data labeled by multiple annotators
modAL - A modular active learning framework for Python
adaptive - :chart_with_upwards_trend: Adaptive: parallel active learning of mathematical functions
refinery - The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.