The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →
Top 23 Jupyter Notebook Dataset Projects
-
covid-chestxray-dataset
We are building an open database of COVID-19 cases with chest X-ray or CT images.
-
whylogs
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
datasets
🎁 5,400,000+ Unsplash images made available for research and machine learning (by unsplash)
-
clusterdata
cluster data collected from production clusters in Alibaba for cluster management research
-
raccoon_dataset
The dataset is used to train my own raccoon detector and I blogged about it on Medium
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
ThoughtSource
A central, open resource for data and tools related to chain-of-thought reasoning in large language models. Developed @ Samwald research group: https://samwald.info/
-
torchxrayvision
TorchXRayVision: A library of chest X-ray datasets and models. Classifiers, segmentation, and autoencoders.
-
hate-speech-and-offensive-language
Repository for the paper "Automated Hate Speech Detection and the Problem of Offensive Language", ICWSM 2017
-
SKAB
SKAB - Skoltech Anomaly Benchmark. Time-series data for evaluating Anomaly Detection algorithms.
-
Awesome_Satellite_Benchmark_Datasets
Supplementary material for our paper "THERE IS NO DATA LIKE MORE DATA" is provided.
-
roboflow-100-benchmark
Code for replicating Roboflow 100 benchmark results and programmatically downloading benchmark datasets
-
mnist1d
A 1D analogue of the MNIST dataset for measuring spatial biases and answering Science of Deep Learning questions.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Here's a live demo with a simple React frontend. It's searching against an S3 bucket containing Unsplash's open source dataset of 25,000 images, plus a few of my own.
Project mention: Simple Implementation of OpenAI Clip (Tutorial) | news.ycombinator.com | 2024-02-21
> Their SKILL tool involves a set of algorithms that make the process go much faster, they said, because the agents learn at the same time in parallel. Their research showed if 102 agents each learn one task and then share, the amount of time needed is reduced by a factor of 101.5 after accounting for the necessary communications and knowledge consolidation among agents.
This is a really interesting idea. It's like the reverse of knowledge distillation (which I've been thinking about a lot[1]) where you have one giant model that knows a lot about a lot & you use that model to train smaller, faster models that know a lot about a little.
Instead, you if you could train a lot of models that know a lot about a little (which is a lot less computationally intensive because the problem space is so confined) and combine them into a generalized model, that'd be hugely beneficial.
Unfortunately, after a bit of digging into the paper & Github repo[2], this doesn't seem to be what's happening at all.
> The code will learn 102 small and separte heads(either a linear head or a linear head with a task bias) for each tasks respectively in order. This step can be parallized on multiple GPUS with one task per GPU. The heads will be saved in the weight folder. After that, the code will learn a task mapper(Either using GMMC or Mahalanobis) to distinguish image task-wisely. Then, all images will be evaluated in the same time without a task label.
So the knowledge isn't being combined (and the agents aren't learning from each other) into a generalized model. They're just training a bunch of independent models for specific tasks & adding a model-selection step that maps an image to the most relevant "expert". My guess is you could do the same thing using CLIP vectors as the routing method to supervised models trained on specific datasets (we found that datasets largely live in distinct regions of CLIP-space[3]).
[1] https://github.com/autodistill/autodistill
[2] https://github.com/gyhandy/Shared-Knowledge-Lifelong-Learnin...
[3] https://www.rf100.org
Jupyter Notebook Dataset related posts
- Simple Implementation of OpenAI Clip (Tutorial)
- SKAB: NEW Data - star count:238.0
- SKAB: NEW Data - star count:238.0
- SKAB: NEW Data - star count:238.0
- SKAB: NEW Data - star count:238.0
- Update from Waymo spokesperson on the dog that was killed by a Waymo ADV
- [P] Fine-tuning LLaMA on TheVault by AI4Code
-
A note from our sponsor - WorkOS
workos.com | 25 Apr 2024
Index
What are some of the best open-source Dataset projects in Jupyter Notebook? This list will help you:
Project | Stars | |
---|---|---|
1 | covid-chestxray-dataset | 2,958 |
2 | whylogs | 2,543 |
3 | datasets | 2,299 |
4 | fma | 2,108 |
5 | clusterdata | 1,477 |
6 | raccoon_dataset | 1,266 |
7 | COVID-CT | 1,062 |
8 | ThoughtSource | 832 |
9 | torchxrayvision | 828 |
10 | hate-speech-and-offensive-language | 750 |
11 | TACO | 540 |
12 | OpenAI-CLIP | 509 |
13 | SKAB | 292 |
14 | Awesome_Satellite_Benchmark_Datasets | 282 |
15 | covid19za | 255 |
16 | alis | 227 |
17 | roboflow-100-benchmark | 227 |
18 | goodreads | 228 |
19 | ImageNetV2 | 223 |
20 | openbrewerydb | 173 |
21 | clip-italian | 170 |
22 | mnist1d | 138 |
23 | cpi | 127 |
Sponsored