Top 23 Jupyter Notebook Dataset Projects

covid-chestxray-dataset

1 2,958 0.0 Jupyter Notebook

We are building an open database of COVID-19 cases with chest X-ray or CT images.
whylogs

6 2,543 9.1 Jupyter Notebook

An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
datasets

4 2,299 4.7 Jupyter Notebook

🎁 5,400,000+ Unsplash images made available for research and machine learning (by unsplash)

Project mention: AI-Powered Image Search with CLIP, pgvector, and Fast API | dev.to | 2024-02-12

Here's a live demo with a simple React frontend. It's searching against an S3 bucket containing Unsplash's open source dataset of 25,000 images, plus a few of my own.

fma

1 2,108 0.0 Jupyter Notebook

FMA: A Dataset For Music Analysis
clusterdata

1 1,477 4.5 Jupyter Notebook

cluster data collected from production clusters in Alibaba for cluster management research
raccoon_dataset

1 1,266 0.0 Jupyter Notebook

The dataset is used to train my own raccoon detector and I blogged about it on Medium
COVID-CT

1 1,062 0.0 Jupyter Notebook

COVID-CT-Dataset: A CT Scan Dataset about COVID-19
InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
ThoughtSource

1 832 8.4 Jupyter Notebook

A central, open resource for data and tools related to chain-of-thought reasoning in large language models. Developed @ Samwald research group: https://samwald.info/
torchxrayvision

1 828 6.4 Jupyter Notebook

TorchXRayVision: A library of chest X-ray datasets and models. Classifiers, segmentation, and autoencoders.
hate-speech-and-offensive-language

2 750 1.9 Jupyter Notebook

Repository for the paper "Automated Hate Speech Detection and the Problem of Offensive Language", ICWSM 2017
TACO

3 540 0.0 Jupyter Notebook

🌮 Trash Annotations in Context Dataset Toolkit (by pedropro)
OpenAI-CLIP

4 509 3.1 Jupyter Notebook

Simple implementation of OpenAI CLIP model in PyTorch.

Project mention: Simple Implementation of OpenAI Clip (Tutorial) | news.ycombinator.com | 2024-02-21

SKAB

9 292 4.8 Jupyter Notebook

SKAB - Skoltech Anomaly Benchmark. Time-series data for evaluating Anomaly Detection algorithms.

Project mention: SKAB: NEW Data - star count:238.0 | /r/algoprojects | 2023-09-25

Awesome_Satellite_Benchmark_Datasets

1 282 2.8 Jupyter Notebook

Supplementary material for our paper "THERE IS NO DATA LIKE MORE DATA" is provided.
covid19za

2 255 3.6 Jupyter Notebook

Coronavirus COVID-19 (2019-nCoV) Data Repository and Dashboard for South Africa
alis

2 227 2.6 Jupyter Notebook

[ICCV 2021] Aligning Latent and Image Spaces to Connect the Unconnectable (by universome)
roboflow-100-benchmark

8 227 0.6 Jupyter Notebook

Code for replicating Roboflow 100 benchmark results and programmatically downloading benchmark datasets

Project mention: AI That Teaches Other AI | news.ycombinator.com | 2023-07-20

> Their SKILL tool involves a set of algorithms that make the process go much faster, they said, because the agents learn at the same time in parallel. Their research showed if 102 agents each learn one task and then share, the amount of time needed is reduced by a factor of 101.5 after accounting for the necessary communications and knowledge consolidation among agents.
This is a really interesting idea. It's like the reverse of knowledge distillation (which I've been thinking about a lot[1]) where you have one giant model that knows a lot about a lot & you use that model to train smaller, faster models that know a lot about a little.
Instead, you if you could train a lot of models that know a lot about a little (which is a lot less computationally intensive because the problem space is so confined) and combine them into a generalized model, that'd be hugely beneficial.
Unfortunately, after a bit of digging into the paper & Github repo[2], this doesn't seem to be what's happening at all.
> The code will learn 102 small and separte heads(either a linear head or a linear head with a task bias) for each tasks respectively in order. This step can be parallized on multiple GPUS with one task per GPU. The heads will be saved in the weight folder. After that, the code will learn a task mapper(Either using GMMC or Mahalanobis) to distinguish image task-wisely. Then, all images will be evaluated in the same time without a task label.
So the knowledge isn't being combined (and the agents aren't learning from each other) into a generalized model. They're just training a bunch of independent models for specific tasks & adding a model-selection step that maps an image to the most relevant "expert". My guess is you could do the same thing using CLIP vectors as the routing method to supervised models trained on specific datasets (we found that datasets largely live in distinct regions of CLIP-space[3]).
[1] https://github.com/autodistill/autodistill
[2] https://github.com/gyhandy/Shared-Knowledge-Lifelong-Learnin...
[3] https://www.rf100.org

goodreads

3 228 5.1 Jupyter Notebook

code samples for the goodreads datasets (by MengtingWan)
ImageNetV2

1 223 2.1 Jupyter Notebook

A new test set for ImageNet
openbrewerydb

1 173 7.6 Jupyter Notebook

🍻 An open-source dataset of breweries, cideries, brewpubs, and bottleshops.
clip-italian

1 170 2.0 Jupyter Notebook

CLIP (Contrastive Language–Image Pre-training) for Italian
mnist1d

1 138 6.3 Jupyter Notebook

A 1D analogue of the MNIST dataset for measuring spatial biases and answering Science of Deep Learning questions.
cpi

1 127 8.0 Jupyter Notebook

Quickly adjust U.S. dollars for inflation using the Consumer Price Index (CPI)
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Jupyter Notebook Dataset related posts

Simple Implementation of OpenAI Clip (Tutorial)
1 project | news.ycombinator.com | 21 Feb 2024
SKAB: NEW Data - star count:238.0
1 project | /r/algoprojects | 25 Sep 2023
SKAB: NEW Data - star count:238.0
1 project | /r/algoprojects | 24 Sep 2023
SKAB: NEW Data - star count:238.0
1 project | /r/algoprojects | 23 Sep 2023
SKAB: NEW Data - star count:238.0
1 project | /r/algoprojects | 19 Sep 2023
Update from Waymo spokesperson on the dog that was killed by a Waymo ADV
1 project | /r/SelfDrivingCars | 13 Jun 2023
[P] Fine-tuning LLaMA on TheVault by AI4Code
2 projects | /r/LocalLLaMA | 30 May 2023
A note from our sponsor - WorkOS
workos.com | 25 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source Dataset projects in Jupyter Notebook? This list will help you:

	Project	Stars
1	covid-chestxray-dataset	2,958
2	whylogs	2,543
3	datasets	2,299
4	fma	2,108
5	clusterdata	1,477
6	raccoon_dataset	1,266
7	COVID-CT	1,062
8	ThoughtSource	832
9	torchxrayvision	828
10	hate-speech-and-offensive-language	750
11	TACO	540
12	OpenAI-CLIP	509
13	SKAB	292
14	Awesome_Satellite_Benchmark_Datasets	282
15	covid19za	255
16	alis	227
17	roboflow-100-benchmark	227
18	goodreads	228
19	ImageNetV2	223
20	openbrewerydb	173
21	clip-italian	170
22	mnist1d	138
23	cpi	127