Jupyter Notebook Dataset

Open-source Jupyter Notebook projects categorized as Dataset

Top 23 Jupyter Notebook Dataset Projects

  • covid-chestxray-dataset

    We are building an open database of COVID-19 cases with chest X-ray or CT images.

  • whylogs

    An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈

  • LearnThisRepo.com

    Learn 300+ open source libraries for free using AI. LearnThisRepo lets you learn 300+ open source repos including Postgres, Langchain, VS Code, and more by chatting with them using AI!

  • waymo-open-dataset

    Waymo Open Dataset

    Project mention: Update from Waymo spokesperson on the dog that was killed by a Waymo ADV | /r/SelfDrivingCars | 2023-06-13

    Interesting point about the Waymo dataset, though this reply suggests they have higher framerates and just don't release them.

  • datasets

    🎁 5,400,000+ Unsplash images made available for research and machine learning (by unsplash)

    Project mention: AI-Powered Image Search with CLIP, pgvector, and Fast API | dev.to | 2024-02-12

    Here's a live demo with a simple React frontend. It's searching against an S3 bucket containing Unsplash's open source dataset of 25,000 images, plus a few of my own.

  • fma

    FMA: A Dataset For Music Analysis

  • clusterdata

    cluster data collected from production clusters in Alibaba for cluster management research

  • raccoon_dataset

    The dataset is used to train my own raccoon detector and I blogged about it on Medium

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.


    COVID-CT-Dataset: A CT Scan Dataset about COVID-19

  • ThoughtSource

    A central, open resource for data and tools related to chain-of-thought reasoning in large language models. Developed @ Samwald research group: https://samwald.info/

  • torchxrayvision

    TorchXRayVision: A library of chest X-ray datasets and models. Classifiers, segmentation, and autoencoders.

  • hate-speech-and-offensive-language

    Repository for the paper "Automated Hate Speech Detection and the Problem of Offensive Language", ICWSM 2017

  • TACO

    🌮 Trash Annotations in Context Dataset Toolkit (by pedropro)

  • OpenAI-CLIP

    Simple implementation of OpenAI CLIP model in PyTorch.

    Project mention: Simple Implementation of OpenAI Clip (Tutorial) | news.ycombinator.com | 2024-02-21
  • SKAB

    SKAB - Skoltech Anomaly Benchmark. Time-series data for evaluating Anomaly Detection algorithms.

    Project mention: SKAB: NEW Data - star count:238.0 | /r/algoprojects | 2023-09-25
  • Awesome_Satellite_Benchmark_Datasets

    Supplementary material for our paper "THERE IS NO DATA LIKE MORE DATA" is provided.

    Project mention: GIS data for a project. I apologize for the banality of my request and for my English. | /r/datasets | 2023-03-15
  • covid19za

    Coronavirus COVID-19 (2019-nCoV) Data Repository and Dashboard for South Africa

  • alis

    [ICCV 2021] Aligning Latent and Image Spaces to Connect the Unconnectable (by universome)

  • ImageNetV2

    A new test set for ImageNet

  • goodreads

    code samples for the goodreads datasets (by MengtingWan)

  • roboflow-100-benchmark

    Code for replicating Roboflow 100 benchmark results and programmatically downloading benchmark datasets

    Project mention: AI That Teaches Other AI | news.ycombinator.com | 2023-07-20

    > Their SKILL tool involves a set of algorithms that make the process go much faster, they said, because the agents learn at the same time in parallel. Their research showed if 102 agents each learn one task and then share, the amount of time needed is reduced by a factor of 101.5 after accounting for the necessary communications and knowledge consolidation among agents.

    This is a really interesting idea. It's like the reverse of knowledge distillation (which I've been thinking about a lot[1]) where you have one giant model that knows a lot about a lot & you use that model to train smaller, faster models that know a lot about a little.

    Instead, you if you could train a lot of models that know a lot about a little (which is a lot less computationally intensive because the problem space is so confined) and combine them into a generalized model, that'd be hugely beneficial.

    Unfortunately, after a bit of digging into the paper & Github repo[2], this doesn't seem to be what's happening at all.

    > The code will learn 102 small and separte heads(either a linear head or a linear head with a task bias) for each tasks respectively in order. This step can be parallized on multiple GPUS with one task per GPU. The heads will be saved in the weight folder. After that, the code will learn a task mapper(Either using GMMC or Mahalanobis) to distinguish image task-wisely. Then, all images will be evaluated in the same time without a task label.

    So the knowledge isn't being combined (and the agents aren't learning from each other) into a generalized model. They're just training a bunch of independent models for specific tasks & adding a model-selection step that maps an image to the most relevant "expert". My guess is you could do the same thing using CLIP vectors as the routing method to supervised models trained on specific datasets (we found that datasets largely live in distinct regions of CLIP-space[3]).

    [1] https://github.com/autodistill/autodistill

    [2] https://github.com/gyhandy/Shared-Knowledge-Lifelong-Learnin...

    [3] https://www.rf100.org

  • openbrewerydb

    🍻 An open-source dataset of breweries, cideries, brewpubs, and bottleshops.

    Project mention: API to get all breweries from a city? | /r/Untappd | 2023-04-19

    More or less, but the list of countries leaves room for improvement. Still, thank you.

  • clip-italian

    CLIP (Contrastive Language–Image Pre-training) for Italian

  • mnist1d

    A 1D analogue of the MNIST dataset for measuring spatial biases and answering Science of Deep Learning questions.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-02-21.

Jupyter Notebook Dataset related posts


What are some of the best open-source Dataset projects in Jupyter Notebook? This list will help you:

Project Stars
1 covid-chestxray-dataset 2,958
2 whylogs 2,498
3 waymo-open-dataset 2,460
4 datasets 2,274
5 fma 2,053
6 clusterdata 1,439
7 raccoon_dataset 1,262
8 COVID-CT 1,062
9 ThoughtSource 813
10 torchxrayvision 802
11 hate-speech-and-offensive-language 740
12 TACO 540
13 OpenAI-CLIP 454
14 SKAB 279
15 Awesome_Satellite_Benchmark_Datasets 260
16 covid19za 256
17 alis 227
18 ImageNetV2 220
19 goodreads 214
20 roboflow-100-benchmark 211
21 openbrewerydb 170
22 clip-italian 167
23 mnist1d 127
Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.