Jupyter Notebook Dataset

Open-source Jupyter Notebook projects categorized as Dataset

Top 23 Jupyter Notebook Dataset Projects

  • covid-chestxray-dataset

    We are building an open database of COVID-19 cases with chest X-ray or CT images.

  • whylogs

    An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • datasets

    🎁 5,400,000+ Unsplash images made available for research and machine learning (by unsplash)

  • Project mention: AI-Powered Image Search with CLIP, pgvector, and Fast API | dev.to | 2024-02-12

    Here's a live demo with a simple React frontend. It's searching against an S3 bucket containing Unsplash's open source dataset of 25,000 images, plus a few of my own.

  • fma

    FMA: A Dataset For Music Analysis

  • clusterdata

    cluster data collected from production clusters in Alibaba for cluster management research

  • raccoon_dataset

    The dataset is used to train my own raccoon detector and I blogged about it on Medium

  • COVID-CT

    COVID-CT-Dataset: A CT Scan Dataset about COVID-19

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • ThoughtSource

    A central, open resource for data and tools related to chain-of-thought reasoning in large language models. Developed @ Samwald research group: https://samwald.info/

  • torchxrayvision

    TorchXRayVision: A library of chest X-ray datasets and models. Classifiers, segmentation, and autoencoders.

  • hate-speech-and-offensive-language

    Repository for the paper "Automated Hate Speech Detection and the Problem of Offensive Language", ICWSM 2017

  • TACO

    🌮 Trash Annotations in Context Dataset Toolkit (by pedropro)

  • OpenAI-CLIP

    Simple implementation of OpenAI CLIP model in PyTorch.

  • Project mention: Simple Implementation of OpenAI Clip (Tutorial) | news.ycombinator.com | 2024-02-21
  • SKAB

    SKAB - Skoltech Anomaly Benchmark. Time-series data for evaluating Anomaly Detection algorithms.

  • Project mention: SKAB: NEW Data - star count:238.0 | /r/algoprojects | 2023-09-25
  • Awesome_Satellite_Benchmark_Datasets

    Supplementary material for our paper "THERE IS NO DATA LIKE MORE DATA" is provided.

  • covid19za

    Coronavirus COVID-19 (2019-nCoV) Data Repository and Dashboard for South Africa

  • alis

    [ICCV 2021] Aligning Latent and Image Spaces to Connect the Unconnectable (by universome)

  • roboflow-100-benchmark

    Code for replicating Roboflow 100 benchmark results and programmatically downloading benchmark datasets

  • Project mention: AI That Teaches Other AI | news.ycombinator.com | 2023-07-20

    > Their SKILL tool involves a set of algorithms that make the process go much faster, they said, because the agents learn at the same time in parallel. Their research showed if 102 agents each learn one task and then share, the amount of time needed is reduced by a factor of 101.5 after accounting for the necessary communications and knowledge consolidation among agents.

    This is a really interesting idea. It's like the reverse of knowledge distillation (which I've been thinking about a lot[1]) where you have one giant model that knows a lot about a lot & you use that model to train smaller, faster models that know a lot about a little.

    Instead, you if you could train a lot of models that know a lot about a little (which is a lot less computationally intensive because the problem space is so confined) and combine them into a generalized model, that'd be hugely beneficial.

    Unfortunately, after a bit of digging into the paper & Github repo[2], this doesn't seem to be what's happening at all.

    > The code will learn 102 small and separte heads(either a linear head or a linear head with a task bias) for each tasks respectively in order. This step can be parallized on multiple GPUS with one task per GPU. The heads will be saved in the weight folder. After that, the code will learn a task mapper(Either using GMMC or Mahalanobis) to distinguish image task-wisely. Then, all images will be evaluated in the same time without a task label.

    So the knowledge isn't being combined (and the agents aren't learning from each other) into a generalized model. They're just training a bunch of independent models for specific tasks & adding a model-selection step that maps an image to the most relevant "expert". My guess is you could do the same thing using CLIP vectors as the routing method to supervised models trained on specific datasets (we found that datasets largely live in distinct regions of CLIP-space[3]).

    [1] https://github.com/autodistill/autodistill

    [2] https://github.com/gyhandy/Shared-Knowledge-Lifelong-Learnin...

    [3] https://www.rf100.org

  • goodreads

    code samples for the goodreads datasets (by MengtingWan)

  • ImageNetV2

    A new test set for ImageNet

  • openbrewerydb

    🍻 An open-source dataset of breweries, cideries, brewpubs, and bottleshops.

  • clip-italian

    CLIP (Contrastive Language–Image Pre-training) for Italian

  • mnist1d

    A 1D analogue of the MNIST dataset for measuring spatial biases and answering Science of Deep Learning questions.

  • cpi

    Quickly adjust U.S. dollars for inflation using the Consumer Price Index (CPI)

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Jupyter Notebook Dataset related posts

Index

What are some of the best open-source Dataset projects in Jupyter Notebook? This list will help you:

Project Stars
1 covid-chestxray-dataset 2,958
2 whylogs 2,543
3 datasets 2,299
4 fma 2,108
5 clusterdata 1,477
6 raccoon_dataset 1,266
7 COVID-CT 1,062
8 ThoughtSource 832
9 torchxrayvision 828
10 hate-speech-and-offensive-language 750
11 TACO 540
12 OpenAI-CLIP 509
13 SKAB 292
14 Awesome_Satellite_Benchmark_Datasets 282
15 covid19za 255
16 alis 227
17 roboflow-100-benchmark 227
18 goodreads 228
19 ImageNetV2 223
20 openbrewerydb 173
21 clip-italian 170
22 mnist1d 138
23 cpi 127

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com