Python Datasets

Open-source Python projects categorized as Datasets Edit details

Top 23 Python Dataset Projects

  • datasets

    🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

    Project mention: FauxPilot – an open-source GitHub Copilot server | | 2022-08-02

    And then pass that my_code.json as the dataset name.


  • label-studio

    Label Studio is a multi-type data labeling and annotation tool with standardized output format

    Project mention: [D] Are there any tools to quickly label training data manually? | | 2022-07-29
  • Scout APM

    Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.

  • doccano

    Open source annotation tool for machine learning practitioners.

    Project mention: Ask HN: Any open source text editors with word tagging? | | 2022-08-04

    I worked at a place where we developed a system for doing this kind of tagging but it was for making training sets and there was no expectation that you could export the document from the system for normal use.

    Quite a few NLP annotation systems are out there

  • datasette

    An open source multi-tool for exploring and publishing data

    Project mention: Ask HN: What's the best way to create a database for legal document clauses? | | 2022-08-10

    I would recommend SQLite + a nice usable interface tool like


    Or SQLitebrowser

  • akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)

  • Activeloop Hub

    Dataset format for AI. Build, manage, query & visualize datasets for deep learning. Stream data real-time to PyTorch/TensorFlow & version-control it. (by activeloopai)

    Project mention: [Q] where to host 50GB dataset (for free?) | | 2022-06-25

    Hey u/platoTheSloth, as u/gopietz mentioned (thanks a lot for the shout-out!!!), you can share them with the general public through uploading to Activeloop Platform (for researchers, we offer special terms, but even as a general public member you get up to 300GBs of free storage!). Thanks to our open source dataset format for AI, Hub, anyone can load the dataset in under 3seconds with one line of code, and stream it while training in PyTorch/TensorFlow.

  • datasets

    TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)

  • SonarLint

    Clean code begins in your IDE with SonarLint. Up your coding game and discover issues early. SonarLint is a free plugin that helps you find & fix bugs and security issues from the moment you start writing code. Install from your favorite IDE marketplace today.

  • Colour

    Colour Science for Python

    Project mention: The Color of Infinite Temperature | | 2022-01-16

    I haven’t seen the math for the conversion but the conversion from CCT to xy/uv are given for a particular domain. One of the conversion with the largest domain, i.e. Ohno m, covers domain [1000K, 100000K]:

    Infinity is very much in extrapolation territory.

  • ogb

    Benchmark datasets, data loaders, and evaluators for graph machine learning

    Project mention: [D] Best way to handle encoding disconnected graphs at the graph level. | | 2022-04-10

    Example code:

  • projects

    🪐 End-to-end NLP workflows from prototype to production (by explosion)

    Project mention: Using pre-trained BERT embeddings for multi-class text classification | | 2022-01-10

    spaCy has an example project that uses BERT that you could use as a reference. It's multilabel but it should be easy to tweak the config to be just multiclass instead.

  • datumaro

    Dataset Management Framework, a Python library and a CLI tool to build, analyze and manage Computer Vision datasets.

    Project mention: Does anyone use CVAT for image annotation? | | 2022-04-18

    1) CVAT has internal inference for models. If you upload model there in the correct format, then it will be able to generate the detection box itself - 2) Yes you can upload your prediction. But last time i did it - there were some problems and it took me several hours. It seems to me that you just need to load the markup in one of the formats that it supported by CVAT. If your format is not supported, then you will need to convert. For example like this -

  • squirrel-core

    A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way :chestnut:

    Project mention: [P] Squirrel: A new OS library for fast & flexible large-scale data loading | | 2022-04-11

    Today we open-sourced Squirrel, a data infrastructure library that my colleagues and I have been working on over the past 1.5 years:

  • DoppelGANger

    [IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions

    Project mention: DoppelGANger: NEW Data - star count:168.0 | | 2022-06-11
  • Data Flow Facilitator for Machine Learning (dffml)

    The easiest way to use Machine Learning. Mix and match underlying ML libraries and data set sources. Generate new datasets or modify existing ones with ease.

  • zozo-shift15m

    SHIFT15M: multiobjective large-scale fashion dataset with distributional shifts

    Project mention: SHIFT15M: Multiobjective Large-scale Fashion Dataset with Distributional Shifts | | 2021-09-09
  • torchSR

    Super Resolution datasets and models in Pytorch

  • scrapeOP

    A python package for scraping

    Project mention: Help with web scraping | | 2022-06-25

    For my first web scraping project, I wanted to use an existing program from github:

  • multimodal

    A collection of multimodal datasets, and visual features for VQA and captionning in pytorch. Just run "pip install multimodal" (by cdancette)

  • podium

    Podium: a framework agnostic Python NLP library for data loading and preprocessing

    Project mention: Show HN: Podium: framework agnostic NLP library for data loading and preprocess | | 2021-12-09
  • AREkit

    Document level Attitude and Relation Extraction toolkit (AREkit) for sampling mass-media news into datasets for your ML-model training and evaluation

    Project mention: Show HN: ARElight – A Mass-Media Processing Application for Relation Extraction | | 2022-06-18
  • exorl

    ExORL: Exploratory Data for Offline Reinforcement Learning

    Project mention: "Don't Change the Algorithm, Change the Data: Exploratory Data for Offline Reinforcement Learning (ExoRL)", Yarats et al 2022 | | 2022-02-13
  • squirrel-datasets-core

    Squirrel dataset hub

    Project mention: [P] Squirrel: A new OS library for fast & flexible large-scale data loading | | 2022-04-11

    Have a look at this tutorial to learn how to convert to messagepack by using Spark.

  • clean-discord

    Cleaning discord data for NLP

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-08-10.

Python Datasets related posts


What are some of the best open-source Dataset projects in Python? This list will help you:

Project Stars
1 datasets 13,902
2 label-studio 9,987
3 doccano 6,571
4 datasette 6,327
5 akshare 5,270
6 Activeloop Hub 4,741
7 datasets 3,351
8 Colour 1,506
9 ogb 1,424
10 projects 908
11 datumaro 269
12 squirrel-core 227
13 DoppelGANger 183
14 Data Flow Facilitator for Machine Learning (dffml) 177
15 zozo-shift15m 130
16 torchSR 84
17 scrapeOP 66
18 multimodal 56
19 podium 55
20 AREkit 41
21 exorl 40
22 squirrel-datasets-core 30
23 clean-discord 7
Find remote jobs at our new job board There are 3 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives