Python Datasets

Open-source Python projects categorized as Datasets

Top 23 Python Dataset Projects

  • datasets

    🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

    Project mention: 🐍🐍 23 issues to grow yourself as an exceptional open-source Python expert 🧑‍💻 🥇 | | 2023-10-19
  • doccano

    Open source annotation tool for machine learning practitioners.

    Project mention: You Can't Have a Free Software AI Stack | | 2023-07-13


    I wrote my own system for classifying a stream of texts in Python, I might Open Source it one of these days but I have to get it to the point where it is modular enough that I can customize it to do the particular things I want without subjecting people to my whims... I use it every day and I'm not afraid to demo it because it is rock solid.

    My understanding is that my system would not be hard to adapt to work on images for certain kinds of tasks.

    Pytorch is open source, Huggingface is open source. CUDA isn't. This is

    and for annotating text spans there are so many open source tools

    I worked for a company a few years back that built annotation tools for projects we sold to customers but never quite got to a polished general purpose annotator. Today there are an overwhelming number of companies in this space and products I never heard of, many of which are cloud based or paid. Looks like a gold rush to me.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • datasette

    An open source multi-tool for exploring and publishing data

    Project mention: Little Data: How do we query personal data? (2013) | | 2024-03-01

    I'm a fan on simonw's datasette/dogsheep ecosystem

  • akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)

  • deeplake

    Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow.

    Project mention: Qdrant, the Vector Search Database, raised $28M in a Series A round | | 2024-01-23

    I think Activeloop(YC) is too:

  • datasets

    TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)

  • torchgeo

    TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data

    Project mention: FLaNK Stack Weekly for 20 Nov 2023 | | 2023-11-20
  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • Colour

    Colour Science for Python

    Project mention: Tailwind Color Palette Generator | | 2024-02-02

    Colour Science is one of the more serious projects I know of, and more or less lets you get as advanced as you want. Used by film professionals among others.

    How would you define what the perfect color tool is? I would guess like most tools that it depends entirely on the job at hand, and that maybe no one perfect tool can exist. Colour Science might be great at serious color management and perceptual measurements and conversions between standardized color spaces, but not the right tool for a web developer looking for quick & easy way to make an HSV palette generation widget (and not because Colour Science is Python, but because it’s too big and heavy of a hammer).

  • ogb

    Benchmark datasets, data loaders, and evaluators for graph machine learning

  • diffgram

    The AI Datastore for Schemas, BLOBs, and Predictions. Use with your apps or integrate built-in Human Supervision, Data Workflow, and UI Catalog to get the most value out of your AI Data.

  • Open3D-ML

    An extension of Open3D to address 3D Machine Learning tasks

    Project mention: Looking for Point Cloud deep learning, training sources | /r/deeplearning | 2023-07-13

    I already have a basic understanding with Open3D-ML and manage to get examples for training to work. However, my knowledge is not sufficient to transfer this to my own data or model deployment.

  • entity-recognition-datasets

    A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.

    Project mention: Recent English newswire NER datasets? | /r/LanguageTechnology | 2023-08-27

    There is of course the list at, but all of the recent English datasets cover other domains of English, such as the music NER, space NER, etc. All interesting things, but not 2020s English newswire.

  • projects

    🪐 End-to-end NLP workflows from prototype to production (by explosion)

    Project mention: Identify custom labels as well as existing labels with Spacy v3 | /r/LanguageTechnology | 2023-03-12

    When I was doing the same task, I used their `spacy project` command-line interface and extended their `ner_drugs` project, made things pretty easy.

  • safe-rlhf

    Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback

    Project mention: [R] Meet Beaver-7B: a Constrained Value-Aligned LLM via Safe RLHF Technique | /r/MachineLearning | 2023-05-16
  • DB-GPT-Hub

    A repository that contains models, datasets, and fine-tuning techniques for DB-GPT, with the purpose of enhancing model performance in Text-to-SQL

    Project mention: Show HN: Improve Text-to-SQL Accuracy with LLM | | 2023-07-10
  • datasets-server

    Lightweight web API for visualizing and exploring all types of datasets - computer vision, speech, text, and tabular - stored on the Hugging Face Hub

  • datumaro

    Dataset Management Framework, a Python library and a CLI tool to build, analyze and manage Computer Vision datasets.

  • CelebV-HQ

    [ECCV 2022] CelebV-HQ: A Large-Scale Video Facial Attributes Dataset

  • DoppelGANger

    [IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions

  • squirrel-core

    A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way :chestnut:

  • Data Flow Facilitator for Machine Learning (dffml)

    The easiest way to use Machine Learning. Mix and match underlying ML libraries and data set sources. Generate new datasets or modify existing ones with ease.

  • Minari

    A standard format for offline reinforcement learning datasets, with popular reference datasets and related utilities

    Project mention: Announcing Minari (Gym for offline RL, by the Farama Foundation) is going into public beta | /r/reinforcementlearning | 2023-05-18

    You can also read the full release notes here:

  • scrapeOP

    A python package for scraping

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-03-01.

Python Datasets related posts


What are some of the best open-source Dataset projects in Python? This list will help you:

Project Stars
1 datasets 18,096
2 doccano 8,771
3 datasette 8,733
4 akshare 8,018
5 deeplake 7,510
6 datasets 4,121
7 torchgeo 2,142
8 Colour 1,911
9 ogb 1,844
10 diffgram 1,772
11 Open3D-ML 1,608
12 entity-recognition-datasets 1,425
13 projects 1,222
14 safe-rlhf 1,070
15 DB-GPT-Hub 852
16 datasets-server 591
17 datumaro 461
18 CelebV-HQ 306
19 DoppelGANger 273
20 squirrel-core 272
21 Data Flow Facilitator for Machine Learning (dffml) 240
22 Minari 194
23 scrapeOP 189
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives