Python Datasets

Open-source Python projects categorized as Datasets

Top 23 Python Dataset Projects

  • datasets

    🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

    Project mention: 🐍🐍 23 issues to grow yourself as an exceptional open-source Python expert 🧑‍💻 🥇 | dev.to | 2023-10-19
  • doccano

    Open source annotation tool for machine learning practitioners.

    Project mention: You Can't Have a Free Software AI Stack | news.ycombinator.com | 2023-07-13

    Huh?

    I wrote my own system for classifying a stream of texts in Python, I might Open Source it one of these days but I have to get it to the point where it is modular enough that I can customize it to do the particular things I want without subjecting people to my whims... I use it every day and I'm not afraid to demo it because it is rock solid.

    My understanding is that my system would not be hard to adapt to work on images for certain kinds of tasks.

    Pytorch is open source, Huggingface is open source. CUDA isn't. This is

    https://labelstud.io/

    and for annotating text spans there are so many open source tools

    https://github.com/doccano/doccano

    I worked for a company a few years back that built annotation tools for projects we sold to customers but never quite got to a polished general purpose annotator. Today there are an overwhelming number of companies in this space and products I never heard of, many of which are cloud based or paid. Looks like a gold rush to me.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • datasette

    An open source multi-tool for exploring and publishing data

    Project mention: Little Data: How do we query personal data? (2013) | news.ycombinator.com | 2024-03-01

    I'm a fan on simonw's datasette/dogsheep ecosystem https://datasette.io/

  • akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)

  • deeplake

    Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

    Project mention: FLaNK AI Weekly 25 March 2025 | dev.to | 2024-03-25
  • datasets

    TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)

  • torchgeo

    TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data

    Project mention: FLaNK Stack Weekly for 20 Nov 2023 | dev.to | 2023-11-20
  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • Colour

    Colour Science for Python

    Project mention: Tailwind Color Palette Generator | news.ycombinator.com | 2024-02-02

    Colour Science is one of the more serious projects I know of, and more or less lets you get as advanced as you want. Used by film professionals among others. https://www.colour-science.org/

    How would you define what the perfect color tool is? I would guess like most tools that it depends entirely on the job at hand, and that maybe no one perfect tool can exist. Colour Science might be great at serious color management and perceptual measurements and conversions between standardized color spaces, but not the right tool for a web developer looking for quick & easy way to make an HSV palette generation widget (and not because Colour Science is Python, but because it’s too big and heavy of a hammer).

  • ogb

    Benchmark datasets, data loaders, and evaluators for graph machine learning

  • diffgram

    The AI Datastore for Schemas, BLOBs, and Predictions. Use with your apps or integrate built-in Human Supervision, Data Workflow, and UI Catalog to get the most value out of your AI Data.

  • Open3D-ML

    An extension of Open3D to address 3D Machine Learning tasks

    Project mention: Looking for Point Cloud deep learning, training sources | /r/deeplearning | 2023-07-13

    I already have a basic understanding with Open3D-ML and manage to get examples for training to work. However, my knowledge is not sufficient to transfer this to my own data or model deployment.

  • entity-recognition-datasets

    A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.

    Project mention: Recent English newswire NER datasets? | /r/LanguageTechnology | 2023-08-27

    There is of course the list at https://github.com/juand-r/entity-recognition-datasets, but all of the recent English datasets cover other domains of English, such as the music NER, space NER, etc. All interesting things, but not 2020s English newswire.

  • projects

    🪐 End-to-end NLP workflows from prototype to production (by explosion)

  • safe-rlhf

    Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback

    Project mention: [R] Meet Beaver-7B: a Constrained Value-Aligned LLM via Safe RLHF Technique | /r/MachineLearning | 2023-05-16
  • DB-GPT-Hub

    A repository that contains models, datasets, and fine-tuning techniques for DB-GPT, with the purpose of enhancing model performance in Text-to-SQL

    Project mention: Show HN: Improve Text-to-SQL Accuracy with LLM | news.ycombinator.com | 2023-07-10
  • datasets-server

    Lightweight web API for visualizing and exploring all types of datasets - computer vision, speech, text, and tabular - stored on the Hugging Face Hub

  • datumaro

    Dataset Management Framework, a Python library and a CLI tool to build, analyze and manage Computer Vision datasets.

  • CelebV-HQ

    [ECCV 2022] CelebV-HQ: A Large-Scale Video Facial Attributes Dataset

  • squirrel-core

    A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way :chestnut:

  • DoppelGANger

    [IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions

  • Data Flow Facilitator for Machine Learning (dffml)

    The easiest way to use Machine Learning. Mix and match underlying ML libraries and data set sources. Generate new datasets or modify existing ones with ease.

  • Minari

    A standard format for offline reinforcement learning datasets, with popular reference datasets and related utilities

    Project mention: Announcing Minari (Gym for offline RL, by the Farama Foundation) is going into public beta | /r/reinforcementlearning | 2023-05-18

    You can also read the full release notes here: https://github.com/Farama-Foundation/Minari/releases/tag/v0.3.0

  • scrapeOP

    A python package for scraping oddsportal.com

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-03-25.

Python Datasets related posts

Index

What are some of the best open-source Dataset projects in Python? This list will help you:

Project Stars
1 datasets 18,228
2 doccano 8,871
3 datasette 8,791
4 akshare 8,151
5 deeplake 7,603
6 datasets 4,141
7 torchgeo 2,176
8 Colour 1,925
9 ogb 1,852
10 diffgram 1,781
11 Open3D-ML 1,634
12 entity-recognition-datasets 1,426
13 projects 1,232
14 safe-rlhf 1,108
15 DB-GPT-Hub 949
16 datasets-server 597
17 datumaro 474
18 CelebV-HQ 306
19 squirrel-core 277
20 DoppelGANger 275
21 Data Flow Facilitator for Machine Learning (dffml) 241
22 Minari 203
23 scrapeOP 189
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com