Python Datasets

Open-source Python projects categorized as Datasets

Top 23 Python Dataset Projects

  1. datasets

    🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools

    Project mention: Training with Big Data on Any Cloud | dev.to | 2025-06-20

    Hugging Face Datasets -- the library that lets you download and manage datasets from the Hugging Face Hub, as well as being a convenient vendor-neutral interface for your own datasets.

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)

  4. cleanlab

    Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

  5. datasette

    An open source multi-tool for exploring and publishing data

    Project mention: The current state of LLM-driven development | news.ycombinator.com | 2025-08-10

    I've been using LLM-assistance for my larger open source projects - https://github.com/simonw/datasette https://github.com/simonw/llm and https://github.com/simonw/sqlite-utils - for a couple of years now.

    Also literally hundreds of smaller plugins and libraries and CLI tools, see https://github.com/simonw?tab=repositories (now at 880 repos) and https://pypi.org/user/simonw/ (340 published packages).

    Unlike my tools.simonwillison.net stuff the vast majority of those products are covered by automated tests and usually have comprehensive documentation too.

  6. doccano

    Open source annotation tool for machine learning practitioners.

  7. deeplake

    Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

    Project mention: What I Learned Comparing Zilliz Cloud and Deep Lake for Scalable Vector Search | dev.to | 2025-06-09

    As I scaled up a semantic search engine for multi-modal content, I found myself at a fork in the road. Should I lean into a purpose-built vector database like Zilliz Cloud, or embrace a more flexible data lake approach with Deep Lake? These tools promise vector search at scale—but they come from fundamentally different architectural philosophies.

  8. datasets

    TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)

  9. Sevalla

    Deploy and host your apps and databases, now with $50 credit! Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more!

    Sevalla logo
  10. torchgeo

    TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data

    Project mention: My First Open Source Contribution @microsoft | dev.to | 2024-11-03

    Issue Worked On: Add Consistent Bands Metadata to Vision Transformer and ResNet Weights #2376 This week, I worked on a GitHub issue to add consistent band metadata across Vision Transformer (ViT) and ResNet weight classes in the torchgeo library. The goal was to ensure uniform metadata across different weight classes, specifically supporting various satellite datasets like Landsat and Sentinel.

  11. Colour

    Colour Science for Python

    Project mention: What Is a Color Space? | news.ycombinator.com | 2025-08-25

    Nice article, I came across very cool Python library recently too re. colour science - https://www.colour-science.org/

    Just started playing with it with my spectrometer based on one of the examples they have, to convert spectral data to a single RGB value.

  12. Open3D-ML

    An extension of Open3D to address 3D Machine Learning tasks

  13. ogb

    Benchmark datasets, data loaders, and evaluators for graph machine learning

  14. DB-GPT-Hub

    A repository that contains models, datasets, and fine-tuning techniques for DB-GPT, with the purpose of enhancing model performance in Text-to-SQL

  15. diffgram

    The AI Datastore for Schemas, BLOBs, and Predictions. Use with your apps or integrate built-in Human Supervision, Data Workflow, and UI Catalog to get the most value out of your AI Data.

  16. entity-recognition-datasets

    A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.

  17. safe-rlhf

    Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback

  18. projects

    🪐 End-to-end NLP workflows from prototype to production (by explosion)

  19. semhash

    Fast Semantic Text Deduplication & Filtering

    Project mention: Show HN: SemHash – Semantic Text Deduplication, Outlier Filtering and Sampling | news.ycombinator.com | 2025-04-27
  20. dataset-viewer

    Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.

  21. datumaro

    Dataset Management Framework, a Python library and a CLI tool to build, analyze and manage Computer Vision datasets.

  22. pudl

    The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.

  23. CelebV-HQ

    [ECCV 2022] CelebV-HQ: A Large-Scale Video Facial Attributes Dataset

  24. Minari

    A standard format for offline reinforcement learning datasets, with popular reference datasets and related utilities

  25. DoppelGANger

    [IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Datasets discussion

Log in or Post with

Python Datasets related posts

  • What I Learned Comparing Zilliz Cloud and Deep Lake for Scalable Vector Search

    1 project | dev.to | 9 Jun 2025
  • Show HN: SemHash – Semantic Text Deduplication, Outlier Filtering and Sampling

    1 project | news.ycombinator.com | 27 Apr 2025
  • Sell Yourself Sell Your Work

    4 projects | news.ycombinator.com | 25 Mar 2025
  • Exploring the Paramilitary Leaks

    1 project | news.ycombinator.com | 6 Mar 2025
  • Show HN: SemHash – Fast Semantic Text Deduplication for Cleaner Datasets

    1 project | news.ycombinator.com | 19 Jan 2025
  • I Track My Health Data in Markdown: Lessons in Digital Longevity

    1 project | news.ycombinator.com | 15 Dec 2024
  • My First Open Source Contribution @microsoft

    1 project | dev.to | 3 Nov 2024
  • A note from our sponsor - Sevalla
    sevalla.com | 2 Sep 2025
    Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more! Learn more →

Index

What are some of the best open-source Dataset projects in Python? This list will help you:

# Project Stars
1 datasets 20,575
2 akshare 13,239
3 cleanlab 10,853
4 datasette 10,296
5 doccano 10,251
6 deeplake 8,792
7 datasets 4,466
8 torchgeo 3,616
9 Colour 2,343
10 Open3D-ML 2,113
11 ogb 2,027
12 DB-GPT-Hub 1,883
13 diffgram 1,881
14 entity-recognition-datasets 1,548
15 safe-rlhf 1,487
16 projects 1,396
17 semhash 798
18 dataset-viewer 778
19 datumaro 641
20 pudl 553
21 CelebV-HQ 443
22 Minari 425
23 DoppelGANger 306

Sponsored
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com