Python Datasets

Open-source Python projects categorized as Datasets

Top 23 Python Dataset Projects

  1. datasets

    🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

    Project mention: 20 Open Source Tools I Recommend to Build, Share, and Run AI Projects | dev.to | 2024-11-13

    Datasets library repository for accessing and sharing datasets with the community.

  2. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  3. akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库 (by akfamily)

  4. cleanlab

    The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

    Project mention: Ask HN: Not a webdev, why are these sites so good? | news.ycombinator.com | 2024-06-18

    https://cleanlab.ai/

  5. datasette

    An open source multi-tool for exploring and publishing data

    Project mention: Exploring LLMs: A Blind Trial for Code Completions | dev.to | 2025-03-09

    SQLite is used because it's lightweight, requires no server setup, and provides a self-contained database solution ideal for this type of data collection. Additionally, Datasette can be used to easily query, visualize, and publish the data for later analysis.

  6. doccano

    Open source annotation tool for machine learning practitioners.

  7. deeplake

    Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

    Project mention: Creation of the ApostropheCMS Documentation Chatbot | dev.to | 2024-08-29

    Finally, we stored these vectors in our chosen database: the activeloop DeepLake database. This database is open source, something near and dear to our own open-source hearts. We will cover some additional details in a further section, but it is specifically designed to handle vector data and perform efficient similarity searches, which is crucial for quick and accurate retrieval during the RAG process.

  8. datasets

    TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. torchgeo

    TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data

    Project mention: My First Open Source Contribution @microsoft | dev.to | 2024-11-03

    Issue Worked On: Add Consistent Bands Metadata to Vision Transformer and ResNet Weights #2376 This week, I worked on a GitHub issue to add consistent band metadata across Vision Transformer (ViT) and ResNet weight classes in the torchgeo library. The goal was to ensure uniform metadata across different weight classes, specifically supporting various satellite datasets like Landsat and Sentinel.

  11. Colour

    Colour Science for Python

  12. ogb

    Benchmark datasets, data loaders, and evaluators for graph machine learning

  13. Open3D-ML

    An extension of Open3D to address 3D Machine Learning tasks

  14. diffgram

    The AI Datastore for Schemas, BLOBs, and Predictions. Use with your apps or integrate built-in Human Supervision, Data Workflow, and UI Catalog to get the most value out of your AI Data.

  15. DB-GPT-Hub

    A repository that contains models, datasets, and fine-tuning techniques for DB-GPT, with the purpose of enhancing model performance in Text-to-SQL

  16. entity-recognition-datasets

    A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.

  17. safe-rlhf

    Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback

  18. projects

    🪐 End-to-end NLP workflows from prototype to production (by explosion)

  19. dataset-viewer

    Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.

  20. datumaro

    Dataset Management Framework, a Python library and a CLI tool to build, analyze and manage Computer Vision datasets.

  21. semhash

    Fast Semantic Text Deduplication

    Project mention: Show HN: SemHash – Fast Semantic Text Deduplication for Cleaner Datasets | news.ycombinator.com | 2025-01-19
  22. pudl

    The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.

  23. CelebV-HQ

    [ECCV 2022] CelebV-HQ: A Large-Scale Video Facial Attributes Dataset

  24. Minari

    A standard format for offline reinforcement learning datasets, with popular reference datasets and related utilities

  25. DoppelGANger

    [IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Datasets discussion

Log in or Post with

Python Datasets related posts

  • Exploring the Paramilitary Leaks

    1 project | news.ycombinator.com | 6 Mar 2025
  • Show HN: SemHash – Fast Semantic Text Deduplication for Cleaner Datasets

    1 project | news.ycombinator.com | 19 Jan 2025
  • I Track My Health Data in Markdown: Lessons in Digital Longevity

    1 project | news.ycombinator.com | 15 Dec 2024
  • My First Open Source Contribution @microsoft

    1 project | dev.to | 3 Nov 2024
  • Creation of the ApostropheCMS Documentation Chatbot

    2 projects | dev.to | 29 Aug 2024
  • TorchGeo: How to Download the NWPU VHR-10 Dataset

    2 projects | dev.to | 23 Aug 2024
  • CLI tool and Python library for manipulating SQLite databases

    1 project | news.ycombinator.com | 8 Jul 2024
  • A note from our sponsor - CodeRabbit
    coderabbit.ai | 23 Mar 2025
    Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →

Index

What are some of the best open-source Dataset projects in Python? This list will help you:

# Project Stars
1 datasets 19,851
2 akshare 11,021
3 cleanlab 10,241
4 datasette 9,884
5 doccano 9,849
6 deeplake 8,485
7 datasets 4,375
8 torchgeo 3,258
9 Colour 2,214
10 ogb 1,987
11 Open3D-ML 1,981
12 diffgram 1,860
13 DB-GPT-Hub 1,668
14 entity-recognition-datasets 1,531
15 safe-rlhf 1,427
16 projects 1,361
17 dataset-viewer 733
18 datumaro 582
19 semhash 573
20 pudl 524
21 CelebV-HQ 404
22 Minari 362
23 DoppelGANger 303

Sponsored
CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai

Did you know that Python is
the 2nd most popular programming language
based on number of references?