Python Dataset

Open-source Python projects categorized as Dataset

Top 23 Python Dataset Projects

  1. public-apis

    A collective list of free APIs

    Project mention: public-apis: what 438k stars actually buy you, and what they don't | dev.to | 2026-05-31

    Repository: public-apis/public-apis

  2. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  3. faker

    Faker is a Python package that generates fake data for you. (by joke2k)

    Project mention: Your Test Data Is Type-Correct and Still Invalid: 6 Postgres Schema Features Generators Skip | dev.to | 2026-06-01

    Free and DIY. Faker, ORM seeders, and hand-written scripts generate values per column. Relationships, table-level constraints, and the features above stay your job, in your code, kept in sync by hand.

  4. LaTeX-OCR

    pix2tex: Using a ViT to convert images of equations into LaTeX code.

  5. fashion-mnist

    A MNIST-like fashion product database. Benchmark :point_down:

  6. doccano

    Open source annotation tool for machine learning practitioners.

  7. awesome-pretrained-chinese-nlp-models

    Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合

  8. transformer-pytorch

    Transformer: PyTorch Implementation of "Attention Is All You Need"

  9. datasets

    TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)

  10. img2dataset

    Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

    Project mention: Anthropic reverses privacy stance, will train on Claude chats | news.ycombinator.com | 2025-08-29

    > By default, you are opted in. Perfectly clear.

    That's called opt-out. You're doing exactly what I described: gaslighting people into believing that opt-in and opt-out are synonyms, which makes the entire concept meaningless. The audacity of you calling me "political" while resorting to such manipulation is astounding.

    These are examples of what "opt-in by default" actually means. It means having the user manually consent to something every time, the polar opposite your definition.

    - https://arstechnica.com/gadgets/2024/06/report-new-apple-int...

    - https://github.com/rom1504/img2dataset/issues/293

    It's also just pure laziness to label me as "hysterical" when PR departments of companies like Google have, like you, misused the terms opt-out and opt-in in deceptive ways.

    https://news.ycombinator.com/item?id=37314981

  11. TextRecognitionDataGenerator

    A synthetic data generator for text recognition

  12. waymo-open-dataset

    Waymo Open Dataset

  13. pandas-datareader

    Extract data from a wide range of Internet sources into a pandas DataFrame.

  14. Colour

    Colour Science for Python

    Project mention: Rendering the Visible Spectrum | news.ycombinator.com | 2026-02-18

    If you are interested in this topic, we have a fully feature colour science Python package that can of course render the visible spectrum: https://github.com/colour-science/colour?tab=readme-ov-file#...

  15. beir

    A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

    Project mention: Gemini Embedding: Powering RAG and context engineering | news.ycombinator.com | 2025-07-31

    It's always worth checking out the MTEB leaderboard: https://huggingface.co/spaces/mteb/leaderboard

    There are some good open models there that have longer context limits and fewer dimensions.

    The benchmarks are just a guide. It's best to build a test dataset with your own data. This is a good example of that: https://github.com/beir-cellar/beir/wiki/Load-your-custom-da...

    Another benefit of having your own test dataset, is that it can grow as your data grows. And you can quickly test new models to see how it performs with YOUR data.

  16. fastdup

    fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.

  17. linusrants

    Dataset of Linus Torvalds' rants classified by negativity using sentiment analysis

    Project mention: All Linus rants from 2012 to 2015 | news.ycombinator.com | 2026-03-16
  18. ESC-50

    ESC-50: Dataset for Environmental Sound Classification

  19. VBench

    [CVPR2024 Highlight] VBench - We Evaluate Video Generation

  20. DataProfiler

    What's in your data? Extract schema, statistics and entities from datasets

  21. streaming

    A Data Streaming Library for Efficient Neural Network Training (by mosaicml)

  22. chatgpt-comparison-detection

    Human ChatGPT Comparison Corpus (HC3), Detectors, and more! 🔥

  23. RecSysDatasets

    This is a repository of public data sources for Recommender Systems (RS).

  24. covid-19

    Novel Coronavirus 2019 time series data on cases (by datasets)

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Dataset discussion

Log in or Post with

Python Dataset related posts

  • Show HN: CRED-1 – Open domain credibility dataset for on-device pre-bunking

    1 project | news.ycombinator.com | 24 May 2026
  • All Linus rants from 2012 to 2015

    1 project | news.ycombinator.com | 16 Mar 2026
  • Building a Cat Enrichment Assessment Tool in Python

    1 project | dev.to | 10 Mar 2026
  • Stop Creating 50 Users When You Only Need 5: Solving Django's Relationship Inflation Problem

    1 project | dev.to | 1 Jan 2026
  • McBroken

    1 project | news.ycombinator.com | 26 Aug 2025
  • McDonald's Gives Its Restaurants an AI Makeover

    1 project | news.ycombinator.com | 7 Mar 2025
  • Chain of Draft: Thinking Faster by Writing Less

    1 project | dev.to | 28 Feb 2025
  • A note from our sponsor - SaaSHub
    www.saashub.com | 7 Jun 2026
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Dataset projects in Python? This list will help you:

# Project Stars
1 public-apis 439,647
2 faker 19,258
3 LaTeX-OCR 16,324
4 fashion-mnist 12,741
5 doccano 10,667
6 awesome-pretrained-chinese-nlp-models 5,568
7 transformer-pytorch 4,585
8 datasets 4,566
9 img2dataset 4,424
10 TextRecognitionDataGenerator 3,660
11 waymo-open-dataset 3,334
12 pandas-datareader 3,181
13 Colour 2,593
14 beir 2,209
15 fastdup 1,855
16 linusrants 1,763
17 ESC-50 1,762
18 VBench 1,645
19 DataProfiler 1,557
20 streaming 1,514
21 chatgpt-comparison-detection 1,355
22 RecSysDatasets 1,232
23 covid-19 1,157

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com

Did you know that Python is
the 1st most popular programming language
based on number of references?