SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python Dataset Projects
-
Project mention: public-apis: what 438k stars actually buy you, and what they don't | dev.to | 2026-05-31
Repository: public-apis/public-apis
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
Project mention: Your Test Data Is Type-Correct and Still Invalid: 6 Postgres Schema Features Generators Skip | dev.to | 2026-06-01
Free and DIY. Faker, ORM seeders, and hand-written scripts generate values per column. Relationships, table-level constraints, and the features above stay your job, in your code, kept in sync by hand.
-
-
-
-
awesome-pretrained-chinese-nlp-models
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合
-
-
-
img2dataset
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Project mention: Anthropic reverses privacy stance, will train on Claude chats | news.ycombinator.com | 2025-08-29> By default, you are opted in. Perfectly clear.
That's called opt-out. You're doing exactly what I described: gaslighting people into believing that opt-in and opt-out are synonyms, which makes the entire concept meaningless. The audacity of you calling me "political" while resorting to such manipulation is astounding.
These are examples of what "opt-in by default" actually means. It means having the user manually consent to something every time, the polar opposite your definition.
- https://arstechnica.com/gadgets/2024/06/report-new-apple-int...
- https://github.com/rom1504/img2dataset/issues/293
It's also just pure laziness to label me as "hysterical" when PR departments of companies like Google have, like you, misused the terms opt-out and opt-in in deceptive ways.
https://news.ycombinator.com/item?id=37314981
-
-
-
-
If you are interested in this topic, we have a fully feature colour science Python package that can of course render the visible spectrum: https://github.com/colour-science/colour?tab=readme-ov-file#...
-
beir
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
Project mention: Gemini Embedding: Powering RAG and context engineering | news.ycombinator.com | 2025-07-31It's always worth checking out the MTEB leaderboard: https://huggingface.co/spaces/mteb/leaderboard
There are some good open models there that have longer context limits and fewer dimensions.
The benchmarks are just a guide. It's best to build a test dataset with your own data. This is a good example of that: https://github.com/beir-cellar/beir/wiki/Load-your-custom-da...
Another benefit of having your own test dataset, is that it can grow as your data grows. And you can quickly test new models to see how it performs with YOUR data.
-
fastdup
fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.
-
-
-
-
-
-
-
-
Python Dataset discussion
Python Dataset related posts
-
Show HN: CRED-1 – Open domain credibility dataset for on-device pre-bunking
-
All Linus rants from 2012 to 2015
-
Building a Cat Enrichment Assessment Tool in Python
-
Stop Creating 50 Users When You Only Need 5: Solving Django's Relationship Inflation Problem
-
McBroken
-
McDonald's Gives Its Restaurants an AI Makeover
-
Chain of Draft: Thinking Faster by Writing Less
-
A note from our sponsor - SaaSHub
www.saashub.com | 7 Jun 2026
Index
What are some of the best open-source Dataset projects in Python? This list will help you:
| # | Project | Stars |
|---|---|---|
| 1 | public-apis | 439,647 |
| 2 | faker | 19,258 |
| 3 | LaTeX-OCR | 16,324 |
| 4 | fashion-mnist | 12,741 |
| 5 | doccano | 10,667 |
| 6 | awesome-pretrained-chinese-nlp-models | 5,568 |
| 7 | transformer-pytorch | 4,585 |
| 8 | datasets | 4,566 |
| 9 | img2dataset | 4,424 |
| 10 | TextRecognitionDataGenerator | 3,660 |
| 11 | waymo-open-dataset | 3,334 |
| 12 | pandas-datareader | 3,181 |
| 13 | Colour | 2,593 |
| 14 | beir | 2,209 |
| 15 | fastdup | 1,855 |
| 16 | linusrants | 1,763 |
| 17 | ESC-50 | 1,762 |
| 18 | VBench | 1,645 |
| 19 | DataProfiler | 1,557 |
| 20 | streaming | 1,514 |
| 21 | chatgpt-comparison-detection | 1,355 |
| 22 | RecSysDatasets | 1,232 |
| 23 | covid-19 | 1,157 |