Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more! Learn more →
Top 23 Python Dataset Projects
-
datasets
🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools
Hugging Face Datasets -- the library that lets you download and manage datasets from the Hugging Face Hub, as well as being a convenient vendor-neutral interface for your own datasets.
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
cleanlab
Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
-
I've been using LLM-assistance for my larger open source projects - https://github.com/simonw/datasette https://github.com/simonw/llm and https://github.com/simonw/sqlite-utils - for a couple of years now.
Also literally hundreds of smaller plugins and libraries and CLI tools, see https://github.com/simonw?tab=repositories (now at 880 repos) and https://pypi.org/user/simonw/ (340 published packages).
Unlike my tools.simonwillison.net stuff the vast majority of those products are covered by automated tests and usually have comprehensive documentation too.
-
-
deeplake
Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
Project mention: What I Learned Comparing Zilliz Cloud and Deep Lake for Scalable Vector Search | dev.to | 2025-06-09As I scaled up a semantic search engine for multi-modal content, I found myself at a fork in the road. Should I lean into a purpose-built vector database like Zilliz Cloud, or embrace a more flexible data lake approach with Deep Lake? These tools promise vector search at scale—but they come from fundamentally different architectural philosophies.
-
-
Sevalla
Deploy and host your apps and databases, now with $50 credit! Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more!
-
Issue Worked On: Add Consistent Bands Metadata to Vision Transformer and ResNet Weights #2376 This week, I worked on a GitHub issue to add consistent band metadata across Vision Transformer (ViT) and ResNet weight classes in the torchgeo library. The goal was to ensure uniform metadata across different weight classes, specifically supporting various satellite datasets like Landsat and Sentinel.
-
Nice article, I came across very cool Python library recently too re. colour science - https://www.colour-science.org/
Just started playing with it with my spectrometer based on one of the examples they have, to convert spectral data to a single RGB value.
-
-
-
DB-GPT-Hub
A repository that contains models, datasets, and fine-tuning techniques for DB-GPT, with the purpose of enhancing model performance in Text-to-SQL
-
diffgram
The AI Datastore for Schemas, BLOBs, and Predictions. Use with your apps or integrate built-in Human Supervision, Data Workflow, and UI Catalog to get the most value out of your AI Data.
-
entity-recognition-datasets
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
-
safe-rlhf
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
-
-
Project mention: Show HN: SemHash – Semantic Text Deduplication, Outlier Filtering and Sampling | news.ycombinator.com | 2025-04-27
-
dataset-viewer
Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
-
datumaro
Dataset Management Framework, a Python library and a CLI tool to build, analyze and manage Computer Vision datasets.
-
pudl
The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
-
-
Minari
A standard format for offline reinforcement learning datasets, with popular reference datasets and related utilities
-
DoppelGANger
[IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Datasets discussion
Python Datasets related posts
-
What I Learned Comparing Zilliz Cloud and Deep Lake for Scalable Vector Search
-
Show HN: SemHash – Semantic Text Deduplication, Outlier Filtering and Sampling
-
Sell Yourself Sell Your Work
-
Exploring the Paramilitary Leaks
-
Show HN: SemHash – Fast Semantic Text Deduplication for Cleaner Datasets
-
I Track My Health Data in Markdown: Lessons in Digital Longevity
-
My First Open Source Contribution @microsoft
-
A note from our sponsor - Sevalla
sevalla.com | 2 Sep 2025
Index
What are some of the best open-source Dataset projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | datasets | 20,575 |
2 | akshare | 13,239 |
3 | cleanlab | 10,853 |
4 | datasette | 10,296 |
5 | doccano | 10,251 |
6 | deeplake | 8,792 |
7 | datasets | 4,466 |
8 | torchgeo | 3,616 |
9 | Colour | 2,343 |
10 | Open3D-ML | 2,113 |
11 | ogb | 2,027 |
12 | DB-GPT-Hub | 1,883 |
13 | diffgram | 1,881 |
14 | entity-recognition-datasets | 1,548 |
15 | safe-rlhf | 1,487 |
16 | projects | 1,396 |
17 | semhash | 798 |
18 | dataset-viewer | 778 |
19 | datumaro | 641 |
20 | pudl | 553 |
21 | CelebV-HQ | 443 |
22 | Minari | 425 |
23 | DoppelGANger | 306 |