Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →
Top 23 Python Dataset Projects
-
datasets
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
Project mention: 20 Open Source Tools I Recommend to Build, Share, and Run AI Projects | dev.to | 2024-11-13Datasets library repository for accessing and sharing datasets with the community.
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Project mention: Ask HN: Not a webdev, why are these sites so good? | news.ycombinator.com | 2024-06-18https://cleanlab.ai/
-
SQLite is used because it's lightweight, requires no server setup, and provides a self-contained database solution ideal for this type of data collection. Additionally, Datasette can be used to easily query, visualize, and publish the data for later analysis.
-
-
deeplake
Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
Finally, we stored these vectors in our chosen database: the activeloop DeepLake database. This database is open source, something near and dear to our own open-source hearts. We will cover some additional details in a further section, but it is specifically designed to handle vector data and perform efficient similarity searches, which is crucial for quick and accurate retrieval during the RAG process.
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
Issue Worked On: Add Consistent Bands Metadata to Vision Transformer and ResNet Weights #2376 This week, I worked on a GitHub issue to add consistent band metadata across Vision Transformer (ViT) and ResNet weight classes in the torchgeo library. The goal was to ensure uniform metadata across different weight classes, specifically supporting various satellite datasets like Landsat and Sentinel.
-
-
-
-
diffgram
The AI Datastore for Schemas, BLOBs, and Predictions. Use with your apps or integrate built-in Human Supervision, Data Workflow, and UI Catalog to get the most value out of your AI Data.
-
DB-GPT-Hub
A repository that contains models, datasets, and fine-tuning techniques for DB-GPT, with the purpose of enhancing model performance in Text-to-SQL
-
entity-recognition-datasets
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
-
safe-rlhf
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
-
-
dataset-viewer
Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
-
datumaro
Dataset Management Framework, a Python library and a CLI tool to build, analyze and manage Computer Vision datasets.
-
Project mention: Show HN: SemHash – Fast Semantic Text Deduplication for Cleaner Datasets | news.ycombinator.com | 2025-01-19
-
pudl
The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
-
-
Minari
A standard format for offline reinforcement learning datasets, with popular reference datasets and related utilities
-
DoppelGANger
[IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Datasets discussion
Python Datasets related posts
-
Exploring the Paramilitary Leaks
-
Show HN: SemHash – Fast Semantic Text Deduplication for Cleaner Datasets
-
I Track My Health Data in Markdown: Lessons in Digital Longevity
-
My First Open Source Contribution @microsoft
-
Creation of the ApostropheCMS Documentation Chatbot
-
TorchGeo: How to Download the NWPU VHR-10 Dataset
-
CLI tool and Python library for manipulating SQLite databases
-
A note from our sponsor - CodeRabbit
coderabbit.ai | 23 Mar 2025
Index
What are some of the best open-source Dataset projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | datasets | 19,851 |
2 | akshare | 11,021 |
3 | cleanlab | 10,241 |
4 | datasette | 9,884 |
5 | doccano | 9,849 |
6 | deeplake | 8,485 |
7 | datasets | 4,375 |
8 | torchgeo | 3,258 |
9 | Colour | 2,214 |
10 | ogb | 1,987 |
11 | Open3D-ML | 1,981 |
12 | diffgram | 1,860 |
13 | DB-GPT-Hub | 1,668 |
14 | entity-recognition-datasets | 1,531 |
15 | safe-rlhf | 1,427 |
16 | projects | 1,361 |
17 | dataset-viewer | 733 |
18 | datumaro | 582 |
19 | semhash | 573 |
20 | pudl | 524 |
21 | CelebV-HQ | 404 |
22 | Minari | 362 |
23 | DoppelGANger | 303 |