The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →
Top 23 Python Dataset Projects
-
datasets
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
Project mention: 🐍🐍 23 issues to grow yourself as an exceptional open-source Python expert 🧑💻 🥇 | dev.to | 2023-10-19 -
Huh?
I wrote my own system for classifying a stream of texts in Python, I might Open Source it one of these days but I have to get it to the point where it is modular enough that I can customize it to do the particular things I want without subjecting people to my whims... I use it every day and I'm not afraid to demo it because it is rock solid.
My understanding is that my system would not be hard to adapt to work on images for certain kinds of tasks.
Pytorch is open source, Huggingface is open source. CUDA isn't. This is
and for annotating text spans there are so many open source tools
https://github.com/doccano/doccano
I worked for a company a few years back that built annotation tools for projects we sold to customers but never quite got to a polished general purpose annotator. Today there are an overwhelming number of companies in this space and products I never heard of, many of which are cloud based or paid. Looks like a gold rush to me.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
Project mention: Little Data: How do we query personal data? (2013) | news.ycombinator.com | 2024-03-01
I'm a fan on simonw's datasette/dogsheep ecosystem https://datasette.io/
-
deeplake
Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
-
-
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
Colour Science is one of the more serious projects I know of, and more or less lets you get as advanced as you want. Used by film professionals among others. https://www.colour-science.org/
How would you define what the perfect color tool is? I would guess like most tools that it depends entirely on the job at hand, and that maybe no one perfect tool can exist. Colour Science might be great at serious color management and perceptual measurements and conversions between standardized color spaces, but not the right tool for a web developer looking for quick & easy way to make an HSV palette generation widget (and not because Colour Science is Python, but because it’s too big and heavy of a hammer).
-
-
diffgram
The AI Datastore for Schemas, BLOBs, and Predictions. Use with your apps or integrate built-in Human Supervision, Data Workflow, and UI Catalog to get the most value out of your AI Data.
-
Project mention: Looking for Point Cloud deep learning, training sources | /r/deeplearning | 2023-07-13
I already have a basic understanding with Open3D-ML and manage to get examples for training to work. However, my knowledge is not sufficient to transfer this to my own data or model deployment.
-
entity-recognition-datasets
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
There is of course the list at https://github.com/juand-r/entity-recognition-datasets, but all of the recent English datasets cover other domains of English, such as the music NER, space NER, etc. All interesting things, but not 2020s English newswire.
-
-
safe-rlhf
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
Project mention: [R] Meet Beaver-7B: a Constrained Value-Aligned LLM via Safe RLHF Technique | /r/MachineLearning | 2023-05-16 -
DB-GPT-Hub
A repository that contains models, datasets, and fine-tuning techniques for DB-GPT, with the purpose of enhancing model performance in Text-to-SQL
-
datasets-server
Lightweight web API for visualizing and exploring all types of datasets - computer vision, speech, text, and tabular - stored on the Hugging Face Hub
-
datumaro
Dataset Management Framework, a Python library and a CLI tool to build, analyze and manage Computer Vision datasets.
-
-
squirrel-core
A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way :chestnut:
-
DoppelGANger
[IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions
-
Data Flow Facilitator for Machine Learning (dffml)
The easiest way to use Machine Learning. Mix and match underlying ML libraries and data set sources. Generate new datasets or modify existing ones with ease.
-
Minari
A standard format for offline reinforcement learning datasets, with popular reference datasets and related utilities
Project mention: Announcing Minari (Gym for offline RL, by the Farama Foundation) is going into public beta | /r/reinforcementlearning | 2023-05-18You can also read the full release notes here: https://github.com/Farama-Foundation/Minari/releases/tag/v0.3.0
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Datasets related posts
- FLaNK AI Weekly 25 March 2025
- Little Data: How do we query personal data? (2013)
- Daily Price Tracking for Trader Joes
- Ask HN: What two software products should have a kid?
- What We Watched: A Netflix Engagement Report – About Netflix
- Effective GPT-4 Programming
- What is Glamorous Toolkit v1.0?
-
A note from our sponsor - WorkOS
workos.com | 28 Mar 2024
Index
What are some of the best open-source Dataset projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | datasets | 18,228 |
2 | doccano | 8,871 |
3 | datasette | 8,791 |
4 | akshare | 8,151 |
5 | deeplake | 7,603 |
6 | datasets | 4,141 |
7 | torchgeo | 2,176 |
8 | Colour | 1,925 |
9 | ogb | 1,852 |
10 | diffgram | 1,781 |
11 | Open3D-ML | 1,634 |
12 | entity-recognition-datasets | 1,426 |
13 | projects | 1,232 |
14 | safe-rlhf | 1,108 |
15 | DB-GPT-Hub | 949 |
16 | datasets-server | 597 |
17 | datumaro | 474 |
18 | CelebV-HQ | 306 |
19 | squirrel-core | 277 |
20 | DoppelGANger | 275 |
21 | Data Flow Facilitator for Machine Learning (dffml) | 241 |
22 | Minari | 203 |
23 | scrapeOP | 189 |