Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more โ
Top 23 Dataset Open-Source Projects
-
datasets
๐ค The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
label-studio
Label Studio is a multi-type data labeling and annotation tool with standardized output format
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
deeplake
Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
-
fl_chart
FL Chart is a highly customizable Flutter chart library that supports Line Chart, Bar Chart, Pie Chart, Scatter Chart, and Radar Chart.
-
coco-annotator
:pencil2: Web-based image segmentation tool for object detection, localization, and keypoints
-
diffgram
The AI Datastore for Schemas, BLOBs, and Predictions. Use with your apps or integrate built-in Human Supervision, Data Workflow, and UI Catalog to get the most value out of your AI Data.
-
voice_datasets
๐ A comprehensive list of open-source datasets for voice and sound computing (95+ datasets).
-
entity-recognition-datasets
A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Merry Christmas buddy.
You'll find a ton of public datasets on GitHub [1].
Maven Analytics offers a monthly data analytics challenge [2] that you can enter for free. See their past competitions for some interesting datasets.
As I'm based in Ireland I'll also recommend the Irish Data Portal [3].
[1] https://github.com/awesomedata/awesome-public-datasets
Project mention: ๐๐ 23 issues to grow yourself as an exceptional open-source Python expert ๐งโ๐ป ๐ฅ | dev.to | 2023-10-19
14. LabelStudio by Human Signal | Github | tutorial
Huh?
I wrote my own system for classifying a stream of texts in Python, I might Open Source it one of these days but I have to get it to the point where it is modular enough that I can customize it to do the particular things I want without subjecting people to my whims... I use it every day and I'm not afraid to demo it because it is rock solid.
My understanding is that my system would not be hard to adapt to work on images for certain kinds of tasks.
Pytorch is open source, Huggingface is open source. CUDA isn't. This is
https://labelstud.io/
and for annotating text spans there are so many open source tools
https://github.com/doccano/doccano
I worked for a company a few years back that built annotation tools for projects we sold to customers but never quite got to a polished general purpose annotator. Today there are an overwhelming number of companies in this space and products I never heard of, many of which are cloud based or paid. Looks like a gold rush to me.
Project mention: Ask HN: High quality Python scripts or small libraries to learn from | news.ycombinator.com | 2024-04-19Simon Willison's github would be a great place to get started imo -
https://github.com/simonw/datasette
Project mention: Full-fledged APIs for slowly moving datasets without writing code | news.ycombinator.com | 2023-10-25
Project mention: Exploring Open-Source Alternatives to Landing AI for Robust MLOps | dev.to | 2023-12-13For instance, the COCO Annotator is a web-based image annotation tool tailored for the COCO dataset format, allowing collaborative labeling with features like attribute tagging and automatic segmentation. Similarly, Label Studio offers an easy-to-use interface for bounding box object labeling in images.
Colour Science is one of the more serious projects I know of, and more or less lets you get as advanced as you want. Used by film professionals among others. https://www.colour-science.org/
How would you define what the perfect color tool is? I would guess like most tools that it depends entirely on the job at hand, and that maybe no one perfect tool can exist. Colour Science might be great at serious color management and perceptual measurements and conversions between standardized color spaces, but not the right tool for a web developer looking for quick & easy way to make an HSV palette generation widget (and not because Colour Science is Python, but because itโs too big and heavy of a hammer).
Project mention: Looking for Point Cloud deep learning, training sources | /r/deeplearning | 2023-07-13I already have a basic understanding with Open3D-ML and manage to get examples for training to work. However, my knowledge is not sufficient to transfer this to my own data or model deployment.
There is of course the list at https://github.com/juand-r/entity-recognition-datasets, but all of the recent English datasets cover other domains of English, such as the music NER, space NER, etc. All interesting things, but not 2020s English newswire.
Datasets related posts
- Streamlining AI/ML Deployment with ModelKits: Innovations and Future Directions
- Introducing the New GitHub Action for using Kit CLI on MLOps pipelines
- Say hello to KitโAn open source solution to MLOps complexity
- FLaNK AI Weekly 25 March 2025
- Little Data: How do we query personal data? (2013)
- Daily Price Tracking for Trader Joes
- Ask HN: What two software products should have a kid?
-
A note from our sponsor - InfluxDB
www.influxdata.com | 24 Apr 2024
Index
What are some of the best open-source Dataset projects? This list will help you:
Project | Stars | |
---|---|---|
1 | awesome-public-datasets | 58,391 |
2 | datasets | 18,376 |
3 | label-studio | 16,469 |
4 | doccano | 8,966 |
5 | datasette | 8,881 |
6 | akshare | 8,321 |
7 | techniques | 7,739 |
8 | deeplake | 7,690 |
9 | fl_chart | 6,376 |
10 | datasets | 4,162 |
11 | awesome-json-datasets | 3,183 |
12 | roapi | 3,070 |
13 | torchgeo | 2,218 |
14 | coco-annotator | 2,008 |
15 | Colour | 1,974 |
16 | ogb | 1,864 |
17 | diffgram | 1,795 |
18 | DataFrames.jl | 1,690 |
19 | Open3D-ML | 1,660 |
20 | voice_datasets | 1,525 |
21 | loghub | 1,518 |
22 | entity-recognition-datasets | 1,431 |
23 | projects | 1,246 |
Sponsored