Datasets

Open-source projects categorized as Datasets

Top 23 Dataset Open-Source Projects

  • awesome-public-datasets

    A topic-centric list of HQ open datasets.

  • Project mention: How to practice data analytics skills | news.ycombinator.com | 2023-12-25

    Merry Christmas buddy.

    You'll find a ton of public datasets on GitHub [1].

    Maven Analytics offers a monthly data analytics challenge [2] that you can enter for free. See their past competitions for some interesting datasets.

    As I'm based in Ireland I'll also recommend the Irish Data Portal [3].

    [1] https://github.com/awesomedata/awesome-public-datasets

  • datasets

    ๐Ÿค— The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

  • Project mention: ๐Ÿ๐Ÿ 23 issues to grow yourself as an exceptional open-source Python expert ๐Ÿง‘โ€๐Ÿ’ป ๐Ÿฅ‡ | dev.to | 2023-10-19
  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • label-studio

    Label Studio is a multi-type data labeling and annotation tool with standardized output format

  • Project mention: First 15 Open Source Advent projects | dev.to | 2023-12-15

    14. LabelStudio by Human Signal | Github | tutorial

  • doccano

    Open source annotation tool for machine learning practitioners.

  • Project mention: You Can't Have a Free Software AI Stack | news.ycombinator.com | 2023-07-13

    Huh?

    I wrote my own system for classifying a stream of texts in Python, I might Open Source it one of these days but I have to get it to the point where it is modular enough that I can customize it to do the particular things I want without subjecting people to my whims... I use it every day and I'm not afraid to demo it because it is rock solid.

    My understanding is that my system would not be hard to adapt to work on images for certain kinds of tasks.

    Pytorch is open source, Huggingface is open source. CUDA isn't. This is

    https://labelstud.io/

    and for annotating text spans there are so many open source tools

    https://github.com/doccano/doccano

    I worked for a company a few years back that built annotation tools for projects we sold to customers but never quite got to a polished general purpose annotator. Today there are an overwhelming number of companies in this space and products I never heard of, many of which are cloud based or paid. Looks like a gold rush to me.

  • datasette

    An open source multi-tool for exploring and publishing data

  • Project mention: Ask HN: High quality Python scripts or small libraries to learn from | news.ycombinator.com | 2024-04-19

    Simon Willison's github would be a great place to get started imo -

    https://github.com/simonw/datasette

  • akshare

    AKShare is an elegant and simple financial data interface library for Python, built for human beings! ๅผ€ๆบ่ดข็ปๆ•ฐๆฎๆŽฅๅฃๅบ“ (by akfamily)

  • techniques

    Techniques for deep learning with satellite & aerial imagery

  • Project mention: What satellite image analytics are in demand now? | /r/gis | 2023-06-26
  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • deeplake

    Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

  • Project mention: FLaNK AI Weekly 25 March 2025 | dev.to | 2024-03-25
  • fl_chart

    FL Chart is a highly customizable Flutter chart library that supports Line Chart, Bar Chart, Pie Chart, Scatter Chart, and Radar Chart.

  • datasets

    TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)

  • awesome-json-datasets

    A curated list of awesome JSON datasets that don't require authentication.

  • Project mention: JSON Datasets | news.ycombinator.com | 2023-05-24
  • roapi

    Create full-fledged APIs for slowly moving datasets without writing a single line of code.

  • Project mention: Full-fledged APIs for slowly moving datasets without writing code | news.ycombinator.com | 2023-10-25
  • torchgeo

    TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data

  • Project mention: FLaNK Stack Weekly for 20 Nov 2023 | dev.to | 2023-11-20
  • coco-annotator

    :pencil2: Web-based image segmentation tool for object detection, localization, and keypoints

  • Project mention: Exploring Open-Source Alternatives to Landing AI for Robust MLOps | dev.to | 2023-12-13

    For instance, the COCO Annotator is a web-based image annotation tool tailored for the COCO dataset format, allowing collaborative labeling with features like attribute tagging and automatic segmentation. Similarly, Label Studio offers an easy-to-use interface for bounding box object labeling in images.

  • Colour

    Colour Science for Python

  • Project mention: Tailwind Color Palette Generator | news.ycombinator.com | 2024-02-02

    Colour Science is one of the more serious projects I know of, and more or less lets you get as advanced as you want. Used by film professionals among others. https://www.colour-science.org/

    How would you define what the perfect color tool is? I would guess like most tools that it depends entirely on the job at hand, and that maybe no one perfect tool can exist. Colour Science might be great at serious color management and perceptual measurements and conversions between standardized color spaces, but not the right tool for a web developer looking for quick & easy way to make an HSV palette generation widget (and not because Colour Science is Python, but because itโ€™s too big and heavy of a hammer).

  • ogb

    Benchmark datasets, data loaders, and evaluators for graph machine learning

  • diffgram

    The AI Datastore for Schemas, BLOBs, and Predictions. Use with your apps or integrate built-in Human Supervision, Data Workflow, and UI Catalog to get the most value out of your AI Data.

  • DataFrames.jl

    In-memory tabular data in Julia

  • Open3D-ML

    An extension of Open3D to address 3D Machine Learning tasks

  • Project mention: Looking for Point Cloud deep learning, training sources | /r/deeplearning | 2023-07-13

    I already have a basic understanding with Open3D-ML and manage to get examples for training to work. However, my knowledge is not sufficient to transfer this to my own data or model deployment.

  • voice_datasets

    ๐Ÿ”Š A comprehensive list of open-source datasets for voice and sound computing (95+ datasets).

  • loghub

    A large collection of system log datasets for AI-driven log analytics [ISSRE'23]

  • entity-recognition-datasets

    A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.

  • Project mention: Recent English newswire NER datasets? | /r/LanguageTechnology | 2023-08-27

    There is of course the list at https://github.com/juand-r/entity-recognition-datasets, but all of the recent English datasets cover other domains of English, such as the music NER, space NER, etc. All interesting things, but not 2020s English newswire.

  • projects

    ๐Ÿช End-to-end NLP workflows from prototype to production (by explosion)

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Datasets related posts

Index

What are some of the best open-source Dataset projects? This list will help you:

Project Stars
1 awesome-public-datasets 58,391
2 datasets 18,376
3 label-studio 16,469
4 doccano 8,966
5 datasette 8,881
6 akshare 8,321
7 techniques 7,739
8 deeplake 7,690
9 fl_chart 6,376
10 datasets 4,162
11 awesome-json-datasets 3,183
12 roapi 3,070
13 torchgeo 2,218
14 coco-annotator 2,008
15 Colour 1,974
16 ogb 1,864
17 diffgram 1,795
18 DataFrames.jl 1,690
19 Open3D-ML 1,660
20 voice_datasets 1,525
21 loghub 1,518
22 entity-recognition-datasets 1,431
23 projects 1,246

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com