Dataset

Top 23 Dataset Open-Source Projects

  • public-apis

    A collective list of free APIs

  • Project mention: 10 GitHub repositories that every developer must follow | dev.to | 2024-02-21

    ✅ public-apis/public-apis : https://github.com/public-apis/public-apis

  • faker

    Faker is a Python package that generates fake data for you. (by joke2k)

  • Project mention: Leveling up your custom fake data with Faker.js | dev.to | 2024-01-27

    Faker was originally written in Perl and is also available as a library for Ruby, Java, and Python.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • label-studio

    Label Studio is a multi-type data labeling and annotation tool with standardized output format

  • Project mention: First 15 Open Source Advent projects | dev.to | 2023-12-15

    14. LabelStudio by Human Signal | Github | tutorial

  • fashion-mnist

    A MNIST-like fashion product database. Benchmark :point_down:

  • Project mention: Logistic Regression for Image Classification Using OpenCV | news.ycombinator.com | 2023-12-31

    In this case there's no advantage to using logistic regression on an image other than the novelty. Logistic regression is excellent for feature explainability, but you can't explain anything from an image.

    Traditional classification algorithms but not deep learning such as SVMs and Random Forest perform a lot better on MNIST, up to 97% accuracy compared to the 88% from logistic regression in this post. Check the Original MNIST benchmarks here: http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/#

  • LaTeX-OCR

    pix2tex: Using a ViT to convert images of equations into LaTeX code.

  • Project mention: Detexify LaTeX Handwriting Symbol Recognition | news.ycombinator.com | 2023-11-14
  • doccano

    Open source annotation tool for machine learning practitioners.

  • Project mention: You Can't Have a Free Software AI Stack | news.ycombinator.com | 2023-07-13

    Huh?

    I wrote my own system for classifying a stream of texts in Python, I might Open Source it one of these days but I have to get it to the point where it is modular enough that I can customize it to do the particular things I want without subjecting people to my whims... I use it every day and I'm not afraid to demo it because it is rock solid.

    My understanding is that my system would not be hard to adapt to work on images for certain kinds of tasks.

    Pytorch is open source, Huggingface is open source. CUDA isn't. This is

    https://labelstud.io/

    and for annotating text spans there are so many open source tools

    https://github.com/doccano/doccano

    I worked for a company a few years back that built annotation tools for projects we sold to customers but never quite got to a polished general purpose annotator. Today there are an overwhelming number of companies in this space and products I never heard of, many of which are cloud based or paid. Looks like a gold rush to me.

  • cleanlab

    The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

  • Project mention: [Research] Detecting Annotation Errors in Semantic Segmentation Data | /r/MachineLearning | 2023-11-05

    We have feely open-sourced our new method for improving segmentation data, published a paper on the research behind it, and released a 5-min code tutorial. You can also read more in the blog if you'd like.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • techniques

    Techniques for deep learning with satellite & aerial imagery

  • Project mention: What satellite image analytics are in demand now? | /r/gis | 2023-06-26
  • awesome-project-ideas

    Curated list of Machine Learning, NLP, Vision, Recommender Systems Project Ideas

  • quickdraw-dataset

    Documentation on how to access and use the Quick, Draw! Dataset.

  • browser-compat-data

    This repository contains compatibility data for Web technologies as displayed on MDN

  • Project mention: Here are the 10 projects I am contributing to over the next 6 months. Share yours | dev.to | 2024-04-13
  • esProc

    esProc SPL is a scripting language for data processing, with well-designed rich library functions and powerful syntax, which can be executed in a Java program through JDBC interface and computing independently.

  • Project mention: Computing Engine on Web | news.ycombinator.com | 2024-04-22
  • awesome-pretrained-chinese-nlp-models

    Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合

  • datasets

    TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)

  • sql-translator

    SQL Translator is a tool for converting natural language queries into SQL code using artificial intelligence. This project is 100% free and open source.

  • Project mention: Storybook GPT | dev.to | 2023-05-08

    I started to see more and more applications that use the OpenAI API and I wanted to try it out. One of these apps is this one made by Kate.

  • Chinese-Names-Corpus

    中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。

  • text

    Models, data loaders and abstractions for language processing, powered by PyTorch

  • img2dataset

    Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

  • Project mention: OpenAI sued for web scraping from millions of internet users in order to train ChatGPT | /r/ArtistHate | 2023-06-30

    Lmao, no it doesn't. As we can see, their downloader uses very obscure "no ai" headers (which can be disabled, so its useless). They only claim it respects "robots.txt" because the google crawler respects it, if a site changes their robots.txt rules they don't remove it from their dataset, that is not "respecting". https://github.com/rom1504/img2dataset

  • awesome-json-datasets

    A curated list of awesome JSON datasets that don't require authentication.

  • Project mention: JSON Datasets | news.ycombinator.com | 2023-05-24
  • TextRecognitionDataGenerator

    A synthetic data generator for text recognition

  • covid-chestxray-dataset

    We are building an open database of COVID-19 cases with chest X-ray or CT images.

  • pandas-datareader

    Extract data from a wide range of Internet sources into a pandas DataFrame.

  • Project mention: Seeking recommendations for forex economic data API | /r/algotrading | 2023-05-03

    I've looked at https://github.com/pydata/pandas-datareader and it looks good, does anyone have experience?

  • whylogs

    An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Dataset related posts

Index

What are some of the best open-source Dataset projects? This list will help you:

Project Stars
1 public-apis 292,037
2 faker 17,080
3 label-studio 16,469
4 fashion-mnist 11,439
5 LaTeX-OCR 10,711
6 doccano 8,966
7 cleanlab 8,592
8 techniques 7,739
9 awesome-project-ideas 7,404
10 quickdraw-dataset 5,934
11 browser-compat-data 4,777
12 esProc 4,425
13 awesome-pretrained-chinese-nlp-models 4,193
14 datasets 4,162
15 sql-translator 3,966
16 Chinese-Names-Corpus 3,814
17 text 3,441
18 img2dataset 3,242
19 awesome-json-datasets 3,183
20 TextRecognitionDataGenerator 3,038
21 covid-chestxray-dataset 2,958
22 pandas-datareader 2,819
23 whylogs 2,543

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com