Python Dataset

Open-source Python projects categorized as Dataset

Top 23 Python Dataset Projects

  • public-apis

    A collective list of free APIs

    Project mention: 10 GitHub repositories that every developer must follow | dev.to | 2024-02-21

    ✅ public-apis/public-apis : https://github.com/public-apis/public-apis

  • faker

    Faker is a Python package that generates fake data for you. (by joke2k)

    Project mention: Leveling up your custom fake data with Faker.js | dev.to | 2024-01-27

    Faker was originally written in Perl and is also available as a library for Ruby, Java, and Python.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • fashion-mnist

    A MNIST-like fashion product database. Benchmark :point_down:

    Project mention: Logistic Regression for Image Classification Using OpenCV | news.ycombinator.com | 2023-12-31

    In this case there's no advantage to using logistic regression on an image other than the novelty. Logistic regression is excellent for feature explainability, but you can't explain anything from an image.

    Traditional classification algorithms but not deep learning such as SVMs and Random Forest perform a lot better on MNIST, up to 97% accuracy compared to the 88% from logistic regression in this post. Check the Original MNIST benchmarks here: http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/#

  • LaTeX-OCR

    pix2tex: Using a ViT to convert images of equations into LaTeX code.

    Project mention: Detexify LaTeX Handwriting Symbol Recognition | news.ycombinator.com | 2023-11-14
  • doccano

    Open source annotation tool for machine learning practitioners.

    Project mention: You Can't Have a Free Software AI Stack | news.ycombinator.com | 2023-07-13

    Huh?

    I wrote my own system for classifying a stream of texts in Python, I might Open Source it one of these days but I have to get it to the point where it is modular enough that I can customize it to do the particular things I want without subjecting people to my whims... I use it every day and I'm not afraid to demo it because it is rock solid.

    My understanding is that my system would not be hard to adapt to work on images for certain kinds of tasks.

    Pytorch is open source, Huggingface is open source. CUDA isn't. This is

    https://labelstud.io/

    and for annotating text spans there are so many open source tools

    https://github.com/doccano/doccano

    I worked for a company a few years back that built annotation tools for projects we sold to customers but never quite got to a polished general purpose annotator. Today there are an overwhelming number of companies in this space and products I never heard of, many of which are cloud based or paid. Looks like a gold rush to me.

  • cleanlab

    The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

    Project mention: [Research] Detecting Annotation Errors in Semantic Segmentation Data | /r/MachineLearning | 2023-11-05

    We have feely open-sourced our new method for improving segmentation data, published a paper on the research behind it, and released a 5-min code tutorial. You can also read more in the blog if you'd like.

  • datasets

    TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • awesome-pretrained-chinese-nlp-models

    Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合

  • text

    Models, data loaders and abstractions for language processing, powered by PyTorch

  • img2dataset

    Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

    Project mention: OpenAI sued for web scraping from millions of internet users in order to train ChatGPT | /r/ArtistHate | 2023-06-30

    Lmao, no it doesn't. As we can see, their downloader uses very obscure "no ai" headers (which can be disabled, so its useless). They only claim it respects "robots.txt" because the google crawler respects it, if a site changes their robots.txt rules they don't remove it from their dataset, that is not "respecting". https://github.com/rom1504/img2dataset

  • TextRecognitionDataGenerator

    A synthetic data generator for text recognition

  • pandas-datareader

    Extract data from a wide range of Internet sources into a pandas DataFrame.

    Project mention: Seeking recommendations for forex economic data API | /r/algotrading | 2023-05-03

    I've looked at https://github.com/pydata/pandas-datareader and it looks good, does anyone have experience?

  • waymo-open-dataset

    Waymo Open Dataset

    Project mention: Update from Waymo spokesperson on the dog that was killed by a Waymo ADV | /r/SelfDrivingCars | 2023-06-13

    Interesting point about the Waymo dataset, though this reply suggests they have higher framerates and just don't release them.

  • transformer-pytorch

    Transformer: PyTorch Implementation of "Attention Is All You Need"

  • Colour

    Colour Science for Python

    Project mention: Tailwind Color Palette Generator | news.ycombinator.com | 2024-02-02

    Colour Science is one of the more serious projects I know of, and more or less lets you get as advanced as you want. Used by film professionals among others. https://www.colour-science.org/

    How would you define what the perfect color tool is? I would guess like most tools that it depends entirely on the job at hand, and that maybe no one perfect tool can exist. Colour Science might be great at serious color management and perceptual measurements and conversions between standardized color spaces, but not the right tool for a web developer looking for quick & easy way to make an HSV palette generation widget (and not because Colour Science is Python, but because it’s too big and heavy of a hammer).

  • fastdup

    fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.

    Project mention: Visualize your dataset using DINOv2 embedding | news.ycombinator.com | 2023-05-02

    Visualizing your dataset (especially large ones) in a low-dimensional embedding space can tell you a lot about the patterns and clusters in your dataset.

    We recently release a notebook showing how you can visualize your dataset using DINOv2 models by running it on your CPU.

    Yes! No GPUs needed.

    We used it to find clusters of similar images, duplicates, and outliers in a subset of the LAION dataset

    Try it on your own dataset:

    Colab notebook - https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/dinov2_notebook.ipynb

    GitHub repo - https://github.com/visual-layer/fastdup

  • DataProfiler

    What's in your data? Extract schema, statistics and entities from datasets

    Project mention: LongRoPE: Extending LLM Context Window Beyond 2M Tokens | news.ycombinator.com | 2024-02-22

    It's been possible to skip tokenization for a long time, my team and I did it here - https://github.com/capitalone/DataProfiler

    For what it's worth, we actually were working with LSTMs with nearly a billion params back in 2016-2017 area. Transformers made it far more effective to train and execute, but ultimately LSTMs are able to achieve similar results, though slow & require more training data.

  • beir

    A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

    Project mention: On building a semantic search engine | news.ycombinator.com | 2024-01-06

    The BEIR project might be what you're looking for: https://github.com/beir-cellar/beir/wiki/Leaderboard

  • ESC-50

    ESC-50: Dataset for Environmental Sound Classification

  • chatgpt-comparison-detection

    Human ChatGPT Comparison Corpus (HC3), Detectors, and more! 🔥

  • covid-19

    Novel Coronavirus 2019 time series data on cases (by datasets)

  • linusrants

    Dataset of Linus Torvalds' rants classified by negativity using sentiment analysis

    Project mention: veryEducational | /r/ProgrammerHumor | 2023-12-05

    Very inspiring quotes from Linus Torvalds

  • synthetic-computer-vision

    A list of synthetic dataset and tools for computer vision

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-02-22.

Python Dataset related posts

Index

What are some of the best open-source Dataset projects in Python? This list will help you:

Project Stars
1 public-apis 287,492
2 faker 17,010
3 fashion-mnist 11,439
4 LaTeX-OCR 10,425
5 doccano 8,871
6 cleanlab 8,153
7 datasets 4,141
8 awesome-pretrained-chinese-nlp-models 4,050
9 text 3,429
10 img2dataset 3,165
11 TextRecognitionDataGenerator 3,002
12 pandas-datareader 2,801
13 waymo-open-dataset 2,496
14 transformer-pytorch 1,992
15 Colour 1,925
16 fastdup 1,389
17 DataProfiler 1,349
18 beir 1,333
19 ESC-50 1,236
20 chatgpt-comparison-detection 1,182
21 covid-19 1,155
22 linusrants 1,040
23 synthetic-computer-vision 989
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com