Top 23 Python Dataset Projects
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
Label Studio is a multi-type data labeling and annotation tool with standardized output formatProject mention: [D] Are there any tools to quickly label training data manually? | reddit.com/r/MachineLearning | 2022-07-29
Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.
Open source annotation tool for machine learning practitioners.Project mention: Ask HN: Any open source text editors with word tagging? | news.ycombinator.com | 2022-08-04
I worked at a place where we developed a system for doing this kind of tagging but it was for making training sets and there was no expectation that you could export the document from the system for normal use.
Quite a few NLP annotation systems are out there
An open source multi-tool for exploring and publishing data
Dataset format for AI. Build, manage, query & visualize datasets for deep learning. Stream data real-time to PyTorch/TensorFlow & version-control it. https://activeloop.ai (by activeloopai)Project mention: [Q] where to host 50GB dataset (for free?) | reddit.com/r/datasets | 2022-06-25
Hey u/platoTheSloth, as u/gopietz mentioned (thanks a lot for the shout-out!!!), you can share them with the general public through uploading to Activeloop Platform (for researchers, we offer special terms, but even as a general public member you get up to 300GBs of free storage!). Thanks to our open source dataset format for AI, Hub, anyone can load the dataset in under 3seconds with one line of code, and stream it while training in PyTorch/TensorFlow.
TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)
Clean code begins in your IDE with SonarLint. Up your coding game and discover issues early. SonarLint is a free plugin that helps you find & fix bugs and security issues from the moment you start writing code. Install from your favorite IDE marketplace today.
Colour Science for PythonProject mention: The Color of Infinite Temperature | news.ycombinator.com | 2022-01-16
I haven’t seen the math for the conversion but the conversion from CCT to xy/uv are given for a particular domain. One of the conversion with the largest domain, i.e. Ohno m, covers domain [1000K, 100000K]: https://github.com/colour-science/colour/blob/develop/colour...
Infinity is very much in extrapolation territory.
Benchmark datasets, data loaders, and evaluators for graph machine learningProject mention: [D] Best way to handle encoding disconnected graphs at the graph level. | reddit.com/r/MachineLearning | 2022-04-10
Example code: https://github.com/snap-stanford/ogb/tree/master/examples/graphproppred/mol
🪐 End-to-end NLP workflows from prototype to production (by explosion)Project mention: Using pre-trained BERT embeddings for multi-class text classification | reddit.com/r/LanguageTechnology | 2022-01-10
spaCy has an example project that uses BERT that you could use as a reference. It's multilabel but it should be easy to tweak the config to be just multiclass instead.
Dataset Management Framework, a Python library and a CLI tool to build, analyze and manage Computer Vision datasets.Project mention: Does anyone use CVAT for image annotation? | reddit.com/r/computervision | 2022-04-18
1) CVAT has internal inference for models. If you upload model there in the correct format, then it will be able to generate the detection box itself - https://onepanel.medium.com/train-an-object-detection-model-from-scratch-and-run-inference-on-it-in-10-minutes-16147ef656aa 2) Yes you can upload your prediction. But last time i did it - there were some problems and it took me several hours. It seems to me that you just need to load the markup in one of the formats that it supported by CVAT. If your format is not supported, then you will need to convert. For example like this - https://github.com/openvinotoolkit/datumaro
A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way :chestnut:Project mention: [P] Squirrel: A new OS library for fast & flexible large-scale data loading | reddit.com/r/MachineLearning | 2022-04-11
Today we open-sourced Squirrel, a data infrastructure library that my colleagues and I have been working on over the past 1.5 years: https://github.com/merantix-momentum/squirrel-core
[IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open QuestionsProject mention: DoppelGANger: NEW Data - star count:168.0 | reddit.com/r/algoprojects | 2022-06-11
The easiest way to use Machine Learning. Mix and match underlying ML libraries and data set sources. Generate new datasets or modify existing ones with ease.
SHIFT15M: multiobjective large-scale fashion dataset with distributional shiftsProject mention: SHIFT15M: Multiobjective Large-scale Fashion Dataset with Distributional Shifts | dev.to | 2021-09-09
Super Resolution datasets and models in Pytorch
A python package for scraping oddsportal.comProject mention: Help with web scraping oddsportal.com | reddit.com/r/learnpython | 2022-06-25
For my first web scraping project, I wanted to use an existing program from github: https://github.com/Seb943/scrapeOP
A collection of multimodal datasets, and visual features for VQA and captionning in pytorch. Just run "pip install multimodal" (by cdancette)
Podium: a framework agnostic Python NLP library for data loading and preprocessingProject mention: Show HN: Podium: framework agnostic NLP library for data loading and preprocess | news.ycombinator.com | 2021-12-09
Document level Attitude and Relation Extraction toolkit (AREkit) for sampling mass-media news into datasets for your ML-model training and evaluationProject mention: Show HN: ARElight – A Mass-Media Processing Application for Relation Extraction | news.ycombinator.com | 2022-06-18
ExORL: Exploratory Data for Offline Reinforcement LearningProject mention: "Don't Change the Algorithm, Change the Data: Exploratory Data for Offline Reinforcement Learning (ExoRL)", Yarats et al 2022 | reddit.com/r/ResearchML | 2022-02-13
Squirrel dataset hubProject mention: [P] Squirrel: A new OS library for fast & flexible large-scale data loading | reddit.com/r/MachineLearning | 2022-04-11
Have a look at this tutorial to learn how to convert to messagepack by using Spark.
Cleaning discord data for NLP
Python Datasets related posts
Ask HN: What's the best way to create a database for legal document clauses?
2 projects | news.ycombinator.com | 10 Aug 2022
Does anyone here use sqlite just to do quick queries, because it is easier than loading to another rdb?
3 projects | reddit.com/r/Python | 5 Aug 2022
Can you recommend an, ideally, open-source system that allows building and rendering reports from a database?
2 projects | reddit.com/r/Database | 2 Aug 2022
2 projects | reddit.com/r/django | 31 Jul 2022
Best SQL Software for Dealing With Local Files
1 project | reddit.com/r/SQL | 9 Jul 2022
What do you guys think is the bare minimum for any of you to feel inclined to use graphql? And what would you say is the easiest way to implement it these days?
3 projects | reddit.com/r/graphql | 7 Jul 2022
Need a database that can hold 16 million records and export any 2000 non-sequential records to Excel within 10 seconds.
2 projects | reddit.com/r/Database | 25 Jun 2022
What are some of the best open-source Dataset projects in Python? This list will help you:
|14||Data Flow Facilitator for Machine Learning (dffml)||177|
Are you hiring? Post a new remote job listing for free.