Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 23 Python Dataset Projects
-
-
Scout Monitoring
Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
-
-
-
cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
-
awesome-pretrained-chinese-nlp-models
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
-
-
img2dataset
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
-
-
-
-
fastdup
fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.
-
beir
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
-
-
-
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Building a Basic Forex Rate Assistant Using Agents for Amazon Bedrock | dev.to | 2024-04-29For inspirations on what type of agents I should build, I turned to the Public APIs GitHub repository which has a curated lists of free APIs. I narrowed my search for an API that does not require sign-up or an API key and returns useful information. I ultimately decided to use the Free Currency Exchange Rates API, which seemed promising upon some basic testing.
Faker was originally written in Perl and is also available as a library for Ruby, Java, and Python.
Project mention: Logistic Regression for Image Classification Using OpenCV | news.ycombinator.com | 2023-12-31In this case there's no advantage to using logistic regression on an image other than the novelty. Logistic regression is excellent for feature explainability, but you can't explain anything from an image.
Traditional classification algorithms but not deep learning such as SVMs and Random Forest perform a lot better on MNIST, up to 97% accuracy compared to the 88% from logistic regression in this post. Check the Original MNIST benchmarks here: http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/#
Project mention: Show HN: Synthesize TikZ Graphics Programs for Scientific Figures and Sketches | news.ycombinator.com | 2024-06-06already claim to (at least partially) support this.
[1] https://github.com/lukas-blecher/LaTeX-OCR
Huh?
I wrote my own system for classifying a stream of texts in Python, I might Open Source it one of these days but I have to get it to the point where it is modular enough that I can customize it to do the particular things I want without subjecting people to my whims... I use it every day and I'm not afraid to demo it because it is rock solid.
My understanding is that my system would not be hard to adapt to work on images for certain kinds of tasks.
Pytorch is open source, Huggingface is open source. CUDA isn't. This is
https://labelstud.io/
and for annotating text spans there are so many open source tools
https://github.com/doccano/doccano
I worked for a company a few years back that built annotation tools for projects we sold to customers but never quite got to a polished general purpose annotator. Today there are an overwhelming number of companies in this space and products I never heard of, many of which are cloud based or paid. Looks like a gold rush to me.
Project mention: [Research] Detecting Annotation Errors in Semantic Segmentation Data | /r/MachineLearning | 2023-11-05We have feely open-sourced our new method for improving segmentation data, published a paper on the research behind it, and released a 5-min code tutorial. You can also read more in the blog if you'd like.
Project mention: OpenAI sued for web scraping from millions of internet users in order to train ChatGPT | /r/ArtistHate | 2023-06-30Lmao, no it doesn't. As we can see, their downloader uses very obscure "no ai" headers (which can be disabled, so its useless). They only claim it respects "robots.txt" because the google crawler respects it, if a site changes their robots.txt rules they don't remove it from their dataset, that is not "respecting". https://github.com/rom1504/img2dataset
Colour Science is one of the more serious projects I know of, and more or less lets you get as advanced as you want. Used by film professionals among others. https://www.colour-science.org/
How would you define what the perfect color tool is? I would guess like most tools that it depends entirely on the job at hand, and that maybe no one perfect tool can exist. Colour Science might be great at serious color management and perceptual measurements and conversions between standardized color spaces, but not the right tool for a web developer looking for quick & easy way to make an HSV palette generation widget (and not because Colour Science is Python, but because it’s too big and heavy of a hammer).
The BEIR project might be what you're looking for: https://github.com/beir-cellar/beir/wiki/Leaderboard
Project mention: LongRoPE: Extending LLM Context Window Beyond 2M Tokens | news.ycombinator.com | 2024-02-22It's been possible to skip tokenization for a long time, my team and I did it here - https://github.com/capitalone/DataProfiler
For what it's worth, we actually were working with LSTMs with nearly a billion params back in 2016-2017 area. Transformers made it far more effective to train and execute, but ultimately LSTMs are able to achieve similar results, though slow & require more training data.
Project mention: Linus Torvalds' rants classified by negativity using sentiment analysis | news.ycombinator.com | 2024-04-04
Python Dataset discussion
Python Dataset related posts
-
An Open Source Tool for Multimodal Fact Verification
-
Linus Torvalds' rants classified by negativity using sentiment analysis
-
Show HN: Mapping almost every law, regulation and case in Australia
-
AI-Powered Image Search with CLIP, pgvector, and Fast API
-
Logistic Regression for Image Classification Using OpenCV
-
veryEducational
-
You Can't Have a Free Software AI Stack
-
A note from our sponsor - InfluxDB
www.influxdata.com | 16 Jun 2024
Index
What are some of the best open-source Dataset projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | public-apis | 296,066 |
2 | faker | 17,246 |
3 | fashion-mnist | 11,619 |
4 | LaTeX-OCR | 11,215 |
5 | doccano | 9,115 |
6 | cleanlab | 8,913 |
7 | awesome-pretrained-chinese-nlp-models | 4,386 |
8 | datasets | 4,217 |
9 | text | 3,467 |
10 | img2dataset | 3,375 |
11 | TextRecognitionDataGenerator | 3,106 |
12 | pandas-datareader | 2,839 |
13 | waymo-open-dataset | 2,579 |
14 | transformer-pytorch | 2,363 |
15 | Colour | 2,018 |
16 | fastdup | 1,443 |
17 | beir | 1,438 |
18 | DataProfiler | 1,373 |
19 | ESC-50 | 1,284 |
20 | chatgpt-comparison-detection | 1,214 |
21 | covid-19 | 1,155 |
22 | linusrants | 1,099 |
23 | synthetic-computer-vision | 991 |