-
fastdup
fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
The positive solution is to scrape Wikimedia Commons for everything in "Category: PD-Art-old-100" and train from scratch on that data. Wikimedia Commons is well-moderated, the image data is public domain[0], and the labels can be filtered down to CC-BY or CC-BY-SA subsets[1]. Your resulting model will be CC-BY-SA licensed and the output completely copyright-free.
For the record, that's what I've been trying to do[2]; my stumbling blocks have been training time and a bug where my resulting pipeline seems to do the opposite of what I ask[3]. I'm assuming it's because my wikitext parser was broken and CLIP didn't have enough text data to train on; I'll have the answer tomorrow when I have a fully-trained U-Net to play with.
If I can ever get this working, I want to also build a CLIP pipeline that can attribute generated images against the training set. This would make it possible to safely use CC-BY and CC-BY-SA datasets: after generating
[0] At least in the US. Other jurisdictions think that scanning an image recopyrights it, see https://en.wikipedia.org/wiki/National_Portrait_Gallery_and_...
[1] Watch out for anything tagged with https://commons.wikimedia.org/wiki/Template:Royal_Museums_Gr... as that will taint your model.
[2] https://github.com/kmeisthax/PD-Diffusion
[3] https://pooper.fantranslation.org/@kmeisthax/109486435508334...
Creators of the data quality tool for computer vision, fastdup, continue to improve on their free release https://github.com/visual-layer/fastdup
Here's a short video of some recent results for LAION 400M https://www.youtube.com/watch?v=dlRCm29Upu4
Related posts
-
Visualize your dataset using DINOv2 embedding
-
Visualize your dataset using DINOv2 embedding
-
[R][P] How to extract feature vectors of large datasets using DINOv2 on CPU
-
Computer Vision pre-trained model for finding how similar two photos of a room are
-
Find image duplicates and outliers – A free, scalable, efficient tool