fastdup
PD-Diffusion
fastdup | PD-Diffusion | |
---|---|---|
18 | 2 | |
1,408 | 1 | |
1.0% | - | |
9.4 | 6.5 | |
28 days ago | 10 months ago | |
Python | Python | |
GNU General Public License v3.0 or later | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
fastdup
-
Visualize your dataset using DINOv2 embedding
Visualizing your dataset (especially large ones) in a low-dimensional embedding space can tell you a lot about the patterns and clusters in your dataset.
We recently release a notebook showing how you can visualize your dataset using DINOv2 models by running it on your CPU.
Yes! No GPUs needed.
We used it to find clusters of similar images, duplicates, and outliers in a subset of the LAION dataset
Try it on your own dataset:
Colab notebook - https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/dinov2_notebook.ipynb
GitHub repo - https://github.com/visual-layer/fastdup
-
[R][P] How to extract feature vectors of large datasets using DINOv2 on CPU
Run 1M images from the LAION dataset through the DINOv2 model and cluster the images using a free tool - fastdup.
-
Computer Vision pre-trained model for finding how similar two photos of a room are
Another option could be fastdup (https://github.com/visual-layer/fastdup) which is probably most helpful for analysis type objectives.
-
Find image duplicates and outliers – A free, scalable, efficient tool
I recently stumbled upon fastdup a tool that lets you gain insights from a large image/video collection.
-
How can we match images in our database?
There is this fastdup framework which supposedly allows you to find duplicates and similar images. i haven't used it though
-
Measure Images Similarity
I came across fastdup recently https://github.com/visual-layer/fastdup
- Dedup-ing LAION (60M duplicates) and ImageNet (1.2M duplicates) with fastdup
- [R] Dedup-ing LAION (60M duplicates) and ImageNet (1.2M duplicates) with fastdup
PD-Diffusion
-
What Does Copyright Say about Generative Models?
>But how much of a song or a painting can you reproduce?
The reason why fair use is vague is specifically to confuse people who ask these kinds of questions. The Supreme Court needed a tool that artists could use to legally smack down people who republish fragments of other people's work, but didn't want to abolish the 1st Amendment in the process. So basically judges have the final say as to whether or not something is novel creativity or in debt to the original. Any hard-and-fast rule beyond "binding precedent applies" is effectively copyright abolition by degrees.
>We lost most of Elizabethan theater because there was no copyright. [..] Without some kind of protection, authors had no interest in publishing at all, let alone publishing accurate texts.
This is a dated example, if only because creative works leave a lot more evidence now than they used to. People today will act to preserve art against the artists own wishes and at great personal risk.
>and it’s easy to suspect that the actual payments will be similar to the royalties musicians get from streaming services: microcents per use
Given the amount of data these systems need (read: more than humanity can provide) I'd say microcents is arguably too high. Remember that you can't actually derive a clear chain of value between one particular training set entry and one particular execution of the model. It's all chucked into a blender that runs on almost-linear algebra and calculus. At best you can detect if parts of the image resemble specific training set examples[0] and pay people slightly more if the model regurgitates training set data.
Let's also keep in mind that a good chunk of the licensing system is based on being able to say no to specific users, or write very tailor-made licensing agreements for specific works or conditions. That's still going to be threatened, even if we can pay sub-Spotify-tier royalties every time a model trains itself on your work.
>It is easy to imagine an AI system that has been trained on the (many) Open Source and Creative Commons licenses.
Working on it: https://github.com/kmeisthax/PD-Diffusion
The thing is, we already have a good database of reusable, public-domain, no-attribution-necessary images; it's called Wikimedia Commons. I really can't fathom why OpenAI didn't start there, other than just an assumption that they were entitled to larger datasets or a feeling that they could get established before anyone sued.
Even then, OpenAI already tried this with computer code and they're getting sued for it anyway, because they never bothered with attribution in the case of training set regurgitation.
[0] This is possible because part of the prompt guidance process involves a thing called CLIP which can do both image and text classification in the same coordinate system.
-
Laion-5B: A New Era of Open Large-Scale Multi-Modal Datasets
The positive solution is to scrape Wikimedia Commons for everything in "Category: PD-Art-old-100" and train from scratch on that data. Wikimedia Commons is well-moderated, the image data is public domain[0], and the labels can be filtered down to CC-BY or CC-BY-SA subsets[1]. Your resulting model will be CC-BY-SA licensed and the output completely copyright-free.
For the record, that's what I've been trying to do[2]; my stumbling blocks have been training time and a bug where my resulting pipeline seems to do the opposite of what I ask[3]. I'm assuming it's because my wikitext parser was broken and CLIP didn't have enough text data to train on; I'll have the answer tomorrow when I have a fully-trained U-Net to play with.
If I can ever get this working, I want to also build a CLIP pipeline that can attribute generated images against the training set. This would make it possible to safely use CC-BY and CC-BY-SA datasets: after generating
[0] At least in the US. Other jurisdictions think that scanning an image recopyrights it, see https://en.wikipedia.org/wiki/National_Portrait_Gallery_and_...
[1] Watch out for anything tagged with https://commons.wikimedia.org/wiki/Template:Royal_Museums_Gr... as that will taint your model.
[2] https://github.com/kmeisthax/PD-Diffusion
[3] https://pooper.fantranslation.org/@kmeisthax/109486435508334...
What are some alternatives?
sahi - Framework agnostic sliced/tiled inference + interactive ui + error analysis plots
computervision-recipes - Best Practices, code samples, and documentation for Computer Vision.
pyod - A Comprehensive and Scalable Python Library for Outlier Detection (Anomaly Detection)
dhash - Python library to calculate the difference hash (perceptual hash) for a given image, useful for detecting duplicates
CVPR2024-Papers-with-Code - CVPR 2024 论文和开源项目合集
albumentations - Fast image augmentation library and an easy-to-use wrapper around other libraries. Documentation: https://albumentations.ai/docs/ Paper about the library: https://www.mdpi.com/2078-2489/11/2/125
plakakia - Python image tiling library for image processing, object detection, etc.
visionner - Visionner turn raw image data into numpy array, more suitable for deep learning task
flockfysh - A simple data vending machine that pops more out that what comes in. Use flockfysh to seamlessly pool existing datasets with quality web-scraped data to get top notch datasets.
CLIP - CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
omni3d - Code release for "Omni3D A Large Benchmark and Model for 3D Object Detection in the Wild"
research-papers