fastdup vs PD-Diffusion

fastdup

fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale. (by visual-layer)

Source Code

Suggest alternative

Edit details

PD-Diffusion

By kmeisthax

Suggest topics

Source Code

Suggest alternative

Edit details

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

fastdup		PD-Diffusion
	Project
18	Mentions	2
1,408	Stars	1
1.0%	Growth	-
9.4	Activity	6.5
28 days ago	Latest Commit	10 months ago
Python	Language	Python
GNU General Public License v3.0 or later	License	Apache License 2.0

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

fastdup

Posts with mentions or reviews of fastdup. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-05-01.

Visualize your dataset using DINOv2 embedding
1 project | news.ycombinator.com | 2 May 2023

Visualizing your dataset (especially large ones) in a low-dimensional embedding space can tell you a lot about the patterns and clusters in your dataset.
We recently release a notebook showing how you can visualize your dataset using DINOv2 models by running it on your CPU.
Yes! No GPUs needed.
We used it to find clusters of similar images, duplicates, and outliers in a subset of the LAION dataset
Try it on your own dataset:
Colab notebook - https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/dinov2_notebook.ipynb
GitHub repo - https://github.com/visual-layer/fastdup

2 projects | /r/computervision | 1 May 2023
[R][P] How to extract feature vectors of large datasets using DINOv2 on CPU
1 project | /r/MachineLearning | 26 Apr 2023

Run 1M images from the LAION dataset through the DINOv2 model and cluster the images using a free tool - fastdup.
Computer Vision pre-trained model for finding how similar two photos of a room are
2 projects | /r/computervision | 23 Mar 2023

Another option could be fastdup (https://github.com/visual-layer/fastdup) which is probably most helpful for analysis type objectives.
Find image duplicates and outliers – A free, scalable, efficient tool
1 project | /r/computervision | 21 Mar 2023

I recently stumbled upon fastdup a tool that lets you gain insights from a large image/video collection.

1 project | news.ycombinator.com | 21 Mar 2023
How can we match images in our database?
2 projects | /r/learnmachinelearning | 16 Mar 2023

There is this fastdup framework which supposedly allows you to find duplicates and similar images. i haven't used it though
Measure Images Similarity
2 projects | /r/learnmachinelearning | 14 Mar 2023

I came across fastdup recently https://github.com/visual-layer/fastdup
Dedup-ing LAION (60M duplicates) and ImageNet (1.2M duplicates) with fastdup
1 project | news.ycombinator.com | 7 Mar 2023
[R] Dedup-ing LAION (60M duplicates) and ImageNet (1.2M duplicates) with fastdup
1 project | /r/MachineLearning | 7 Mar 2023

PD-Diffusion

Posts with mentions or reviews of PD-Diffusion. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2022-12-15.

What Does Copyright Say about Generative Models?
2 projects | news.ycombinator.com | 15 Dec 2022

>But how much of a song or a painting can you reproduce?
The reason why fair use is vague is specifically to confuse people who ask these kinds of questions. The Supreme Court needed a tool that artists could use to legally smack down people who republish fragments of other people's work, but didn't want to abolish the 1st Amendment in the process. So basically judges have the final say as to whether or not something is novel creativity or in debt to the original. Any hard-and-fast rule beyond "binding precedent applies" is effectively copyright abolition by degrees.
>We lost most of Elizabethan theater because there was no copyright. [..] Without some kind of protection, authors had no interest in publishing at all, let alone publishing accurate texts.
This is a dated example, if only because creative works leave a lot more evidence now than they used to. People today will act to preserve art against the artists own wishes and at great personal risk.
>and it’s easy to suspect that the actual payments will be similar to the royalties musicians get from streaming services: microcents per use
Given the amount of data these systems need (read: more than humanity can provide) I'd say microcents is arguably too high. Remember that you can't actually derive a clear chain of value between one particular training set entry and one particular execution of the model. It's all chucked into a blender that runs on almost-linear algebra and calculus. At best you can detect if parts of the image resemble specific training set examples[0] and pay people slightly more if the model regurgitates training set data.
Let's also keep in mind that a good chunk of the licensing system is based on being able to say no to specific users, or write very tailor-made licensing agreements for specific works or conditions. That's still going to be threatened, even if we can pay sub-Spotify-tier royalties every time a model trains itself on your work.
>It is easy to imagine an AI system that has been trained on the (many) Open Source and Creative Commons licenses.
Working on it: https://github.com/kmeisthax/PD-Diffusion
The thing is, we already have a good database of reusable, public-domain, no-attribution-necessary images; it's called Wikimedia Commons. I really can't fathom why OpenAI didn't start there, other than just an assumption that they were entitled to larger datasets or a feeling that they could get established before anyone sued.
Even then, OpenAI already tried this with computer code and they're getting sued for it anyway, because they never bothered with attribution in the case of training set regurgitation.
[0] This is possible because part of the prompt guidance process involves a thing called CLIP which can do both image and text classification in the same coordinate system.
Laion-5B: A New Era of Open Large-Scale Multi-Modal Datasets
2 projects | news.ycombinator.com | 12 Dec 2022

The positive solution is to scrape Wikimedia Commons for everything in "Category: PD-Art-old-100" and train from scratch on that data. Wikimedia Commons is well-moderated, the image data is public domain[0], and the labels can be filtered down to CC-BY or CC-BY-SA subsets[1]. Your resulting model will be CC-BY-SA licensed and the output completely copyright-free.
For the record, that's what I've been trying to do[2]; my stumbling blocks have been training time and a bug where my resulting pipeline seems to do the opposite of what I ask[3]. I'm assuming it's because my wikitext parser was broken and CLIP didn't have enough text data to train on; I'll have the answer tomorrow when I have a fully-trained U-Net to play with.
If I can ever get this working, I want to also build a CLIP pipeline that can attribute generated images against the training set. This would make it possible to safely use CC-BY and CC-BY-SA datasets: after generating
[0] At least in the US. Other jurisdictions think that scanning an image recopyrights it, see https://en.wikipedia.org/wiki/National_Portrait_Gallery_and_...
[1] Watch out for anything tagged with https://commons.wikimedia.org/wiki/Template:Royal_Museums_Gr... as that will taint your model.
[2] https://github.com/kmeisthax/PD-Diffusion
[3] https://pooper.fantranslation.org/@kmeisthax/109486435508334...

What are some alternatives?

When comparing fastdup and PD-Diffusion you can also consider the following projects:

sahi - Framework agnostic sliced/tiled inference + interactive ui + error analysis plots

computervision-recipes - Best Practices, code samples, and documentation for Computer Vision.

pyod - A Comprehensive and Scalable Python Library for Outlier Detection (Anomaly Detection)

dhash - Python library to calculate the difference hash (perceptual hash) for a given image, useful for detecting duplicates

CVPR2024-Papers-with-Code - CVPR 2024 论文和开源项目合集

albumentations - Fast image augmentation library and an easy-to-use wrapper around other libraries. Documentation: https://albumentations.ai/docs/ Paper about the library: https://www.mdpi.com/2078-2489/11/2/125

plakakia - Python image tiling library for image processing, object detection, etc.

visionner - Visionner turn raw image data into numpy array, more suitable for deep learning task

flockfysh - A simple data vending machine that pops more out that what comes in. Use flockfysh to seamlessly pool existing datasets with quality web-scraped data to get top notch datasets.

CLIP - CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

omni3d - Code release for "Omni3D A Large Benchmark and Model for 3D Object Detection in the Wild"

research-papers

fastdup vs sahi fastdup vs computervision-recipes fastdup vs pyod fastdup vs dhash fastdup vs CVPR2024-Papers-with-Code fastdup vs albumentations fastdup vs plakakia fastdup vs visionner fastdup vs flockfysh fastdup vs CLIP fastdup vs omni3d fastdup vs research-papers

Compare fastdup vs PD-Diffusion and see what are their differences.

fastdup

PD-Diffusion

fastdup

PD-Diffusion

What are some alternatives?