Top 12 duplicate-detection Open-Source Projects

duplicut

1 777 0.0 C

Remove duplicates from MASSIVE wordlist, without sorting it (for dictionary-based password cracking)
depp

7 266 0.0 Go

⚡ Check your npm modules for unused and duplicate dependencies fast
InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
videohash

4 257 0.0 Python

Near Duplicate Video Detection (Perceptual Video Hashing) - Get a 64-bit comparable hash-value for any video.

Project mention: videohash / video fingerprinting Question : Detecting if a small clip is part of a longer movie | /r/learnprogramming | 2023-06-27

Hi all, I try to create a program to detect if a certain video scene (normally within 10 seconds) is within a longer video file. The idea is that if I find an scene on youtube, I want to know from which episodes of a particular TV show (assuming I know which tv show, but no idea which episode), so I want to find it out. Current solution: [a] - Extract Frame using ffmpeg from the reference clip (fps = 1) [b] - Extract Frame using ffmpeg from the longer video file (fps around 0.1 or 0.5) For each frame from [a] , I do a imagehash for [a] and [b] and comparing the hamming distance, get the lowest distance from this round of comparision and move on to the next frame from [a] Eventually I got an average score and I can find out if this TV episode contain the scene I was looking for. However, this is slow and not efficient. I found out that there is a videohash library https://github.com/akamhy/videohash But it said "Videohash cannot be used to verify whether one video is a part of another (video fingerprinting)." Does anybody know why? Is it because it gets a videohash for the whole video? If this is the case, how about I use the video hash lib to create a hash for my reference clip (let's say it is about 10 seconds) and then I create multiple 10-second version of the Longer video, generate a videohash just for it and compared that with my reference clip. Would that work? (Yes I understand that for a 60 minutes movie, that would be like 360 video hash to be calculated)... Do you think this is better? Thanks.

deduplicator

3 254 5.8 Rust

Filter, Sort & Delete Duplicate Files Recursively
Panako

2 174 4.0 Java

The Panako acoustic fingerprinting system.

Project mention: Show HN: Pyzam, Shazam for DJs and Mixtapes in Python | news.ycombinator.com | 2024-04-24

Hello, really glad to see project like this popping up. I have few questions as I was working on something similar few years ago:
1. I did some development myself for a "Track Discovery for Djs"[1] project in this space of "dj music recognition" and I am wondering how are you able to handle mixtapes and dj mixes when there is a significant element of sound manipulation/distortion applied, like pitch/tempo + various effects? In my tests this totally confused the algorithms which were not designed to handle such cases.
2. Can you share which algorithm have you implemented for this project? I did read most of the research papers in this space and my preferred solution was to build upon https://github.com/JorenSix/Panako which I did.
In the space of "minimal microhouse techno" type of genre where there are often similar rhythm patterns or even tracks build up using same sample packs it proved to be more difficult to have reliable results than not.
I was investigating how Spotify and other market leaders can do track recognition and they do train ML models on the same track which has applied 100+ various different effects...
Curious to hear your thoughts...
[1] - https://rominimal.club

bughub

1 109 10.0

A collection of free-text bug reports for duplicate issue identification

Project mention: need help in implementing siamese network with triplet loss for predicting duplicate tickets . | /r/learnmachinelearning | 2023-05-26

source datasets -https://github.com/logpai/bugrepo

removedupes

4 78 2.4 JavaScript

Remove Duplicate Messages
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
cbird

5 72 8.2 C++

Command-line program for managing a media collection, with focus on Content-Based Image Retrieval (Computer Vision) methods for finding duplicates.

Project mention: Similar photo finder for screenshots | /r/DataHoarder | 2023-07-04

Try cbird (my program). In addition to playing with thresholds, there are a couple of options that might help. Also if you can share the screenshots I'd be interested in testing.

dude

1 54 8.7 Python

Duplicates Detector is a cross-platform GUI utility for finding duplicate files, allowing you to delete or link them to save space. Duplicate files are displayed and processed on two synchronized panels for efficient and convenient operation. (by PJDude)

Project mention: fdupes: Identify or Delete Duplicate Files | news.ycombinator.com | 2023-11-02

Hi. I recommend my little program, the bottleneck is the gui in tkinter, but maybe it will be useful to someone:
https://github.com/PJDude/dude

photodedupe

1 16 6.1 Rust

A utility for locating near duplicate photos irrespective of image resolution, compression settings or file format.

Project mention: Tips for cleaning duplicate files out of directories | /r/linux | 2023-06-05

That focuses on exact duplicates through hashing. I found https://github.com/InexplicableMagic/photodedupe to be helpful for finding near duplicates in images through LSH.

samanlainen

1 6 3.0 Rust

Delete duplicate files
DupCatch

1 0 10.0 Python

This tool is built to find duplicates in anki cards that are not identified by the built in Anki 'find duplicates' function

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

duplicate-detection related posts

Show HN: Pyzam, Shazam for DJs and Mixtapes in Python

2 projects | news.ycombinator.com | 24 Apr 2024
Similar photo finder for screenshots

1 project | /r/DataHoarder | 4 Jul 2023
videohash / video fingerprinting Question : Detecting if a small clip is part of a longer movie

1 project | /r/learnprogramming | 27 Jun 2023
Video File Deduplication and Indexing/Sorting Software?

1 project | /r/DataHoarder | 22 Sep 2022
cbird Visual Deduplicator v0.6 Update

1 project | /r/DataHoarder | 7 Aug 2022
A note from our sponsor - InfluxDB
www.influxdata.com | 1 May 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source duplicate-detection projects? This list will help you:

	Project	Stars
1	duplicut	777
2	depp	266
3	videohash	257
4	deduplicator	254
5	Panako	174
6	bughub	109
7	removedupes	78
8	cbird	72
9	dude	54
10	photodedupe	16
11	samanlainen	6
12	DupCatch	0

duplicate-detection

Top 12 duplicate-detection Open-Source Projects

duplicate-detection related posts

Show HN: Pyzam, Shazam for DJs and Mixtapes in Python

Similar photo finder for screenshots

videohash / video fingerprinting Question : Detecting if a small clip is part of a longer movie

Video File Deduplication and Indexing/Sorting Software?

cbird Visual Deduplicator v0.6 Update

Index