duplicate-detection

Open-source projects categorized as duplicate-detection

Top 12 duplicate-detection Open-Source Projects

  • duplicut

    Remove duplicates from MASSIVE wordlist, without sorting it (for dictionary-based password cracking)

  • depp

    ⚡ Check your npm modules for unused and duplicate dependencies fast

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • videohash

    Near Duplicate Video Detection (Perceptual Video Hashing) - Get a 64-bit comparable hash-value for any video.

  • Project mention: videohash / video fingerprinting Question : Detecting if a small clip is part of a longer movie | /r/learnprogramming | 2023-06-27

    Hi all, I try to create a program to detect if a certain video scene (normally within 10 seconds) is within a longer video file. The idea is that if I find an scene on youtube, I want to know from which episodes of a particular TV show (assuming I know which tv show, but no idea which episode), so I want to find it out. Current solution: [a] - Extract Frame using ffmpeg from the reference clip (fps = 1) [b] - Extract Frame using ffmpeg from the longer video file (fps around 0.1 or 0.5) For each frame from [a] , I do a imagehash for [a] and [b] and comparing the hamming distance, get the lowest distance from this round of comparision and move on to the next frame from [a] Eventually I got an average score and I can find out if this TV episode contain the scene I was looking for. However, this is slow and not efficient. I found out that there is a videohash library https://github.com/akamhy/videohash But it said "Videohash cannot be used to verify whether one video is a part of another (video fingerprinting)." Does anybody know why? Is it because it gets a videohash for the whole video? If this is the case, how about I use the video hash lib to create a hash for my reference clip (let's say it is about 10 seconds) and then I create multiple 10-second version of the Longer video, generate a videohash just for it and compared that with my reference clip. Would that work? (Yes I understand that for a 60 minutes movie, that would be like 360 video hash to be calculated)... Do you think this is better? Thanks.

  • deduplicator

    Filter, Sort & Delete Duplicate Files Recursively

  • Panako

    The Panako acoustic fingerprinting system.

  • Project mention: Show HN: Pyzam, Shazam for DJs and Mixtapes in Python | news.ycombinator.com | 2024-04-24

    Hello, really glad to see project like this popping up. I have few questions as I was working on something similar few years ago:

    1. I did some development myself for a "Track Discovery for Djs"[1] project in this space of "dj music recognition" and I am wondering how are you able to handle mixtapes and dj mixes when there is a significant element of sound manipulation/distortion applied, like pitch/tempo + various effects? In my tests this totally confused the algorithms which were not designed to handle such cases.

    2. Can you share which algorithm have you implemented for this project? I did read most of the research papers in this space and my preferred solution was to build upon https://github.com/JorenSix/Panako which I did.

    In the space of "minimal microhouse techno" type of genre where there are often similar rhythm patterns or even tracks build up using same sample packs it proved to be more difficult to have reliable results than not.

    I was investigating how Spotify and other market leaders can do track recognition and they do train ML models on the same track which has applied 100+ various different effects...

    Curious to hear your thoughts...

    [1] - https://rominimal.club

  • bughub

    A collection of free-text bug reports for duplicate issue identification

  • Project mention: need help in implementing siamese network with triplet loss for predicting duplicate tickets . | /r/learnmachinelearning | 2023-05-26

    source datasets -https://github.com/logpai/bugrepo

  • removedupes

    Remove Duplicate Messages

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • cbird

    Command-line program for managing a media collection, with focus on Content-Based Image Retrieval (Computer Vision) methods for finding duplicates.

  • Project mention: Similar photo finder for screenshots | /r/DataHoarder | 2023-07-04

    Try cbird (my program). In addition to playing with thresholds, there are a couple of options that might help. Also if you can share the screenshots I'd be interested in testing.

  • dude

    Duplicates Detector is a cross-platform GUI utility for finding duplicate files, allowing you to delete or link them to save space. Duplicate files are displayed and processed on two synchronized panels for efficient and convenient operation. (by PJDude)

  • Project mention: fdupes: Identify or Delete Duplicate Files | news.ycombinator.com | 2023-11-02

    Hi. I recommend my little program, the bottleneck is the gui in tkinter, but maybe it will be useful to someone:

    https://github.com/PJDude/dude

  • photodedupe

    A utility for locating near duplicate photos irrespective of image resolution, compression settings or file format.

  • Project mention: Tips for cleaning duplicate files out of directories | /r/linux | 2023-06-05

    That focuses on exact duplicates through hashing. I found https://github.com/InexplicableMagic/photodedupe to be helpful for finding near duplicates in images through LSH.

  • samanlainen

    Delete duplicate files

  • DupCatch

    This tool is built to find duplicates in anki cards that are not identified by the built in Anki 'find duplicates' function

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

duplicate-detection related posts

  • Show HN: Pyzam, Shazam for DJs and Mixtapes in Python

    2 projects | news.ycombinator.com | 24 Apr 2024
  • Similar photo finder for screenshots

    1 project | /r/DataHoarder | 4 Jul 2023
  • videohash / video fingerprinting Question : Detecting if a small clip is part of a longer movie

    1 project | /r/learnprogramming | 27 Jun 2023
  • Video File Deduplication and Indexing/Sorting Software?

    1 project | /r/DataHoarder | 22 Sep 2022
  • cbird Visual Deduplicator v0.6 Update

    1 project | /r/DataHoarder | 7 Aug 2022
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 1 May 2024
    Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source duplicate-detection projects? This list will help you:

Project Stars
1 duplicut 777
2 depp 266
3 videohash 257
4 deduplicator 254
5 Panako 174
6 bughub 109
7 removedupes 78
8 cbird 72
9 dude 54
10 photodedupe 16
11 samanlainen 6
12 DupCatch 0

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com