Go Find Duplicates: blazingly-fast simple-to-use tool to find duplicate files

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Stream - Scalable APIs for Chat, Feeds, Moderation, & Video.
Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
getstream.io
featured
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com
featured
  1. rdfind

    find duplicate files utility

    As far as I know, the standard tool for this is rdfind. This new tool claims to be "blazingly fast", so it should provide something to show it. Ideally a comparison with rdfind, but even a basic benchmark would make it less dubious. https://github.com/pauldreik/rdfind

    But the main problem is not the suspicious performance, it's the lack of explanation. The tool is supposed to "find duplicate files (photos, videos, music, documents)". Does it mean it is restricted to some file types? Does it find identical photos with different metadata to be duplicates? Compare this with rdfind which clearly describes what it does, provides a summary of its algorithm, and even mentions alternatives.

    Overall, it may be a fine toy/hobby project (3 commits only, 3 months ago), I didn't read the code (except for finding the command-line options). I don't get why it got so much attention.

  2. Stream

    Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

    Stream logo
  3. rmlint

    Extremely fast tool to remove duplicates and other lint from your filesystem

    I use and test assorted duplicate finders regularly.

    fdupes is the classic (going way way back) but it's really very slow, not worth using anymore.

    The four I know are worth trying these days (depending on data set, hardware, file arrangement and other factors, any one of these might be fastest for a specific use case) are https://github.com/jbruchon/jdupes , https://github.com/pauldreik/rdfind , https://github.com/jvirkki/dupd , https://github.com/sahib/rmlint

    Had not encountered fclones before, will give it a try.

  4. go-find-duplicates

    Find duplicate files (photos, videos, music, documents) on your computer, portable hard drives etc.

  5. duphard

    A simple utility to detect duplicate files and replace them with hard links.

    For example I maintain a tar file and a docker image with Kafka connectors which share many jar files. Using duphard I can save hundreds of megabytes, or even more than a gigabyte! For a documentation website with many copies of the same image (let's just say some static generators favor this practice for maintaining multiple versions), I can reduce the website size by 60%+, which then makes ssh copies, docker pulls, etc way faster speeding up deployment times.

    https://github.com/andmarios/duphard

  6. fclones

    Efficient Duplicate File Finder

    See also fclones (focuses on performance, has benchmarks https://github.com/pkolaczk/fclones). I didn't know about rdfind but thought the standard was fdupes https://github.com/adrianlopezroche/fdupes, which is as fast (or slow) as rdfind according to fclones (and fclones is much faster).

  7. fdupes

    FDUPES is a program for identifying or deleting duplicate files residing within specified directories.

    See also fclones (focuses on performance, has benchmarks https://github.com/pkolaczk/fclones). I didn't know about rdfind but thought the standard was fdupes https://github.com/adrianlopezroche/fdupes, which is as fast (or slow) as rdfind according to fclones (and fclones is much faster).

  8. mpifileutils

    File utilities designed for scalability and performance.

    If you want something that scales horizontally, dcmp from https://github.com/hpc/mpifileutils is an option.

  9. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  10. czkawka

    Multi functional app to find duplicates, empty folders, similar images etc.

    RESF checking in

    The first one i found and still use when it got obvious that fslint is EOL is czkawka [0] (meaning hiccup in polish). Its' speed is an order of magnitude higher than fslint, memory use is 20%-75%.

    <;)> Satisfied customer, would buy it again.

    [0] https://github.com/qarmin/czkawka

  11. fd

    A simple, fast and user-friendly alternative to 'find'

    ```find some/location -type d -wholename '/January/Photos'```

    https://github.com/sharkdp/fd

  12. jdupes

    Discontinued A powerful duplicate file finder and an enhanced fork of 'fdupes'.

    I use and test assorted duplicate finders regularly.

    fdupes is the classic (going way way back) but it's really very slow, not worth using anymore.

    The four I know are worth trying these days (depending on data set, hardware, file arrangement and other factors, any one of these might be fastest for a specific use case) are https://github.com/jbruchon/jdupes , https://github.com/pauldreik/rdfind , https://github.com/jvirkki/dupd , https://github.com/sahib/rmlint

    Had not encountered fclones before, will give it a try.

  13. dupd

    CLI utility to find duplicate files

    I use and test assorted duplicate finders regularly.

    fdupes is the classic (going way way back) but it's really very slow, not worth using anymore.

    The four I know are worth trying these days (depending on data set, hardware, file arrangement and other factors, any one of these might be fastest for a specific use case) are https://github.com/jbruchon/jdupes , https://github.com/pauldreik/rdfind , https://github.com/jvirkki/dupd , https://github.com/sahib/rmlint

    Had not encountered fclones before, will give it a try.

  14. kindfs

    Index filesystem into a database, then easily make queries e.g. to find duplicates files/dirs, or mount the index with FUSE.

    FWIW if people are interested, I wrote https://github.com/karteum/kindfs for the purpose of indexing the hard drive, with the following goals

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • I'm amazed how I find anything &amp; why I have so many dupes!

    4 projects | /r/DataHoarder | 8 Jul 2023
  • Johnny Decimal

    4 projects | news.ycombinator.com | 13 Jun 2023
  • Any good duplicate file finder for windows?

    3 projects | /r/sysadmin | 22 Apr 2023
  • ISO: Binary File Comparison Tool for Duplicate File Checks

    5 projects | /r/DataHoarder | 25 Jan 2022
  • File Servers... how are you handling duplicates

    1 project | /r/sysadmin | 8 Dec 2023