Go Find Duplicates: blazingly-fast simple-to-use tool to find duplicate files

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • rdfind

    find duplicate files utility

    As far as I know, the standard tool for this is rdfind. This new tool claims to be "blazingly fast", so it should provide something to show it. Ideally a comparison with rdfind, but even a basic benchmark would make it less dubious. https://github.com/pauldreik/rdfind

    But the main problem is not the suspicious performance, it's the lack of explanation. The tool is supposed to "find duplicate files (photos, videos, music, documents)". Does it mean it is restricted to some file types? Does it find identical photos with different metadata to be duplicates? Compare this with rdfind which clearly describes what it does, provides a summary of its algorithm, and even mentions alternatives.

    Overall, it may be a fine toy/hobby project (3 commits only, 3 months ago), I didn't read the code (except for finding the command-line options). I don't get why it got so much attention.

  • rmlint

    Extremely fast tool to remove duplicates and other lint from your filesystem

    I use and test assorted duplicate finders regularly.

    fdupes is the classic (going way way back) but it's really very slow, not worth using anymore.

    The four I know are worth trying these days (depending on data set, hardware, file arrangement and other factors, any one of these might be fastest for a specific use case) are https://github.com/jbruchon/jdupes , https://github.com/pauldreik/rdfind , https://github.com/jvirkki/dupd , https://github.com/sahib/rmlint

    Had not encountered fclones before, will give it a try.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • go-find-duplicates

    Find duplicate files (photos, videos, music, documents) on your computer, portable hard drives etc.

  • duphard

    A simple utility to detect duplicate files and replace them with hard links.

    For example I maintain a tar file and a docker image with Kafka connectors which share many jar files. Using duphard I can save hundreds of megabytes, or even more than a gigabyte! For a documentation website with many copies of the same image (let's just say some static generators favor this practice for maintaining multiple versions), I can reduce the website size by 60%+, which then makes ssh copies, docker pulls, etc way faster speeding up deployment times.

    https://github.com/andmarios/duphard

  • fclones

    Efficient Duplicate File Finder

    See also fclones (focuses on performance, has benchmarks https://github.com/pkolaczk/fclones). I didn't know about rdfind but thought the standard was fdupes https://github.com/adrianlopezroche/fdupes, which is as fast (or slow) as rdfind according to fclones (and fclones is much faster).

  • fdupes

    FDUPES is a program for identifying or deleting duplicate files residing within specified directories.

    See also fclones (focuses on performance, has benchmarks https://github.com/pkolaczk/fclones). I didn't know about rdfind but thought the standard was fdupes https://github.com/adrianlopezroche/fdupes, which is as fast (or slow) as rdfind according to fclones (and fclones is much faster).

  • mpifileutils

    File utilities designed for scalability and performance.

    If you want something that scales horizontally, dcmp from https://github.com/hpc/mpifileutils is an option.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • czkawka

    Multi functional app to find duplicates, empty folders, similar images etc.

    RESF checking in

    The first one i found and still use when it got obvious that fslint is EOL is czkawka [0] (meaning hiccup in polish). Its' speed is an order of magnitude higher than fslint, memory use is 20%-75%.

    <;)> Satisfied customer, would buy it again.

    [0] https://github.com/qarmin/czkawka

  • fd

    A simple, fast and user-friendly alternative to 'find'

    ```find some/location -type d -wholename '/January/Photos'```

    https://github.com/sharkdp/fd

  • jdupes

    Discontinued A powerful duplicate file finder and an enhanced fork of 'fdupes'.

    I use and test assorted duplicate finders regularly.

    fdupes is the classic (going way way back) but it's really very slow, not worth using anymore.

    The four I know are worth trying these days (depending on data set, hardware, file arrangement and other factors, any one of these might be fastest for a specific use case) are https://github.com/jbruchon/jdupes , https://github.com/pauldreik/rdfind , https://github.com/jvirkki/dupd , https://github.com/sahib/rmlint

    Had not encountered fclones before, will give it a try.

  • dupd

    CLI utility to find duplicate files

    I use and test assorted duplicate finders regularly.

    fdupes is the classic (going way way back) but it's really very slow, not worth using anymore.

    The four I know are worth trying these days (depending on data set, hardware, file arrangement and other factors, any one of these might be fastest for a specific use case) are https://github.com/jbruchon/jdupes , https://github.com/pauldreik/rdfind , https://github.com/jvirkki/dupd , https://github.com/sahib/rmlint

    Had not encountered fclones before, will give it a try.

  • kindfs

    Index filesystem into a database, then easily make queries e.g. to find duplicates files/dirs, or mount the index with FUSE.

    FWIW if people are interested, I wrote https://github.com/karteum/kindfs for the purpose of indexing the hard drive, with the following goals

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts