fdupes: Identify or Delete Duplicate Files

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern API for authentication & user identity.
  • Onboard AI - ChatGPT with full context of any GitHub repo.
  • jdupes

    A powerful duplicate file finder and an enhanced fork of 'fdupes'.

    200 lines of Nim [1] seems to run about 9X faster than the 8000 lines of C in fdupes on a little test dir I have. If you need C, I think jdupes [2] is faster as @TacticalCoder points out a couple of times here. In my testing, `dups` is usually faster than `jdupes`, though.

    [1] https://github.com/c-blake/bu/blob/main/dups.nim

    [2] https://github.com/jbruchon/jdupes

  • fdupes

    FDUPES is a program for identifying or deleting duplicate files residing within specified directories.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • fclones

    Efficient Duplicate File Finder

  • duperemove

    Tools for deduping file systems

    Very useful for identifying files that may need to get deduplicate or that can be removed entirely. Unfortunately, I don't think this will also find identical directories.

    If deleting files isn't what you want, I'd suggest looking into deduplicating tools.

    ZFS has its own de duplicator built in, which is nice. It should just deduplicate files and individual extents of files by itself once you enable it. Probably not a good idea on very write-heavy disks, but it's an option.

    Other file systems with extent level deduplication can use https://github.com/markfasheh/duperemove to not only deduplicaye files, but also deduplicate individual extents. This can be very useful for file systems that store a lot of duplicate content, like different WINE prefixes. For filesystems without extent deduplication, duperemove should try hard linking files to make them take up practically no disks space.

  • Git

    Git Source Code Mirror - This is a publish-only repository but pull requests can be turned into patches to the mailing list via GitGitGadget (https://gitgitgadget.github.io/). Please follow Documentation/SubmittingPatches procedure for any of your improvements.

    You know another project with much of its source files in the top-level directory? https://github.com/git/git

  • czkawka

    Multi functional app to find duplicates, empty folders, similar images etc.

    I've used Czkawka (https://github.com/qarmin/czkawka) because it does Lanczos-based image duplicate detection, which makes it more practical for me.

  • duff

    Command-line utility for finding duplicate files

  • WorkOS

    The modern API for authentication & user identity. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • rmlint

    Extremely fast tool to remove duplicates and other lint from your filesystem

    My preferred solution is rmlint [https://github.com/sahib/rmlint] mostly because it also looks at duplicate directories. It produces a bash script instead of deleting anything itself, so you can examine it before running the script it made.

  • lsdup

    List duplicate files and directories, then optionally take action on them, all from a commandline.

    Writing a program like this is one of the first exercises I give myself when learning a new programming language, because it touches a little bit of everything (reading files, output, CLI, using libraries, hashmaps, functions, loops, conditionals, etc) and isn't too onerous to implement.

    My latest (it's a few years old at this point) is lsdup (rust version) using blake3 for hashing the content: https://github.com/redsaz/lsdup/

    All it does is list the groups of duplicate files, grouped by hash, groups ordered by size. I'll usually pipe the output to a file, then do whatever I want to the list, and run a different script to process the resulting list. It works fine enough.

  • bu

    B)asic|But-For U)tility Code/Programs (in Nim & Often Unix/POSIX/Linux Context)

    200 lines of Nim [1] seems to run about 9X faster than the 8000 lines of C in fdupes on a little test dir I have. If you need C, I think jdupes [2] is faster as @TacticalCoder points out a couple of times here. In my testing, `dups` is usually faster than `jdupes`, though.

    [1] https://github.com/c-blake/bu/blob/main/dups.nim

    [2] https://github.com/jbruchon/jdupes

  • kindfs

    Index filesystem into a database, then easily make queries e.g. to find duplicates files/dirs, or mount the index with FUSE.

    fdupes is really nice and fast, but (as far as I remember) it was lacking two features that I needed for my use case, which were 1°/ list duplicate dirs (without listing all of the duplicate sub-contents), and 2°/ being able to identify that all the contents in one dir would be included in another part of the FS (regardless of files/dir structures), which is particularly useful when you have a bigmess/ directory that you progressively sort-out in a clean/ directory. Said differently : fdupes helps to regain space but was not able to help me much to cleanup a messy drive...

    This is why I wrote https://github.com/karteum/kindfs (which indexes the fs into an sqlite DB and then enables various ways to process it).

  • dude

    Duplicates Detector is a cross-platform GUI utility for finding duplicate files, allowing you to delete or link them to save space. Duplicate files are displayed and processed on two synchronized panels for efficient and convenient operation. (by PJDude)

    Hi. I recommend my little program, the bottleneck is the gui in tkinter, but maybe it will be useful to someone:

    https://github.com/PJDude/dude

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts