-
As far as I know, the standard tool for this is rdfind. This new tool claims to be "blazingly fast", so it should provide something to show it. Ideally a comparison with rdfind, but even a basic benchmark would make it less dubious. https://github.com/pauldreik/rdfind
But the main problem is not the suspicious performance, it's the lack of explanation. The tool is supposed to "find duplicate files (photos, videos, music, documents)". Does it mean it is restricted to some file types? Does it find identical photos with different metadata to be duplicates? Compare this with rdfind which clearly describes what it does, provides a summary of its algorithm, and even mentions alternatives.
Overall, it may be a fine toy/hobby project (3 commits only, 3 months ago), I didn't read the code (except for finding the command-line options). I don't get why it got so much attention.
-
Stream
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
-
I use and test assorted duplicate finders regularly.
fdupes is the classic (going way way back) but it's really very slow, not worth using anymore.
The four I know are worth trying these days (depending on data set, hardware, file arrangement and other factors, any one of these might be fastest for a specific use case) are https://github.com/jbruchon/jdupes , https://github.com/pauldreik/rdfind , https://github.com/jvirkki/dupd , https://github.com/sahib/rmlint
Had not encountered fclones before, will give it a try.
-
go-find-duplicates
Find duplicate files (photos, videos, music, documents) on your computer, portable hard drives etc.
-
For example I maintain a tar file and a docker image with Kafka connectors which share many jar files. Using duphard I can save hundreds of megabytes, or even more than a gigabyte! For a documentation website with many copies of the same image (let's just say some static generators favor this practice for maintaining multiple versions), I can reduce the website size by 60%+, which then makes ssh copies, docker pulls, etc way faster speeding up deployment times.
https://github.com/andmarios/duphard
-
See also fclones (focuses on performance, has benchmarks https://github.com/pkolaczk/fclones). I didn't know about rdfind but thought the standard was fdupes https://github.com/adrianlopezroche/fdupes, which is as fast (or slow) as rdfind according to fclones (and fclones is much faster).
-
fdupes
FDUPES is a program for identifying or deleting duplicate files residing within specified directories.
See also fclones (focuses on performance, has benchmarks https://github.com/pkolaczk/fclones). I didn't know about rdfind but thought the standard was fdupes https://github.com/adrianlopezroche/fdupes, which is as fast (or slow) as rdfind according to fclones (and fclones is much faster).
-
If you want something that scales horizontally, dcmp from https://github.com/hpc/mpifileutils is an option.
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
RESF checking in
The first one i found and still use when it got obvious that fslint is EOL is czkawka [0] (meaning hiccup in polish). Its' speed is an order of magnitude higher than fslint, memory use is 20%-75%.
<;)> Satisfied customer, would buy it again.
[0] https://github.com/qarmin/czkawka
-
```find some/location -type d -wholename '/January/Photos'```
https://github.com/sharkdp/fd
-
I use and test assorted duplicate finders regularly.
fdupes is the classic (going way way back) but it's really very slow, not worth using anymore.
The four I know are worth trying these days (depending on data set, hardware, file arrangement and other factors, any one of these might be fastest for a specific use case) are https://github.com/jbruchon/jdupes , https://github.com/pauldreik/rdfind , https://github.com/jvirkki/dupd , https://github.com/sahib/rmlint
Had not encountered fclones before, will give it a try.
-
I use and test assorted duplicate finders regularly.
fdupes is the classic (going way way back) but it's really very slow, not worth using anymore.
The four I know are worth trying these days (depending on data set, hardware, file arrangement and other factors, any one of these might be fastest for a specific use case) are https://github.com/jbruchon/jdupes , https://github.com/pauldreik/rdfind , https://github.com/jvirkki/dupd , https://github.com/sahib/rmlint
Had not encountered fclones before, will give it a try.
-
kindfs
Index filesystem into a database, then easily make queries e.g. to find duplicates files/dirs, or mount the index with FUSE.
FWIW if people are interested, I wrote https://github.com/karteum/kindfs for the purpose of indexing the hard drive, with the following goals