-
I've been using `fclones` [1] to do this, with `dedupe`, which uses reflink/clonefile.
https://github.com/pkolaczk/fclones
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
See the comments on https://news.ycombinator.com/item?id=38113396 for a list of alternatives. I used https://github.com/sahib/rmlint in the past and can't complain.
-
Thank you for creating and sharing this utility.
I ran it over my Postgres development directories that have almost identical files. It saved me about 1.7GB.
The project doesn't have any license associated with it. If you don't mind, can you please license this project with a license of your choice.
As a gesture of thanks, I have attempted to improve the installation step slightly and have created this pull request: https://github.com/ttkb-oss/dedup/pull/6
-
I wrote a similar (but simpler) script which would replace a file by a hardlink if it has the same content.
My main motivation was for the packages of Python virtual envs, where I often have similar packages installed, and even if versions are different, many files would still match.
https://github.com/albertz/system-tools/blob/master/bin/merg...
-
Yes, Linux has a systemcall to do this for any filesystem with reflink support (and it is safe and atomic). You need a "driver" program to identify duplicates but there are a handful out there. I've used https://github.com/markfasheh/duperemove and was very pleased with how it worked.
-
I think this is somewhat funny.
His comment is pretty understandable if you've done frontend work in javascript.
Node_modules is so ripe for duplicate content that some tools explicitly call out that they're disk efficient (It's literally in the tagline for PNPM "Fast, disk space efficient package manager": https://github.com/pnpm/pnpm)
So he got ok results (~13% savings) on possibly the best target content available in a user's home directory.
Then he got results so bad it's utterly not worth doing on the rest (0.10% - not 10%, literally 1/10 of a single percent).
---
Deduplication isn't super simple, isn't always obviously better, and can require other system resources in unexpected ways (ex - lots of CPU and RAM). It's a cool tech to fiddle with on a NAS, and I'm generally a fan of modern CoW filesystems (incl APFS).
But I want to be really clear - this is people picking spare change out of the couch style savings. Penny wise, pound foolish. The only people who are likely to actually save anything buying this app probably already know it, and have a large set of real options available. Everyone else is falling into the "download more ram" trap.