The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →
Top 23 Deduplication Open-Source Projects
-
Project mention: Ask HN: What is your approach for managing personal digital assets? | news.ycombinator.com | 2024-03-24
I religiously use Google contacts. It's the simplest way to keep people contacts up to date on Android.
I archive all important documents in specific folders by subject and date. This is backed up to back blaze with restic. https://restic.net/
I use https://ente.io for pictures. I convinced my wife to use it, and she agreed to auto share her photos so I don't nag her for copies. It had simple import from Facebook and Google.
I also keep extensive journals, which really helps to tie it all together. I can basically grep for hangouts, conversations, etc.
I also separate work journal from personal, and have essentially a journal for each project. https://jodavaho.io/tags/bullet-journal.html for how.
I religiously use Google calendar for all plans, you can easily search it for past events to get dates.
I also use monicahq for some notes about things I should remember about people but the habit never stuck.
-
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
-
kopia
Cross-platform backup tool for Windows, macOS & Linux with fast, incremental backups, client-side end-to-end encryption, compression and data deduplication. CLI and GUI included.
I've been happy with: https://kopia.io/
Fairly easy to configure, does snapshots to S3 and has a icon in my tray I can watch :)
-
Dupeguru
-
libpostal
A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
@echo off REM Check if MSYS2 and MinGW are installed where msys2 2>nul >nul if %errorlevel% equ 0 ( echo MSYS2 is already installed. Use --force to reinstall. ) else ( REM Install MSYS2 and MinGW choco install msys2 refreshenv ) REM Check if MSYS2 packages are updated pacman -Qu 2>nul >nul if %errorlevel% equ 0 ( echo MSYS2 packages are already updated. Use --force to reinstall. ) else ( REM Update MSYS2 packages pacman -Syu ) REM Check if build dependencies are installed pacman -Q autoconf automake curl git make libtool gcc mingw-w64-x86_64-gcc 2>nul >nul if %errorlevel% equ 0 ( echo Build dependencies are already installed. Use --force to reinstall. ) else ( REM Install build dependencies pacman -S autoconf automake curl git make libtool gcc mingw-w64-x86_64-gcc ) REM Check if libpostal is cloned if exist libpostal ( echo libpostal repository is already cloned. Use --force to reinstall. ) else ( REM Clone libpostal repository git clone https://github.com/openvenues/libpostal ) cd libpostal REM Check if libpostal is built and installed if exist C:/Program Files/libpostal/bin/libpostal.dll ( echo libpostal is already built and installed. Use --force to reinstall. ) else ( REM Build and install libpostal cp -rf windows/* ./ ./bootstrap.sh ./configure --datadir=C:/libpostal make -j4 make install ) REM Check if libpostal is added to PATH environment variable setx /m PATH "%PATH%;C:\Program Files\libpostal\bin" 2>nul >nul if %errorlevel% equ 0 ( echo libpostal is already added to PATH environment variable. Use --force to reinstall. ) else ( REM Add libpostal to PATH environment variable setx PATH "%PATH%;C:\Program Files\libpostal\bin" ) REM Test libpostal installation libpostal "100 S Broad St, Philadelphia, PA" pause
-
My preferred solution is rmlint [https://github.com/sahib/rmlint] mostly because it also looks at duplicate directories. It produces a bash script instead of deleting anything itself, so you can examine it before running the script it made.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
- for important files, a separate box where I have borgmatic [1] in deduplication mode installed; this is updated once in a while
Just curious: Do you have any reason to believe that such a data corruption bug is likely in ZFS? It seems like saying that ext4 could have a bug and you should also store stuff on NTFS, just in case (which I think does not make sense..).
-
I'm a huge fan of restic as well. My only complaint is performance and memory usage. I'm looking forward to being able to use Rustic: https://rustic.cli.rs/
-
Project mention: Help! Does anyone know how to install johncena141 games on linux? | /r/LinuxCrackSupport | 2023-07-01
on a fresh install all you need is dwarfs https://github.com/mhx/dwarfs and libopenal1
-
splink
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
Project mention: Splink: Fast, accurate, scalable probabilistic data linkage | news.ycombinator.com | 2024-03-13 -
I really like restic, and am personally happy to use it via the command line. It's very fast and efficient! However, I do wish there was better tooling / wrappers around it. For example, Pika Backup is a popular UI for Borg of which no equivalent exists for Restic. I'd love to be able to set something simple up on my partner's Macbook.
For my own purposes, I've been using a script I found on Github[0] for a while, but it only really supports Backblaze B2 AFAIK.[1]
I've been meaning to try autorestic[2] and resticprofile[3] as they are potentially more flexible than the script I'm currently using, and prestic[4] looks intriguing for my partner's use, but seems to have very few users. And the fact that there are so many competing tools makes it difficult to land on one.
[0] https://github.com/erikw/restic-automatic-backup-scheduler
[1] https://github.com/erikw/restic-automatic-backup-scheduler/i...
[2] https://github.com/cupcakearmy/autorestic
-
-
Project mention: Announcing rustic - fast, encrypted, deduplicated backups powered by Rust | /r/rust | 2023-04-24
I'm not really doing much about it anymore, but I have somewhat similar project: https://github.com/dpc/rdedup
-
LSH
Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
-
-
cargo-limit
Productivity improvements for Rust ecosystem: warnings are skipped until errors are fixed, LSP-independent Neovim integration, etc.
-
-
zpaqfranz
Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
Now, onto files backup - if you value your data, don't make just one backup copy, make two or three. Also, I'd recommend using software that will make snapshots and you could restore whichever version you need. I am using zpaqfranz for few years now, it is command line software but you can make batch file and update the archive when needed - it will add only new and changed files, so only first backup will last long.
-
-
-
entity-embed
PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
-
benji
Benji Backup: A block based deduplicating backup software for Ceph RBD images, iSCSI targets, image files and block devices
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Deduplication related posts
- Splink: Fast, accurate, scalable probabilistic data linkage
- I Backup
- Duplicity
- Restic – Simple Backups
- The Drive Stats of Backblaze Storage Pods
- Rustic – fast, encrypted, and deduplicated backups
- How to use onedrive for culling photos
-
A note from our sponsor - WorkOS
workos.com | 28 Mar 2024
Index
What are some of the best open-source Deduplication projects? This list will help you:
Project | Stars | |
---|---|---|
1 | restic | 23,429 |
2 | BorgBackup | 10,422 |
3 | alertmanager | 6,233 |
4 | kopia | 6,079 |
5 | dupeguru | 4,692 |
6 | libpostal | 3,935 |
7 | rmlint | 1,757 |
8 | borgmatic | 1,619 |
9 | rustic | 1,442 |
10 | dwarfs | 1,244 |
11 | splink | 1,060 |
12 | autorestic | 1,055 |
13 | zingg | 868 |
14 | rdedup | 818 |
15 | LSH | 271 |
16 | deduplicator | 253 |
17 | cargo-limit | 237 |
18 | kvdo | 236 |
19 | zpaqfranz | 213 |
20 | vdo | 186 |
21 | dduper | 162 |
22 | entity-embed | 138 |
23 | benji | 136 |