SaaSHub helps you find the best software and product alternatives Learn more →
Top 15 Python Deduplication Projects
-
Project mention: Ask HN: How do you manage files and backups as an individual? | news.ycombinator.com | 2024-07-14
I started using Nextcloud for file/contact/calendar syncing a few years, and have gradually moved most of my digital life into it. Documents and photos, but also scripts to automatically set some things up for me when I do a fresh install of Linux (I've been playing around with a few different distros lately). The only thing that doesn't live in Nextcloud are some old DVD rips, and that's mostly due to "haven't gotten around to it yet". Besides those, if it's not in Nextcloud, it's not something I care too much about losing if a disk were to fail.
My Nextcloud instance is then backed up to another drive on the same machine plus two off-site locations - an old server I still run at my parents' house, and and external HDD my friend let me plug in to his server. The on- and off-site backups are done using `borg` (https://www.borgbackup.org/), which does reduplication and encryption (with the keys backed up in 1Password).
I've been meaning to set up an automated restore on one of the offsite servers - a script to automatically unpack the latest backup, set up a fresh DB, and make things read-only - firstly to verify that the backups are functional and complete, but also as a backup in case my home server or home internet goes down. I know in _theory_ I've got everything I need in the backups to do a full restore, but I can't recall the last time I actually tried it out...
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
Dupeguru
-
Project mention: Syncthing – A decentralized continuous file synchronization program | news.ycombinator.com | 2024-08-18
You could use Syncthing just to empty the incoming files from your phone (ingest) and then move the photos via cron to a second folder (also Syncthing) which is just shared with the replicas.
Another approach would be to push the files from Syncthing to borg (borgmatic can do replicas) https://torsion.org/borgmatic/
-
splink
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
Project mention: Finding near-duplicates with Jaccard similarity and MinHash | news.ycombinator.com | 2024-07-04For deduplicating people, the Fellegi Sunter model is also a powerful approach. Splink[0] is a Python library that implements this for big datasets. Probably you could combine parts of both approaches as well. Full disclosure: I am the lead author
[0]https://github.com/moj-analytical-services/splink
-
LSH
Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
-
-
npbackup
A secure and efficient file backup solution that fits both system administrators (CLI) and end users (GUI)
-
benji
Benji Backup: A block based deduplicating backup software for Ceph RBD images, iSCSI targets, image files and block devices
-
unisim
UniSim is a package for efficient similarity computation, fuzzy matching, and clustering of data.
Project mention: Finding near-duplicates with Jaccard similarity and MinHash | news.ycombinator.com | 2024-07-04Hashing or tiny neural nets combined with a Vector Search engine with Tanimoto/Jaccard is a very common deduplication strategy for large datasets. It might be wiser than using linear-complexity MapReduce operations.
There is a nice Google project using 0.5 M parameter RETSim model and the USearch engine for that: https://github.com/google/unisim
-
dude
Duplicates Detector is a cross-platform GUI utility for finding duplicate files, allowing you to delete or link them to save space. Duplicate files are displayed and processed on two synchronized panels for efficient and convenient operation. (by PJDude)
-
Neural-Scam-Artist
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
-
Project mention: Aug 7, 2024 - Developing Data-Centric Visual AI Apps Workshop | dev.to | 2024-08-07
From concept interpolation to image deduplication, optical character recognition, and even curating your own AI art gallery by adding generated images directly into a dataset, your imagination is the only limit. Join us to discover how you can unleash your creativity and interact with data like never before.
-
-
-
Deduper
The goal of this project is to make a deduper program that anybody can run on their computer to save storage space. (by ThatOneShortGuy)
Python Deduplication discussion
Python Deduplication related posts
-
Finding near-duplicates with Jaccard similarity and MinHash
-
Splink: Fast, accurate, scalable probabilistic data linkage
-
I Backup
-
Duplicity
-
How to use onedrive for culling photos
-
Does anyone know any freeware duplicate file checkers without an upsell similar to awesome duplicate photo finder?
-
Kopia: Open-Source, Fast and Secure Open-Source Backup Software
-
A note from our sponsor - SaaSHub
www.saashub.com | 3 Oct 2024
Index
What are some of the best open-source Deduplication projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | BorgBackup | 11,050 |
2 | dupeguru | 5,303 |
3 | borgmatic | 1,735 |
4 | splink | 1,293 |
5 | LSH | 278 |
6 | dduper | 168 |
7 | npbackup | 163 |
8 | benji | 136 |
9 | unisim | 109 |
10 | dude | 108 |
11 | Neural-Scam-Artist | 23 |
12 | image-deduplication-plugin | 14 |
13 | dedup | 11 |
14 | chunkdup | 2 |
15 | Deduper | 0 |