Python Deduplication

Open-source Python projects categorized as Deduplication

Top 15 Python Deduplication Projects

Deduplication
  • BorgBackup

    Deduplicating archiver with compression and authenticated encryption.

    Project mention: Ask HN: How do you manage files and backups as an individual? | news.ycombinator.com | 2024-07-14

    I started using Nextcloud for file/contact/calendar syncing a few years, and have gradually moved most of my digital life into it. Documents and photos, but also scripts to automatically set some things up for me when I do a fresh install of Linux (I've been playing around with a few different distros lately). The only thing that doesn't live in Nextcloud are some old DVD rips, and that's mostly due to "haven't gotten around to it yet". Besides those, if it's not in Nextcloud, it's not something I care too much about losing if a disk were to fail.

    My Nextcloud instance is then backed up to another drive on the same machine plus two off-site locations - an old server I still run at my parents' house, and and external HDD my friend let me plug in to his server. The on- and off-site backups are done using `borg` (https://www.borgbackup.org/), which does reduplication and encryption (with the keys backed up in 1Password).

    I've been meaning to set up an automated restore on one of the offsite servers - a script to automatically unpack the latest backup, set up a fresh DB, and make things read-only - firstly to verify that the backups are functional and complete, but also as a backup in case my home server or home internet goes down. I know in _theory_ I've got everything I need in the backups to do a full restore, but I can't recall the last time I actually tried it out...

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • dupeguru

    Find duplicate files

    Project mention: How to use onedrive for culling photos | /r/onedrive | 2023-12-11

    Dupeguru

  • borgmatic

    Simple, configuration-driven backup software for servers and workstations

    Project mention: Syncthing – A decentralized continuous file synchronization program | news.ycombinator.com | 2024-08-18

    You could use Syncthing just to empty the incoming files from your phone (ingest) and then move the photos via cron to a second folder (also Syncthing) which is just shared with the replicas.

    Another approach would be to push the files from Syncthing to borg (borgmatic can do replicas) https://torsion.org/borgmatic/

  • LSH

    Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents

  • dduper

    Fast block-level out-of-band BTRFS deduplication tool.

  • npbackup

    A secure and efficient file backup solution that fits both system administrators (CLI) and end users (GUI)

    Project mention: Duplicity | news.ycombinator.com | 2024-01-24
  • benji

    Benji Backup: A block based deduplicating backup software for Ceph RBD images, iSCSI targets, image files and block devices

  • unisim

    UniSim is a package for efficient similarity computation, fuzzy matching, and clustering of data.

    Project mention: Finding near-duplicates with Jaccard similarity and MinHash | news.ycombinator.com | 2024-07-04

    Hashing or tiny neural nets combined with a Vector Search engine with Tanimoto/Jaccard is a very common deduplication strategy for large datasets. It might be wiser than using linear-complexity MapReduce operations.

    There is a nice Google project using 0.5 M parameter RETSim model and the USearch engine for that: https://github.com/google/unisim

  • dude

    Duplicates Detector is a cross-platform GUI utility for finding duplicate files, allowing you to delete or link them to save space. Duplicate files are displayed and processed on two synchronized panels for efficient and convenient operation. (by PJDude)

    Project mention: Dude v2 Released | news.ycombinator.com | 2024-05-24
  • Neural-Scam-Artist

    Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

  • image-deduplication-plugin

    Remove exact and approximate duplicates from your dataset in FiftyOne!

    Project mention: Aug 7, 2024 - Developing Data-Centric Visual AI Apps Workshop | dev.to | 2024-08-07

    From concept interpolation to image deduplication, optical character recognition, and even curating your own AI art gallery by adding generated images directly into a dataset, your imagination is the only limit. Join us to discover how you can unleash your creativity and interact with data like never before.

  • dedup

    Find duplicate text files.

  • chunkdup

    Find (partial content) duplicate files.

  • Deduper

    The goal of this project is to make a deduper program that anybody can run on their computer to save storage space. (by ThatOneShortGuy)

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Deduplication discussion

Log in or Post with

Python Deduplication related posts

Index

What are some of the best open-source Deduplication projects in Python? This list will help you:

Project Stars
1 BorgBackup 11,050
2 dupeguru 5,303
3 borgmatic 1,735
4 splink 1,293
5 LSH 278
6 dduper 168
7 npbackup 163
8 benji 136
9 unisim 109
10 dude 108
11 Neural-Scam-Artist 23
12 image-deduplication-plugin 14
13 dedup 11
14 chunkdup 2
15 Deduper 0

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com