SaaSHub helps you find the best software and product alternatives Learn more →
Top 20 Python Deduplication Projects
-
Project mention: Ask HN: What Are You Working On? (February 2026) | news.ycombinator.com | 2026-02-08
Hey, sorry it was hacked.
You mention rsync which can be fine. But there are tons of other solutions, many that will have snapshot features. I use borg backup, for instance. https://www.borgbackup.org/
Also, look into scriping your server setup with tools like Ansible or PyInfra. There is always the risk that bad things happen to servers, and when you want a new server it's great to be able to spin things up in a matter of minutes. Tools like these are profesional best practices these days.
In fact, if you have a scripted server setup and a server that doesn't collect data itself you may not even need backups. What is there to backup? Just spin up a new server with your scripts and carry on.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
Project mention: DupeGuru – a tool to find duplicate files on your computer | news.ycombinator.com | 2026-02-12
-
Project mention: Borgmatic: The Simple, Configuration-Driven Guardian for Your Server Backups | dev.to | 2025-10-29
View the Project on GitHub
-
splink
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
-
Project mention: Curator: Scalable data pre processing and curation toolkit for LLMs | news.ycombinator.com | 2025-08-20
-
Project mention: Show HN: Pyversity – Fast Result Diversification for Retrieval and RAG | news.ycombinator.com | 2025-10-19
True, I think that's also a great usecase! Though these algorithms likely won't scale to very large datasets (e.g. millions of samples), but for smaller datasets, like fine-tuning sets, I think this would work very well. I've worked on something similar in the past that works for larger datasets (semantic deduplication: https://github.com/MinishLab/semhash).
-
npbackup
A secure and efficient file backup solution that fits both system administrators (CLI) and end users (GUI)
-
dude
Duplicates Detector is a cross-platform GUI utility for finding duplicate files, allowing you to delete or link them to save space. Duplicate files are displayed and processed on two synchronized panels for efficient and convenient operation. (by PJDude)
-
benji
Benji Backup: A block based deduplicating backup software for Ceph RBD images, iSCSI targets, image files and block devices
-
Project mention: Show HN: FileHunter, Self-hosted file manager that remembers disconnected drives | news.ycombinator.com | 2026-03-03
-
-
Neural-Scam-Artist
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
-
-
-
fast-copy
Fast file copier with SSH streaming, deduplication & block-order I/O. 3-5x faster than scp/rsync/SFTP for remote transfers. Copies folders at maximum sequential disk speed. Tar pipe streaming, content-aware dedup, hard-linking. Cross-platform (Linux, macOS, Windows). Python CLI.
Project mention: Show HN: Fast_copy new release with new features | news.ycombinator.com | 2026-05-18 -
BitwardenDeduplicator
A Python-based tool to identify and remove duplicate entries from Bitwarden export files. This project simplifies the management of your Bitwarden vault by allowing you to filter, deduplicate, and save results in a structured format.
-
-
Project mention: I built Flow-Guard, an open-source finite-state workflow simulator for AI coding agents | dev.to | 2026-04-29
Repo: https://github.com/liuyingxuvka/FlowGuard
-
Deduper
The goal of this project is to make a deduper program that anybody can run on their computer to save storage space. (by ThatOneShortGuy)
-
fsxn-lakehouse-integrations
Data Lake and Lakehouse platform integrations with Amazon FSx for NetApp ONTAP via S3 Access Points
Project mention: AI Enrichment Pipeline: From Sample Classification to 100K-File Metadata Search with Bedrock and OpenSearch NextGen | dev.to | 2026-06-08For sensitive workloads, use VPC interface endpoints for Bedrock Runtime and S3 VPC endpoints for batch input/output. See genai/bedrock-private-connectivity.md.
Python Deduplication discussion
Python Deduplication related posts
-
Show HN: Fast_copy new release with new features
-
I launched an open source CLI tool with zero audience — here's what happened in 10 days
-
I built a faster alternative to cp and rsync — here's how it works
-
DupeGuru – a tool to find duplicate files on your computer
-
Show HN: Pyversity – Fast Result Diversification for Retrieval and RAG
-
AWS Restored My Account: The Human Who Made the Difference
-
Show HN: SemHash – Semantic Text Deduplication, Outlier Filtering and Sampling
-
A note from our sponsor - SaaSHub
www.saashub.com | 12 Jun 2026
Index
What are some of the best open-source Deduplication projects in Python? This list will help you:
| # | Project | Stars |
|---|---|---|
| 1 | BorgBackup | 13,397 |
| 2 | dupeguru | 7,615 |
| 3 | borgmatic | 2,273 |
| 4 | splink | 2,192 |
| 5 | Curator | 1,602 |
| 6 | semhash | 936 |
| 7 | npbackup | 336 |
| 8 | dude | 186 |
| 9 | benji | 150 |
| 10 | file-hunter | 60 |
| 11 | dedupe_it | 28 |
| 12 | Neural-Scam-Artist | 28 |
| 13 | image-deduplication-plugin | 18 |
| 14 | dedup | 14 |
| 15 | fast-copy | 7 |
| 16 | BitwardenDeduplicator | 5 |
| 17 | chunkdup | 3 |
| 18 | FlowGuard | 1 |
| 19 | Deduper | 0 |
| 20 | fsxn-lakehouse-integrations | 0 |