Python Deduplication

Open-source Python projects categorized as Deduplication
Python Backup CLI GUI NLP

Top 20 Python Deduplication Projects

Deduplication
  1. BorgBackup

    Deduplicating archiver with compression and authenticated encryption.

    Project mention: Ask HN: What Are You Working On? (February 2026) | news.ycombinator.com | 2026-02-08

    Hey, sorry it was hacked.

    You mention rsync which can be fine. But there are tons of other solutions, many that will have snapshot features. I use borg backup, for instance. https://www.borgbackup.org/

    Also, look into scriping your server setup with tools like Ansible or PyInfra. There is always the risk that bad things happen to servers, and when you want a new server it's great to be able to spin things up in a matter of minutes. Tools like these are profesional best practices these days.

    In fact, if you have a scripted server setup and a server that doesn't collect data itself you may not even need backups. What is there to backup? Just spin up a new server with your scripts and carry on.

  2. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  3. dupeguru

    Find duplicate files

    Project mention: DupeGuru – a tool to find duplicate files on your computer | news.ycombinator.com | 2026-02-12
  4. borgmatic

    Simple, configuration-driven backup software for servers and workstations

    Project mention: Borgmatic: The Simple, Configuration-Driven Guardian for Your Server Backups | dev.to | 2025-10-29

    View the Project on GitHub

  5. Curator

    Scalable data pre processing and curation toolkit for LLMs (by NVIDIA-NeMo)

    Project mention: Curator: Scalable data pre processing and curation toolkit for LLMs | news.ycombinator.com | 2025-08-20
  6. semhash

    Fast Multimodal Semantic Deduplication & Filtering

    Project mention: Show HN: Pyversity – Fast Result Diversification for Retrieval and RAG | news.ycombinator.com | 2025-10-19

    True, I think that's also a great usecase! Though these algorithms likely won't scale to very large datasets (e.g. millions of samples), but for smaller datasets, like fine-tuning sets, I think this would work very well. I've worked on something similar in the past that works for larger datasets (semantic deduplication: https://github.com/MinishLab/semhash).

  7. npbackup

    A secure and efficient file backup solution that fits both system administrators (CLI) and end users (GUI)

  8. dude

    Duplicates Detector is a cross-platform GUI utility for finding duplicate files, allowing you to delete or link them to save space. Duplicate files are displayed and processed on two synchronized panels for efficient and convenient operation. (by PJDude)

  9. benji

    Benji Backup: A block based deduplicating backup software for Ceph RBD images, iSCSI targets, image files and block devices

  10. file-hunter

    File Hunter — catalog, deduplicate, and consolidate your archive storage

    Project mention: Show HN: FileHunter, Self-hosted file manager that remembers disconnected drives | news.ycombinator.com | 2026-03-03
  11. dedupe_it

    Simple fuzzy deduplication

  12. Neural-Scam-Artist

    Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

  13. image-deduplication-plugin

    Remove exact and approximate duplicates from your dataset in FiftyOne!

  14. dedup

    Find duplicate text files.

  15. fast-copy

    Fast file copier with SSH streaming, deduplication & block-order I/O. 3-5x faster than scp/rsync/SFTP for remote transfers. Copies folders at maximum sequential disk speed. Tar pipe streaming, content-aware dedup, hard-linking. Cross-platform (Linux, macOS, Windows). Python CLI.

    Project mention: Show HN: Fast_copy new release with new features | news.ycombinator.com | 2026-05-18
  16. BitwardenDeduplicator

    A Python-based tool to identify and remove duplicate entries from Bitwarden export files. This project simplifies the management of your Bitwarden vault by allowing you to filter, deduplicate, and save results in a structured format.

  17. chunkdup

    Find (partial content) duplicate files.

  18. FlowGuard

    An architecture simulator for checking software workflows before AI agents write code.

    Project mention: I built Flow-Guard, an open-source finite-state workflow simulator for AI coding agents | dev.to | 2026-04-29

    Repo: https://github.com/liuyingxuvka/FlowGuard

  19. Deduper

    The goal of this project is to make a deduper program that anybody can run on their computer to save storage space. (by ThatOneShortGuy)

  20. fsxn-lakehouse-integrations

    Data Lake and Lakehouse platform integrations with Amazon FSx for NetApp ONTAP via S3 Access Points

    Project mention: AI Enrichment Pipeline: From Sample Classification to 100K-File Metadata Search with Bedrock and OpenSearch NextGen | dev.to | 2026-06-08

    For sensitive workloads, use VPC interface endpoints for Bedrock Runtime and S3 VPC endpoints for batch input/output. See genai/bedrock-private-connectivity.md.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Deduplication discussion

Log in or Post with

Python Deduplication related posts

  • Show HN: Fast_copy new release with new features

    1 project | news.ycombinator.com | 18 May 2026
  • I launched an open source CLI tool with zero audience — here's what happened in 10 days

    1 project | dev.to | 11 Apr 2026
  • I built a faster alternative to cp and rsync — here's how it works

    1 project | dev.to | 5 Apr 2026
  • DupeGuru – a tool to find duplicate files on your computer

    1 project | news.ycombinator.com | 12 Feb 2026
  • Show HN: Pyversity – Fast Result Diversification for Retrieval and RAG

    2 projects | news.ycombinator.com | 19 Oct 2025
  • AWS Restored My Account: The Human Who Made the Difference

    1 project | news.ycombinator.com | 7 Aug 2025
  • Show HN: SemHash – Semantic Text Deduplication, Outlier Filtering and Sampling

    1 project | news.ycombinator.com | 27 Apr 2025
  • A note from our sponsor - SaaSHub
    www.saashub.com | 12 Jun 2026
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Deduplication projects in Python? This list will help you:

# Project Stars
1 BorgBackup 13,397
2 dupeguru 7,615
3 borgmatic 2,273
4 splink 2,192
5 Curator 1,602
6 semhash 936
7 npbackup 336
8 dude 186
9 benji 150
10 file-hunter 60
11 dedupe_it 28
12 Neural-Scam-Artist 28
13 image-deduplication-plugin 18
14 dedup 14
15 fast-copy 7
16 BitwardenDeduplicator 5
17 chunkdup 3
18 FlowGuard 1
19 Deduper 0
20 fsxn-lakehouse-integrations 0

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com

Did you know that Python is
the 1st most popular programming language
based on number of references?