Dedupe_it Alternatives
Similar projects and alternatives to dedupe_it
-
LSH
Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
-
-
-
splink
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
-
npbackup
A secure and efficient file backup solution that fits both system administrators (CLI) and end users (GUI)
-
string-embed
string embed for fast edit distance computation, codes for [Convolutional Embedding for Edit Distance (SIGIR 20)].
-
dedupe_it discussion
dedupe_it reviews and mentions
-
Show HN: Fuzzy deduplicate any CSV using vector embeddings
- Does last-mile comparison and merges duplicates using Claude
Demo video: https://youtu.be/7mZ0kdwXBwM
Github repo (Apache 2.0 licensed): https://github.com/SnowPilotOrg/dedupe_it
Background story: My company has a table for tracking leads, which includes website visitors, demo form submissions, app signups, and manual entries. It’s full of duplicates. And writing formulas to merge those dupes has been a massive PITA.
I figured that an LLM could handle any data shape and give me a way to deal with tricky custom rules like “treat international subsidiaries as distinct from their parent company”.
The challenging thing was avoiding an NxN comparison matrix. The solution I came up with was first narrowing down our search space using vector embeddings + semantic similarity search, and then using a generative LLM only to compare a few nearest neighbors and merge.
Some cool attributes of this approach:
- Can work incrementally (no reprocessing the entire dataset)
Stats
SnowPilotOrg/dedupe_it is an open source project licensed under Apache License 2.0 which is an OSI approved license.
The primary programming language of dedupe_it is Python.