dedupe_it

Simple fuzzy deduplication (by SnowPilotOrg)

Dedupe_it Alternatives

Similar projects and alternatives to dedupe_it

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better dedupe_it alternative or higher similarity.

dedupe_it discussion

Log in or Post with

dedupe_it reviews and mentions

Posts with mentions or reviews of dedupe_it. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-11-04.
  • Show HN: Fuzzy deduplicate any CSV using vector embeddings
    2 projects | news.ycombinator.com | 4 Nov 2024
    - Does last-mile comparison and merges duplicates using Claude

    Demo video: https://youtu.be/7mZ0kdwXBwM

    Github repo (Apache 2.0 licensed): https://github.com/SnowPilotOrg/dedupe_it

    Background story: My company has a table for tracking leads, which includes website visitors, demo form submissions, app signups, and manual entries. It’s full of duplicates. And writing formulas to merge those dupes has been a massive PITA.

    I figured that an LLM could handle any data shape and give me a way to deal with tricky custom rules like “treat international subsidiaries as distinct from their parent company”.

    The challenging thing was avoiding an NxN comparison matrix. The solution I came up with was first narrowing down our search space using vector embeddings + semantic similarity search, and then using a generative LLM only to compare a few nearest neighbors and merge.

    Some cool attributes of this approach:

    - Can work incrementally (no reprocessing the entire dataset)

Stats

Basic dedupe_it repo stats
1
23
7.6
about 1 month ago

SnowPilotOrg/dedupe_it is an open source project licensed under Apache License 2.0 which is an OSI approved license.

The primary programming language of dedupe_it is Python.


Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com

Did you konow that Python is
the 2nd most popular programming language
based on number of metions?