LSH
intertext
Our great sponsors
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
LSH
-
D Efficient Way To Cluster Millions Of Face
I'm looking into this, after reading the wikipedia entry of this it sound promissing! I already found a python lib https://github.com/mattilyra/LSH for this, I will get back to you once I tested this!
intertext
-
Ask HN: Tool to find text reuse, similar paragraphs, fuzzy/near dupes in folder?
Do you know of any too that I can use to compare my own notes and documents vault in search for copied paragraphs or almost similar phrases? Normal diffing/hashing wouldn't work as we're talking about the contents of slightly modified documents, and the comparison of each file against all others.
I found the following tools that seem related yet not quite there, maybe I'm missing a particular term of art?
https://github.com/YaleDHLab/intertext
What are some alternatives?
datasketch - MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
sourmash - Quickly search, compare, and analyze genomic and metagenomic data sets.
image-ndd-lsh - Near-duplicate image detection using Locality Sensitive Hashing
neardup - Near-duplicate detection
Neural-Scam-Artist - Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
dash-tools - DashTools - Plotly Dash Command Line Tools - Create, Run and Deploy Templated Python Apps from Terminal
Cython - The most widely used Python to C compiler
dedup - Find duplicate text files.
dash - Data Apps & Dashboards for Python. No JavaScript Required.