Top 5 Python Minhash Projects
-
datasketch
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
LSH
Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
-
Neural-Scam-Artist
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Project mention: Ask HN: Tool to find text reuse, similar paragraphs, fuzzy/near dupes in folder? | news.ycombinator.com | 2023-06-25Do you know of any too that I can use to compare my own notes and documents vault in search for copied paragraphs or almost similar phrases? Normal diffing/hashing wouldn't work as we're talking about the contents of slightly modified documents, and the comparison of each file against all others.
I found the following tools that seem related yet not quite there, maybe I'm missing a particular term of art?
https://github.com/YaleDHLab/intertext
Python Minhash related posts
Index
What are some of the best open-source Minhash projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | datasketch | 2,348 |
2 | sourmash | 432 |
3 | LSH | 273 |
4 | intertext | 110 |
5 | Neural-Scam-Artist | 22 |
Sponsored