intertext
sourmash
intertext | sourmash | |
---|---|---|
1 | 1 | |
110 | 435 | |
0.0% | 2.1% | |
0.0 | 9.4 | |
12 months ago | 4 days ago | |
Python | Python | |
- | GNU General Public License v3.0 or later |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
intertext
-
Ask HN: Tool to find text reuse, similar paragraphs, fuzzy/near dupes in folder?
Do you know of any too that I can use to compare my own notes and documents vault in search for copied paragraphs or almost similar phrases? Normal diffing/hashing wouldn't work as we're talking about the contents of slightly modified documents, and the comparison of each file against all others.
I found the following tools that seem related yet not quite there, maybe I'm missing a particular term of art?
https://github.com/YaleDHLab/intertext
sourmash
-
Any good meta-transcriptomics pipelines
have you seen https://www.nature.com/articles/s41587-019-0209-9?ref=https://githubhelp.com and https://github.com/sourmash-bio/sourmash ?
What are some alternatives?
neardup - Near-duplicate detection
biobakery_workflows - bioBakery workflows is a collection of workflows and tasks for executing common microbial community analyses using standardized, validated tools and parameters.
dash-tools - DashTools - Plotly Dash Command Line Tools - Create, Run and Deploy Templated Python Apps from Terminal
GEMAP_NCLDV - Genome Mapping Analysis Pipeline for giant viruses
LSH - Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
rustplus - Rust+ API Wrapper Written in Python for the Game: Rust
Neural-Scam-Artist - Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
sequence_align - Efficient implementations of Needleman-Wunsch and other sequence alignment algorithms written in Rust with Python bindings via PyO3.
datasketch - MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
mosec - A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine
dash - Data Apps & Dashboards for Python. No JavaScript Required.