Seeking advice on improving NLP search results

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

txtai

354 6,910 9.3 Python

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
finetuner

36 1,423 5.5 Python

:dart: Task-oriented embedding tuning for BERT, CLIP, etc.

Back then, I came across some info about a self-supervised sentence embedding system that surpasses Sentence Transformers NLI models, but forgot where it was. You could use Jina’s Finetuner. It lets you boost your pre-trained models' performance, making them ready for production without having to spend a lot of time labeling or buying expensive hardware.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
sentence-transformers

45 13,661 9.1 Python

Multilingual Sentence & Image Embeddings with BERT

Not sure what kind of texts you have, but these models have a max sequence limit of 512 (approx 350 words or so). If you're texts are longer than that, consider splitting them up into chunks or creating a summary and taking an embedding of that. Some clustering algorithm may be the way to go here. Here's a bunch of examples. I use agglomerative for my use case.

hnswlib

12 3,984 6.6 C++

Header-only C++/python library for fast approximate nearest neighbors

3000 texts doesn't sound like to many, so may be a brute force cos calculation to find the most similar vector would work. If that's taking too much time, may be look at KNN or ANN modules to speed up finding the most similar vector. I use hsnwlib in knn mode for this. SOrt through about 350,000 vectors in about 30-50 msec.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project