Our great sponsors
-
txtai
💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
-
Back then, I came across some info about a self-supervised sentence embedding system that surpasses Sentence Transformers NLI models, but forgot where it was. You could use Jina’s Finetuner. It lets you boost your pre-trained models' performance, making them ready for production without having to spend a lot of time labeling or buying expensive hardware.
-
InfluxDB
Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.
-
Not sure what kind of texts you have, but these models have a max sequence limit of 512 (approx 350 words or so). If you're texts are longer than that, consider splitting them up into chunks or creating a summary and taking an embedding of that. Some clustering algorithm may be the way to go here. Here's a bunch of examples. I use agglomerative for my use case.
-
3000 texts doesn't sound like to many, so may be a brute force cos calculation to find the most similar vector would work. If that's taking too much time, may be look at KNN or ANN modules to speed up finding the most similar vector. I use hsnwlib in knn mode for this. SOrt through about 350,000 vectors in about 30-50 msec.
Related posts
- Improving Search Quality for Non-English Queries with Fine-tuned Multilingual CLIP Models
- Image matching within database? [P]
- Most Trending Open Source MLOps Tools of 2022
- Just released Jina 3.0 - self-hosted AI powered search for any type of data - text, image, gif, audio, video, 3d mesh [Open-Source]
- SearXNG is a internet metasearch engine which gets results from various engines.