Seeking advice on improving NLP search results

This page summarizes the projects mentioned and recommended in the original post on /r/LanguageTechnology

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • txtai

    💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

  • finetuner

    :dart: Task-oriented embedding tuning for BERT, CLIP, etc.

  • Back then, I came across some info about a self-supervised sentence embedding system that surpasses Sentence Transformers NLI models, but forgot where it was. You could use Jina’s Finetuner. It lets you boost your pre-trained models' performance, making them ready for production without having to spend a lot of time labeling or buying expensive hardware.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • sentence-transformers

    Multilingual Sentence & Image Embeddings with BERT

  • Not sure what kind of texts you have, but these models have a max sequence limit of 512 (approx 350 words or so). If you're texts are longer than that, consider splitting them up into chunks or creating a summary and taking an embedding of that. Some clustering algorithm may be the way to go here. Here's a bunch of examples. I use agglomerative for my use case.

  • hnswlib

    Header-only C++/python library for fast approximate nearest neighbors

  • 3000 texts doesn't sound like to many, so may be a brute force cos calculation to find the most similar vector would work. If that's taking too much time, may be look at KNN or ANN modules to speed up finding the most similar vector. I use hsnwlib in knn mode for this. SOrt through about 350,000 vectors in about 30-50 msec.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts