Seeking advice on improving NLP search results

This page summarizes the projects mentioned and recommended in the original post on /r/LanguageTechnology

Our great sponsors
  • InfluxDB - Collect and Analyze Billions of Data Points in Real Time
  • Sonar - Write Clean Python Code. Always.
  • Mergify - Updating dependencies is time-consuming.
  • txtai

    💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

  • finetuner

    :dart: Task-oriented embedding tuning for BERT, CLIP, etc.

    Back then, I came across some info about a self-supervised sentence embedding system that surpasses Sentence Transformers NLI models, but forgot where it was. You could use Jina’s Finetuner. It lets you boost your pre-trained models' performance, making them ready for production without having to spend a lot of time labeling or buying expensive hardware.

  • InfluxDB

    Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.

  • sentence-transformers

    Multilingual Sentence & Image Embeddings with BERT

    Not sure what kind of texts you have, but these models have a max sequence limit of 512 (approx 350 words or so). If you're texts are longer than that, consider splitting them up into chunks or creating a summary and taking an embedding of that. Some clustering algorithm may be the way to go here. Here's a bunch of examples. I use agglomerative for my use case.

  • hnswlib

    Header-only C++/python library for fast approximate nearest neighbors

    3000 texts doesn't sound like to many, so may be a brute force cos calculation to find the most similar vector would work. If that's taking too much time, may be look at KNN or ANN modules to speed up finding the most similar vector. I use hsnwlib in knn mode for this. SOrt through about 350,000 vectors in about 30-50 msec.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts