On building a semantic search engine

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

beir

8 1,372 4.2 Python

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

The BEIR project might be what you're looking for: https://github.com/beir-cellar/beir/wiki/Leaderboard

viberary

1 401 8.7 Jupyter Notebook

Good books, good vibes

One issue I always run into when implementing these approaches is the embedding model's context window being too small to represent what I need.
For example, on this project, looking at the generation of training data [1], it seems like what's actually being generated are embeddings on a string concatenated from each review, title, description, etc. [2]. With the max_seq_length set to 200, wouldn't book reviews with long reviews result in the book description text never being encoded? Wouldn't this result in queries not matching against potentially similar descriptions if the reviews are topically dissimilar (e.g., discussing author's style, book's flow, etc. instead of plot).
[1] https://github.com/veekaybee/viberary/blob/main/src/model/ge...

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
unilm

40 18,319 9.0 Python

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

e5-mistral is essentially a distillation from gpt-4 to a smaller model. You can see here https://github.com/microsoft/unilm/blob/16da2f193b9c1dab0a69...
they actually have custom prompts for each dataset being tested.
Question would be, if you haven't seen the task before, what is a good prompt to prepend for your task?
IMO e5-mistral is overfit to MTEB

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project