Our great sponsors
-
beir
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
The BEIR project might be what you're looking for: https://github.com/beir-cellar/beir/wiki/Leaderboard
One issue I always run into when implementing these approaches is the embedding model's context window being too small to represent what I need.
For example, on this project, looking at the generation of training data [1], it seems like what's actually being generated are embeddings on a string concatenated from each review, title, description, etc. [2]. With the max_seq_length set to 200, wouldn't book reviews with long reviews result in the book description text never being encoded? Wouldn't this result in queries not matching against potentially similar descriptions if the reviews are topically dissimilar (e.g., discussing author's style, book's flow, etc. instead of plot).
[1] https://github.com/veekaybee/viberary/blob/main/src/model/ge...
e5-mistral is essentially a distillation from gpt-4 to a smaller model. You can see here https://github.com/microsoft/unilm/blob/16da2f193b9c1dab0a69...
they actually have custom prompts for each dataset being tested.
Question would be, if you haven't seen the task before, what is a good prompt to prepend for your task?
IMO e5-mistral is overfit to MTEB