On building a semantic search engine

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • beir

    A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

  • The BEIR project might be what you're looking for: https://github.com/beir-cellar/beir/wiki/Leaderboard

  • viberary

    Good books, good vibes

  • One issue I always run into when implementing these approaches is the embedding model's context window being too small to represent what I need.

    For example, on this project, looking at the generation of training data [1], it seems like what's actually being generated are embeddings on a string concatenated from each review, title, description, etc. [2]. With the max_seq_length set to 200, wouldn't book reviews with long reviews result in the book description text never being encoded? Wouldn't this result in queries not matching against potentially similar descriptions if the reviews are topically dissimilar (e.g., discussing author's style, book's flow, etc. instead of plot).

    [1] https://github.com/veekaybee/viberary/blob/main/src/model/ge...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • unilm

    Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

  • e5-mistral is essentially a distillation from gpt-4 to a smaller model. You can see here https://github.com/microsoft/unilm/blob/16da2f193b9c1dab0a69...

    they actually have custom prompts for each dataset being tested.

    Question would be, if you haven't seen the task before, what is a good prompt to prepend for your task?

    IMO e5-mistral is overfit to MTEB

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts