You probably shouldn't use OpenAI's embeddings

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • semantic-search-tweets

    Run semantic queries over your twitter history

  • It's in the repo:

    You first create embeddings. What is this? It's an n-dimensional vector space with your tweets 'embedded' in that space. Each word is an n-dimensional vector in this space. The vectorization is supposed to maintain 'semantic distance'. Basically, if two words are very close in meaning or related (by say frequently appearing next to each other in corpus) they should be 'close' in some of those n-dimensions as well. The result at the end is the '.bin' file, the 'semantic model' of your corpus.

    https://github.com/dbasch/semantic-search-tweets/blob/main/e...

    For semantic search, you run the same embedding algorithm against the query and take the resultant vectors and do similarity search via matrix ops, resulting in a set of results, with probabilities. These point back to the original source, here the tweets, and you just print the tweet(s) that you select from that result set.

    https://github.com/dbasch/semantic-search-tweets/blob/main/s...

    Experts can chime in here but there are knobs such as 'batch size' and the functions you use to index. (cosine was used here.)

    So the various performance dimensions of the process should also be clear. There is a fixed cost of making the embeddings of your data. There is a per-op embedding of your query, and then running the similarity algorithm to find the result set.

  • gpt4-pdf-chatbot-langchain

    GPT4 & LangChain Chatbot for large PDF docs

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • Fast_Sentence_Embeddings

    Compute Sentence Embeddings Fast!

  • You can find some comparisons and evaluation datasets/tasks here: https://www.sbert.net/docs/pretrained_models.html

    Generally MiniLM is a good baseline. For faster models you want this library:

    https://github.com/oborchers/Fast_Sentence_Embeddings

    For higher quality ones, just take the bigger/slower models in the SentenceTransformers library

  • marqo

    Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts