80 million sentence embeddings

This page summarizes the projects mentioned and recommended in the original post on /r/LanguageTechnology

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • faiss

    A library for efficient similarity search and clustering of dense vectors.

  • You can take advantage of the fact that many of these sentences aren't even in the same neighbourhood by using techniques like locally sensitive hashing or FAISS to reduce the number of operations that are done up front.

  • google-research

    Google Research

  • Nearest neighbour search isn't O(N²), neither is building the index. If you had a machine with enough RAM, then I would recommend scann, as it works well and is incredibly fast. I'm not sure if it works with an on-disk file format, though that's what you would want.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • haystack

    :mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

  • Yeah you can use elasticsearch to do some of the heavy lifting when indexing vectors although I've not personally used it for this exsct use case. I think tools like jina and haystack make this super easy for you

  • jina

    ☁️ Build multimodal AI applications with cloud-native stack

  • Yeah you can use elasticsearch to do some of the heavy lifting when indexing vectors although I've not personally used it for this exsct use case. I think tools like jina and haystack make this super easy for you

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts