Embeddings: What they are and why they matter

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • telekinesis

    Control Objects and Functions Remotely

  • If you want to see what Show HN posts, ProductHunt Startups, YC Companies, and Github repos are related to LLM embeddings, you can quickly find them on the LLM-Embeddings-Based Search Engine (MVP) I just launched:

    https://payperrun.com/%3E/search?displayParams={%22q%22:%22L...

  • supabase

    The open source Firebase alternative.

  • I built my own note taking ios app a little while back and adding embeddings to my existing fulltext search functionality was 1) surprisingly easy and 2) way more powerful that I initially expected.

    I knew it would work for things like if I search for "dog" I'll also get notes that say "canine", but I didn't originally realize until I played around with it that if I search for something like "pets I might like" I'll get all kinds of notes I've taken regarding various animals with positive sentiment.

    It was the first big aha moment for me.

    At the time I found Supabase's PR for their DocsGPT really helpful for example code: https://github.com/supabase/supabase/pull/12056

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • vectordb

    A minimal Python package for storing and retrieving text using chunking, embeddings, and vector search. (by kagisearch)

  • If you are looking for lightweight, low- latency, fully local, end-to-end solution (chunking, embedding, storage and vector search), try vectordb [1]

    Just spent a day updating it with latest benchmarks for text embedding models.

    [1] https://github.com/kagisearch/vectordb

  • marqo

    Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

  • Try this https://github.com/marqo-ai/marqo which handles all the chunking for you (and is configurable). Also handles chunking of images in an analogous way. This enables highlighting in longer docs and also for images in a single retrieval step.

  • DBoW2

    Enhanced hierarchical bag-of-word library for C++

  • Not quite the same application, but in computer vision and visual SLAM algorithms (to construct a map of your surrounding using a camera) embedding have become the de-facto algorithm to perform place-recognition ! And it's very similar to this article. It is called "bag-of-word place recognition" and it really became the standard, used by absolutely every open-source library nowadays.

    The core idea is that each image is passed through a feature-extractor-descriptor pipeline and is 'embedded' in a vector containing the N top features. While the camera moves, a database of images (called keyframes) is created (images are stored as much-lower dimensional vectors). Again while the camera moves, all images are used to query the database, something like cosine-similarity is used to retrieve the best match from the vector database. If a match happened, a stereo-constraints can be computed betweeen the query image and the match, and the software is able to update the map.

    [1] is the original paper and here's the most famous implementation: https://github.com/dorian3d/DBoW2

    [1]: https://www.google.com/search?client=firefox-b-d&q=Bags+of+B...

  • llm-cluster

    LLM plugin for clustering embeddings

  • I'm trying to understand the clustering code but not doing too well.

    https://github.com/simonw/llm-cluster/blob/main/llm_cluster....

    So does this take each row from the DB, convert to a numpy array (?), then uses an existing model called MiniBatchKMeans (?) to go over that array and generate a bunch of labels. Then add it to a dictionary and print to console.

  • bert

    TensorFlow code and pre-trained models for BERT

  • The general idea is that you have a particular task & dataset, and you optimize these vectors to maximize that task. So the properties of these vectors - what information is retained and what is left out during the 'compression' - are effectively determined by that task.

    In general, the core task for the various "LLM tools" involves prediction of a hidden word, trained on very large quantities of real text - thus also mirroring whatever structure (linguistic, syntactic, semantic, factual, social bias, etc) exists there.

    If you want to see how the sausage is made and look at the actual algorithms, then the key two approaches to read up on would probably be Mikolov's word2vec (https://arxiv.org/abs/1301.3781) with the CBOW (Continuous Bag of Words) and Continuous Skip-Gram Model, which are based on relatively simple math optimization, and then on the BERT (https://arxiv.org/abs/1810.04805) structure which does a conceptually similar thing but with a large neural network that can learn more from the same data. For both of them, you can either read the original papers or look up blog posts or videos that explain them, different people have different preferences on how readable academic papers are.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • roadmap

    This is the public roadmap for Salesforce Heroku services. (by heroku)

  • In case anyone is interested, Heroku finally released pgvector support for Postgres yesterday: https://github.com/heroku/roadmap/issues/156

    Pgvector is an extremely excellent way to experiment with embeddings in a lightweight way, without adding a bunch of extra infrastructure dependencies.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts