Embeddings: What they are and why they matter

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

telekinesis

12 16 5.6 Python

Control Objects and Functions Remotely

If you want to see what Show HN posts, ProductHunt Startups, YC Companies, and Github repos are related to LLM embeddings, you can quickly find them on the LLM-Embeddings-Based Search Engine (MVP) I just launched:
https://payperrun.com/%3E/search?displayParams={%22q%22:%22L...

supabase

767 65,869 10.0 TypeScript

The open source Firebase alternative.

I built my own note taking ios app a little while back and adding embeddings to my existing fulltext search functionality was 1) surprisingly easy and 2) way more powerful that I initially expected.
I knew it would work for things like if I search for "dog" I'll also get notes that say "canine", but I didn't originally realize until I played around with it that if I search for something like "pets I might like" I'll get all kinds of notes I've taken regarding various animals with positive sentiment.
It was the first big aha moment for me.
At the time I found Supabase's PR for their DocsGPT really helpful for example code: https://github.com/supabase/supabase/pull/12056

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
vectordb

6 543 7.6 Python

A minimal Python package for storing and retrieving text using chunking, embeddings, and vector search. (by kagisearch)

If you are looking for lightweight, low- latency, fully local, end-to-end solution (chunking, embedding, storage and vector search), try vectordb [1]
Just spent a day updating it with latest benchmarks for text embedding models.
[1] https://github.com/kagisearch/vectordb

marqo

114 4,111 9.3 Python

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

Try this https://github.com/marqo-ai/marqo which handles all the chunking for you (and is configurable). Also handles chunking of images in an analogous way. This enables highlighting in longer docs and also for images in a single retrieval step.

DBoW2

2 824 0.0 C++

Enhanced hierarchical bag-of-word library for C++

Not quite the same application, but in computer vision and visual SLAM algorithms (to construct a map of your surrounding using a camera) embedding have become the de-facto algorithm to perform place-recognition ! And it's very similar to this article. It is called "bag-of-word place recognition" and it really became the standard, used by absolutely every open-source library nowadays.
The core idea is that each image is passed through a feature-extractor-descriptor pipeline and is 'embedded' in a vector containing the N top features. While the camera moves, a database of images (called keyframes) is created (images are stored as much-lower dimensional vectors). Again while the camera moves, all images are used to query the database, something like cosine-similarity is used to retrieve the best match from the vector database. If a match happened, a stereo-constraints can be computed betweeen the query image and the match, and the software is able to update the map.
[1] is the original paper and here's the most famous implementation: https://github.com/dorian3d/DBoW2
[1]: https://www.google.com/search?client=firefox-b-d&q=Bags+of+B...

llm-cluster

3 58 4.9 Python

LLM plugin for clustering embeddings

I'm trying to understand the clustering code but not doing too well.
https://github.com/simonw/llm-cluster/blob/main/llm_cluster....
So does this take each row from the DB, convert to a numpy array (?), then uses an existing model called MiniBatchKMeans (?) to go over that array and generate a bunch of labels. Then add it to a dictionary and print to console.

bert

49 36,992 0.0 Python

TensorFlow code and pre-trained models for BERT

The general idea is that you have a particular task & dataset, and you optimize these vectors to maximize that task. So the properties of these vectors - what information is retained and what is left out during the 'compression' - are effectively determined by that task.
In general, the core task for the various "LLM tools" involves prediction of a hidden word, trained on very large quantities of real text - thus also mirroring whatever structure (linguistic, syntactic, semantic, factual, social bias, etc) exists there.
If you want to see how the sausage is made and look at the actual algorithms, then the key two approaches to read up on would probably be Mikolov's word2vec (https://arxiv.org/abs/1301.3781) with the CBOW (Continuous Bag of Words) and Continuous Skip-Gram Model, which are based on relatively simple math optimization, and then on the BERT (https://arxiv.org/abs/1810.04805) structure which does a conceptually similar thing but with a large neural network that can learn more from the same data. For both of them, you can either read the original papers or look up blog posts or videos that explain them, different people have different preferences on how readable academic papers are.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
roadmap

11 187 1.7

This is the public roadmap for Salesforce Heroku services. (by heroku)

In case anyone is interested, Heroku finally released pgvector support for Postgres yesterday: https://github.com/heroku/roadmap/issues/156
Pgvector is an extremely excellent way to experiment with embeddings in a lightweight way, without adding a bunch of extra infrastructure dependencies.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project