The Illustrated Word2Vec

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

what_are_embeddings

2 853 8.1 Jupyter Notebook

A deep dive into embeddings starting from fundamentals

That is essentially correct. You take an object and "embed" it in a high-dimensional vector space to represent it.
For a deep dive, I highly recommend Vicki Boykis's free materials:
https://vickiboykis.com/what_are_embeddings/

Fast_Sentence_Embeddings

3 603 0.0 Jupyter Notebook

Compute Sentence Embeddings Fast!

This is a great guide.
Also - despite the fact that language model embedding [1] are currently the hot rage, good old embedding models are more than good enough for most tasks.
With just a bit of tuning, they're generally as good at many sentence embedding tasks [2], and with good libraries [3] you're getting something like 400k sentence/sec on laptop CPU versus ~4k-15k sentences/sec on a v100 for LM embeddings.
When you should use language model embeddings:
- Multilingual tasks. While some embedding models are multilingual aligned (eg. MUSE [4]), you still need to route the sentence to the correct embedding model file (you need something like langdetect). It's also cumbersome, with one 400mb file per language.
For LM embedding models, many are multilingual aligned right away.
- Tasks that are very context specific or require fine-tuning. For instance, if you're making a RAG system for medical documents, the embedding space is best when it creates larger deviations for the difference between seemingly-related medical words.
This means models with more embedding dimensions, and heavily favors LM models over classic embedding models.
1. sbert.net
2. https://collaborate.princeton.edu/en/publications/a-simple-b...
3. https://github.com/oborchers/Fast_Sentence_Embeddings
4. https://github.com/facebookresearch/MUSE

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
MUSE

4 3,128 0.0 Python

Discontinued A library for Multilingual Unsupervised or Supervised word Embeddings

This is a great guide.
Also - despite the fact that language model embedding [1] are currently the hot rage, good old embedding models are more than good enough for most tasks.
With just a bit of tuning, they're generally as good at many sentence embedding tasks [2], and with good libraries [3] you're getting something like 400k sentence/sec on laptop CPU versus ~4k-15k sentences/sec on a v100 for LM embeddings.
When you should use language model embeddings:
- Multilingual tasks. While some embedding models are multilingual aligned (eg. MUSE [4]), you still need to route the sentence to the correct embedding model file (you need something like langdetect). It's also cumbersome, with one 400mb file per language.
For LM embedding models, many are multilingual aligned right away.
- Tasks that are very context specific or require fine-tuning. For instance, if you're making a RAG system for medical documents, the embedding space is best when it creates larger deviations for the difference between seemingly-related medical words.
This means models with more embedding dimensions, and heavily favors LM models over classic embedding models.
1. sbert.net
2. https://collaborate.princeton.edu/en/publications/a-simple-b...
3. https://github.com/oborchers/Fast_Sentence_Embeddings
4. https://github.com/facebookresearch/MUSE

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

[D] Unsupervised document similarity state of the art

2 projects | /r/MachineLearning | 9 Apr 2021
Experimenting with LLM-Based Chunk Enhancement for Better RAG Results

1 project | news.ycombinator.com | 1 Dec 2023
Multi-Modal Vector Embeddings at Scale

1 project | /r/LangChain | 29 Sep 2023
Good RAG implementation

1 project | /r/LlamaIndex | 29 Sep 2023
Challenges with Image Embeddings at Scale

1 project | /r/computervision | 29 Sep 2023

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Embeddings sentence-embeddings sentence-representation Machine Learning sentence-similarity
Post date: 19 Apr 2024

what_are_embeddings

Fast_Sentence_Embeddings

InfluxDB

MUSE

Related posts

[D] Unsupervised document similarity state of the art

Experimenting with LLM-Based Chunk Enhancement for Better RAG Results

Multi-Modal Vector Embeddings at Scale

Good RAG implementation

Challenges with Image Embeddings at Scale