Top 15 sentence-transformer Open-Source Projects

Top2Vec

13 2,847 7.0 Python

Top2Vec learns jointly embedded topic, document and word vectors.

Project mention: [D] Is it better to create a different set of Doc2Vec embeddings for each group in my dataset, rather than generating embeddings for the entire dataset? | /r/MachineLearning | 2023-10-28

I'm using Top2Vec with Doc2Vec embeddings to find topics in a dataset of ~4000 social media posts. This dataset has three groups:

setfit

13 2,001 9.2 Jupyter Notebook

Efficient few-shot learning with Sentence Transformers

Project mention: FLaNK Stack 05 Feb 2024 | dev.to | 2024-02-05

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
mteb

2 1,421 9.8 Python

MTEB: Massive Text Embedding Benchmark

Project mention: AI for AWS Documentation | news.ycombinator.com | 2023-07-06

RAG is very difficult to do right. I am experimenting with various RAG projects from [1]. The main problems are:
- Chunking can interfer with context boundaries
- Content vectors can differ vastly from question vectors, for this you have to use hypothetical embeddings (they generate artificial questions and store them)
- Instead of saving just one embedding per text-chuck you should store various (text chunk, hypothetical embedding questions, meta data)
- RAG will miserably fail with requests like "summarize the whole document"
- to my knowledge, openAI embeddings aren't performing well, use a embedding that is optimized for question answering or information retrieval and supports multi language. Also look into instructor embeddings: https://github.com/embeddings-benchmark/mteb
1 https://github.com/underlines/awesome-marketing-datascience/...

beir

8 1,393 4.2 Python

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

Project mention: On building a semantic search engine | news.ycombinator.com | 2024-01-06

The BEIR project might be what you're looking for: https://github.com/beir-cellar/beir/wiki/Leaderboard

StoryToolkitAI

26 584 9.4 Python

An editing tool that uses AI to transcribe, understand content and search for anything in your footage, integrated with ChatGPT and other AI models

Project mention: Creating a workflow with translated subtitles | /r/davinciresolve | 2023-08-17

There's StoryToolkitAI, it's free (But requires Davinci Resolve Studio). It can transcribe and generate subtitles quite accurately. It also has a feature to translate subtitles to English. I haven't tried the translate feature yet, but I've been using this tool for my work a lot. It also supports more languages than resolve's built in transcription and auto subtitle tool.

DiffCSE

3 286 0.0 Python

Code for the NAACL 2022 long paper "DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings"
elastic_transformers

0 159 0.0 Jupyter Notebook

Making BERT stretchy. Semantic Elasticsearch with Sentence Transformers
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
nalcos

1 53 5.1 Python

Search Git commits in natural language
Python-Schema-Matching

1 24 3.5 Python

A python tool using XGboost and sentence-transformers to perform schema matching task on tables.
Llama-2-GGML-CSV-Chatbot

2 8 7.8 Python

The Llama-2-GGML-CSV-Chatbot is a conversational tool leveraging the powerful Llama-2 7B language model. It facilitates multi-turn interactions based on uploaded CSV data, allowing users to engage in seamless conversations.

Project mention: Do you Know! Llama ? | dev.to | 2024-04-11

huggingface.co/Llama-2-GGML-CSV-Chatbot

emoji_search

2 8 7.4 Python

Semantically Search Emojis From the Command Line!

Project mention: How to Build a Semantic Search Engine for Emojis | dev.to | 2024-02-07

pip install git+https://github.com/jacobmarks/emoji_search.git

emoji-search-plugin

1 5 6.3 Python

Semantic Emoji Search Plugin for FiftyOne

Project mention: How to Build a Semantic Search Engine for Emojis | dev.to | 2024-01-10

By “this”, I mean an open-source semantic emoji search engine, with both UI-centric and CLI versions. The Python CLI library can be found here, and the UI-centric version can be found here. You can also play around with a hosted (also free) version of the UI emoji search engine online here.

balena

1 5 5.6 Python

BALanced Execution through Natural Activation : a human-computer interaction methodology for code running.

Project mention: BALanced Execution Through Natural Activation: A HCI Methodology | news.ycombinator.com | 2023-12-31

tsdae

1 3 5.1 Python

Tranformer-based Denoising AutoEncoder for Sentence Transformers Unsupervised pre-training.

Project mention: Tranformer-based Denoising AutoEncoder for ST Unsupervised pre-training | news.ycombinator.com | 2024-02-04

A new PyPI package for training sentence embedding models in just 2 lines.
The acquisition of sentence embeddings often necessitates a substantial volume of labeled data. However, in many cases and fields, labeled data is rarely accessible, and the procurement of such data is costly. In this project, we employ an unsupervised process grounded in pre-trained Transformers-based Sequential Denoising Auto-Encoder (TSDAE), introduced by the Ubiquitous Knowledge Processing Lab of Darmstadt, which can realize a performance level reaching 93.1% of in-domain supervised methodologies.
The TSDAE schema comprises two components: an encoder and a decoder. Throughout the training process, TSDAE translates tainted sentences into uniform-sized vectors, necessitating the decoder to reconstruct the original sentences utilizing this sentence embedding. For good reconstruction quality, the semantics must be captured well in the sentence embeddings from the encoder. Subsequently, during inference, the encoder is solely utilized to form sentence embeddings.
GitHub : https://github.com/louisbrulenaudet/tsdae
Installation :

search-engine

3 1 10.0 Jupyter Notebook

Full stack search-engine created from youtube videos obtained using "web-scraping"
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

sentence-transformers related posts

Do you Know! Llama ?

1 project | dev.to | 11 Apr 2024
BEIR: A Heterogeneous Benchmark for Information Retrieval

1 project | news.ycombinator.com | 2 Jan 2024
Smarter Summaries with Finetuning GPT-3.5 and Chain of Density

1 project | news.ycombinator.com | 14 Nov 2023
Benefits of hybrid search

1 project | dev.to | 18 Aug 2023
[Discussion] Convince me that this training set contamination is fine (or not)

1 project | /r/MachineLearning | 2 Jun 2023
Ask HN: What's the best framework for text classification (few-shot learning)?

1 project | news.ycombinator.com | 9 Apr 2023
Is it worth using LLMs like GPT-3 for text classification?

1 project | /r/datascience | 8 Mar 2023
A note from our sponsor - SaaSHub
www.saashub.com | 9 May 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source sentence-transformer projects? This list will help you:

	Project	Stars
1	Top2Vec	2,847
2	setfit	2,001
3	mteb	1,421
4	beir	1,393
5	StoryToolkitAI	584
6	DiffCSE	286
7	elastic_transformers	159
8	nalcos	53
9	Python-Schema-Matching	24
10	Llama-2-GGML-CSV-Chatbot	8
11	emoji_search	8
12	emoji-search-plugin	5
13	balena	5
14	tsdae	3
15	search-engine	1