Top 23 Bert Open-Source Projects

transformers

175 125,021 10.0 Python

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Project mention: Maxtext: A simple, performant and scalable Jax LLM | news.ycombinator.com | 2024-04-23

Is t5x an encoder/decoder architecture?
Some more general options.
The Flax ecosystem
https://github.com/google/flax?tab=readme-ov-file
or dm-haiku
https://github.com/google-deepmind/dm-haiku
were some of the best developed communities in the Jax AI field
Perhaps the “trax” repo? https://github.com/google/trax
Some HF examples https://github.com/huggingface/transformers/tree/main/exampl...
Sadly it seems much of the work is proprietary these days, but one example could be Grok-1, if you customize the details. https://github.com/xai-org/grok-1/blob/main/run.py

nlp-tutorial

1 13,691 0.0 Jupyter Notebook

Natural Language Processing Tutorial for Deep Learning Researchers
InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
haystack

54 13,633 9.9 Python

:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

Project mention: Release Radar • March 2024 Edition | dev.to | 2024-04-07

View on GitHub

clip-as-service

15 12,181 5.2 Python

🏄 Scalable embedding, reasoning, ranking for images and sentences with CLIP

Project mention: Search for anything ==> Immich fails to download textual.onnx | /r/immich | 2023-09-15

PaddleNLP

2 11,423 9.8 Python

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
tokenizers

8 8,395 8.5 Rust

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Project mention: HF Transfer: Speed up file transfers | /r/rust | 2023-07-07

Hugging Face seems to like Rust. They also wrote Tokenizers in Rust.

Transformers-Tutorials

7 7,510 8.4 Jupyter Notebook

This repository contains demos I made with the Transformers library by HuggingFace.

Project mention: AI enthusiasm #6 - Finetune any LLM you want💡 | dev.to | 2024-04-16

Most of this tutorial is based on Hugging Face course about Transformers and on Niels Rogge's Transformers tutorials: make sure to check their work and give them a star on GitHub, if you please ❤️

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
bertviz

15 6,377 3.9 Python

BertViz: Visualize Attention in NLP Models (BERT, GPT2, BART, etc.)

Project mention: StreamingLLM: tiny tweak to KV LRU improves long conversations | news.ycombinator.com | 2024-02-13

This seems only to work cause large GPTs have redundant, undercomplex attentions. See this issue in BertViz about attention in Llama: https://github.com/jessevig/bertviz/issues/128

ERNIE

4 6,165 2.7 Python

Official implementations for various pre-training models of ERNIE-family, covering topics of Language Understanding & Generation, Multimodal Understanding & Generation, and beyond.
BERT-pytorch

1 5,988 0.0 Python

Google AI 2018 BERT pytorch implementation
BERTopic

22 5,543 8.2 Python

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

Project mention: how can a top2vec output be improved | /r/learnmachinelearning | 2023-07-04

Try experimenting with different hyperparameters, clustering algorithms and embedding representations. Try https://github.com/MaartenGr/BERTopic/tree/master/bertopic

FasterTransformer

7 5,456 4.3 C++

Transformer related optimization, including BERT, GPT

Project mention: Train Your AI Model Once and Deploy on Any Cloud | news.ycombinator.com | 2023-07-08

https://docs.nvidia.com/ai-enterprise/overview/0.1.0/platfor...
RIVA: NVIDIA® Riva, a premium edition of NVIDIA AI Enterprise software, is a GPU-accelerated speech and translation AI SDK
FasterTransformer: https://github.com/NVIDIA/FasterTransformer an

pytorch-sentiment-analysis

2 4,218 4.0 Jupyter Notebook

Tutorials on getting started with PyTorch and TorchText for sentiment analysis.
awesome-pretrained-chinese-nlp-models

1 4,193 8.9 Python

Awesome Pretrained Chinese NLP Models，高质量中文预训练模型&大模型&多模态模型&大语言模型集合
PromptPapers

1 3,918 0.7

Must-read papers on prompt-based tuning for pre-trained language models.
spark-nlp

87 3,682 9.4 Scala

State of the Art Natural Language Processing

Project mention: Spark NLP 5.1.0: Introducing state-of-the-art OpenAI Whisper speech-to-text, OpenAI Embeddings and Completion transformers, MPNet text embeddings, ONNX support for E5 text embeddings, new multi-lingual BART Zero-Shot text classification, and much more! | /r/Python | 2023-09-06

KeyBERT

5 3,213 6.1 Python

Minimal keyword extraction with BERT

Project mention: I want to extract important keywords from large documents... | /r/LangChain | 2023-12-07

Use something else like KeyBERT or BERTopic: https://github.com/MaartenGr/KeyBERT It's much faster.

lightseq

1 3,088 3.7 C++

LightSeq: A High Performance Library for Sequence Processing and Generation
llmware

9 3,086 9.8 Python

Providing enterprise-grade LLM-based development framework, tools, and fine-tuned models.

Project mention: More Agents Is All You Need: LLMs performance scales with the number of agents | news.ycombinator.com | 2024-04-06

I couldn't agree more. You should check out LLMWare's SLIM agents (https://github.com/llmware-ai/llmware/tree/main/examples/SLI...). It's focusing on pretty much exactly this and chaining multiple local LLMs together.
A really good topic that ties in with this is the need for deterministic sampling (I may have the terminology a bit incorrect) depending on what the model is indended for. The LLMWare team did a good 2 part video on this here as well (https://www.youtube.com/watch?v=7oMTGhSKuNY)
I think dedicated miniture LLMs are the way forward.
Disclaimer - Not affiliated with them in any way, just think it's a really cool project.

machine-learning-articles

5 3,073 4.1

🧠💬 Articles I wrote about machine learning, archived from MachineCurve.com.
DeepKE

2 2,929 9.4 Python

[EMNLP 2022] An Open Toolkit for Knowledge Graph Extraction and Construction

Project mention: Would this method work to increase the memory of the model? Saving summaries generated by a 2nd model and injecting them depending on the current topic. | /r/LocalLLaMA | 2023-06-09

Top2Vec

13 2,839 7.0 Python

Top2Vec learns jointly embedded topic, document and word vectors.

Project mention: [D] Is it better to create a different set of Doc2Vec embeddings for each group in my dataset, rather than generating embeddings for the entire dataset? | /r/MachineLearning | 2023-10-28

I'm using Top2Vec with Doc2Vec embeddings to find topics in a dataset of ~4000 social media posts. This dataset has three groups:

rust-bert

7 2,415 6.8 Rust

Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

Project mention: How to leverage the state-of-the-art NLP models in Rust | /r/infinilabs | 2023-06-07

brew install libtorch brew link libtorch brew ls --verbose libtorch | grep dylib export LIBTORCH=$(brew --cellar pytorch)/$(brew info --json pytorch | jq -r '.[0].installed[0].version') export LD_LIBRARY_PATH=${LIBTORCH}/lib:$LD_LIBRARY_PATH git clone https://github.com/guillaume-be/rust-bert.git cd rust-bert ORT_STRATEGY=system cargo run --example sentence_embeddings

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Bert related posts

AI enthusiasm #6 - Finetune any LLM you want💡
2 projects | dev.to | 16 Apr 2024
More Agents Is All You Need: LLMs performance scales with the number of agents
2 projects | news.ycombinator.com | 6 Apr 2024
Splade: Sparse Neural Search
1 project | news.ycombinator.com | 11 Mar 2024
Show HN: LLMWare – Small Specialized Function Calling 1B LLMs for Multi-Step RAG
2 projects | news.ycombinator.com | 11 Feb 2024
Better Call GPT, Comparing Large Language Models Against Lawyers (pdf)
1 project | news.ycombinator.com | 6 Feb 2024
Show HN: LLMWare – Integrated Solution for RAG in Finance and Legal
1 project | news.ycombinator.com | 21 Jan 2024
On building a semantic search engine
3 projects | news.ycombinator.com | 6 Jan 2024
A note from our sponsor - WorkOS
workos.com | 26 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source Bert projects? This list will help you:

	Project	Stars
1	transformers	125,021
2	nlp-tutorial	13,691
3	haystack	13,633
4	clip-as-service	12,181
5	PaddleNLP	11,423
6	tokenizers	8,395
7	Transformers-Tutorials	7,510
8	bertviz	6,377
9	ERNIE	6,165
10	BERT-pytorch	5,988
11	BERTopic	5,543
12	FasterTransformer	5,456
13	pytorch-sentiment-analysis	4,218
14	awesome-pretrained-chinese-nlp-models	4,193
15	PromptPapers	3,918
16	spark-nlp	3,682
17	KeyBERT	3,213
18	lightseq	3,088
19	llmware	3,086
20	machine-learning-articles	3,073
21	DeepKE	2,929
22	Top2Vec	2,839
23	rust-bert	2,415