Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure. Learn more →
Sentencepiece Alternatives
Similar projects and alternatives to sentencepiece
-
-
Stream
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
-
-
stylegan2-pytorch
Simplest working implementation of Stylegan2, state of the art generative adversarial network, in Pytorch. Enabling everyone to experience disentanglement
-
-
-
-
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
-
tinygrad
Discontinued You like pytorch? You like micrograd? You love tinygrad! ❤️ [Moved to: https://github.com/tinygrad/tinygrad] (by geohot)
-
-
-
-
-
-
-
-
transformers
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
-
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
sentencepiece discussion
sentencepiece reviews and mentions
-
Show HN: TokenDagger – A tokenizer 2-4x faster than OpenAI's Tiktoken
Gemini uses SentencePiece [1], and the proprietary Gemini models share the same tokenizer vocabulary as Gemma [2, 3, 4].
[1] https://github.com/google/sentencepiece
[2] https://github.com/googleapis/python-aiplatform/blob/main/ve...
[3] https://storage.googleapis.com/deepmind-media/gemma/gemma-2-...: "We inherit from the large Gemini vocabulary (256k entries)."
[4] https://storage.googleapis.com/deepmind-media/gemma/Gemma3Re...: "We use the same tokenizer as Gemini 2.0."
- Tokenization for language modeling: BPE vs. Unigram Language Modeling (2020)
- sentencepiece
-
LLM.int8(): 8-Bit Matrix Multiplication for Transformers at Scale
you need to train the model on 1 trillion tokens (https://platform.openai.com/tokenizer https://github.com/google/sentencepiece) anyways for it to get reasoning capacities, which it feels very unlikely that your data is that much.
I'm highly skeptical that you have enough data to pretrain if you don't have enough data to fine tune.
fine tuning + vector search + prompting of as much stuff as you can, on a LLM like palm2 or gpt4 is what I would do. otherwise you can use falcon 40B ofc.
maybe I should charge for this ahah
-
[P] TokenMonster Ungreedy ~ 35% faster inference and 35% increased context-length for large language models (compared to tiktoken). Benchmarks included.
a) Comparison with SentencePiece tokenizer with comparable settings (It can also ignore word-boundaries and create phrase tokens)
-
LLaMA tokenizer: is a JavaScript implementation available anywhere?
LLaMA uses the sentencepiece tokenizer: https://github.com/google/sentencepiece
-
[P] New tokenization method improves LLM performance & context-length by 25%+
Besides, are you familiar with SentencePiece? What you are doing looks very similar (generate a large vocab, prune worst token until vocab size is reached), only the token selection criterion is different. It's also purely data driven in the sense that there are no assumption specific to language (and it can optionally segment across whitespace, as you are doing).
-
Code runs without definition of function (automatically calls a different function instead)
Hi, I'm studying the implementation of encode and decode functions for Google's SentencePiece tokenizer.
-
How to handle multiple languages in a sentence?
I think many LMs nowadays use unicode tokenizers, that are not tied to specific languages. E.g. sentencepiece is the most popular one: https://github.com/google/sentencepiece
- Large language models are having their Stable Diffusion moment
-
A note from our sponsor - Stream
getstream.io | 14 Jul 2025
Stats
google/sentencepiece is an open source project licensed under Apache License 2.0 which is an OSI approved license.
The primary programming language of sentencepiece is C++.
Popular Comparisons
- sentencepiece VS CTranslate2
- sentencepiece VS transformers
- sentencepiece VS OpenNMT-Tutorial
- sentencepiece VS gpt-2
- sentencepiece VS llama
- sentencepiece VS text-generation-webui
- sentencepiece VS lm-human-preferences
- sentencepiece VS bevy_retro
- sentencepiece VS dalle-mini
- sentencepiece VS tokenmonster