Understanding GPT Tokenizers

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

Constrained-Text-Generation-Studio

25 195 4.1 Python

Code repo for "Most Language Models can be Poets too: An AI Writing Assistant and Constrained Text Generation Studio" at the (CAI2) workshop, jointly held at (COLING 2022)

I agree with you, and I'm SHOCKED at how little work there actually is in phonetics within the NLP community. Consider that most of the phonetic tools that I am using to enforce rhyming or similar syntactic constrained in constrained text generation studio (https://github.com/Hellisotherpeople/Constrained-Text-Genera...) were built circa 2014, such as the CMU rhyming dictionary. In most cases, I could not find better modern implementations of these tools.
I did learn an awful lot about phonetic representations and matching algorithms. Things like "soundex" and "double metaphone" now make sense to me and are fascinating to read about.

Constrained-Text-Genera

11 - -

I agree with you, and I'm SHOCKED at how little work there actually is in phonetics within the NLP community. Consider that most of the phonetic tools that I am using to enforce rhyming or similar syntactic constrained in constrained text generation studio (https://github.com/Hellisotherpeople/Constrained-Text-Genera...) were built circa 2014, such as the CMU rhyming dictionary. In most cases, I could not find better modern implementations of these tools.
I did learn an awful lot about phonetic representations and matching algorithms. Things like "soundex" and "double metaphone" now make sense to me and are fascinating to read about.

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
fastbpe

1 2 5.5 Java

Java library implementing Byte-Pair Encoding Tokenization (by deepanprabhu)

Tokenization is very important and I did implement fastbpe in java to understand things - https://github.com/deepanprabhu/fastbpe

tokenizer

2 228 4.3 Go

Pure Go implementation of OpenAI's tiktoken tokenizer

How I wish this post had appeared a few days earlier... I am writing on my own library for some agent experiments (in go, to make my life more interesting I guess), and knowing the number of tokens is important to implement a token buffer memory (as you approach the model's context window size, you prune enough messages from the beginning of the conversation that the whole thing keeps some given size, in tokens). While there's a nice native library in go for OpenAI models (https://github.com/tiktoken-go/tokenizer), the only library I found for Hugging Face models (and Claude, they published their tokenizer spec in the same JSON format) calls into HF's Rust implementation, which makes it challenging as a dependency in Go. What is more, any tokenizer needs to keep some representation of its vocabulary in memory. So, in the end I removed the true tokenizers, and ended up using an approximate version (just split it in on spaces and multiply by a factor I determined experimentally for the models I use using the real tokenizer, with a little extra for safety). If it turns out someone needs the real thing they can always provide their own token counter). I was actually rather happy with this result: I have less dependencies, and use less memory. But to get there I needed to do a deep dive too understand BPE tokenizers :)
(The library, if anyone is interested: https://github.com/ryszard/agency.)

agency

5 42 7.0 Go

Agency: Robust LLM Agent Management with Go (by ryszard)

How I wish this post had appeared a few days earlier... I am writing on my own library for some agent experiments (in go, to make my life more interesting I guess), and knowing the number of tokens is important to implement a token buffer memory (as you approach the model's context window size, you prune enough messages from the beginning of the conversation that the whole thing keeps some given size, in tokens). While there's a nice native library in go for OpenAI models (https://github.com/tiktoken-go/tokenizer), the only library I found for Hugging Face models (and Claude, they published their tokenizer spec in the same JSON format) calls into HF's Rust implementation, which makes it challenging as a dependency in Go. What is more, any tokenizer needs to keep some representation of its vocabulary in memory. So, in the end I removed the true tokenizers, and ended up using an approximate version (just split it in on spaces and multiply by a factor I determined experimentally for the models I use using the real tokenizer, with a little extra for safety). If it turns out someone needs the real thing they can always provide their own token counter). I was actually rather happy with this result: I have less dependencies, and use less memory. But to get there I needed to do a deep dive too understand BPE tokenizers :)
(The library, if anyone is interested: https://github.com/ryszard/agency.)

nn-zero-to-hero

10 10,396 2.4 Jupyter Notebook

Neural Networks: Zero to Hero

Andrej covers this in https://github.com/karpathy/nn-zero-to-hero. He explains things in multiple ways, both the matrix multiplications as well as the "programmer's" way of thinking of it - i.e. the lookups. The downside is it takes a while to get through those lectures. I would say for each 1 hour you need another 10 to looks stuff up and practice, unless you are fresh out of calculus and linear algebra classes.

llama.go

12 1,160 8.2 Go

llama.go is like llama.cpp in pure Golang!

You might reuse simple LLaMA tokenizer right in your Go code, look there:
https://github.com/gotzmann/llama.go/blob/8cc54ca81e6bfbce25...

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
llama-tokenizer-js

5 299 7.1 JavaScript

JS tokenizer for LLaMA and LLaMA 2

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

chatgpt alternative

3 projects | /r/selfhosted | 8 Dec 2023
Best way to use AMD CPU and GPU

5 projects | /r/LocalLLaMA | 17 Jun 2023
Chinese-Alpaca-Plus-13B-GPTQ

1 project | /r/LocalLLaMA | 30 May 2023
How to train a new language that is not in base model?

1 project | /r/LocalLLaMA | 28 May 2023
Gotzmann LLM Score

1 project | /r/LocalLLaMA | 26 May 2023

Understanding GPT Tokenizers

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
llm llama bpe alpaca bytepairencoding
Post date: 8 Jun 2023

Constrained-Text-Generation-Studio

Constrained-Text-Genera

InfluxDB

fastbpe

tokenizer

agency

nn-zero-to-hero

llama.go

SaaSHub

llama-tokenizer-js

Related posts

chatgpt alternative

Best way to use AMD CPU and GPU

Chinese-Alpaca-Plus-13B-GPTQ

How to train a new language that is not in base model?

Gotzmann LLM Score

Understanding GPT Tokenizers

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com llm llama bpe alpaca bytepairencoding Post date: 8 Jun 2023

Related posts

chatgpt alternative

Best way to use AMD CPU and GPU

Chinese-Alpaca-Plus-13B-GPTQ

How to train a new language that is not in base model?

Gotzmann LLM Score

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
llm llama bpe alpaca bytepairencoding
Post date: 8 Jun 2023