Understanding GPT Tokenizers

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • Constrained-Text-Generation-Studio

    Code repo for "Most Language Models can be Poets too: An AI Writing Assistant and Constrained Text Generation Studio" at the (CAI2) workshop, jointly held at (COLING 2022)

  • I agree with you, and I'm SHOCKED at how little work there actually is in phonetics within the NLP community. Consider that most of the phonetic tools that I am using to enforce rhyming or similar syntactic constrained in constrained text generation studio (https://github.com/Hellisotherpeople/Constrained-Text-Genera...) were built circa 2014, such as the CMU rhyming dictionary. In most cases, I could not find better modern implementations of these tools.

    I did learn an awful lot about phonetic representations and matching algorithms. Things like "soundex" and "double metaphone" now make sense to me and are fascinating to read about.

  • I agree with you, and I'm SHOCKED at how little work there actually is in phonetics within the NLP community. Consider that most of the phonetic tools that I am using to enforce rhyming or similar syntactic constrained in constrained text generation studio (https://github.com/Hellisotherpeople/Constrained-Text-Genera...) were built circa 2014, such as the CMU rhyming dictionary. In most cases, I could not find better modern implementations of these tools.

    I did learn an awful lot about phonetic representations and matching algorithms. Things like "soundex" and "double metaphone" now make sense to me and are fascinating to read about.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • fastbpe

    Java library implementing Byte-Pair Encoding Tokenization (by deepanprabhu)

  • Tokenization is very important and I did implement fastbpe in java to understand things - https://github.com/deepanprabhu/fastbpe

  • tokenizer

    Pure Go implementation of OpenAI's tiktoken tokenizer

  • How I wish this post had appeared a few days earlier... I am writing on my own library for some agent experiments (in go, to make my life more interesting I guess), and knowing the number of tokens is important to implement a token buffer memory (as you approach the model's context window size, you prune enough messages from the beginning of the conversation that the whole thing keeps some given size, in tokens). While there's a nice native library in go for OpenAI models (https://github.com/tiktoken-go/tokenizer), the only library I found for Hugging Face models (and Claude, they published their tokenizer spec in the same JSON format) calls into HF's Rust implementation, which makes it challenging as a dependency in Go. What is more, any tokenizer needs to keep some representation of its vocabulary in memory. So, in the end I removed the true tokenizers, and ended up using an approximate version (just split it in on spaces and multiply by a factor I determined experimentally for the models I use using the real tokenizer, with a little extra for safety). If it turns out someone needs the real thing they can always provide their own token counter). I was actually rather happy with this result: I have less dependencies, and use less memory. But to get there I needed to do a deep dive too understand BPE tokenizers :)

    (The library, if anyone is interested: https://github.com/ryszard/agency.)

  • agency

    Agency: Robust LLM Agent Management with Go (by ryszard)

  • How I wish this post had appeared a few days earlier... I am writing on my own library for some agent experiments (in go, to make my life more interesting I guess), and knowing the number of tokens is important to implement a token buffer memory (as you approach the model's context window size, you prune enough messages from the beginning of the conversation that the whole thing keeps some given size, in tokens). While there's a nice native library in go for OpenAI models (https://github.com/tiktoken-go/tokenizer), the only library I found for Hugging Face models (and Claude, they published their tokenizer spec in the same JSON format) calls into HF's Rust implementation, which makes it challenging as a dependency in Go. What is more, any tokenizer needs to keep some representation of its vocabulary in memory. So, in the end I removed the true tokenizers, and ended up using an approximate version (just split it in on spaces and multiply by a factor I determined experimentally for the models I use using the real tokenizer, with a little extra for safety). If it turns out someone needs the real thing they can always provide their own token counter). I was actually rather happy with this result: I have less dependencies, and use less memory. But to get there I needed to do a deep dive too understand BPE tokenizers :)

    (The library, if anyone is interested: https://github.com/ryszard/agency.)

  • nn-zero-to-hero

    Neural Networks: Zero to Hero

  • Andrej covers this in https://github.com/karpathy/nn-zero-to-hero. He explains things in multiple ways, both the matrix multiplications as well as the "programmer's" way of thinking of it - i.e. the lookups. The downside is it takes a while to get through those lectures. I would say for each 1 hour you need another 10 to looks stuff up and practice, unless you are fresh out of calculus and linear algebra classes.

  • llama.go

    llama.go is like llama.cpp in pure Golang!

  • You might reuse simple LLaMA tokenizer right in your Go code, look there:

    https://github.com/gotzmann/llama.go/blob/8cc54ca81e6bfbce25...

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • llama-tokenizer-js

    JS tokenizer for LLaMA and LLaMA 2

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • chatgpt alternative

    3 projects | /r/selfhosted | 8 Dec 2023
  • Best way to use AMD CPU and GPU

    5 projects | /r/LocalLLaMA | 17 Jun 2023
  • Chinese-Alpaca-Plus-13B-GPTQ

    1 project | /r/LocalLLaMA | 30 May 2023
  • How to train a new language that is not in base model?

    1 project | /r/LocalLLaMA | 28 May 2023
  • Gotzmann LLM Score

    1 project | /r/LocalLLaMA | 26 May 2023