Our great sponsors
-
tensor2tensor
Discontinued Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
The visualisation here may be helpful.
https://github.com/tensorflow/tensor2tensor/issues/1591
OpenAi have made their tokenizers public. [1]
As someone has pointed out, with BPE you specify the vocab size, not the token size. It's a relatively simple algo, this Huggingface course does a nice job of explaining it. [2] Plus the original paper has a very readable Python example.
[1] https://github.com/openai/tiktoken
[2] https://huggingface.co/course/chapter6/5?fw=pt
Related posts
- Understand how transformers work by demystifying all the math behind them
- [P] Why the Original Transformer Figure Is Wrong, And Some Other Interesting Tidbits
- Why the Original Transformer LLM Figure Is Wrong, and Other Interesting Tidbits
- [P] Why I quit my lucrative job at Google to start Vectara? (neural search as a service for developers everywhere).
- Alias-Free GAN