Our great sponsors
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
One of the public colabs using CLIP uses fourier transforms for image generation and it really is very fast. https://github.com/eps696/aphantasia
I don't think this is valid in the context of this article. The input tokens are not one-hot encodings of the input characters, they are learned embeddings on a 32K SentencePiece vocabulary (4.1.1). As "STOP" and "SPOT" are probably fairly common words in their training dataset, I think it's safe to assume that each word would be assigned its own unique vector rather than be represented by the four "subword units" comprising their character decomposition.
Related posts
- sentencepiece
- [P] TokenMonster Ungreedy ~ 35% faster inference and 35% increased context-length for large language models (compared to tiktoken). Benchmarks included.
- LLaMA tokenizer: is a JavaScript implementation available anywhere?
- [P] New tokenization method improves LLM performance & context-length by 25%+
- Code runs without definition of function (automatically calls a different function instead)