Show HN: LLaMA tokenizer that runs in browser

SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App

With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.

surveyjs.io

featured

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

chat-with-gpt

39 2,261 5.3 TypeScript

An open-source ChatGPT app with a voice

There is this for openAI: https://github.com/cogentapps/chat-with-gpt/blob/main/app/sr....
Not completely sure, but I think it will likely work as it is for llama.

llama-tokenizer-js

5 299 7.1 JavaScript

JS tokenizer for LLaMA and LLaMA 2
SurveyJS

surveyjs.io featured

Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
tiktoken

30 9,884 6.7 Python

tiktoken is a fast BPE tokeniser for use with OpenAI's models.

https://platform.openai.com/tokenizer or the official python library tiktoken https://github.com/openai/tiktoken or this JS port of tiktoken https://github.com/dqbd/tiktoken

tiktoken

1 592 7.6 Python

JS port and JS/WASM bindings for openai/tiktoken (by dqbd)

https://platform.openai.com/tokenizer or the official python library tiktoken https://github.com/openai/tiktoken or this JS port of tiktoken https://github.com/dqbd/tiktoken

agency

5 43 7.0 Go

Agency: Robust LLM Agent Management with Go (by ryszard)

Tokenizers seem to be a massive pain in the neck if you are just calling into an API to use your model. The algorithm itself is non-trivial, and they need pretty sizable data to function: the vocabulary and the merges, which just sit there, using memory. I'm writing https://github.com/ryszard/agency in Go, and while there's a good library for the OpenAI tokenization, if you want a tokenizer for the HF models the best I found was a library calling HF's Rust implementation, which makes it horrible for distribution.
However, at some point I realized that I needed not really the tokens, but the token count, as my most important use was implementing a Token Buffer Memory (trim messages from the beginning in such a way that you never exceed a context size number of tokens). And in order to do that I don't need it to be exactly right, just mostly right, if I am ok with slightly suboptimal efficiency (keeping slightly less tokens than the model supports). So, I took files from Project Gutenberg, and compared the ratio of tokens I get using a proper tokenizer and just calling `strings.Split`, and it seems to be remarkably stable for a given model and language (multiply the length of the result of splitting on spaces by 1.55 for OpenAI and 1.7 for Claude, which leaves a tiny safety margin).
I'm not throwing shade at this project – just being able to call the tokenizer would've saved me a lot of time. But I hope that if I'm wrong about the estimates bring good enough some good person will point out the error of my ways :)

gpt4-tokenizer-visualizer

3 20 4.2 TypeScript

GPT4 Tokenizer Visualizer

Yes, this one does
https://github.com/functorism/gpt4-tokenizer-visualizer

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Top Open Source Prompt Engineering Guides & Tools🔧🏗️🚀

5 projects | dev.to | 2 May 2024
Ask HN: What's the best charting library for customer-facing dashboards?

17 projects | news.ycombinator.com | 29 Apr 2024
Digitized Continuous Magnetic Recordings for the 1859 Carrington Event

1 project | news.ycombinator.com | 23 Apr 2024
Show HN: LLaMA 3 tokenizer runs in the browser

2 projects | news.ycombinator.com | 21 Apr 2024
Show HN: Minard – Generate beautiful charts with natural language

1 project | news.ycombinator.com | 18 Apr 2024

Show HN: LLaMA tokenizer that runs in browser

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
gpt4 tiktoken Tokenizer Visualization
Post date: 13 Jun 2023

chat-with-gpt

llama-tokenizer-js

SurveyJS

tiktoken

tiktoken

agency

gpt4-tokenizer-visualizer

Related posts

Top Open Source Prompt Engineering Guides & Tools🔧🏗️🚀

Ask HN: What's the best charting library for customer-facing dashboards?

Digitized Continuous Magnetic Recordings for the 1859 Carrington Event

Show HN: LLaMA 3 tokenizer runs in the browser

Show HN: Minard – Generate beautiful charts with natural language

Show HN: LLaMA tokenizer that runs in browser

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com gpt4 tiktoken Tokenizer Visualization Post date: 13 Jun 2023

chat-with-gpt

llama-tokenizer-js

SurveyJS

tiktoken

tiktoken

agency

gpt4-tokenizer-visualizer

Related posts

Top Open Source Prompt Engineering Guides & Tools🔧🏗️🚀

Ask HN: What's the best charting library for customer-facing dashboards?

Digitized Continuous Magnetic Recordings for the 1859 Carrington Event

Show HN: LLaMA 3 tokenizer runs in the browser

Show HN: Minard – Generate beautiful charts with natural language

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
gpt4 tiktoken Tokenizer Visualization
Post date: 13 Jun 2023