Show HN: LLaMA tokenizer that runs in browser

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App
With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
surveyjs.io
featured
InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
  • chat-with-gpt

    An open-source ChatGPT app with a voice

  • There is this for openAI: https://github.com/cogentapps/chat-with-gpt/blob/main/app/sr....

    Not completely sure, but I think it will likely work as it is for llama.

  • llama-tokenizer-js

    JS tokenizer for LLaMA and LLaMA 2

  • SurveyJS

    Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.

    SurveyJS logo
  • tiktoken

    tiktoken is a fast BPE tokeniser for use with OpenAI's models.

  • https://platform.openai.com/tokenizer or the official python library tiktoken https://github.com/openai/tiktoken or this JS port of tiktoken https://github.com/dqbd/tiktoken

  • tiktoken

    JS port and JS/WASM bindings for openai/tiktoken (by dqbd)

  • https://platform.openai.com/tokenizer or the official python library tiktoken https://github.com/openai/tiktoken or this JS port of tiktoken https://github.com/dqbd/tiktoken

  • agency

    Agency: Robust LLM Agent Management with Go (by ryszard)

  • Tokenizers seem to be a massive pain in the neck if you are just calling into an API to use your model. The algorithm itself is non-trivial, and they need pretty sizable data to function: the vocabulary and the merges, which just sit there, using memory. I'm writing https://github.com/ryszard/agency in Go, and while there's a good library for the OpenAI tokenization, if you want a tokenizer for the HF models the best I found was a library calling HF's Rust implementation, which makes it horrible for distribution.

    However, at some point I realized that I needed not really the tokens, but the token count, as my most important use was implementing a Token Buffer Memory (trim messages from the beginning in such a way that you never exceed a context size number of tokens). And in order to do that I don't need it to be exactly right, just mostly right, if I am ok with slightly suboptimal efficiency (keeping slightly less tokens than the model supports). So, I took files from Project Gutenberg, and compared the ratio of tokens I get using a proper tokenizer and just calling `strings.Split`, and it seems to be remarkably stable for a given model and language (multiply the length of the result of splitting on spaces by 1.55 for OpenAI and 1.7 for Claude, which leaves a tiny safety margin).

    I'm not throwing shade at this project – just being able to call the tokenizer would've saved me a lot of time. But I hope that if I'm wrong about the estimates bring good enough some good person will point out the error of my ways :)

  • gpt4-tokenizer-visualizer

    GPT4 Tokenizer Visualizer

  • Yes, this one does

    https://github.com/functorism/gpt4-tokenizer-visualizer

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Top Open Source Prompt Engineering Guides & ToolsπŸ”§πŸ—οΈπŸš€

    5 projects | dev.to | 2 May 2024
  • Ask HN: What's the best charting library for customer-facing dashboards?

    17 projects | news.ycombinator.com | 29 Apr 2024
  • Digitized Continuous Magnetic Recordings for the 1859 Carrington Event

    1 project | news.ycombinator.com | 23 Apr 2024
  • Show HN: LLaMA 3 tokenizer runs in the browser

    2 projects | news.ycombinator.com | 21 Apr 2024
  • Show HN: Minard – Generate beautiful charts with natural language

    1 project | news.ycombinator.com | 18 Apr 2024