Tokenizer Alternatives

Similar projects and alternatives to tokenizer

tiktoken

30 9,577 7.0 Python tokenizer VS tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
Constrained-Text-Generation-Studio

25 194 4.1 Python tokenizer VS Constrained-Text-Generation-Studio

Code repo for "Most Language Models can be Poets too: An AI Writing Assistant and Constrained Text Generation Studio" at the (CAI2) workshop, jointly held at (COLING 2022)
InfluxDB

www.influxdata.com
sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
llama.go

12 1,154 8.2 Go tokenizer VS llama.go

llama.go is like llama.cpp in pure Golang!
Constrained-Text-Genera

11 - - tokenizer VS Constrained-Text-Genera
agency

5 39 7.0 Go tokenizer VS agency

Agency: Robust LLM Agent Management with Go (by ryszard)
nn-zero-to-hero

10 10,293 2.6 Jupyter Notebook tokenizer VS nn-zero-to-hero

Neural Networks: Zero to Hero
sentences

0 417 4.5 Go tokenizer VS sentences

A multilingual command line sentence tokenizer in Golang
WorkOS

workos.com
sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
llama-tokenizer-js

5 292 7.1 JavaScript tokenizer VS llama-tokenizer-js

JS tokenizer for LLaMA
tokenizer-go

1 116 4.0 Go tokenizer VS tokenizer-go

A Go wrapper for GPT-3 token encode/decode. https://platform.openai.com/tokenizer
tiktoken-go

1 532 4.6 Go tokenizer VS tiktoken-go

go version of tiktoken
fastbpe

1 2 5.5 Java tokenizer VS fastbpe

Java library implementing Byte-Pair Encoding Tokenization (by deepanprabhu)
spaGO

11 1,693 0.0 Go tokenizer VS spaGO

Discontinued Self-contained Machine Learning and Natural Language Processing library in Go

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better tokenizer alternative or higher similarity.

Suggest an alternative to tokenizer

tokenizer reviews and mentions

Posts with mentions or reviews of tokenizer. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-06-08.

Understanding GPT Tokenizers
10 projects | news.ycombinator.com | 8 Jun 2023

How I wish this post had appeared a few days earlier... I am writing on my own library for some agent experiments (in go, to make my life more interesting I guess), and knowing the number of tokens is important to implement a token buffer memory (as you approach the model's context window size, you prune enough messages from the beginning of the conversation that the whole thing keeps some given size, in tokens). While there's a nice native library in go for OpenAI models (https://github.com/tiktoken-go/tokenizer), the only library I found for Hugging Face models (and Claude, they published their tokenizer spec in the same JSON format) calls into HF's Rust implementation, which makes it challenging as a dependency in Go. What is more, any tokenizer needs to keep some representation of its vocabulary in memory. So, in the end I removed the true tokenizers, and ended up using an approximate version (just split it in on spaces and multiply by a factor I determined experimentally for the models I use using the real tokenizer, with a little extra for safety). If it turns out someone needs the real thing they can always provide their own token counter). I was actually rather happy with this result: I have less dependencies, and use less memory. But to get there I needed to do a deep dive too understand BPE tokenizers :)
(The library, if anyone is interested: https://github.com/ryszard/agency.)
Pure Go implementation of OpenAI's tokenizer
4 projects | /r/golang | 7 Apr 2023

Stats

Basic tokenizer repo stats

Mentions

Stars

228

Activity

4.3

Last Commit

almost 1 year ago

tiktoken-go/tokenizer is an open source project licensed under MIT License which is an OSI approved license.

The primary programming language of tokenizer is Go.

tokenizer

Tokenizer Alternatives

Similar projects and alternatives to tokenizer

tokenizer reviews and mentions

Stats

Popular Comparisons