Top 23 Tokenizer Open-Source Projects

Chevrotain

3 2,397 6.7 TypeScript

Parser Building Toolkit for JavaScript

Project mention: Ohm: A library and language for building parsers, interpreters, compilers, etc. | news.ycombinator.com | 2023-10-31

How does this compare with Chevrotain[1]?
More specifically, can I build lexers with Ohm? Can it generate a syntax diagram from a grammar?
[1]: https://github.com/chevrotain/chevrotain

moo

1 802 2.4 JavaScript

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo. (by no-context)
InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
kagome

1 789 6.4 Go

Self-contained Japanese Morphological Analyzer written in pure Go
JFlex

1 574 4.3 Java

The fast scanner generator for Java™ with full Unicode support
php-parser

0 514 3.3 JavaScript

:herb: NodeJS PHP Parser - extract AST or tokens (by glayzzle)
simple

1 488 5.6 C++

支持中文和拼音的 SQLite fts5 全文搜索扩展｜ A SQLite3 fts5 tokenizer which supports Chinese and PinYin (by wangfenjin)

Project mention: Postgres Full Text Search is better than | news.ycombinator.com | 2023-04-27

There's a C++ library for tokenising Chinese for sqlite FTS: https://github.com/wangfenjin/simple

js-tokens

2 477 6.0 JavaScript

Tiny JavaScript tokenizer.
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
friso

0 474 2.6 C

High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.
CogCompNLP

0 469 0.0 Java

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
sentences

0 419 4.5 Go

A multilingual command line sentence tokenizer in Golang
gpt-tokenizer

1 377 4.6 TypeScript

JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT-2 / GPT-3 / GPT-4. Port of OpenAI's tiktoken with additional features.

Project mention: I wrote a tokenizer for LLaMA that runs inside the browser | /r/LocalLLaMA | 2023-06-13

There are more differences between GPT2 tokenizer and LLaMA tokenizer than only the vocab and merge data. It would take me some time to do implement a GPT2 tokenizer, and there are already good alternatives for those, so it wouldn't make sense to put time into making another one. For example, this library contains a GPT2 tokenizer: https://github.com/niieani/gpt-tokenizer

fugashi

1 366 5.4 C++

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
jumanpp

1 365 1.6 C++

Juman++ (a Morphological Analyzer Toolkit)
lindera

1 352 8.4 Rust

A multilingual morphological analysis library.
vscode-blockman

8 341 3.9 TypeScript

VSCode extension to highlight nested code blocks
bitextor

2 278 5.9 Python

Bitextor generates translation memories from multilingual websites
sentence-splitter

1 203 0.0 Python

Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.
tiktoken-rs

1 198 7.6 Rust

Ready-made tokenizer library for working with GPT and tiktoken
html5gum

3 145 6.8 Rust

A WHATWG-compliant HTML5 tokenizer and tag soup parser
tokenizer

1 136 6.1 Go

NLP tokenizers written in Go language (by sugarme)
simplemma

0 125 5.2 Python

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Cledev.OpenAI

0 103 5.1 C#

.NET 7 SDK for OpenAI with a Blazor Server playground
spacy-experimental

5 93 4.2 Python

🧪 Cutting-edge experimental spaCy components and features
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Tokenizer related posts

Show HN: LLaMA 3 tokenizer runs in the browser
2 projects | news.ycombinator.com | 21 Apr 2024
Show HN: LLaMA tokenizer that runs in browser
7 projects | news.ycombinator.com | 13 Jun 2023
Intro video for my VS Code extension "Blockman"
3 projects | /r/vscode | 13 Jan 2023
Build package for NPM & Deno
5 projects | /r/Deno | 5 Jan 2023
spaCy just got an experimental feature to detect co-references
1 project | /r/learnmachinelearning | 7 Oct 2022
Edit code from browser
2 projects | /r/reactjs | 5 Jul 2022
SpanFinder is a new experimental spaCy component that identifies span boundaries
1 project | news.ycombinator.com | 21 Jun 2022
A note from our sponsor - WorkOS
workos.com | 25 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source Tokenizer projects? This list will help you:

	Project	Stars
1	Chevrotain	2,397
2	moo	802
3	kagome	789
4	JFlex	574
5	php-parser	514
6	simple	488
7	js-tokens	477
8	friso	474
9	CogCompNLP	469
10	sentences	419
11	gpt-tokenizer	377
12	fugashi	366
13	jumanpp	365
14	lindera	352
15	vscode-blockman	341
16	bitextor	278
17	sentence-splitter	203
18	tiktoken-rs	198
19	html5gum	145
20	tokenizer	136
21	simplemma	125
22	Cledev.OpenAI	103
23	spacy-experimental	93