The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more β
Top 23 Tokenizer Open-Source Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
simple
ζ―ζδΈζεζΌι³η SQLite fts5 ε ¨ζζη΄’ζ©ε± ο½ A SQLite3 fts5 tokenizer which supports Chinese and PinYin (by wangfenjin)
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
friso
High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.
-
CogCompNLP
CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
-
gpt-tokenizer
JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT-2 / GPT-3 / GPT-4. Port of OpenAI's tiktoken with additional features.
-
sentence-splitter
Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Ohm: A library and language for building parsers, interpreters, compilers, etc. | news.ycombinator.com | 2023-10-31How does this compare with Chevrotain[1]?
More specifically, can I build lexers with Ohm? Can it generate a syntax diagram from a grammar?
[1]: https://github.com/chevrotain/chevrotain
There's a C++ library for tokenising Chinese for sqlite FTS: https://github.com/wangfenjin/simple
Project mention: I wrote a tokenizer for LLaMA that runs inside the browser | /r/LocalLLaMA | 2023-06-13There are more differences between GPT2 tokenizer and LLaMA tokenizer than only the vocab and merge data. It would take me some time to do implement a GPT2 tokenizer, and there are already good alternatives for those, so it wouldn't make sense to put time into making another one. For example, this library contains a GPT2 tokenizer: https://github.com/niieani/gpt-tokenizer
Tokenizer related posts
- Show HN: LLaMA 3 tokenizer runs in the browser
- Show HN: LLaMA tokenizer that runs in browser
- Intro video for my VS Code extension "Blockman"
- Build package for NPM & Deno
- spaCy just got an experimental feature to detect co-references
- Edit code from browser
- SpanFinder is a new experimental spaCy component that identifies span boundaries
-
A note from our sponsor - WorkOS
workos.com | 25 Apr 2024
Index
What are some of the best open-source Tokenizer projects? This list will help you:
Project | Stars | |
---|---|---|
1 | Chevrotain | 2,397 |
2 | moo | 802 |
3 | kagome | 789 |
4 | JFlex | 574 |
5 | php-parser | 514 |
6 | simple | 488 |
7 | js-tokens | 477 |
8 | friso | 474 |
9 | CogCompNLP | 469 |
10 | sentences | 419 |
11 | gpt-tokenizer | 377 |
12 | fugashi | 366 |
13 | jumanpp | 365 |
14 | lindera | 352 |
15 | vscode-blockman | 341 |
16 | bitextor | 278 |
17 | sentence-splitter | 203 |
18 | tiktoken-rs | 198 |
19 | html5gum | 145 |
20 | tokenizer | 136 |
21 | simplemma | 125 |
22 | Cledev.OpenAI | 103 |
23 | spacy-experimental | 93 |
Sponsored