With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js. Learn more β
Top 23 Tokenizer Open-Source Projects
-
SurveyJS
Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
-
simple
ζ―ζδΈζεζΌι³η SQLite fts5 ε ¨ζζη΄’ζ©ε± ο½ A SQLite3 fts5 tokenizer which supports Chinese and PinYin (by wangfenjin)
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
friso
High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.
-
CogCompNLP
CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
-
gpt-tokenizer
JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT-2 / GPT-3 / GPT-4. Port of OpenAI's tiktoken with additional features.
-
sentence-splitter
Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Ohm: A library and language for building parsers, interpreters, compilers, etc. | news.ycombinator.com | 2023-10-31How does this compare with Chevrotain[1]?
More specifically, can I build lexers with Ohm? Can it generate a syntax diagram from a grammar?
[1]: https://github.com/chevrotain/chevrotain
Project mention: I wrote a tokenizer for LLaMA that runs inside the browser | /r/LocalLLaMA | 2023-06-13There are more differences between GPT2 tokenizer and LLaMA tokenizer than only the vocab and merge data. It would take me some time to do implement a GPT2 tokenizer, and there are already good alternatives for those, so it wouldn't make sense to put time into making another one. For example, this library contains a GPT2 tokenizer: https://github.com/niieani/gpt-tokenizer
Project mention: tiktoken_ruby VS ruby-openai - a user suggested alternative | libhunt.com/r/tiktoken_ruby | 2024-05-03
Tokenizer related posts
-
Show HN: LLaMA 3 tokenizer runs in the browser
-
Show HN: LLaMA tokenizer that runs in browser
-
Intro video for my VS Code extension "Blockman"
-
Build package for NPM & Deno
-
spaCy just got an experimental feature to detect co-references
-
Edit code from browser
-
SpanFinder is a new experimental spaCy component that identifies span boundaries
-
A note from our sponsor - SurveyJS
surveyjs.io | 4 May 2024
Index
What are some of the best open-source Tokenizer projects? This list will help you:
Project | Stars | |
---|---|---|
1 | Chevrotain | 2,399 |
2 | moo | 807 |
3 | kagome | 789 |
4 | JFlex | 575 |
5 | php-parser | 515 |
6 | simple | 489 |
7 | js-tokens | 479 |
8 | friso | 474 |
9 | CogCompNLP | 469 |
10 | sentences | 421 |
11 | gpt-tokenizer | 380 |
12 | jumanpp | 366 |
13 | fugashi | 366 |
14 | lindera | 351 |
15 | vscode-blockman | 341 |
16 | bitextor | 279 |
17 | sentence-splitter | 203 |
18 | tiktoken-rs | 200 |
19 | html5gum | 146 |
20 | tokenizer | 140 |
21 | simplemma | 125 |
22 | tiktoken_ruby | 106 |
23 | Cledev.OpenAI | 103 |
Sponsored