Tokenizer

Top 23 Tokenizer Open-Source Projects

  • Chevrotain

    Parser Building Toolkit for JavaScript

  • Project mention: Ohm: A library and language for building parsers, interpreters, compilers, etc. | news.ycombinator.com | 2023-10-31

    How does this compare with Chevrotain[1]?

    More specifically, can I build lexers with Ohm? Can it generate a syntax diagram from a grammar?

    [1]: https://github.com/chevrotain/chevrotain

  • moo

    Optimised tokenizer/lexer generator! πŸ„ Uses /y for performance. Moo. (by no-context)

  • SurveyJS

    Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.

    SurveyJS logo
  • kagome

    Self-contained Japanese Morphological Analyzer written in pure Go

  • JFlex

    The fast scanner generator for Javaβ„’ with full Unicode support

  • php-parser

    :herb: NodeJS PHP Parser - extract AST or tokens (by glayzzle)

  • simple

    ζ”―ζŒδΈ­ζ–‡ε’Œζ‹ΌιŸ³ηš„ SQLite fts5 ε…¨ζ–‡ζœη΄’ζ‰©ε±• | A SQLite3 fts5 tokenizer which supports Chinese and PinYin (by wangfenjin)

  • js-tokens

    Tiny JavaScript tokenizer.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • friso

    High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.

  • CogCompNLP

    CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.

  • sentences

    A multilingual command line sentence tokenizer in Golang

  • gpt-tokenizer

    JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT-2 / GPT-3 / GPT-4. Port of OpenAI's tiktoken with additional features.

  • Project mention: I wrote a tokenizer for LLaMA that runs inside the browser | /r/LocalLLaMA | 2023-06-13

    There are more differences between GPT2 tokenizer and LLaMA tokenizer than only the vocab and merge data. It would take me some time to do implement a GPT2 tokenizer, and there are already good alternatives for those, so it wouldn't make sense to put time into making another one. For example, this library contains a GPT2 tokenizer: https://github.com/niieani/gpt-tokenizer

  • jumanpp

    Juman++ (a Morphological Analyzer Toolkit)

  • fugashi

    A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.

  • lindera

    A multilingual morphological analysis library.

  • vscode-blockman

    VSCode extension to highlight nested code blocks

  • bitextor

    Bitextor generates translation memories from multilingual websites

  • sentence-splitter

    Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.

  • tiktoken-rs

    Ready-made tokenizer library for working with GPT and tiktoken

  • html5gum

    A WHATWG-compliant HTML5 tokenizer and tag soup parser

  • tokenizer

    NLP tokenizers written in Go language (by sugarme)

  • simplemma

    Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

  • tiktoken_ruby

    Unofficial ruby binding for tiktoken by way of rust

  • Project mention: tiktoken_ruby VS ruby-openai - a user suggested alternative | libhunt.com/r/tiktoken_ruby | 2024-05-03
  • Cledev.OpenAI

    .NET 7 SDK for OpenAI with a Blazor Server playground

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Tokenizer related posts

  • Show HN: LLaMA 3 tokenizer runs in the browser

    2 projects | news.ycombinator.com | 21 Apr 2024
  • Show HN: LLaMA tokenizer that runs in browser

    7 projects | news.ycombinator.com | 13 Jun 2023
  • Intro video for my VS Code extension "Blockman"

    3 projects | /r/vscode | 13 Jan 2023
  • Build package for NPM & Deno

    5 projects | /r/Deno | 5 Jan 2023
  • spaCy just got an experimental feature to detect co-references

    1 project | /r/learnmachinelearning | 7 Oct 2022
  • Edit code from browser

    2 projects | /r/reactjs | 5 Jul 2022
  • SpanFinder is a new experimental spaCy component that identifies span boundaries

    1 project | news.ycombinator.com | 21 Jun 2022
  • A note from our sponsor - SurveyJS
    surveyjs.io | 4 May 2024
    With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js. Learn more β†’

Index

What are some of the best open-source Tokenizer projects? This list will help you:

Project Stars
1 Chevrotain 2,399
2 moo 807
3 kagome 789
4 JFlex 575
5 php-parser 515
6 simple 489
7 js-tokens 479
8 friso 474
9 CogCompNLP 469
10 sentences 421
11 gpt-tokenizer 380
12 jumanpp 366
13 fugashi 366
14 lindera 351
15 vscode-blockman 341
16 bitextor 279
17 sentence-splitter 203
18 tiktoken-rs 200
19 html5gum 146
20 tokenizer 140
21 simplemma 125
22 tiktoken_ruby 106
23 Cledev.OpenAI 103

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com