Tokenizer

Top 23 Tokenizer Open-Source Projects

  • Chevrotain

    Parser Building Toolkit for JavaScript

  • Project mention: Ohm: A library and language for building parsers, interpreters, compilers, etc. | news.ycombinator.com | 2023-10-31

    How does this compare with Chevrotain[1]?

    More specifically, can I build lexers with Ohm? Can it generate a syntax diagram from a grammar?

    [1]: https://github.com/chevrotain/chevrotain

  • moo

    Optimised tokenizer/lexer generator! πŸ„ Uses /y for performance. Moo. (by no-context)

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • kagome

    Self-contained Japanese Morphological Analyzer written in pure Go

  • JFlex

    The fast scanner generator for Javaβ„’ with full Unicode support

  • php-parser

    :herb: NodeJS PHP Parser - extract AST or tokens (by glayzzle)

  • simple

    ζ”―ζŒδΈ­ζ–‡ε’Œζ‹ΌιŸ³ηš„ SQLite fts5 ε…¨ζ–‡ζœη΄’ζ‰©ε±• | A SQLite3 fts5 tokenizer which supports Chinese and PinYin (by wangfenjin)

  • Project mention: Postgres Full Text Search is better than | news.ycombinator.com | 2023-04-27

    There's a C++ library for tokenising Chinese for sqlite FTS: https://github.com/wangfenjin/simple

  • js-tokens

    Tiny JavaScript tokenizer.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • friso

    High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.

  • CogCompNLP

    CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.

  • sentences

    A multilingual command line sentence tokenizer in Golang

  • gpt-tokenizer

    JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT-2 / GPT-3 / GPT-4. Port of OpenAI's tiktoken with additional features.

  • Project mention: I wrote a tokenizer for LLaMA that runs inside the browser | /r/LocalLLaMA | 2023-06-13

    There are more differences between GPT2 tokenizer and LLaMA tokenizer than only the vocab and merge data. It would take me some time to do implement a GPT2 tokenizer, and there are already good alternatives for those, so it wouldn't make sense to put time into making another one. For example, this library contains a GPT2 tokenizer: https://github.com/niieani/gpt-tokenizer

  • fugashi

    A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.

  • jumanpp

    Juman++ (a Morphological Analyzer Toolkit)

  • lindera

    A multilingual morphological analysis library.

  • vscode-blockman

    VSCode extension to highlight nested code blocks

  • bitextor

    Bitextor generates translation memories from multilingual websites

  • sentence-splitter

    Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.

  • tiktoken-rs

    Ready-made tokenizer library for working with GPT and tiktoken

  • html5gum

    A WHATWG-compliant HTML5 tokenizer and tag soup parser

  • tokenizer

    NLP tokenizers written in Go language (by sugarme)

  • simplemma

    Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

  • Cledev.OpenAI

    .NET 7 SDK for OpenAI with a Blazor Server playground

  • spacy-experimental

    πŸ§ͺ Cutting-edge experimental spaCy components and features

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Tokenizer related posts

Index

What are some of the best open-source Tokenizer projects? This list will help you:

Project Stars
1 Chevrotain 2,397
2 moo 802
3 kagome 789
4 JFlex 574
5 php-parser 514
6 simple 488
7 js-tokens 477
8 friso 474
9 CogCompNLP 469
10 sentences 419
11 gpt-tokenizer 377
12 fugashi 366
13 jumanpp 365
14 lindera 352
15 vscode-blockman 341
16 bitextor 278
17 sentence-splitter 203
18 tiktoken-rs 198
19 html5gum 145
20 tokenizer 136
21 simplemma 125
22 Cledev.OpenAI 103
23 spacy-experimental 93

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com