tokenisation

Open-source projects categorized as tokenisation

Top 3 tokenisation Open-Source Projects

tokenisation
  • tokenmonster

    Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript

  • Project mention: Tokenizer benchmark comparing 16 language models pre-trained from scratch | news.ycombinator.com | 2023-09-05

    The actual analysis: https://github.com/alasdairforsythe/tokenmonster/blob/main/b...

    > Summary of Findings:

    > - Comparable (50256-strict-nocapcode) TokenMonster vocabularies perform better than both GPT-2 Tokenizer and tiktoken p50k_base on all metrics.

    > - Optimal vocabulary size is 32,000.

    > - Simpler vocabularies converge faster but do not necessarily produce better results when converged.

    > - Higher compression (more chr/tok) does not negatively affect model quality alone.

    > - Vocabularies with multiple words per token have a 5% negative impact on SMLQA (Ground Truth) benchmark, but a 13% better chr/tok compression.

    > - Capcode takes longer to learn, but once the model has converged, does not appear to affect SMLQA (Ground Truth) or SQuAD (Data Extraction) benchmarks significantly in either direction.

    > - Validation loss and F1 score are both meaningless metrics when comparing different tokenizers.

    > - Flaws and complications in the tokenizer affect the model's ability to learn facts more than they affect its linguistic capability.

  • TokenScript

    TokenScript schema, specs and paper

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • FramesIos

    Frames iOS: making native card payments simple

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Index

What are some of the best open-source tokenisation projects? This list will help you:

Project Stars
1 tokenmonster 516
2 TokenScript 239
3 FramesIos 74

Sponsored
Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com