tokenizing

Open-source projects categorized as tokenizing

tokenizing Open-Source Projects

  • tokenmonster

    Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript

  • Project mention: Tokenizer benchmark comparing 16 language models pre-trained from scratch | news.ycombinator.com | 2023-09-05

    The actual analysis: https://github.com/alasdairforsythe/tokenmonster/blob/main/b...

    > Summary of Findings:

    > - Comparable (50256-strict-nocapcode) TokenMonster vocabularies perform better than both GPT-2 Tokenizer and tiktoken p50k_base on all metrics.

    > - Optimal vocabulary size is 32,000.

    > - Simpler vocabularies converge faster but do not necessarily produce better results when converged.

    > - Higher compression (more chr/tok) does not negatively affect model quality alone.

    > - Vocabularies with multiple words per token have a 5% negative impact on SMLQA (Ground Truth) benchmark, but a 13% better chr/tok compression.

    > - Capcode takes longer to learn, but once the model has converged, does not appear to affect SMLQA (Ground Truth) or SQuAD (Data Extraction) benchmarks significantly in either direction.

    > - Validation loss and F1 score are both meaningless metrics when comparing different tokenizers.

    > - Flaws and complications in the tokenizer affect the model's ability to learn facts more than they affect its linguistic capability.

  • tokenizer

    Tokenizer (lexer) for golang (by bzick)

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Index

Project Stars
1 tokenmonster 514
2 tokenizer 87

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com