Tokenmonster Alternatives

Similar projects and alternatives to tokenmonster

sentencepiece

19 9,396 8.3 C++ tokenmonster VS sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
InfluxDB

www.influxdata.com
sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better tokenmonster alternative or higher similarity.

Suggest an alternative to tokenmonster

tokenmonster reviews and mentions

Posts with mentions or reviews of tokenmonster. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-06-04.

Tokenizer benchmark comparing 16 language models pre-trained from scratch
1 project | news.ycombinator.com | 5 Sep 2023

The actual analysis: https://github.com/alasdairforsythe/tokenmonster/blob/main/b...
> Summary of Findings:
> - Comparable (50256-strict-nocapcode) TokenMonster vocabularies perform better than both GPT-2 Tokenizer and tiktoken p50k_base on all metrics.
> - Optimal vocabulary size is 32,000.
> - Simpler vocabularies converge faster but do not necessarily produce better results when converged.
> - Higher compression (more chr/tok) does not negatively affect model quality alone.
> - Vocabularies with multiple words per token have a 5% negative impact on SMLQA (Ground Truth) benchmark, but a 13% better chr/tok compression.
> - Capcode takes longer to learn, but once the model has converged, does not appear to affect SMLQA (Ground Truth) or SQuAD (Data Extraction) benchmarks significantly in either direction.
> - Validation loss and F1 score are both meaningless metrics when comparing different tokenizers.
> - Flaws and complications in the tokenizer affect the model's ability to learn facts more than they affect its linguistic capability.
How best to benchmark the accuracy of a model for comparing different tokenizers? [D]
1 project | /r/MachineLearning | 18 Jul 2023

I need to benchmark the performance of my tokenizer against standard tokenizers. It would be best for reproducibility if I benchmark against an existing model on a standard benchmark, swapping out the existing tokenizer for my tokenizer.
Benchmark a vocabulary by training a small model -- Any plug & play solutions?
1 project | /r/learnmachinelearning | 15 Jul 2023

Having just released my ungreedy subword tokenizer (TokenMonster), I keep being ask to provide benchmarks on how it performs when actually used to train a model, vs other tokenizers.
TokenMonster Ungreedy Subword Tokenizer V4: Enables Models to be 4x Smaller and Whilst Achieving Higher Chr/Token (With Evidence) [P]
1 project | /r/MachineLearning | 14 Jul 2023

This is all I've been doing 16 hours per day, 7 days per week for the past couple of months. If you like it please ☆ star the GitHub so people will find it. If you have any questions feel free to ask on here or on the GitHub Discussions tab. Thank you.
Tokenmonster: Determine tokens to optimally represents a dataset
1 project | news.ycombinator.com | 9 Jun 2023
TokenMonster: Ungreedy tokenizer, outperforming tiktoken by 35%
1 project | news.ycombinator.com | 4 Jun 2023
TokenMonster Ungreedy ~ 35% faster inference and 35% increased context-length for large language models (compared to tiktoken). Benchmarks included
1 project | /r/LanguageTechnology | 4 Jun 2023

TokenMonster is an ungreedy tokenizer and vocabulary builder, outperforming tiktoken by 35%. In fact, TokenMonster's smallest 24000 vocabulary consistently uses less tokens than tiktoken's largest 100256 vocabulary to tokenize the same text. Save the tokens! See benchmark.
[P] TokenMonster Ungreedy ~ 35% faster inference and 35% increased context-length for large language models (compared to tiktoken). Benchmarks included.
2 projects | /r/MachineLearning | 4 Jun 2023

From the GitHub:
[P] New tokenization method improves LLM performance & context-length by 25%+
2 projects | /r/MachineLearning | 13 May 2023

Code at Github.
A note from our sponsor - InfluxDB
www.influxdata.com | 18 Apr 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →