Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Tokenmonster Alternatives
Similar projects and alternatives to tokenmonster
-
sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
tokenmonster reviews and mentions
-
Tokenizer benchmark comparing 16 language models pre-trained from scratch
The actual analysis: https://github.com/alasdairforsythe/tokenmonster/blob/main/b...
> Summary of Findings:
> - Comparable (50256-strict-nocapcode) TokenMonster vocabularies perform better than both GPT-2 Tokenizer and tiktoken p50k_base on all metrics.
> - Optimal vocabulary size is 32,000.
> - Simpler vocabularies converge faster but do not necessarily produce better results when converged.
> - Higher compression (more chr/tok) does not negatively affect model quality alone.
> - Vocabularies with multiple words per token have a 5% negative impact on SMLQA (Ground Truth) benchmark, but a 13% better chr/tok compression.
> - Capcode takes longer to learn, but once the model has converged, does not appear to affect SMLQA (Ground Truth) or SQuAD (Data Extraction) benchmarks significantly in either direction.
> - Validation loss and F1 score are both meaningless metrics when comparing different tokenizers.
> - Flaws and complications in the tokenizer affect the model's ability to learn facts more than they affect its linguistic capability.
-
How best to benchmark the accuracy of a model for comparing different tokenizers? [D]
I need to benchmark the performance of my tokenizer against standard tokenizers. It would be best for reproducibility if I benchmark against an existing model on a standard benchmark, swapping out the existing tokenizer for my tokenizer.
-
Benchmark a vocabulary by training a small model -- Any plug & play solutions?
Having just released my ungreedy subword tokenizer (TokenMonster), I keep being ask to provide benchmarks on how it performs when actually used to train a model, vs other tokenizers.
-
TokenMonster Ungreedy Subword Tokenizer V4: Enables Models to be 4x Smaller and Whilst Achieving Higher Chr/Token (With Evidence) [P]
This is all I've been doing 16 hours per day, 7 days per week for the past couple of months. If you like it please ☆ star the GitHub so people will find it. If you have any questions feel free to ask on here or on the GitHub Discussions tab. Thank you.
- Tokenmonster: Determine tokens to optimally represents a dataset
- TokenMonster: Ungreedy tokenizer, outperforming tiktoken by 35%
-
TokenMonster Ungreedy ~ 35% faster inference and 35% increased context-length for large language models (compared to tiktoken). Benchmarks included
TokenMonster is an ungreedy tokenizer and vocabulary builder, outperforming tiktoken by 35%. In fact, TokenMonster's smallest 24000 vocabulary consistently uses less tokens than tiktoken's largest 100256 vocabulary to tokenize the same text. Save the tokens! See benchmark.
-
[P] TokenMonster Ungreedy ~ 35% faster inference and 35% increased context-length for large language models (compared to tiktoken). Benchmarks included.
From the GitHub:
-
[P] New tokenization method improves LLM performance & context-length by 25%+
Code at Github.
-
A note from our sponsor - InfluxDB
www.influxdata.com | 18 Apr 2024
Stats
alasdairforsythe/tokenmonster is an open source project licensed under MIT License which is an OSI approved license.
The primary programming language of tokenmonster is Go.