tokenizing Open-Source Projects

tokenmonster

9 514 8.8 Go

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript

Project mention: Tokenizer benchmark comparing 16 language models pre-trained from scratch | news.ycombinator.com | 2023-09-05

The actual analysis: https://github.com/alasdairforsythe/tokenmonster/blob/main/b...
> Summary of Findings:
> - Comparable (50256-strict-nocapcode) TokenMonster vocabularies perform better than both GPT-2 Tokenizer and tiktoken p50k_base on all metrics.
> - Optimal vocabulary size is 32,000.
> - Simpler vocabularies converge faster but do not necessarily produce better results when converged.
> - Higher compression (more chr/tok) does not negatively affect model quality alone.
> - Vocabularies with multiple words per token have a 5% negative impact on SMLQA (Ground Truth) benchmark, but a 13% better chr/tok compression.
> - Capcode takes longer to learn, but once the model has converged, does not appear to affect SMLQA (Ground Truth) or SQuAD (Data Extraction) benchmarks significantly in either direction.
> - Validation loss and F1 score are both meaningless metrics when comparing different tokenizers.
> - Flaws and complications in the tokenizer affect the model's ability to learn facts more than they affect its linguistic capability.

tokenizer

0 87 5.4 Go

Tokenizer (lexer) for golang (by bzick)
InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Index

	Project	Stars
1	tokenmonster	514
2	tokenizer	87

tokenizing

tokenizing Open-Source Projects

tokenmonster

tokenizer

InfluxDB

Index