Show HN: Next-token prediction in JavaScript – build fast LLMs from scratch

Our great sponsors

SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

Our great sponsors

next-token-prediction

5 117 5.6 JavaScript

Next-token prediction in JavaScript — build fast language and diffusion models.

People on here will be happy to say that I do a similar thing, however my sequence length is dynamic because I also use a 2nd data structure - I'll use pretentious academic speak: I use a simple bigram LM (2-gram) for single next-word likeliness and separately a trie that models all words and phrases (so, n-gram). Not sure how many total nodes because sentence lengths vary in training data, but there are about 200,000 entry points (keys) so probably about 2-10 million total nodes in the default setup.
"Constructing 7-gram LM": They likely started with bigrams (what I use) which only tells you the next word based on 1 word given, and thought to increase accuracy by modeling out more words in a sequence, and eventually let the user (developer) pass in any amount they want to model (https://github.com/google-research/google-research/blob/5c87...). I thought of this too at first, but I actually got more accuracy (and speed) out of just keeping them as bigrams and making a totally separate structure that models out an n-gram of all phrases (e.g. could be a 24-token long sequence or 100+ tokens etc. I model it all) and if that phrase is found, then I just get the bigram assumption of the last token of the phrase. This works better when the training data is more diverse (for a very generic model), but theirs would probably outperform mine on accuracy when the training data has a lot of nearly identical sentences that only change wildly toward the end - I don't find this pattern in typical data though, maybe for certain coding and other tasks there are those patterns though. But because it's not dynamic and they make you provide that number, even a low number (any phrase longer than 2 words) - theirs will always have to do more lookup work than with simple bigrams and they're also limited by that fixed number as far as accuracy. I wonder how scalable that is - if I need to train on occasional ~100-word long sentences but also (and mostly) just ~3-word long sentences, I guess I set this to 100 and have a mostly "undefined" trie.
I also thought of the name "LMJS", theirs is "jslm" :) but I went with simply "next-token-prediction" because that's what it ultimately does as a library. I don't know what theirs is really designed for other than proving a concept. Most of their code files are actually comments and hypothetical scenarios.
I recently added a browser example showing simple autocomplete using my library: https://github.com/bennyschmidt/next-token-prediction/tree/m... (video)
And next I'm implementing 8-dimensional embeddings that are converted to normalized vectors between 0-1 to see if doing math on them does anything useful beyond similarity, right now they look like this:
  [nextFrequency, prevalence, specificity, length, firstLetter, lastLetter, firstVowel, lastVowel]

llimo

3 9 5.1 JavaScript

Large language and image models in pure JavaScript.

This system predicts "was" as the next word because it usually is the next word after "dog" (in the source data). This library was built to ultimately provide completions, not have a conversation, so no doubt OpenAI's approach works better for chat.
I am however already making a chat model. Here's my approach if anyone cares: The completer already gives great completions and fast, but some of them make no sense to what was asked. The chat model I'm working on here (https://github.com/bennyschmidt/llimo/pull/1) can just get all completions and use parts-of-speech codes to match a completion to the cursor. I don't have this fully implemented yet, but you can get the idea in this PR. This is like an NLP layer specific to chat - has nothing to do with the next-token prediction in general, and there are no NLP libraries in `next-token-prediction` (the npm). The example I've been using to explain this is:
  User: "Where is Paris?"

SurveyJS

surveyjs.io sponsored

Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
google-research

98 32,804 9.6 Jupyter Notebook

Google Research

People on here will be happy to say that I do a similar thing, however my sequence length is dynamic because I also use a 2nd data structure - I'll use pretentious academic speak: I use a simple bigram LM (2-gram) for single next-word likeliness and separately a trie that models all words and phrases (so, n-gram). Not sure how many total nodes because sentence lengths vary in training data, but there are about 200,000 entry points (keys) so probably about 2-10 million total nodes in the default setup.
"Constructing 7-gram LM": They likely started with bigrams (what I use) which only tells you the next word based on 1 word given, and thought to increase accuracy by modeling out more words in a sequence, and eventually let the user (developer) pass in any amount they want to model (https://github.com/google-research/google-research/blob/5c87...). I thought of this too at first, but I actually got more accuracy (and speed) out of just keeping them as bigrams and making a totally separate structure that models out an n-gram of all phrases (e.g. could be a 24-token long sequence or 100+ tokens etc. I model it all) and if that phrase is found, then I just get the bigram assumption of the last token of the phrase. This works better when the training data is more diverse (for a very generic model), but theirs would probably outperform mine on accuracy when the training data has a lot of nearly identical sentences that only change wildly toward the end - I don't find this pattern in typical data though, maybe for certain coding and other tasks there are those patterns though. But because it's not dynamic and they make you provide that number, even a low number (any phrase longer than 2 words) - theirs will always have to do more lookup work than with simple bigrams and they're also limited by that fixed number as far as accuracy. I wonder how scalable that is - if I need to train on occasional ~100-word long sentences but also (and mostly) just ~3-word long sentences, I guess I set this to 100 and have a mostly "undefined" trie.
I also thought of the name "LMJS", theirs is "jslm" :) but I went with simply "next-token-prediction" because that's what it ultimately does as a library. I don't know what theirs is really designed for other than proving a concept. Most of their code files are actually comments and hypothetical scenarios.
I recently added a browser example showing simple autocomplete using my library: https://github.com/bennyschmidt/next-token-prediction/tree/m... (video)
And next I'm implementing 8-dimensional embeddings that are converted to normalized vectors between 0-1 to see if doing math on them does anything useful beyond similarity, right now they look like this:
  [nextFrequency, prevalence, specificity, length, firstLetter, lastLetter, firstVowel, lastVowel]

horsey-books

1 4 - Python

Generate phrases using Markov chains

[2] https://github.com/longears/horsey-books

wink-nlp

21 1,143 8.1 JavaScript

Developer friendly Natural Language Processing ✨

This is awesome, thanks. I've been messing with wink's NLP library (https://winkjs.org/wink-nlp/) to transform user queries and format responses so I can make a proper chat bot - will see what I can learn from these!

dasher-web

1 42 5.8 JavaScript

Dasher text entry in HTML, CSS, JavaScript, and SVG

Their library was actually made for dasher.. http://www.inference.org.uk/dasher/ - there was a web version being made (https://github.com/dasher-project/dasher-web We hit a bottleneck with the graphics driving. Note in dasher pretty much the entire tree is in dynamic view). Now this may help to understand the use case. Dasher is for people with disabilities who cant speak. It needs to be a personalised LM that trains on the fly and and keeps track of new words/sentences. But in truth too, utterances are usually small.
Don't get too knocked back by comments. A) If it works - it works. B) Your learning is as valuable as the outcome.
Oh have a look at https://imagineville.org/software/ for some other things that may be of interest..

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

A Curated List of Free ML/ DL YouTube Courses
1 project | news.ycombinator.com | 28 Jan 2024
ML-YouTube-Courses: NEW Courses - star count:11622.0
1 project | /r/algoprojects | 7 Dec 2023
ML-YouTube-Courses: NEW Courses - star count:11622.0
1 project | /r/algoprojects | 6 Dec 2023
ML-YouTube-Courses: NEW Courses - star count:11622.0
1 project | /r/algoprojects | 5 Dec 2023
Mastering ROUGE Matrix: Your Guide to Large Language Model Evaluation for Summarization with Examples
2 projects | dev.to | 8 Oct 2023

Show HN: Next-token prediction in JavaScript – build fast LLMs from scratch

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Machine Learning Natural Language Processing AI NLP Research
Post date: 10 Apr 2024

next-token-prediction

llimo

SurveyJS

google-research

horsey-books

wink-nlp

dasher-web

Related posts

Show HN: Next-token prediction in JavaScript – build fast LLMs from scratch

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Machine Learning Natural Language Processing AI NLP Research Post date: 10 Apr 2024

next-token-prediction

llimo

SurveyJS

google-research

horsey-books

wink-nlp

dasher-web

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Machine Learning Natural Language Processing AI NLP Research
Post date: 10 Apr 2024