Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
sparsegpt
Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
instruct-eval
This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.
-
lm-evaluation-harness
A framework for few-shot evaluation of autoregressive language models. (by bigscience-workshop)
-
geov
The GeoV model is a large langauge model designed by Georges Harik and uses Rotary Positional Embeddings with Relative distances (RoPER). We have shared a pre-trained 9B parameter model.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
It's unclear which models will be trained to 1.5T tokens. The details of how many tokens each model saw in training are on Github - https://github.com/stability-AI/stableLM/ . But only for the ones that have been released.
Looks like my edit window closed, but my results ended up being very low so there must be something wrong (I've reached out to StabilityAI just in case). It does however seem to roughly match another user's 3B testing: https://twitter.com/abacaj/status/1648881680835387392
The current scores I have place it between gpt2_774M_q8 and pythia_deduped_410M (yikes!). Based on training and specs you'd expect it to outperform Pythia 6.9B at least... this is running on a HEAD checkout of https://github.com/EleutherAI/lm-evaluation-harness (releases don't support hf-casual) for those looking to replicate/debug.
Note, another LLM currently being trained, GeoV 9B, already far outperforms this model at just 80B tokens trained: https://github.com/geov-ai/geov/blob/master/results.080B.md
That's not going to happen. But it's likely that StableLM 175B will rival GPT-4.
Also, you can finetune Base StableLM yourself on any consumer GPU with 8GB of VRAM in a couple of hours and it will be commercial licensed. (using https://github.com/johnsmith0031/alpaca_lora_4bit)
You can even use the exact same dataset StabilityAI used.
Great to see the continued release of open models. The only disappointing thing is that models keep building on CC-BY-NC licensed datasets, which severely limits their use.
Hopefully, people consider txtinstruct (https://github.com/neuml/txtinstruct) and other approaches to generate instruction-tuning datasets without the baggage.
I think "sweet spot" is going to depend on your task, but here's a good recent paper that may give you some more context on thinking about training and model sizes: https://www.harmdevries.com/post/model-size-vs-compute-overh...
There have also been quite a few developments on sparsity lately. Here's a technique SparseGPT which suggests that you can prune 50% of parameters with almost no loss in performance for example: https://arxiv.org/abs/2301.00774
I've been diving in lately and while it's not efficient, the only way to do manage is to create a new conda/mamba environment, or a custom Docker image for all the conflicting packages.
For safety and speed, you should prefer the safetensor format: https://huggingface.co/docs/safetensors/speed
If you know what you are doing you can do your own conversions: https://github.com/huggingface/safetensors or for safety, https://huggingface.co/spaces/diffusers/convert
I really dislike this approach of announcing new models that some companies have taken, they don't mention evaluation results or performance of the model, but instead talk about how "transparent", "accessible" and "supportive" these models are.
Anyway, I have benchmarked stablelm-base-alpha-3b (the open-source version, not the fine-tuned one which is under a NC license) using the MMLU benchmark and the results are rather underwhelming compared to other open source models:
* stablelm-base-alpha-3b (3B params): 25.6% average accuracy
* flan-t5-xl (3B params): 49.3% average accuracy
* flan-t5-small (80M params): 29.4% average accuracy
MMLU is just one benchmark, but based on the blog post, I don't think it will yield much better results in others. I'll leave links to the MMLU results of other proprietary[0] and open-access[1] models (results may vary by ±2% depending on the parameters used during inference).
[0]: https://paperswithcode.com/sota/multi-task-language-understa...
[1]: https://github.com/declare-lab/flan-eval/blob/main/mmlu.py#L...
Yeah, although looks like it currently has some issues with coqa: https://github.com/EleutherAI/lm-evaluation-harness/issues/2...
There's also the bigscience fork, but I ran into even more problems (although I didn't try too hard) https://github.com/bigscience-workshop/lm-evaluation-harness
And there's https://github.com/EleutherAI/lm-eval2/ (not sure if it's just starting over w/ a new repo or what?) but it has limited tests available
Yeah, although looks like it currently has some issues with coqa: https://github.com/EleutherAI/lm-evaluation-harness/issues/2...
There's also the bigscience fork, but I ran into even more problems (although I didn't try too hard) https://github.com/bigscience-workshop/lm-evaluation-harness
And there's https://github.com/EleutherAI/lm-eval2/ (not sure if it's just starting over w/ a new repo or what?) but it has limited tests available
https://github.com/HazyResearch/flash-attention#memory
"standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length."
llama.cpp has preliminary support already. https://github.com/ggerganov/llama.cpp/issues/1063#issuecomm...
That dataset is licensed under CC BY NC 4.0, which is not open. It also has a bunch of garbage in it; see https://github.com/gururise/AlpacaDataCleaned
Looks like my edit window closed, but my results ended up being very low so there must be something wrong (I've reached out to StabilityAI just in case). It does however seem to roughly match another user's 3B testing: https://twitter.com/abacaj/status/1648881680835387392
The current scores I have place it between gpt2_774M_q8 and pythia_deduped_410M (yikes!). Based on training and specs you'd expect it to outperform Pythia 6.9B at least... this is running on a HEAD checkout of https://github.com/EleutherAI/lm-evaluation-harness (releases don't support hf-casual) for those looking to replicate/debug.
Note, another LLM currently being trained, GeoV 9B, already far outperforms this model at just 80B tokens trained: https://github.com/geov-ai/geov/blob/master/results.080B.md