Stability AI Launches the First of Its StableLM Suite of Language Models

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • StableLM

    StableLM: Stability AI Language Models

  • It's unclear which models will be trained to 1.5T tokens. The details of how many tokens each model saw in training are on Github - https://github.com/stability-AI/stableLM/ . But only for the ones that have been released.

  • lm-evaluation-harness

    A framework for few-shot evaluation of language models.

  • Looks like my edit window closed, but my results ended up being very low so there must be something wrong (I've reached out to StabilityAI just in case). It does however seem to roughly match another user's 3B testing: https://twitter.com/abacaj/status/1648881680835387392

    The current scores I have place it between gpt2_774M_q8 and pythia_deduped_410M (yikes!). Based on training and specs you'd expect it to outperform Pythia 6.9B at least... this is running on a HEAD checkout of https://github.com/EleutherAI/lm-evaluation-harness (releases don't support hf-casual) for those looking to replicate/debug.

    Note, another LLM currently being trained, GeoV 9B, already far outperforms this model at just 80B tokens trained: https://github.com/geov-ai/geov/blob/master/results.080B.md

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • alpaca_lora_4bit

  • That's not going to happen. But it's likely that StableLM 175B will rival GPT-4.

    Also, you can finetune Base StableLM yourself on any consumer GPU with 8GB of VRAM in a couple of hours and it will be commercial licensed. (using https://github.com/johnsmith0031/alpaca_lora_4bit)

    You can even use the exact same dataset StabilityAI used.

  • transformers

    🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

  • txtinstruct

    📚 Datasets and models for instruction-tuning

  • Great to see the continued release of open models. The only disappointing thing is that models keep building on CC-BY-NC licensed datasets, which severely limits their use.

    Hopefully, people consider txtinstruct (https://github.com/neuml/txtinstruct) and other approaches to generate instruction-tuning datasets without the baggage.

  • sparsegpt

    Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".

  • I think "sweet spot" is going to depend on your task, but here's a good recent paper that may give you some more context on thinking about training and model sizes: https://www.harmdevries.com/post/model-size-vs-compute-overh...

    There have also been quite a few developments on sparsity lately. Here's a technique SparseGPT which suggests that you can prune 50% of parameters with almost no loss in performance for example: https://arxiv.org/abs/2301.00774

  • safetensors

    Simple, safe way to store and distribute tensors

  • I've been diving in lately and while it's not efficient, the only way to do manage is to create a new conda/mamba environment, or a custom Docker image for all the conflicting packages.

    For safety and speed, you should prefer the safetensor format: https://huggingface.co/docs/safetensors/speed

    If you know what you are doing you can do your own conversions: https://github.com/huggingface/safetensors or for safety, https://huggingface.co/spaces/diffusers/convert

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • instruct-eval

    This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.

  • I really dislike this approach of announcing new models that some companies have taken, they don't mention evaluation results or performance of the model, but instead talk about how "transparent", "accessible" and "supportive" these models are.

    Anyway, I have benchmarked stablelm-base-alpha-3b (the open-source version, not the fine-tuned one which is under a NC license) using the MMLU benchmark and the results are rather underwhelming compared to other open source models:

    * stablelm-base-alpha-3b (3B params): 25.6% average accuracy

    * flan-t5-xl (3B params): 49.3% average accuracy

    * flan-t5-small (80M params): 29.4% average accuracy

    MMLU is just one benchmark, but based on the blog post, I don't think it will yield much better results in others. I'll leave links to the MMLU results of other proprietary[0] and open-access[1] models (results may vary by ±2% depending on the parameters used during inference).

    [0]: https://paperswithcode.com/sota/multi-task-language-understa...

    [1]: https://github.com/declare-lab/flan-eval/blob/main/mmlu.py#L...

  • lm-evaluation-harness

    A framework for few-shot evaluation of autoregressive language models. (by bigscience-workshop)

  • Yeah, although looks like it currently has some issues with coqa: https://github.com/EleutherAI/lm-evaluation-harness/issues/2...

    There's also the bigscience fork, but I ran into even more problems (although I didn't try too hard) https://github.com/bigscience-workshop/lm-evaluation-harness

    And there's https://github.com/EleutherAI/lm-eval2/ (not sure if it's just starting over w/ a new repo or what?) but it has limited tests available

  • lm-eval2

  • Yeah, although looks like it currently has some issues with coqa: https://github.com/EleutherAI/lm-evaluation-harness/issues/2...

    There's also the bigscience fork, but I ran into even more problems (although I didn't try too hard) https://github.com/bigscience-workshop/lm-evaluation-harness

    And there's https://github.com/EleutherAI/lm-eval2/ (not sure if it's just starting over w/ a new repo or what?) but it has limited tests available

  • cformers

    SoTA Transformers with C-backend for fast inference on your CPU. (by antimatter15)

  • stanford_alpaca

    Code and documentation to train Stanford's Alpaca models, and generate the data.

  • flash-attention

    Fast and memory-efficient exact attention

  • https://github.com/HazyResearch/flash-attention#memory

    "standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length."

  • llama.cpp

    LLM inference in C/C++

  • llama.cpp has preliminary support already. https://github.com/ggerganov/llama.cpp/issues/1063#issuecomm...

  • AlpacaDataCleaned

    Alpaca dataset from Stanford, cleaned and curated

  • That dataset is licensed under CC BY NC 4.0, which is not open. It also has a bunch of garbage in it; see https://github.com/gururise/AlpacaDataCleaned

  • geov

    The GeoV model is a large langauge model designed by Georges Harik and uses Rotary Positional Embeddings with Relative distances (RoPER). We have shared a pre-trained 9B parameter model.

  • Looks like my edit window closed, but my results ended up being very low so there must be something wrong (I've reached out to StabilityAI just in case). It does however seem to roughly match another user's 3B testing: https://twitter.com/abacaj/status/1648881680835387392

    The current scores I have place it between gpt2_774M_q8 and pythia_deduped_410M (yikes!). Based on training and specs you'd expect it to outperform Pythia 6.9B at least... this is running on a HEAD checkout of https://github.com/EleutherAI/lm-evaluation-harness (releases don't support hf-casual) for those looking to replicate/debug.

    Note, another LLM currently being trained, GeoV 9B, already far outperforms this model at just 80B tokens trained: https://github.com/geov-ai/geov/blob/master/results.080B.md

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts