Do large language models need all those layers?

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

gptq

8 1,725 4.4 Python

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".

I think it's not that LLMs have redundant layers in general - it's a specific problem with OPT-66B, not anything else.
An 2022 paper "Scaling Language Models: Methods, Analysis & Insights from Training Gopher" (http://arxiv.org/abs/2112.11446) has captured it well on page 103, Appendix G:
> The general finding is that whilst compressing models for a particular application has seen success, it is difficult to compress them for the objective of language modelling over a diverse corpus.
The appendix G explores various techniques like pruning and distillation but found that neither method was an efficient way to obtain better loss at lower number of parameters.
So why does pruning work for OPT-66B in particular? I'm not sure but there are evidence that OPT-66B is an outlier: one evidence is in the GPTQ paper ("GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers", https://arxiv.org/abs/2210.17323) that mentions in its footnote on its 7th page:
> [2] Upon closer inspection of the OPT-66B model, it appears that this is correlated with the fact that this trained

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

MicroPython in WASM

1 project | news.ycombinator.com | 14 May 2024
Dagger.io : La nouvelle ère du CI/CD dans le monde DevOps

3 projects | dev.to | 14 May 2024
Building a Parenting Assistant using Lyzr SDK

1 project | dev.to | 14 May 2024
Show HN: EmuBert – the first open encoder model for Australian law

1 project | news.ycombinator.com | 14 May 2024
GPT-4o

9 projects | news.ycombinator.com | 13 May 2024

Do large language models need all those layers?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Post date: 15 Dec 2023

gptq

InfluxDB

Related posts

MicroPython in WASM

Dagger.io : La nouvelle ère du CI/CD dans le monde DevOps

Building a Parenting Assistant using Lyzr SDK

Show HN: EmuBert – the first open encoder model for Australian law

GPT-4o