[D] Transformer sequence generation - is it truly quadratic scaling?

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

x-transformers

10 4,147 8.7 Python

A simple but complete full-attention transformer with a set of promising experimental features from various papers

However, I've come across the concept of Key, Value Caching in Transformer-Decoders recently (e.g. Figure 3 here), wherein because each output (and hence each input, since the model is autoregressive) only depends on previous outputs (inputs), we don't need to re-compute Key and Value vectors for all t < t_i at timestep i of the sequence. My intuition leads me to believe, then, that (unconditioned) inference for a decoder-only model uses an effective sequence length of 1 (the most recently produced token is the only real input that requires computation on), making Attention a linear-complexity operation. This thinking seems to be validated by this github issue, and this paper (2nd paragraph of Introduction).

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

x-transformers

1 project | news.ycombinator.com | 31 Mar 2024
A single API call using almost the whole 32k context window costs around 2$.

1 project | /r/OpenAI | 15 Mar 2023
GPT-4 architecture: what we can deduce from research literature

1 project | news.ycombinator.com | 14 Mar 2023
You’ll be able to run chatgpt on your own device quite easily very soon

2 projects | /r/OpenAI | 13 Mar 2023
The GPT Architecture, on a Napkin

4 projects | news.ycombinator.com | 11 Dec 2022

[D] Transformer sequence generation - is it truly quadratic scaling?

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning
Artificial intelligence Deep Learning attention-mechanism Transformers
Post date: 23 Sep 2021

x-transformers

InfluxDB

Related posts

x-transformers

A single API call using almost the whole 32k context window costs around 2$.

GPT-4 architecture: what we can deduce from research literature

You’ll be able to run chatgpt on your own device quite easily very soon

The GPT Architecture, on a Napkin

[D] Transformer sequence generation - is it truly quadratic scaling?

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning Artificial intelligence Deep Learning attention-mechanism Transformers Post date: 23 Sep 2021

x-transformers

InfluxDB

Related posts

x-transformers

A single API call using almost the whole 32k context window costs around 2$.

GPT-4 architecture: what we can deduce from research literature

You’ll be able to run chatgpt on your own device quite easily very soon

The GPT Architecture, on a Napkin

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning
Artificial intelligence Deep Learning attention-mechanism Transformers
Post date: 23 Sep 2021