StreamingLLM —a simple and efficient framework that enables LLMs to handle unlimited texts without fine-tuning

Scout Monitoring - Free Django app performance insights with Scout Monitoring

Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

www.scoutapm.com

featured

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

streaming-llm

11 6,285 7.2 Python

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach -- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink'' even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. Code and datasets are provided in the link.

Scout Monitoring

www.scoutapm.com featured

Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Show HN: Huewords, a Word and Logic Puzzle

1 project | news.ycombinator.com | 5 Jun 2024
Show HN: A solver for the recent Huewords game

1 project | news.ycombinator.com | 5 Jun 2024
JSON extra uses orjson instead of ujson

1 project | news.ycombinator.com | 5 Jun 2024
Natural Language Queries for SQL using SLIM

1 project | dev.to | 5 Jun 2024
Nicegui.io

1 project | news.ycombinator.com | 5 Jun 2024

StreamingLLM —a simple and efficient framework that enables LLMs to handle unlimited texts without fine-tuning

This page summarizes the projects mentioned and recommended in the original post on /r/LocalLLaMA Post date: 4 Oct 2023

streaming-llm

Scout Monitoring

Related posts

Show HN: Huewords, a Word and Logic Puzzle

Show HN: A solver for the recent Huewords game

JSON extra uses orjson instead of ujson

Natural Language Queries for SQL using SLIM

Nicegui.io