StreamingLLM: Efficient streaming technique enable infinite sequence lengths

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

streaming-llm

11 6,206 7.2 Python

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
CTranslate2

14 2,825 8.9 C++

Fast inference engine for Transformer models

Etc.
Now, what this allows you to do is reuse the attention computed from the previous turns (since the prefix is the same).
In practice, people often have a system prompt before the conversation history, which (as far a I can tell) makes this technique not applicable (the input prefix will change as soon as the conversation history is long enough that we need to start dropping the oldest turns).
In such case, what you could do is to cache at least the system prompt. This is also possible with https://github.com/OpenNMT/CTranslate2/blob/2203ad5c8baf878a...

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Explore large language models on any computer with 512MB of RAM

4 projects | /r/LocalLLaMA | 17 Jun 2023
CTranslate2: An efficient inference engine for Transformer models

1 project | news.ycombinator.com | 21 May 2023
[D] Faster Flan-T5 inference

1 project | /r/MachineLearning | 22 Feb 2023
[P] CTranslate2: an efficient inference engine for Transformer models

1 project | /r/MachineLearning | 23 May 2022
GDlog: A GPU-Accelerated Deductive Engine

16 projects | news.ycombinator.com | 3 Dec 2023

StreamingLLM: Efficient streaming technique enable infinite sequence lengths

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
neural-machine-translation CPP Mkl quantization Cuda
Post date: 3 Oct 2023

streaming-llm

CTranslate2

InfluxDB

Related posts

Explore large language models on any computer with 512MB of RAM

CTranslate2: An efficient inference engine for Transformer models

[D] Faster Flan-T5 inference

[P] CTranslate2: an efficient inference engine for Transformer models

GDlog: A GPU-Accelerated Deductive Engine

StreamingLLM: Efficient streaming technique enable infinite sequence lengths

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com neural-machine-translation CPP Mkl quantization Cuda Post date: 3 Oct 2023

streaming-llm

CTranslate2

InfluxDB

Related posts

Explore large language models on any computer with 512MB of RAM

CTranslate2: An efficient inference engine for Transformer models

[D] Faster Flan-T5 inference

[P] CTranslate2: an efficient inference engine for Transformer models

GDlog: A GPU-Accelerated Deductive Engine

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
neural-machine-translation CPP Mkl quantization Cuda
Post date: 3 Oct 2023