PowerInfer: Fast Large Language Model Serving with a Consumer-Grade GPU [pdf]

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

PowerInfer

4 6,942 9.8 C++

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

> PowerInfer’s source code is publicly available at https://github.com/SJTU-IPADS/PowerInfer

llama.cpp

769 56,891 10.0 C++

LLM inference in C/C++

This is basically fork of llama.cpp. I created a PR to see the diff and added my comments on it: https://github.com/ggerganov/llama.cpp/pull/4543
One thing that caught my interest is this line from their readme:
> PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers.
Apple's Metal/M3 is perfect for this because CPU and GPU share memory. No need to do any data transfers. Checkout mlx from apple: https://github.com/ml-explore/mlx

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
mlx

23 14,087 9.8 C++

MLX: An array framework for Apple silicon

This is basically fork of llama.cpp. I created a PR to see the diff and added my comments on it: https://github.com/ggerganov/llama.cpp/pull/4543
One thing that caught my interest is this line from their readme:
> PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers.
Apple's Metal/M3 is perfect for this because CPU and GPU share memory. No need to do any data transfers. Checkout mlx from apple: https://github.com/ml-explore/mlx

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project