PowerInfer: Fast Large Language Model Serving with a Consumer-Grade GPU [pdf]

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • PowerInfer

    High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

  • > PowerInfer’s source code is publicly available at https://github.com/SJTU-IPADS/PowerInfer

  • llama.cpp

    LLM inference in C/C++

  • This is basically fork of llama.cpp. I created a PR to see the diff and added my comments on it: https://github.com/ggerganov/llama.cpp/pull/4543

    One thing that caught my interest is this line from their readme:

    > PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers.

    Apple's Metal/M3 is perfect for this because CPU and GPU share memory. No need to do any data transfers. Checkout mlx from apple: https://github.com/ml-explore/mlx

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • mlx

    MLX: An array framework for Apple silicon

  • This is basically fork of llama.cpp. I created a PR to see the diff and added my comments on it: https://github.com/ggerganov/llama.cpp/pull/4543

    One thing that caught my interest is this line from their readme:

    > PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers.

    Apple's Metal/M3 is perfect for this because CPU and GPU share memory. No need to do any data transfers. Checkout mlx from apple: https://github.com/ml-explore/mlx

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts