-
Binary: vanilla llama.cpp, build-mtp/bin/llama-server, built from the MTP PR branch (commit ebe4fca, PR #22673). PR #22673 merged to master on 2026-05-16, so any master checkout after that date ships --spec-type mtp natively.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
lucebox-hub
Lucebox optimization hub: hand-tuned LLM inference, built for specific consumer hardware.
The Dell T5820 install was the hardware story (companion post forthcoming). DFlash was the software follow-up. Initial scan of Luce-Org/lucebox-hub (advertising 3.43x decode + 10x TTFT on RTX 3090) ran into the same blocker: their daemon is a raw generate primitive with no OpenAI API, no jinja chat templates, no tool calling. Slotting it behind Hermes/k2 would need a chat-template shim written from scratch.
-
beellama.cpp
DFlash & TurboQuant in llama.cpp with up to 3x faster generation and 7.5x more KV cache in same VRAM
BeeLlama.cpp by Anbeeld already had the shim baked in: DFlash speculative decoding, TurboQuant KV cache, and CopySpec fallback layered onto the OpenAI server with --jinja and tool-call detection preserved. Different binary. Same flags Hermes needed.