Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
It indeed is. An attention mechanism's key and value matrices grow linearly with context length. With PagedAttention[1], we could imagine an external service providing context. The hard part is the how, of course. We can't load our entire database in every conversation, and I suspect there are also training challenges (perhaps addressed via LandmarkAttention[2] and other mechanisms to efficiently retrieve additional key-value matrices.
To manage 20-50 tokens/sec, must arrive within 50-20ms. Pausing the autoregressive transformer when it creates a Q vector stalls the batch, so we need a way to predict queries _ahead_ of where they'd be useful.
[1] https://arxiv.org/abs/2309.06180
[2] https://arxiv.org/abs/2305.16300