Kimi Prefill-as-a-Service Splits LLM Inference for 1.54× Throughput

Kimi / Moonshot AI published a paper proposing Prefill-as-a-Service: separating the two phases of LLM inference—compute-heavy prompt prefill and latency-sensitive token-by-token decode—into independent services that can run on different hardware nodes or data centers. The key enabler is an optimized KV-cache representation that makes cross-network transfer of the prefill state feasible. Results on a 20×-scaled Kimi Linear model show 1.54× higher throughput, 64% lower P90 Time to First Token, and lower cost per generated token compared to co-located inference.

Why It Matters

Independent scaling of prefill and decode is the kind of systems insight that becomes standard practice within 12-18 months of publication. For any team running high-throughput LLM serving, separating these phases eliminates the GPU scheduling conflict between a batched, compute-bound workload and a streaming, latency-bound one—directly reducing inference costs at scale.