Kimi Prefill-as-a-Service teilt LLM-Inferenz für 1,54-fachen Durchsatz

Kimi's Prefill-as-a-Service paper proposes separating LLM inference into independent Prefill and Decode services on different hardware. On a 20×-scaled Kimi Linear model: 1.54× throughput, 64% lower P90 TTFT.

1 Min. Lesezeit|agenticonsult Intelligence

Kimi Prefill-as-a-Service Splits LLM Inference for 1.54× Throughput

Kimi / Moonshot AI published a paper proposing Prefill-as-a-Service: separating the two phases of LLM inference—compute-heavy prompt prefill and latency-sensitive token-by-token decode—into independent services that can run on different hardware nodes or data centers. The key enabler is an optimized KV-cache representation that makes cross-network transfer of the prefill state feasible. Results on a 20×-scaled Kimi Linear model show 1.54× higher throughput, 64% lower P90 Time to First Token, and lower cost per generated token compared to co-located inference.

Why It Matters

Independent scaling of prefill and decode is the kind of systems insight that becomes standard practice within 12-18 months of publication. For any team running high-throughput LLM serving, separating these phases eliminates the GPU scheduling conflict between a batched, compute-bound workload and a streaming, latency-bound one—directly reducing inference costs at scale.

Diskutieren aufLinkedIn X

Diese Eilmeldung wurde mit AI-Unterstuetzung aus der genannten Primaerquelle zusammengestellt. Sie dient der schnellen Lageorientierung — fuer die massgebliche Aussage bitte die Originalpublikation konsultieren.

Kimi Prefill-as-a-Service teilt LLM-Inferenz für 1,54-fachen Durchsatz

Kimi Prefill-as-a-Service Splits LLM Inference for 1.54× Throughput

Why It Matters

Live News Feed