@SemiAnalysis_: PDOOM ALERT 🚨 : ~48% of e2e LLM latency is prefill, ~52% is decode. Prefill itself breaks into 2 ops: 🟠 Prefill extend (cache write) — inge...

@SemiAnalysis_ 3 信息等级 3 发布：2026-05-26T23:00 抓取：2026-05-26 23:18

🔗 原文链接

AI 算力

摘要

Semianalysis发布LLM推理延迟分析：端到端延迟中prefill占48%，decode占52%；prefill又分为prefill extend（缓写入）和cache read（缓存读取）。

客观事实

LLM端到端延迟中prefill占48%
LLM端到端延迟中decode占52%
Prefill分为prefill extend和cache read

SemiAnalysis

原文

PDOOM ALERT 🚨 : ~48% of e2e LLM latency is prefill, ~52% is decode. Prefill itself breaks into 2 ops:

🟠 Prefill extend (cache write) — ingests new context/files, writes fresh KV tokens
🟠 Cache read — reuses existing KV cache from prior turns https://t.co/zzKrZFZKhX

likes: 10 | retweets: 0 | replies: 0 | views: 1980