Semianalysis发布LLM推理延迟分析:端到端延迟中prefill占48%,decode占52%;prefill又分为prefill extend(缓写入)和cache read(缓存读取)。
PDOOM ALERT 🚨 : ~48% of e2e LLM latency is prefill, ~52% is decode. Prefill itself breaks into 2 ops:
🟠 Prefill extend (cache write) — ingests new context/files, writes fresh KV tokens
🟠 Cache read — reuses existing KV cache from prior turns https://t.co/zzKrZFZKhX
likes: 10 | retweets: 0 | replies: 0 | views: 1980