← 返回列表

@SemiAnalysis_: A common misconception is that TPU v8i must be the training chip because it has two compute dies. Die count is not the relevant metric, what...

@SemiAnalysis_ 3 信息等级 3 1 噪音/剔除;2 较弱;3 普通事实;4 重要行业动态;5 极重大事件。该分数是信息显著性,不是投资建议。 发布:2026-05-04T17:00 抓取:2026-05-05 04:04
🔗 原文链接
摘要

SemiAnalysis指出常见误解:TPU v8i并非训练芯片,而是推理芯片。v8i配备8组HBM3E 12-Hi显存,共288GB,带宽8.6 TB/s,而v8t为6组216GB、6.5 TB/s。v8i有384MB片上SRAM,v8t为128MB。FP4算力上,v8i为10.1 PFLOPs,v8t为12.6 PFLOPs。

客观事实
  • TPU v8i配备8组HBM3E 12-Hi,共288GB显存,带宽8.6 TB/s
  • TPU v8t配备6组HBM3E,共216GB显存,带宽6.5 TB/s
  • TPU v8i的FP4算力为10.1 PFLOPs,v8t为12.6 PFLOPs
TPU v8i TPU v8t Google HBM3E

原文

A common misconception is that TPU v8i must be the training chip because it has two compute dies. Die count is not the relevant metric, what matters is the balance between compute throughput and memory capacity/bandwidth.

Reason 1: Memory capacity and bandwidth

TPU v8i has 8 stacks of HBM3E 12-Hi versus 6 on TPU v8t, giving it 288 GB of HBM and 8.6 TB/s of memory bandwidth versus 216 GB and 6.5 TB/s on the training chip. This matters because inference decode is memory-bandwidth-bound, not compute-bound. The 8i also carries 384 MB of on-chip SRAM versus 128 MB on the 8t, providing more buffer for KV cache and attention operations.

Reason 2: The training chip achieves higher FP4 FLOPs from a single die

Despite having two compute dies, TPU v8i achieves only 10.1 PFLOPs at FP4, while the single-die TPU v8t achieves 12.6 PFLOPs. Google designed the 8t's die to be extremely compute-dense, maximizing MXU throughput for training's sustained high arithmetic intensity. This also seems to highlight Google's broader direction, Google is attempting to train with FP4, a regime where the 8t's dense single die excels.

likes: 182 | retweets: 23 | replies: 6 | views: 31295