SemiAnalysis指出常见误解:TPU v8i并非训练芯片,而是推理芯片。v8i配备8组HBM3E 12-Hi显存,共288GB,带宽8.6 TB/s,而v8t为6组216GB、6.5 TB/s。v8i有384MB片上SRAM,v8t为128MB。FP4算力上,v8i为10.1 PFLOPs,v8t为12.6 PFLOPs。
A common misconception is that TPU v8i must be the training chip because it has two compute dies. Die count is not the relevant metric, what matters is the balance between compute throughput and memory capacity/bandwidth.
Reason 1: Memory capacity and bandwidth
TPU v8i has 8 stacks of HBM3E 12-Hi versus 6 on TPU v8t, giving it 288 GB of HBM and 8.6 TB/s of memory bandwidth versus 216 GB and 6.5 TB/s on the training chip. This matters because inference decode is memory-bandwidth-bound, not compute-bound. The 8i also carries 384 MB of on-chip SRAM versus 128 MB on the 8t, providing more buffer for KV cache and attention operations.
Reason 2: The training chip achieves higher FP4 FLOPs from a single die
Despite having two compute dies, TPU v8i achieves only 10.1 PFLOPs at FP4, while the single-die TPU v8t achieves 12.6 PFLOPs. Google designed the 8t's die to be extremely compute-dense, maximizing MXU throughput for training's sustained high arithmetic intensity. This also seems to highlight Google's broader direction, Google is attempting to train with FP4, a regime where the 8t's dense single die excels.
likes: 182 | retweets: 23 | replies: 6 | views: 31295