@SemiAnalysis_: SPEED IS THE MOAT: AMD ROCm software stack has improved performance by over 75x in the last 14 days since DeepSeekv4 launch. The performance...

@SemiAnalysis_ 3 信息等级 3 发布：2026-05-10T17:00 抓取：2026-05-11 04:03

AI 半导体算力

摘要

AMD ROCm软件栈在DeepSeekv4发布后14天内性能提升超75倍，通过融合mHC操作和RoPE Hadamard变换降低CPU开销并提高HBM利用率。此外，使用TileLang和Triton编写注意力索引器和KVCache压缩器以加快开发速度。未来目标：再提升5倍以匹敌单节点B200，再提升1.5倍以匹敌PD分离式B200。

客观事实

AMD ROCm软件栈14天内性能提升超75倍
改进包括融合mHC操作和RoPE Hadamard变换
目标：再提5倍追平单节点B200，再提1.5倍追平PD分离式B200

AMD ROCm DeepSeekv4 B200 TileLang Triton

原文

SPEED IS THE MOAT: AMD ROCm software stack has improved performance by over 75x in the last 14 days since DeepSeekv4 launch. The performance comes from fusing mHC operations & also fusing RoPE hadamard transformations to reduce cpu overhead & improve HBM memory utlization. Furthermore, other kernels like the attention indexer & kvcache compressor has been written using TileLang & Triton for fast development velocity.

Another 5x performance improvement is needed to catch up to single node aggregated B200 performance & then another 1.5x is needed to catch up to PD disaggregated B200 performance, which is within the realm of possibility for AMD within the next couple of weeks. Great work to HaiShaw, Thomas, @roaner, @AnushElangovan for this rapid improvement.

likes: 627 | retweets: 58 | replies: 21 | views: 97328