NVIDIA GB200芯片改变了服务大型MoE模型(如Qwen)时的prefill和decode分离方式,相比Hopper芯片有吞吐量优势,团队已发表量化对比结果。
GB 200s change how one does the prefill and decode disaggregation when serving large MoEs like Qwen. We’ve published details of our stack quantifying the throughput benefits compared to serving on Hoppers.
likes: 61 | retweets: 4 | replies: 6 | views: 9548