一篇推文介绍通过组合多个B200 8-GPU机器,使用RoCEv2 CX-7以太网和Tomahawk交换机,并应用PD分解推理优化,使每GPU token吞吐量提升高达7倍,每百万token成本降低7倍。
THE MORE U BUY, THE MORE U SAVE: By ganging up multiple B200 8-GPU machines together over RoCEv2 CX-7 ethernet with Tomahawk switches with an inference optimization called PD disaggregation, the per GPU token throughput increases up to 7x. By increasing per GPU token throughput by up to 7x, this decreases cost per million tokens by up to 7x also.
Great work to @inferact & @vllm_project for building this amazing OSS engine & for @NVIDIADC @KranenKyle for building dynamo inference orchestrator. More improvements to disagg b200 perf to come!
likes: 121 | retweets: 12 | replies: 4 | views: 22135