NVIDIA发布DynoSim仿真工具,用于模拟LLM服务的Pareto前沿,帮助优化模型后端、张量并行、预填充/解码拆分等多层交互的配置选择,解决现代LLM服务调优难题。
Modern LLM serving is hard to tune because each deployment is a stack of interacting choices: model backend, tensor-parallel shape, prefill/decode split, worker...Modern LLM serving is hard to tune because each deployment is a stack of interacting choices: model backend, tensor-parallel shape, prefill/decode split, worker counts, scheduler settings, routing policy, KV cache behavior, autoscaling thresholds, and topology. Those choices interact across layers, and a local improvement can shift the bottleneck somewhere else. For larger models…
Source