NVIDIA发布Dynamo Snapshot技术,用于加速Kubernetes上推理工作负载的冷启动,减少GPU空闲时间,避免SLA违规。
The cold-start problem In production inference deployments, demand fluctuates over time, requiring inference replicas to scale elastically. However,...In production inference deployments, demand fluctuates over time, requiring inference replicas to scale elastically. However, cold-starting inference workloads on Kubernetes can take several minutes. During that time, GPUs are allocated but idle, generating no tokens and serving no requests. This delay increases the risk of service level agreement (SLA) violations during traffic spikes…
Source