NVIDIA发布NCCL Inspector工具,集成Prometheus实现分布式深度学习训练的实时性能监控和调试,可加速诊断通信、计算等问题。
Distributed deep learning depends on fast, reliable GPU-to-GPU communication using the NVIDIA Collective Communication Library (NCCL). When training slows down,...Distributed deep learning depends on fast, reliable GPU-to-GPU communication using the NVIDIA Collective Communication Library (NCCL). When training slows down, it becomes challenging to determine why and what to do next. A problem can span computation, communication, a specific rank, or underlying hardware. NVIDIA NCCL Inspector accelerates triaging by providing a lightweight and continuous…
Source