推文指出LLM训练依赖快速矩阵乘法,但许多周围操作仍受内存限制。CODA方法对这些内核进行重新参数化优化。
RT @HanGuo97: LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.
CODA reparameterizes them…
likes: 381 | retweets: 64 | replies: 6 | views: 90059