一条推特讨论AI基准测试趋势,指出benchmaxxing时代已结束,业界不再严重依赖训练集作弊;持续更新的基准测试如LCB出现;大约2025年初,模型在截断后切片上不再崩溃。
no, that era has ended, nobody is seriously benchmaxxing now. Originally this term meant training on actual test, at best on paraphrases, and we came up with continuously updated benchmarks like LCB. Around early 2025 (≈R1), models stopped crashing on post-cutoff slices. https://t.co/Y87LKngIXk
likes: 34 | retweets: 0 | replies: 2 | views: 4553