@AnthropicAI: As AI takes on work humans can't fully check, a capable model could deliberately hold back—and we'd never know. New Anthropic Fellows rese...

@AnthropicAI 3 信息等级 3 发布：2026-05-05T17:38 抓取：2026-05-06 04:02

AI 研究

摘要

Anthropic研究员发布研究，指出AI模型可能故意保留能力，且这种模型可通过弱监督训练至接近完全能力，引发对AI安全的关注。

客观事实

Anthropic

As AI takes on work humans can't fully check, a capable model could deliberately hold back—and we'd never know.

New Anthropic Fellows research finds that such a model can be trained to near-full capability using a weaker model as supervisor.

likes: 1243 | retweets: 117 | replies: 115 | views: 163934