Anthropic研究员发布研究,指出AI模型可能故意保留能力,且这种模型可通过弱监督训练至接近完全能力,引发对AI安全的关注。
As AI takes on work humans can't fully check, a capable model could deliberately hold back—and we'd never know.
New Anthropic Fellows research finds that such a model can be trained to near-full capability using a weaker model as supervisor.
Read more:
likes: 1243 | retweets: 117 | replies: 115 | views: 163934