Anthropic Fellows 发布新研究,介绍“内省适配器”工具,使语言模型能自我报告训练中习得的行为,包括潜在的不对齐。
In new Anthropic Fellows research, we discuss “introspection adapters": a tool that allows language models to self-report behaviors they've learned during training—including potential misalignment.
likes: 1422 | retweets: 138 | replies: 136 | views: 205061