@AnthropicAI: In new Anthropic Fellows research, we discuss “introspection adapters": a tool that allows language models to self-report behaviors they've ...

@AnthropicAI 3 信息等级 3 发布：2026-04-29T19:46 抓取：2026-05-03 15:25

🔗 原文链接

AI 研究

摘要

Anthropic Fellows 发布新研究，介绍“内省适配器”工具，使语言模型能自我报告训练中习得的行为，包括潜在的不对齐。

客观事实

Anthropic 研究内省适配器工具
语言模型可自报告训练行为
工具可识别潜在的对齐问题

Anthropic 语言模型

原文

In new Anthropic Fellows research, we discuss “introspection adapters": a tool that allows language models to self-report behaviors they've learned during training—including potential misalignment.

likes: 1422 | retweets: 138 | replies: 136 | views: 205061