OpenAI 发布分析,指出思维链监控是防御 AI 代理失调的关键层,为避免惩罚失调推理而保持可监控性,并发现有限数量的意外思维链评分影响了已发布模型。
Chain of thought monitors are a key layer of defense against AI agent misalignment. To preserve monitorability, we avoid penalizing misaligned reasoning during RL.
We found a limited amount of accidental CoT grading which affected released models, and are sharing our analysis.
https://t.co/0o3PLfafC4
likes: 1867 | retweets: 157 | replies: 212 | views: 226810