Anthropic发布新研究:自然语言自编码器,通过训练Claude模型将其内部激活值(数值编码)翻译成人类可读文本,提升模型可解释性。
New Anthropic research: Natural Language Autoencoders.
Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read.
Here, we train Claude to translate its activations into human-readable text. https://t.co/pMLsxM2VAO
likes: 9987 | retweets: 1040 | replies: 366 | views: 994616