Toward universal steering and monitoring of AI models

Back to news list

Source: Science Magazine

Original: https://www.science.org/doi/abs/10.1126/science.aea6792?af=R...

Published: 2026-02-19T07:00:03Z

Researchers have developed an efficient and scalable way to extract linear representations of general concepts in large AI models, such as language models, visual-language models, and reasoning models with sizes ranging from 8 to 90 billion parameters[1][2]. These representations enable models to be controlled, thereby revealing vulnerabilities, mitigating misbehavior, and improving model capabilities beyond standard prompting[1][2]. Quantitative analysis of hundreds of concepts has shown that newer and larger models are more controllable[1][2]. Concept representations are transferable between human languages ​​and can be combined to manage multiple concepts at once[1][2]. The method also serves to monitor incorrect content, such as hallucinations or toxic content, while predictive models based on representations are more accurate than direct assessment of outputs[1][2]. The concept identification and control process takes less than a minute on a single NVIDIA A100 GPU with less than 500 training samples[1].