MATS Alumnus
Matthew Siu, Sviatoslav Chalnev
Collabortators
Sviatoslav Chalnev, Matthew Siu, Arthur Conmy
Citations
Abstract
To control the behavior of language models, steering methods attempt to ensure that outputs of the model satisfy specific pre-defined properties. Adding steering vectors to the model is a promising method of model control that is easier than finetuning, and may be more robust than prompting. However, it can be difficult to anticipate the effects of steering vectors produced by methods such as CAA [Panickssery et al., 2024] or the direct use of SAE latents [Templeton et al., 2024]. In our work, we address this issue by using SAEs to measure the effects of steering vectors, giving us a method that can be used to understand the causal effect of any steering vector intervention. We use this method for measuring causal effects to develop an improved steering method, SAE-Targeted Steering (SAE-TS), which finds steering vectors to target specific SAE features while minimizing unintended side effects. We show that overall, SAE-TS balances steering effects with coherence better than CAA and SAE feature steering, when evaluated on a range of tasks.
Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs
Authors:
Jorio Cocola, Dylan Feng
Date:
December 10, 2025
Citations:
0
AI agents find $4.6M in blockchain smart contract exploits
Authors:
Fellow: Winnie Xiao
Date:
December 1, 2025
Citations:
0
The MATS Program is an independent research and educational initiative connecting emerging researchers with mentors in AI alignment, governance, and security.
Each MATS cohort runs for 12 weeks in Berkeley, California, followed by an optional 6–12 month extension in London for selected scholars.