Discovering qualitatively new techniques
Steering GPT-2-XL by adding an activation vector opened up a new way to cheaply steer LLMs at runtime. Subsequent work has reinforced the promise of this technique, and steering vectors have become a small research subfield of its own. Unsupervised discovery of model behaviors may now be possible thanks to Andrew Mack’s method for unsupervised steering vector discovery. Gradient routing provides a novel form of weak supervision, enabling us to induce useful structure in models even without access to perfect labels. Unlearn and Distill succeeds at robust unlearning for the same reason that prior deep unlearning methods failed.
Formalizing shard theory
Shard theory has helped unlock a range of empirical insights, including steering vectors. The time seems ripe to put the theory on firmer mathematical footing. For initial thoughts, see this comment.
Something else
We're open to supporting any conceptually motivated empirical ML research that promotes the safe development of transformative AI.
Alex is a researcher at Anthropic. He is interested in developing principled methods to induce safety-relevant structure in models. Examples include gradient routing to localize learning updates in models and distillation for robust unlearning.
Previously, Alex conducted applied research in reinforcement learning at Riot Games AI and Amazon. He earned a PhD in Statistics from North Carolina State University, where he was advised by Eric Laber.
Alex is a Research Scientist at Google DeepMind. He’s currently working on training invariants into model behavior. In the past, he formulated and proved the power-seeking theorems, co-formulated the shard theory of human value formation, and proposed the Attainable Utility Preservation approach to penalizing negative side effects.
Highlighted outputs from past streams:
- Mechanistic interpretability to understand and control maze-solving agents (MATS 3.0, paper)
- Introduced the now-staple technique of “steering vectors”
- Steering GPT-2-XL by adding an activation vector
- Steering Llama-2 with contrastive activation additions (MATS 4.0, paper)
- Unsupervised discovery of model behaviors using steering vectors (MATS 5.0)
- Gradient routing (MATS 6.0)
- Unlearn and distill for making robust unlearning a reality
We will have weekly 1-1's and weekly team lunch, as well as asynchronous communication over Slack. Mentees are always welcome to reach out at any time, in case guidance is needed outside of usual meeting times.
We will have weekly 1-1's and weekly team lunch, as well as asynchronous communication over Slack. Mentees are always welcome to reach out at any time, in case guidance is needed outside of usual meeting times.
Scholars should mostly figure things out on their own outside of meetings
We will have weekly 1-1's and weekly team lunch, as well as asynchronous communication over Slack. Mentees are always welcome to reach out at any time, in case guidance is needed outside of usual meeting times.
https://arxiv.org/abs/2410.04332
https://arxiv.org/abs/2506.06278
Ideal candidates would have:
Probably will work with collaborators from stream
Mentor(s) will talk through project ideas with scholar
MATS Research phase provides scholars with a community of peers.
.webp)
During the Research phase, scholars work out of a shared office, have shared housing, and are supported by a full-time Community Manager.
Working in a community of independent researchers gives scholars easy access to future collaborators, a deeper understanding of other alignment agendas, and a social network in the alignment community.
Previous MATS cohorts included regular lightning talks, scholar-led study groups on mechanistic interpretability and linear algebra, and hackathons. Other impromptu office events included group-jailbreaking Bing chat and exchanging hundreds of anonymous compliment notes. Scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups.