Team Shard

In the shard theory stream, we create qualitatively new methods and fields of inquiry, from steering vectors to gradient routing to unsupervised capability elicitation to robust unlearning. If you're theory-minded, maybe you'll help us formalize shard theory itself.

Apply

View all streams

Stream overview

Discovering qualitatively new techniques

Steering GPT-2-XL by adding an activation vector opened up a new way to cheaply steer LLMs at runtime. Subsequent work has reinforced the promise of this technique, and steering vectors have become a small research subfield of its own. Unsupervised discovery of model behaviors may now be possible thanks to Andrew Mack’s method for unsupervised steering vector discovery. Gradient routing provides a novel form of weak supervision, enabling us to induce useful structure in models even without access to perfect labels. Unlearn and Distill succeeds at robust unlearning for the same reason that prior deep unlearning methods failed.

Formalizing shard theory

Shard theory has helped unlock a range of empirical insights, including steering vectors. The time seems ripe to put the theory on firmer mathematical footing. For initial thoughts, see this comment.

Something else

We're open to supporting any conceptually motivated empirical ML research that promotes the safe development of transformative AI.

Mentors

Alex Cloud

Anthropic

Member of Technical Staff

SF Bay Area

—

Interpretability

Agent Foundations

Alex is a researcher at Anthropic. He is interested in developing principled methods to induce safety-relevant structure in models. Examples include gradient routing to localize learning updates in models and distillation for robust unlearning.

Previously, Alex conducted applied research in reinforcement learning at Riot Games AI and Amazon. He earned a PhD in Statistics from North Carolina State University, where he was advised by Eric Laber.

Alex Turner

Google DeepMind

Research Scientist

SF Bay Area

—

Interpretability

Agent Foundations

Alex is a Research Scientist at Google DeepMind. He’s currently working on training invariants into model behavior. In the past, he formulated and proved the power-seeking theorems, co-formulated the shard theory of human value formation, and proposed the Attainable Utility Preservation approach to penalizing negative side effects.

Highlighted outputs from past streams:

- Mechanistic interpretability to understand and control maze-solving agents (MATS 3.0, paper)

- Introduced the now-staple technique of “steering vectors”

- Steering GPT-2-XL by adding an activation vector

- Steering Llama-2 with contrastive activation additions (MATS 4.0, paper)

- Unsupervised discovery of model behaviors using steering vectors (MATS 5.0)

- Gradient routing (MATS 6.0)

- Unlearn and distill for making robust unlearning a reality

We will have weekly 1-1's and weekly team lunch, as well as asynchronous communication over Slack. Mentees are always welcome to reach out at any time, in case guidance is needed outside of usual meeting times.

Mentorship style

Scholars should mostly figure things out on their own outside of meetings

Representative papers

https://arxiv.org/abs/2410.04332

https://arxiv.org/abs/2506.06278

Scholars we are looking for

Ideal candidates would have:

Academic background in machine learning, computer science, statistics, or a related quantitative field.
Familiarity with ML engineering.
Proven experience working on machine learning projects, either academically or professionally.
Strong programming skills, preferably in Python, and proficiency in data manipulation and analysis.
Ability to write up results into a paper.

Probably will work with collaborators from stream

Project selection

Mentor(s) will talk through project ideas with scholar