Team Shard

In the shard theory stream, we create qualitatively new methods and fields of inquiry, from steering vectors to gradient routing to unsupervised capability elicitation to robust unlearning. If you're theory-minded, maybe you'll help us formalize shard theory itself.

Stream overview

Discovering qualitatively new techniques

Steering GPT-2-XL by adding an activation vector opened up a new way to cheaply steer LLMs at runtime. Subsequent work has reinforced the promise of this technique, and steering vectors have become a small research subfield of its own. Unsupervised discovery of model behaviors may now be possible thanks to Andrew Mack’s method for unsupervised steering vector discoveryGradient routing provides a novel form of weak supervision, enabling us to induce useful structure in models even without access to perfect labels. Unlearn and Distill succeeds at robust unlearning for the same reason that prior deep unlearning methods failed.

Formalizing shard theory

Shard theory has helped unlock a range of empirical insights, including steering vectors. The time seems ripe to put the theory on firmer mathematical footing. For initial thoughts, see this comment

Something else

We're open to supporting any conceptually motivated empirical ML research that promotes the safe development of transformative AI.

Mentors

Alex Cloud
Anthropic
,
Member of Technical Staff
SF Bay Area
Interpretability, Agent Foundations

Alex is a researcher at Anthropic. He is interested in developing principled methods to induce safety-relevant structure in models. Examples include gradient routing to localize learning updates in models and distillation for robust unlearning.

Previously, Alex conducted applied research in reinforcement learning at Riot Games AI and Amazon. He earned a PhD in Statistics from North Carolina State University, where he was advised by Eric Laber.

Alex Turner
Google DeepMind
,
Research Scientist
SF Bay Area
Interpretability, Agent Foundations

Alex is a Research Scientist at Google DeepMind. He’s currently working on training invariants into model behavior. In the past, he formulated and proved the power-seeking theorems, co-formulated the shard theory of human value formation, and proposed the Attainable Utility Preservation approach to penalizing negative side effects.

Highlighted outputs from past streams:

- Mechanistic interpretability to understand and control maze-solving agents (MATS 3.0, paper)

    - Introduced the now-staple technique of “steering vectors”

- Steering GPT-2-XL by adding an activation vector

- Steering Llama-2 with contrastive activation additions (MATS 4.0, paper)

- Unsupervised discovery of model behaviors using steering vectors (MATS 5.0)

- Gradient routing (MATS 6.0)

- Unlearn and distill for making robust unlearning a reality

We will have weekly 1-1's and weekly team lunch, as well as asynchronous communication over Slack. Mentees are always welcome to reach out at any time, in case guidance is needed outside of usual meeting times.

Mentorship style

We will have weekly 1-1's and weekly team lunch, as well as asynchronous communication over Slack. Mentees are always welcome to reach out at any time, in case guidance is needed outside of usual meeting times.

Scholars should mostly figure things out on their own outside of meetings

We will have weekly 1-1's and weekly team lunch, as well as asynchronous communication over Slack. Mentees are always welcome to reach out at any time, in case guidance is needed outside of usual meeting times.

Representative papers

https://arxiv.org/abs/2410.04332

https://arxiv.org/abs/2506.06278

Scholars we are looking for

Ideal candidates would have:

  • Academic background in machine learning, computer science, statistics, or a related quantitative field.
  • Familiarity with ML engineering.
  • Proven experience working on machine learning projects, either academically or professionally.
  • Strong programming skills, preferably in Python, and proficiency in data manipulation and analysis.
  • Ability to write up results into a paper.

Probably will work with collaborators from stream

Project selection

Mentor(s) will talk through project ideas with scholar

Community at MATS

MATS Research phase provides scholars with a community of peers.

During the Research phase, scholars work out of a shared office, have shared housing, and are supported by a full-time Community Manager.

Working in a community of independent researchers gives scholars easy access to future collaborators, a deeper understanding of other alignment agendas, and a social network in the alignment community.

Previous MATS cohorts included regular lightning talks, scholar-led study groups on mechanistic interpretability and linear algebra, and hackathons. Other impromptu office events included group-jailbreaking Bing chat and exchanging hundreds of anonymous compliment notes.  Scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups.