
Google DeepMind
—
Research Scientist
Alex is a Research Scientist at Google DeepMind. He’s currently working on training invariants into model behavior. In the past, he formulated and proved the power-seeking theorems, co-formulated the shard theory of human value formation, and proposed the Attainable Utility Preservation approach to penalizing negative side effects.
Highlighted outputs from past streams:
- Mechanistic interpretability to understand and control maze-solving agents (MATS 3.0, paper)
- Introduced the now-staple technique of “steering vectors”
- Steering GPT-2-XL by adding an activation vector
- Steering Llama-2 with contrastive activation additions (MATS 4.0, paper)
- Unsupervised discovery of model behaviors using steering vectors (MATS 5.0)
- Gradient routing (MATS 6.0)
- Unlearn and distill for making robust unlearning a reality