Alex Turner

Google DeepMind

Research Scientist

Alex is a Research Scientist at Google DeepMind. He’s currently working on training invariants into model behavior. In the past, he formulated and proved the power-seeking theorems, co-formulated the shard theory of human value formation, and proposed the Attainable Utility Preservation approach to penalizing negative side effects.

Highlighted outputs from past streams:

- Mechanistic interpretability to understand and control maze-solving agents (MATS 3.0, paper)

    - Introduced the now-staple technique of “steering vectors”

- Steering GPT-2-XL by adding an activation vector

- Steering Llama-2 with contrastive activation additions (MATS 4.0, paper)

- Unsupervised discovery of model behaviors using steering vectors (MATS 5.0)

- Gradient routing (MATS 6.0)

- Unlearn and distill for making robust unlearning a reality