Conceptual research on deceptive alignment, designing scheming propensity evaluations and honeypots. Some example directions:
I am a research scientist on the AGI Safety & Alignment team at Google DeepMind. I am currently focusing on deceptive alignment and AI control (recent work: https://arxiv.org/abs/2505.01420), particularly scheming propensity evaluations and honeypots. My past research includes power-seeking incentives, specification gaming, and avoiding side effects.
During the program, we will meet once a week to go through any updates / results, and your plans for the next week. I'm also happy to comment on docs, respond on Slack, or have additional ad hoc meetings as needed.
https://arxiv.org/abs/2505.01420
https://arxiv.org/abs/2403.13793
https://arxiv.org/abs/2505.23575
https://arxiv.org/abs/2412.12480
Scholars will be working together in team(s)
I will talk through project ideas with scholars