Victoria Krakovna

Conceptual research on deceptive alignment, designing scheming propensity evaluations and honeypots. The stream will run in person in London, with scholars working together in team(s). 

Stream overview

Conceptual research on deceptive alignment, designing scheming propensity evaluations and honeypots. Some example directions: 

  • Building realistic evals / honeypots for scheming propensity (as similar to normal deployment settings as possible, not triggering eval awareness)
  • Propensity evals testing for other instrumental goals besides self-preservation (e.g. resource acquisition, goal preservation)
  • Investigate how different forms of evaluation awareness affect model behavior on propensity evals, and whether it's feasible / useful to influence what kind of evaluation the model believes it's in (e.g. capability vs safety eval). 

Mentors

Victoria Krakovna
Google DeepMind
,
Research Scientist
London
Scheming & Deception, Dangerous Capability Evals, Control, Red-Teaming

I am a research scientist on the AGI Safety & Alignment team at Google DeepMind. I am currently focusing on deceptive alignment and AI control (recent work: https://arxiv.org/abs/2505.01420), particularly scheming propensity evaluations and honeypots. My past research includes power-seeking incentives, specification gaming, and avoiding side effects. 

Mentorship style

During the program, we will meet once a week to go through any updates / results, and your plans for the next week. I'm also happy to comment on docs, respond on Slack, or have additional ad hoc meetings as needed.

Representative papers

https://arxiv.org/abs/2505.01420

https://arxiv.org/abs/2403.13793

https://arxiv.org/abs/2505.23575

https://arxiv.org/abs/2412.12480

Scholars we are looking for

  • Collaborative software engineering: You are comfortable writing high-quality code independently and quickly. You are fluent in standard software engineering practices, e.g. version control. You can navigate a codebase written not entirely by you and can build on it productively.
  • Familiarity with deceptive alignment, safety cases, capability evaluations, AI control, and related topics. This makes it much easier for us to be on the same page about the goals of the project, and is important for making day-to-day project decisions in a conceptually sound way.
  • Conceptual research ability: You can come up with ideas for interesting experiments that align with the project plan. You prioritise well.
  • Good communication, team player: You can clearly communicate your research ideas, experimental methodology, results, etc, verbally or in writing (writing is more important). You can notice and express confusion, uncertainty, disagreement, or dissatisfaction and are willing to work through conflicts. You impartially consider other people’s ideas, disagree respectfully, can admit when you’re wrong, and are willing to commit to the team’s direction.

Scholars will be working together in team(s)

Project selection

I will talk through project ideas with scholars

Community at MATS

MATS Research phase provides scholars with a community of peers.

During the Research phase, scholars work out of a shared office, have shared housing, and are supported by a full-time Community Manager.

Working in a community of independent researchers gives scholars easy access to future collaborators, a deeper understanding of other alignment agendas, and a social network in the alignment community.

Previous MATS cohorts included regular lightning talks, scholar-led study groups on mechanistic interpretability and linear algebra, and hackathons. Other impromptu office events included group-jailbreaking Bing chat and exchanging hundreds of anonymous compliment notes.  Scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups.