Victoria Krakovna

Conceptual research on deceptive alignment, designing scheming propensity evaluations and honeypots. The stream will run in person in London, with scholars working together in team(s).

Apply

View all streams

Stream overview

Conceptual research on deceptive alignment, designing scheming propensity evaluations and honeypots. Some example directions:

Building realistic evals / honeypots for scheming propensity (as similar to normal deployment settings as possible, not triggering eval awareness)
Propensity evals testing for other instrumental goals besides self-preservation (e.g. resource acquisition, goal preservation)
Investigate how different forms of evaluation awareness affect model behavior on propensity evals, and whether it's feasible / useful to influence what kind of evaluation the model believes it's in (e.g. capability vs safety eval).

Mentors

Victoria Krakovna

Google DeepMind

Research Scientist

London

—

Scheming and Deception

Dangerous Capability Evals

Control

Red-Teaming

I am a research scientist on the AGI Safety & Alignment team at Google DeepMind. I am currently focusing on deceptive alignment and AI control (recent work: https://arxiv.org/abs/2505.01420), particularly scheming propensity evaluations and honeypots. My past research includes power-seeking incentives, specification gaming, and avoiding side effects.

Mentorship style

During the program, we will meet once a week to go through any updates / results, and your plans for the next week. I'm also happy to comment on docs, respond on Slack, or have additional ad hoc meetings as needed.

Representative papers

https://arxiv.org/abs/2505.01420

https://arxiv.org/abs/2403.13793

https://arxiv.org/abs/2505.23575

https://arxiv.org/abs/2412.12480

Scholars we are looking for

Collaborative software engineering: You are comfortable writing high-quality code independently and quickly. You are fluent in standard software engineering practices, e.g. version control. You can navigate a codebase written not entirely by you and can build on it productively.
Familiarity with deceptive alignment, safety cases, capability evaluations, AI control, and related topics. This makes it much easier for us to be on the same page about the goals of the project, and is important for making day-to-day project decisions in a conceptually sound way.
Conceptual research ability: You can come up with ideas for interesting experiments that align with the project plan. You prioritise well.
Good communication, team player: You can clearly communicate your research ideas, experimental methodology, results, etc, verbally or in writing (writing is more important). You can notice and express confusion, uncertainty, disagreement, or dissatisfaction and are willing to work through conflicts. You impartially consider other people’s ideas, disagree respectfully, can admit when you’re wrong, and are willing to commit to the team’s direction.

Scholars will be working together in team(s)

Project selection

I will talk through project ideas with scholars