Google DeepMind

This stream will focus on monitoring, stress-testing safety methods, and evals, with a focus on risks from scheming AIs. Examples include (black-box) AI control techniques, white-box monitors (probes etc.), chain-of-thought monitoring/faithfulness, building evaluation environments, and stress-testing mitigations.

Stream overview

The GDM stream focuses on detecting and mitigating risk from scheming AI. A project would typically involve building an evaluation, prototyping and stress-testing a mitigation, investigating some threat model or trade-off relevant to these topics, or building environments to enable all of the above.

We will share concrete project proposals closer to the start of the program. They will be similar in flavour though not exactly the same as the following illustrative project directions (many of which are projects we have previously advised):

Risk assessment: How worried should we be about risks from scheming AI?

  • Capability evaluations for deceptive reasoning or its precursors
  • Capability evaluations for opaque reasoning
  • Capability evaluations for steganographic communication and reasoning
  • Evaluations of monitoring awareness and monitoring subversion propensity
  • Honeypot evaluations that trigger based on the action alone (rather than CoT)
  • Honeypot evaluations for settings other than research sabotage
  • Propensity evaluations testing for other instrumental goals besides self-preservation (e.g. resource acquisition, goal preservation)
  • Create a model organism of exploration hacking, or generalisation hacking
  • Create a model organism of activation obfuscation
  • Study how much a schemer can subvert control by acting borderline suspicious all the time

Mitigations: AI control and alignment

  • Develop improved white-box monitoring methods
  • Develop collusion countermeasures for untrusted monitoring
  • Develop AI control methodology in the distributed attacks setting
  • Develop cross-context consistency as a sabotage detection methodology
  • Stress-test chain-of-thought monitoring 
  • Stress-test “bootstrapped” monitoring, a hybrid trusted-untrusted monitoring chain
  • Study a version of untrusted monitoring where the monitor is an honesty-tuned version of the agent 
  • Study what happens to a model that has a hidden secret during character training
  • Test the effect of different types of RL training on CoT legibility
  • Identify the causes of agent’s being misaligned and eg. lying about their results
  • Build environment / dataset for studying data poisoning mitigations
  • Build environment / dataset for studying decision sabotage mitigations

(There will be lots of projects to choose from, so you don’t have to be excited about every single direction.)

Mentors

David Lindner
Google DeepMind
,
Research Scientist
London
Control
Monitoring
Safeguards
Scheming and Deception
Dangerous Capability Evals

David Lindner is a Research Scientist on Google DeepMind's AGI Safety and Alignment team where he works on evaluations and mitigations for deceptive alignment and scheming. His recent work includes MONA, a method for reducing multi-turn reward hacking during RL, designing evaluations for stealth and situational awareness, and helping develop GDM's approach to deceptive alignment. Currently, David is interested in studying mitigations for scheming, including CoT monitoring and AI control. You can find more details on his website.

Read more
Mary Phuong
Google DeepMind
,
Research Scientist
London
Control
Scheming and Deception
Model Organisms
Monitoring

Mary is a research scientist on the AGI Safety and Alignment team at Google DeepMind, where she works on preparedness for loss of control risks (misalignment, ML R&D, model poisoning). Her role involves making sure GDM has sufficient early warning signals and response plans in place for these threats. Previously, she has worked on AI control, dangerous capability evaluations for scheming precursor capabilities (stealth and situational awareness) as well catastrophic misuse capabilities. 

Read more
Roland Zimmermann
Google DeepMind
,
Senior Research Scientist
Zurich
Control
Model Organisms
Monitoring
Scheming and Deception

Roland works as a Research Scientist at Google DeepMind as a member of the AGI Safety and Alignment team. He completed his Ph.D. at the University of Tuebingen / MPI-IS working with Wieland Brendel on interpretabilityrobustness and learning theory. His current work is focussed on evaluations and mitigations for deceptive alignment and scheming. More generally, he is interested in understanding the behavior, capabilities and limitations of AIs and their training procedures to increase trust and safety.

Read more
Victoria Krakovna
Google DeepMind
,
Research scientist
London
Scheming and Deception
Control
Red-Teaming

I am a research scientist on the AGI Safety & Alignment team at Google DeepMind. I focus on deceptive alignment and AI control, particularly scheming propensity evaluations. My past research includes dangerous capability evals, power-seeking incentives, specification gaming, and avoiding side effects.    

Read more

Mentorship style

This varies a bit by mentor, but generally we expect fellows to autonomously drive their project forward (alongside their teammate(s)) and meet with their mentor once a week. 

Fellows are responsible for:

  • project execution and day-to-day decision making (e.g. how to structure the codebase, what experiments to run),
  • keeping track of and meeting deadlines (e.g. MATS deadlines, conference deadlines),
  • seeking mentor feedback where appropriate and making good use of meetings.

Mentors are responsible for:

  • making sure their fellows are working on promising projects,
  • giving fellows feedback on their progress and development,
  • advising on project direction, theory of change, prioritisation, eval / dataset / experiment design or methodology, interpreting results, relevant literature.

In addition to weekly meetings, you may set up ad hoc meetings with your mentor, reach them on Slack, or send them proposals / paper drafts for review. All meetings will be at times compatible with the BST / CEST timezones.

Fellows we are looking for

We're looking for fellows who want to dedicate their careers to making transformative AI go well, and who have strong technical skills. You don't need to be exceptional at every technical skill below, but you should have basic competence across the list and be excellent (or improving fast) in some of them.

Mission orientation

  • You see the development of AGI / ASI as one of the most significant events in human history and want to dedicate your career to helping things go well.

Technical skills

  • Coding / Software engineering: You can write high-quality, legible code quickly (skilfully using AI), navigate large codebases you didn't write, and debug problems quickly and thoroughly. You have good judgment about when to write stable shared code vs throwaway code, and when to invest in a refactor, tooling or faster iteration loops. You check and understand your AI’s work.
  • ML and LLM engineering: You are fluent at training models (via SFT/RL APIs or writing training code yourself), creating high-quality synthetic data, and debugging training runs. You're comfortable writing inference code, scaffolds, and eval pipelines (e.g. Inspect). You know what to log and can extract insights from logs and transcripts.
  • Research ability: You design sound experiments targeted at the research question. You understand confounders, competing hypotheses, baselines, ablations, and statistical significance. You generate good ideas for experiments / research questions that align with the project plan.
  • Research context and big picture: You are familiar with deceptive alignment, AI control, model alignment, model evaluations, model organisms, and safety cases. You know what most terms in our project descriptions mean, the field's open problems, and how your project contributes to AGI safety going well. Your design and prioritisation decisions reflect this understanding.
  • Communication: You clearly communicate your research ideas, project status, experimental methodology, and results – both verbally and in writing.

Working style

  • Independence: We provide weekly research guidance, but expect you to run experiments and make low-level decisions independently most of the time.

Team player: You impartially consider others' ideas, disagree respectfully, admit when you're wrong, and commit to the team's direction once decided. You are happy to work in teams of 2-3.

We design our stream to be highly collaborative. We encourage scholars to work together and possibly with external collaborators.

Project selection

Close to program start, we will share a list of proposed projects, each tied to a primary mentor (some projects may also have a secondary mentor). You will then have some time to think, look up relevant literature, ask mentors questions and talk to other fellows. Then we will send out a form for you to indicate your top N projects, ranked, as well as any teammate or mentor preferences. We will then optimise fellow allocation into teams and projects to maximally satisfy everyone’s preferences.