This stream will focus on monitoring, stress-testing safety methods, and evals, with a focus on risks from scheming AIs. Examples include (black-box) AI control techniques, white-box monitors (probes etc.), chain-of-thought monitoring/faithfulness, building evaluation environments, and stress-testing mitigations.
The GDM stream focuses on detecting and mitigating risk from scheming AI. A project would typically involve building an evaluation, prototyping and stress-testing a mitigation, investigating some threat model or trade-off relevant to these topics, or building environments to enable all of the above.
We will share concrete project proposals closer to the start of the program. They will be similar in flavour though not exactly the same as the following illustrative project directions (many of which are projects we have previously advised):
Risk assessment: How worried should we be about risks from scheming AI?
Mitigations: AI control and alignment
(There will be lots of projects to choose from, so you don’t have to be excited about every single direction.)
David Lindner is a Research Scientist on Google DeepMind's AGI Safety and Alignment team where he works on evaluations and mitigations for deceptive alignment and scheming. His recent work includes MONA, a method for reducing multi-turn reward hacking during RL, designing evaluations for stealth and situational awareness, and helping develop GDM's approach to deceptive alignment. Currently, David is interested in studying mitigations for scheming, including CoT monitoring and AI control. You can find more details on his website.
Mary is a research scientist on the AGI Safety and Alignment team at Google DeepMind, where she works on preparedness for loss of control risks (misalignment, ML R&D, model poisoning). Her role involves making sure GDM has sufficient early warning signals and response plans in place for these threats. Previously, she has worked on AI control, dangerous capability evaluations for scheming precursor capabilities (stealth and situational awareness) as well catastrophic misuse capabilities.
Roland works as a Research Scientist at Google DeepMind as a member of the AGI Safety and Alignment team. He completed his Ph.D. at the University of Tuebingen / MPI-IS working with Wieland Brendel on interpretability, robustness and learning theory. His current work is focussed on evaluations and mitigations for deceptive alignment and scheming. More generally, he is interested in understanding the behavior, capabilities and limitations of AIs and their training procedures to increase trust and safety.
I am a research scientist on the AGI Safety & Alignment team at Google DeepMind. I focus on deceptive alignment and AI control, particularly scheming propensity evaluations. My past research includes dangerous capability evals, power-seeking incentives, specification gaming, and avoiding side effects.
This varies a bit by mentor, but generally we expect fellows to autonomously drive their project forward (alongside their teammate(s)) and meet with their mentor once a week.
Fellows are responsible for:
Mentors are responsible for:
In addition to weekly meetings, you may set up ad hoc meetings with your mentor, reach them on Slack, or send them proposals / paper drafts for review. All meetings will be at times compatible with the BST / CEST timezones.
We're looking for fellows who want to dedicate their careers to making transformative AI go well, and who have strong technical skills. You don't need to be exceptional at every technical skill below, but you should have basic competence across the list and be excellent (or improving fast) in some of them.
Mission orientation
Technical skills
Working style
Team player: You impartially consider others' ideas, disagree respectfully, admit when you're wrong, and commit to the team's direction once decided. You are happy to work in teams of 2-3.
We design our stream to be highly collaborative. We encourage scholars to work together and possibly with external collaborators.
Close to program start, we will share a list of proposed projects, each tied to a primary mentor (some projects may also have a secondary mentor). You will then have some time to think, look up relevant literature, ask mentors questions and talk to other fellows. Then we will send out a form for you to indicate your top N projects, ranked, as well as any teammate or mentor preferences. We will then optimise fellow allocation into teams and projects to maximally satisfy everyone’s preferences.