David Lindner

This stream will focus on monitoring, stress-testing safety methods, and evals, with a focus on risks from scheming AIs. Examples include (black-box) AI control techniques, white-box monitors (probes etc.), chain-of-thought monitoring/faithfulness, building evaluation environments, and stress-testing mitigations.

Stream overview

I'm interested in detecting and mitigating deceptive alignment (mainly via capability evaluations and control). I'm interested in supervising projects in these areas:

  • Developing capability evaluations for deceptive reasoning ability.
  • Usefulness of externalized reasoning for oversight (CoT faithfulness by default, architectures promoting CoT faithfulness, monitoring of CoT)
  • Control (primarily simulating control evaluations in different settings).

Mentors

David Lindner
Google DeepMind
,
Research Scientist
London
Control
Monitoring
Safeguards
Dangerous Capability Evals
Scheming and Deception

David Lindner is a Research Scientist on Google DeepMind's AG Safety and Alignment team where he works on evaluations and mitigations for deceptive alignment and scheming. His recent work includes MONA, a method for reducing multi-turn reward hacking during RL, designing evaluations for stealth and situational awareness, and helping develop GDM's approach to deceptive alignment. Currently, David is interested in studying mitigations for scheming, including CoT monitoring and AI control. You can find more details on his website.

Read more

Mentorship style

For each project, we will have a weekly meeting to discuss the overall project direction and prioritize next steps for the upcoming week. On a day-to-day basis, you will discuss experiments and write code with other mentees on the project (though I'm available on Slack for quick feedback between meetings or to address things that are blocking you).

I structure the program around collaborative, team-based research projects. You will work in a small team, on a project from a predefined list. I organize the 12-week program into fast-paced research sprints designed to create and keep research velocity, so you should expect regular deadlines and milestones. I will provide a more detailed schedule and set of milestones at the beginning of the program.

Scholars we are looking for

I am looking for scholars with strong machine learning engineering skills, as well as a background in technical research. While I’ll provide weekly guidance on research, I expect scholars to be able to run experiments and decide on low-level details fairly independently most of the time. I’ll propose concrete projects to choose from, so you should not expect to work on your own research idea during MATS. I strongly encourage collaboration within the stream, so you should expect to work in teams of 2-3 scholars on a project, hence good communication and team skills are important.

We design our stream to be highly collaborative. We encourage scholars to work together and possibly with external collaborators.

Project selection

We will most likely have a joint project selection phase, where we present a list of projects (with the option for scholars to iterate on them). Afterward, each project will have at least one main mentor, but we might also co-mentor some projects.