This stream will focus on monitoring, stress-testing safety methods, and evals, with a focus on risks from scheming AIs. Examples include (black-box) AI control techniques, white-box monitors (probes etc.), chain-of-thought monitoring/faithfulness, building evaluation environments, and stress-testing mitigations.
I'm interested in detecting and mitigating deceptive alignment (mainly via capability evaluations and control). I'm interested in supervising projects in these areas:
David Lindner is a Research Scientist on Google DeepMind's AG Safety and Alignment team where he works on evaluations and mitigations for deceptive alignment and scheming. His recent work includes MONA, a method for reducing multi-turn reward hacking during RL, designing evaluations for stealth and situational awareness, and helping develop GDM's approach to deceptive alignment. Currently, David is interested in studying mitigations for scheming, including CoT monitoring and AI control. You can find more details on his website.
For each project, we will have a weekly meeting to discuss the overall project direction and prioritize next steps for the upcoming week. On a day-to-day basis, you will discuss experiments and write code with other mentees on the project (though I'm available on Slack for quick feedback between meetings or to address things that are blocking you).
I structure the program around collaborative, team-based research projects. You will work in a small team, on a project from a predefined list. I organize the 12-week program into fast-paced research sprints designed to create and keep research velocity, so you should expect regular deadlines and milestones. I will provide a more detailed schedule and set of milestones at the beginning of the program.
I am looking for scholars with strong machine learning engineering skills, as well as a background in technical research. While I’ll provide weekly guidance on research, I expect scholars to be able to run experiments and decide on low-level details fairly independently most of the time. I’ll propose concrete projects to choose from, so you should not expect to work on your own research idea during MATS. I strongly encourage collaboration within the stream, so you should expect to work in teams of 2-3 scholars on a project, hence good communication and team skills are important.
We design our stream to be highly collaborative. We encourage scholars to work together and possibly with external collaborators.
We will most likely have a joint project selection phase, where we present a list of projects (with the option for scholars to iterate on them). Afterward, each project will have at least one main mentor, but we might also co-mentor some projects.
MATS Research phase provides scholars with a community of peers.
.webp)
During the Research phase, scholars work out of a shared office, have shared housing, and are supported by a full-time Community Manager.
Working in a community of independent researchers gives scholars easy access to future collaborators, a deeper understanding of other alignment agendas, and a social network in the alignment community.
Previous MATS cohorts included regular lightning talks, scholar-led study groups on mechanistic interpretability and linear algebra, and hackathons. Other impromptu office events included group-jailbreaking Bing chat and exchanging hundreds of anonymous compliment notes. Scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups.