I am broadly interested in research directions scholars are excited about, that can advance the quality of our AI Safety tools, and our confidence in them. Three particular areas of research that seem promising to me are:
Arthur Conmy is a Research Engineer at Google DeepMind, on the Language Model Interpretability team with Neel Nanda.
Arthur's focus is on practically useful interpretability and related AI Safety research. For example, Arthur was one of core engineers who added probes to Gemini deployments first. Arthur has also recently led research on how to interpret reasoning models: https://arxiv.org/abs/2506.19143 and how to elicit knowledge from model organisms: https://arxiv.org/abs/2510.01070 (both through MATS).
In the past, Arthur did early influential work on automating interpretability, finding circuits. Previously, Arthur worked at Redwood Research.
I meet 1h/week, in group meetings (scheduled).
I also fairly frequently schedule ad hoc meetings with scholars to check on how they're doing and to address issues or opportunities that aren't directly related to the project.
I'll help with research obstacles, including outside of meetings.
My MATS Winter 2025 Paper – not as interpretability focused, but we do use probing and this was an important paper increasing confidence about claims of unfaithfulness – we can find unfaithfulness in the wild, but the rates of unfaithfulness are low
Thought Anchors – early reasoning model interpretability work I co-supervised with Neel and Paul and Uzay, MATS Summer 2025
Benchmarking Interpretability – MATS Summer 2024 work evaluating Sparse Autoencoders (SAEs), one of several lines of evidence we used at DeepMind to deprioritise SAEs
and also Eliciting Secret Knowledge from Language Models from MATS Summer 2025 scholar Bartosz, which uses model organisms to evaluate interpretability tools
Executing fast on projects is highly important. But also having a good sense of which next steps are correct is also valuable, though I enjoy being pretty involved in projects, so it's somewhat easier for me to steer projects than it is for me to teach you how to execute fast from scratch. It helps to be motivated to make interpretability useful, and use it for AI Safety, too.
I will also be interviewing folks doing Neel Nanda's MATS research sprint who Neel doesn't get to work with.
I think collaborations are strong, so I would try to pair you with, for example, some of my MATS extension scholars, or most likely other scholars I take this round.
Mentor(s) will talk through project ideas with scholar.
MATS Research phase provides scholars with a community of peers.
.webp)
During the Research phase, scholars work out of a shared office, have shared housing, and are supported by a full-time Community Manager.
Working in a community of independent researchers gives scholars easy access to future collaborators, a deeper understanding of other alignment agendas, and a social network in the alignment community.
Previous MATS cohorts included regular lightning talks, scholar-led study groups on mechanistic interpretability and linear algebra, and hackathons. Other impromptu office events included group-jailbreaking Bing chat and exchanging hundreds of anonymous compliment notes. Scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups.