Neel Nanda

Neel takes a pragmatic approach to interpretability: identify what stands between where we are now and where we want to be by AGI, and then focus on the subset of resulting research problems that can be tractably studied on today's models. This can look like diving deep into the internals of the model, or simpler black box methods like reading and carefully intervening on the chain of thought - whatever is the right tool for the job. This could look like studying how to detect deception, understanding why a model took a seemingly concerning action, or fixing weak points in other areas of safety, e.g. using interpretability to stop models realising they are being tested. You can learn more about Neel's approach in this podcast.

He has spent far too much time having MATS scholars, and has worked with ~60 so far - he’s excited to take on even more!

Mentors

Neel Nanda
Google DeepMind
,
Senior Research Scientist
London
Interpretability

Neel leads the mechanistic interpretability team at Google DeepMind, trying to use the internals of models to understand them better, and use this to make them safer - eg detecting deception, understanding concerning behaviours, and monitoring deployed systems for harmful behaviour.

Since mid 2024, Neel has become more pessimistic about ambitious mechanistic interpretability, and more optimistic that pragmatic approaches can add a lot of value. He's doing less work on basic science, and working more on model biology work, and work applying interpretability to real-world safety problems like monitoring.

He has spent far too much time having MATS scholars, and has about ~50 alumni - he's excited to take on even more!

Mentorship style

Representative papers

Scholars we are looking for

Project selection

Community at MATS

MATS Research phase provides scholars with a community of peers.

During the Research phase, scholars work out of a shared office, have shared housing, and are supported by a full-time Community Manager.

Working in a community of independent researchers gives scholars easy access to future collaborators, a deeper understanding of other alignment agendas, and a social network in the alignment community.

Previous MATS cohorts included regular lightning talks, scholar-led study groups on mechanistic interpretability and linear algebra, and hackathons. Other impromptu office events included group-jailbreaking Bing chat and exchanging hundreds of anonymous compliment notes.  Scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups.