Lee Sharkey

Lee's stream will focus primarily on improving mechanistic interpretability methods for reverse-engineering neural networks.

Apply

View all streams

Stream overview

Understanding what AI systems are thinking seems important for ensuring their safety, especially as they become more capable. For some dangerous capabilities like deception, it’s likely one of the few ways we can get safety assurances in high stakes settings. The ability to reverse engineer what neural networks have learned promises one of the few ways that we might get assurances for safe generalization behaviour, especially as AI systems become vastly more capable than humans.

Safety motivations aside, capable AI systems are extremely interesting objects of study, and doing digital neuroscience on them is comparatively much easier than studying biological neural systems.

A majority of reverse-engineering-focused interpretability work has involved sparse dictionary learning, which has various issues Sharkey et al., 2025. In response to these issues, my team and I at Goodfire (and previously Apollo Research) developed a new approach to decomposing neural networks, called Stochastic Parameter Decomposition (SPD), which is based on Attribution-based Parameter Decomposition (APD).

We think SPD-like approaches are the most likely candidate to overcome the issues with dictionary learning. We believe it may offer solutions to many issues in mech interp, such as attention head-distributed representations, circuit identification, and understanding feature geometry.

After significant investigation, we think SPD is now ready to be applied to actually-interesting models, such as language models and vision models. We are excited to begin to use SPD to reverse engineer non-toy neural networks. SPD is also suggestive of approaches that may make it possible to develop intrinsically interpretable architectures.

MATS projects in my stream should at least be conceptually informed by SPD if not build on it directly.

Mentors

Lee Sharkey

Goodfire AI

Principal Investigator

London

—

Interpretability

Lee Sharkey is a Principle Investigator at Goodfire. His team has focused on improved interpretability methods, including parameter decomposition methods such as Attribution-based Parameter Decomposition and Stochastic Parameter Decomposition.

Previously, Lee was Chief Strategy Officer and cofounder of Apollo Research, and a Research Engineer at Conjecture, where he worked on sparse autorencoders as a solution to representational superposition. Lee’s past research includes “Goal Misgeneralization in Deep Reinforcement Learning” and “Circumventing interpretability: How to defeat mind-readers.”

Mentorship style

Mentorship looks like a 1 h weekly meeting by default with approximately daily slack messages in between. Usually these meetings are just for updates about how the project is going, where I’ll provide some input and steering if necessary and desired. If there are urgent bottlenecks I’m more than happy to meet in between the weekly interval or respond on slack in (almost always) less than 24h. We'll often run daily standup meetings if timezones permit, but these are optional.

Representative papers

I will expect candidates to be familiar with the papers Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition and Stochastic Parameter Decomposition

And ideally some of the surrounding literature, which can be found in the references section of the APD paper.

Especially interested applicants may also wish to follow our ongoing work, which we mostly post about in the #parameter-decomposition channel of the Open Source Mechanistic Interpretability Slack workspace.

Scholars we are looking for

As an indicative guide (this is not a score sheet), in no particular order, I evaluate candidates according to:

Science background: What indicators are there that the candidate can think scientifically, can run their own research projects or help others effectively in theirs?

Quantitative skills: How likely is the candidate to have a good grasp of mathematics that is relevant for interpretability research? Note that this may include advanced math topics that have not yet been widely used in interpretability research but have potential to be.

Engineering skills: How strong is the candidates’ engineering background? Have they worked enough with python and the relevant libraries? (e.g. Pytorch, scikit learn)

Other interpretability prerequisites: How likely is the candidate to have a good grasp of the content gestured at in Neel Nanda’s list of interpretability prerequisites?

Safety research interests: How deeply has the candidate engaged with the relevant AI safety literature? Are the research directions that they’ve landed on consistent with a well developed theory of impact?

Conscientiousness: In interpretability, as in art, projects are never finished, only abandoned. How likely is it that the candidate will have enough conscientiousness to bring a project to a completed-enough state?

In the past cohort I chose a diversity of candidates with varying strengths and I think this worked quite well. Some mentees were outstanding in particular dimensions, others were great all rounders.

In some cases, projects might be of a nature that they’ll work well as a collaboration with external researchers. In these cases, I’ll usually encourage collaborations. But in general I leave the decision to form collaborations up to each scholar.

Project selection

In general I'd like projects in my stream to at least be informed by SPD if not build on it directly. Scholars and I will discuss projects and come to a consensus on what feels like a good direction. I will not tell scholars to work on a particular direction, since in my experience intrinsic motivation to work on a particular direction is important for producing good research.