Lee's stream will focus primarily on improving mechanistic interpretability methods for reverse-engineering neural networks.
Understanding what AI systems are thinking seems important for ensuring their safety, especially as they become more capable. For some dangerous capabilities like deception, it’s likely one of the few ways we can get safety assurances in high stakes settings.
The ability to reverse engineer what neural networks have learned promises one of the few ways that we might get assurances for safe generalization behaviour, especially as AI systems become vastly more capable than humans.
Safety motivations aside, capable AI systems are extremely interesting objects of study, and doing digital neuroscience on them is comparatively much easier than studying biological neural systems.
A majority of reverse-engineering-focused interpretability work has involved sparse dictionary learning, which has various issues Sharkey et al., 2025. In response to these issues, my team and I at Goodfire (and previously Apollo Research) developed a new approach to decomposing neural networks, called Parameter Decomposition. We have used parameter decomposition to resolve feature splitting, identify attention-head distributed computations, identify circuits, and more.
We believe there are gaps in our method that remain to be solved. In our work (whether SAEs or parameter decomposition), we minimize 'description length'. But we are not yet confident that we have the right 'type of description'. We think understanding computational manifolds (which are projections of activation manifolds) are likely part of the answer here, since they may offer an even more concise description of neural computation than SDL latents or VPD parameter subcomponents.
MATS projects in my stream should at least be conceptually informed by parameter decomposition, manifolds, and minimum description length framings of interpretability, if not build on them directly.
Mentorship looks like a 1 h weekly meeting by default with approximately daily slack messages in between. Usually these meetings are just for updates about how the project is going, where I’ll provide some input and steering if necessary and desired. If there are urgent bottlenecks I’m more than happy to meet in between the weekly interval or respond on slack in (almost always) less than 24h. We'll often run daily standup meetings if timezones permit, but these are optional.
As an indicative guide (this is not a score sheet), in no particular order, I evaluate candidates according to:
In the past cohort I chose a diversity of candidates with varying strengths and I think this worked quite well. Some mentees were outstanding in particular dimensions, others were great all rounders.
In some cases, projects might be of a nature that they’ll work well as a collaboration with external researchers. In these cases, I’ll usually encourage collaborations. But in general I leave the decision to form collaborations up to each scholar.
In general I'd like projects in my stream should at least be conceptually informed by parameter decomposition, manifolds, and minimum description length framings of interpretability, if not build on them directly.
Scholars and I will discuss projects and come to a consensus on what feels like a good direction. I will not tell scholars to work on a particular direction, since, in my experience, intrinsic motivation to work on a particular direction is important for producing good research.