Identifying Sparsely Active Circuits Through Local Loss Landscape Decomposition

MATS Alumnus

Brianna Chrisman

Collabortators

Brianna Chrisman, Lucius Bushnaq, Lee Sharkey

Citations

2 Citations

Abstract

Much of mechanistic interpretability has focused on understanding the activation spaces of large neural networks. However, activation space-based approaches reveal little about the underlying circuitry used to compute features. To better understand the circuits employed by models, we introduce a new decomposition method called Local Loss Landscape Decomposition (L3D). L3D identifies a set of low-rank subnetworks: directions in parameter space of which a subset can reconstruct the gradient of the loss between any sample's output and a reference output vector. We design a series of progressively more challenging toy models with well-defined subnetworks and show that L3D can nearly perfectly recover the associated subnetworks. Additionally, we investigate the extent to which perturbing the model in the direction of a given subnetwork affects only the relevant subset of samples. Finally, we apply L3D to a real-world transformer model and a convolutional neural network, demonstrating its potential to identify interpretable and relevant circuits in parameter space.

Recent research

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

Authors:

Jorio Cocola, Dylan Feng

Date:

December 10, 2025

Citations:

0

AI agents find $4.6M in blockchain smart contract exploits

Authors:

Fellow: Winnie Xiao

Date:

December 1, 2025

Citations:

0

Frequently asked questions

What is the MATS Program?
How long does the program last?