Probing by Analogy: Decomposing Probes into Activations for Better Interpretability and Inter-Model Generalization

MATS Alumnus

Patrick Leask

Collabortators

Patrick Leask, Noura Al Moubayed

Citations

Citations

Abstract

Linear probes have been used to demonstrate that LLM activations linearly encode high-level properties of the input, such as truthfulness, and that these directions can evolve significantly during fine-tuning and training. However, despite their seeming simplicity, linear probes can have complex geometric interpretations, leverage spurious correlations, and lack selectivity. We present a method for decomposing linear probe directions into weighted sums of as few as 10 model activations, whilst maintaining task performance. These probes are also invariant to affine transformations of the representation space, and we demonstrate that, in some cases, poor base to fine-tune probe generalization performance is partially due to simple transformations of representation subspaces, and the structure of the representation space changes less than indicated by other methods.

Recent research

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

Authors:

Jorio Cocola, Dylan Feng

Date:

December 10, 2025

Citations:

0

AI agents find $4.6M in blockchain smart contract exploits

Authors:

Fellow: Winnie Xiao

Date:

December 1, 2025

Citations:

0

Frequently asked questions

What is the MATS Program?
How long does the program last?