I am interested in mentoring projects that use simple interpretability approaches to understand how models generalize. Here are some illustrative example projects:
"studying persona representations": filter pretraining data for tokens within quotes and/or forum posts, and characterize the important dimensions in how pre-trained models represent these tokens (with SAE-like methods or simpler methods). Study the causal effect of these directions on behavior.
"studying goal representations": filter pretraining data for tokens describing multi-step sequences of actions (e.g. wikihow, walkthroughs, instruction manuals), characterize the important dimensions in how pre-trained models represent these tokens. Study the causal effect of these directions on behavior.
"read features vs. write features": pre-trained LLMs infer features that were relevant to the process of generating text (reading), and use them to compute features for predicting subsequent text (writing). Steer models' activations with random bias vectors as a stand-in for "write" features, and sample tokens. Then measure the activations of unsteered models on those sampled tokens, and compare them to unsteered samples as a baseline, as a stand-in for "read" features. Characterize the geometry of this mapping from “write” features to “read” features.
“studying representations underlying RL generalization”: train models with RL with a mixture of prompt distributions. Use a reward signal that (1) varies across prompt distributions, or (2) is the same across prompt distributions. Do models trained with (1) vs. (2) generalize differently? If so, can this be mechanistically explained using representations?
I lead the interpretability team at OpenAI. I am most interested in simple, practical interpretability approaches that are targeted at making models safer. In a previous life, I worked as a neuroscientist.
1 hour/week meetings + async discussions in Slack threads; can schedule additional meetings ad hoc as needed.
Essential:
Preferred:
Scholars are welcome to find collaborators.
We'll go through potential projects at the beginning, and scholars can propose alternatives. Scholars should explore the first week or two, and decide on a project direction in the second week.
MATS Research phase provides scholars with a community of peers.
.webp)
During the Research phase, scholars work out of a shared office, have shared housing, and are supported by a full-time Community Manager.
Working in a community of independent researchers gives scholars easy access to future collaborators, a deeper understanding of other alignment agendas, and a social network in the alignment community.
Previous MATS cohorts included regular lightning talks, scholar-led study groups on mechanistic interpretability and linear algebra, and hackathons. Other impromptu office events included group-jailbreaking Bing chat and exchanging hundreds of anonymous compliment notes. Scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups.