Dan Mossing

This stream focuses on representations that underlie how language models generalize, for example representations of personas, goals, or training data components.

Apply

View all streams

Stream overview

I am interested in mentoring projects that use simple interpretability approaches to understand how models generalize. Here are some illustrative example projects:

"studying persona representations": filter pretraining data for tokens within quotes and/or forum posts, and characterize the important dimensions in how pre-trained models represent these tokens (with SAE-like methods or simpler methods). Study the causal effect of these directions on behavior.

"studying goal representations": filter pretraining data for tokens describing multi-step sequences of actions (e.g. wikihow, walkthroughs, instruction manuals), characterize the important dimensions in how pre-trained models represent these tokens. Study the causal effect of these directions on behavior.

"read features vs. write features": pre-trained LLMs infer features that were relevant to the process of generating text (reading), and use them to compute features for predicting subsequent text (writing). Steer models' activations with random bias vectors as a stand-in for "write" features, and sample tokens. Then measure the activations of unsteered models on those sampled tokens, and compare them to unsteered samples as a baseline, as a stand-in for "read" features. Characterize the geometry of this mapping from “write” features to “read” features.

“studying representations underlying RL generalization”: train models with RL with a mixture of prompt distributions. Use a reward signal that (1) varies across prompt distributions, or (2) is the same across prompt distributions. Do models trained with (1) vs. (2) generalize differently? If so, can this be mechanistically explained using representations?

Mentors

Dan Mossing

OpenAI

Member of technical staff

SF Bay Area

—

Interpretability

I lead the interpretability team at OpenAI. I am most interested in simple, practical interpretability approaches that are targeted at making models safer. In a previous life, I worked as a neuroscientist.

Mentorship style

1 hour/week meetings + async discussions in Slack threads; can schedule additional meetings ad hoc as needed.

Representative papers

Auditing Language Models for Hidden Objectives

Persona Features Control Emergent Misalignment

Emergent Misalignment is Easy, Narrow Misalignment is Hard

Scholars we are looking for

Essential:

some research experience
some (python) software engineering experience
some mech interp familiarity
some linear algebra and stats background
curiosity and independence
a disposition toward iterating quickly

Preferred:

experience training models in pytorch
experience writing papers

Scholars are welcome to find collaborators.

Project selection

We'll go through potential projects at the beginning, and scholars can propose alternatives. Scholars should explore the first week or two, and decide on a project direction in the second week.