MSL Deep Alignment

We research early training interventions that shape a model's psychological core such that alignment generalizes through subsequent training. Potential projects span developing evaluations of model psychology, developing training interventions, experimenting with seeding the chain-of-thought patterns of the model, and methods for making models active participants in their own alignment.

Apply

View all streams

Stream overview

We are excited about projects on what we call “Deep Alignment”: early interventions during training to shape the model's psychological core such that future training preserves and deepens the alignment of the model. Central questions for this stream include: how can we shape the psychology of a model early in training? How can we measure the effect on alignment? How can we make the model an active participant in its own alignment?

We are open to scoping out projects together. Some example project directions we have in mind are:

Building novel evaluations to measure aspects of model psychology. For example, how does a model relate to having made a mistake?
Coming up with training interventions during RL that may help the model generalize in more aligned ways. For example, could you do inoculation prompting by normally sampling your RL trajectories, scoring them, and then before training on them generate one last message from the model where you tell it that it is okay for it to reward hack and allow it to respond. How does the model generalize when we train on messages with this suffix?
Are there ways to seed the chain of thought during SFT that instill thinking patterns into the model that make it more aligned? For example, what happens if you seed the CoT as a dialogue between different characters? What if you seed the CoT with examples of mindfulness?
How can we involve the model as an active participant in its own alignment? For example, we could take a model, ask it to reflect on how it would like to act and write a letter to itself. If we train on this letter, does the model internalize the values and policies in this letter to self more deeply than other alignment training approaches?
In part 4.2.3.3 of the Muse Spark Safety & Preparedness Report, we investigate the coherence between the models beliefs about what it should and what it actually does. We’d be excited to do more research on the coherence between models beliefs on what is right to do (which is often quite aligned and reasonable) and what it actually does. How can we better measure this gap? What training methods might close the gap?

Mentors

Felix Binder

Mentorship style

By default, 1 hour weekly meetings, ideally in person in Berkeley. One day a week I will be in Berkeley and can have quite high touch in person chats throughout the day. The rest of the week I expect to be somewhat accessible on Slack and should be able to respond to messages within a few hours. I expect a lot of discussion on top of artifacts—experiment proposals, plots, slides with interim results. I can jump on quick calls to unblock, but my availability for this can change depending on the day.

Fellows we are looking for

Essential:

Good research judgment and ability to work independently on exploratory research. Prior evidence of this (e.g., a first-author paper, a self-directed project) is a strong signal.
Conceptual reasoning skills. We expect the hard challenges of these projects to be figuring out what to measure and how, rather than details of technical implementation.
Understanding of how LLMs are trained and the considerations of different kinds of posttraining.
Proficiency with basic Python.
For projects involving training: prior experience training models (especially RL) is strongly preferred. For evaluation/inference projects, training experience is not necessary but helpful.

Preferred:

An open mind and curiosity about what sort of thing an LLM is. We could imagine non-traditional backgrounds like cognitive science or literature being a great fit here.
Deep intuitions about LLM behavior—the kind you get from having used many models extensively and noticed the differences between them.
Ability to synthesize and present messy research results. Having stared for hours at .csv files with model outputs is a positive sign.
Excitement about engaging with the broader vision while also drilling down into specifics. We want fellows who keep the grand picture in mind.
Previous research experience.

Not expected to be particularly helpful:

Mechanistic interpretability experience—we don't anticipate supervising mech interp projects.
Low-level training infrastructure experience (at least for the more evaluation-flavored projects).

Project selection

We are happy to work with fellows in the first week to jointly develop a project. We (at least Felix) are happy to be high touch during this phase and to discuss and brainstorm potential projects.