We research early training interventions that shape a model's psychological core such that alignment generalizes through subsequent training. Potential projects span developing evaluations of model psychology, developing training interventions, experimenting with seeding the chain-of-thought patterns of the model, and methods for making models active participants in their own alignment.
We are excited about projects on what we call “Deep Alignment”: early interventions during training to shape the model's psychological core such that future training preserves and deepens the alignment of the model. Central questions for this stream include: how can we shape the psychology of a model early in training? How can we measure the effect on alignment? How can we make the model an active participant in its own alignment?
We are open to scoping out projects together. Some example project directions we have in mind are:
By default, 1 hour weekly meetings, ideally in person in Berkeley. One day a week I will be in Berkeley and can have quite high touch in person chats throughout the day. The rest of the week I expect to be somewhat accessible on Slack and should be able to respond to messages within a few hours. I expect a lot of discussion on top of artifacts—experiment proposals, plots, slides with interim results. I can jump on quick calls to unblock, but my availability for this can change depending on the day.
Essential:
Preferred:
Not expected to be particularly helpful:
We are happy to work with fellows in the first week to jointly develop a project. We (at least Felix) are happy to be high touch during this phase and to discuss and brainstorm potential projects.