Where does Olmo get its values?

MATS Fellow:

Lily Sun

Authors:

Citations

Abstract:

Post-training is immensely important: it is what takes LLMs from next-token predictors to generally useful assistants. However, curation of post-training data is often heuristic and empirical, and its effects mostly understood post-hoc. In this paper, we investigate effects of post-training by examining when and how Olmo-3-7B-Instruct learns its values. We first quantify value changes across post-training, finding an increase in safety-related values during SFT but a decrease during DPO. Zooming into DPO, we find that we can predict (Spearman ) changes in values without training, using only the dataset, via dot products of activation differences on DPO datapoints with value directions. However, we surprisingly find that most of this value change over DPO is due to Olmo's decreased propensity to refuse; our method is likely just picking up on this simpler latent value. Nevertheless, our results show that we can, to some extent, isolate where values change during training and predict how they will change from just training data; we are excited about future work that further investigates such questions.

Recent research

Interpreting Language Model Parameters

Authors:

Bart Bussmann, Nathan Hu, Michael Ivanitskiy

Date:

May 5, 2026

Citations:

Removing Sandbagging in LLMs by Training with Weak Supervision

Authors:

Emil Ryd

Date:

May 1, 2026

Citations:

Where does Olmo get its values?

Recent research

Frequently asked questions