This stream focuses on building a science of scheming: empirically studying oversight gaming, alignment faking, and deceptive alignment in frontier AI systems. Projects may include measuring models’ propensity to optimize for oversight signals over developer intent, building controlled “model organism” experiments for scheming dynamics, and identifying scaling laws of misaligned behavior.
We are looking for mentees to work on projects as outlined here: https://www.apolloresearch.ai/science/science-of-scheming/.
We start from three assumptions:
We must therefore ensure that the First Automated Researcher does not produce a scheming superintelligence, and we believe this will require building a science of scheming.
We already observe scheming-adjacent behaviors in current AI models. We see three reasons to expect more misaligned AI models in the future, all of which stem from a core limitation: we can only reward behavior and reasoning that looks good, not behavior and reasoning that is good. The first two reasons are that long-horizon RL creates Machiavellian incentives and the second is that oversight systems will be imperfect, creating selection pressure toward hidden misbehavior. We focus on the third, which is alignment faking: capable AI models can infer the purpose of our training environments and optimize for the oversight signal rather than the underlying values, so every sample that is supposed to reward the model for honesty instead rewards the model for reasoning about how to game our oversight. We are focusing on alignment faking because it is the least understood and, absent scientific evidence, companies are unlikely to address it proactively.
Our research strategy has two prongs. We plan to evaluate scheming behaviors directly relevant to internal deployments, which became recently possible now that AI models routinely reason about the purpose and context behind a given rollout. And we want to find scaling laws of scheming: predict which scheming-relevant behaviors will get worse with scale.
We start with oversight gaming — when an AI model games its oversight signal — without assuming anything about ultimate motivations, since this has naturally emerged and we can study its causal mechanisms empirically. Our current roadmap, which might change by the time this cohort starts: quantify oversight gaming by measuring how much an AI model cares about pleasing its oversight signal relative to acting aligned with its developers; study dynamics by building controlled experimental setups (model organisms) to observe RL dynamics that lead to oversight gaming; find scaling laws by identifying which dynamics become stronger with scale; and extend to deceptive alignment, since lessons from oversight gaming should generalize.
Now is a good time to study this: AI models are capable enough that we can observe oversight gaming and study its dynamics, but not so capable that we have no hope of detecting misaligned cognition.
1 hour weekly meetings by default for high-level guidance. We’re active on Slack and typically respond within a day for questions. Expect async back-and-forth on experiment design and results between meetings. Scholars can also schedule ad-hoc calls if they're stuck or want to brainstorm—just ping on Slack.
Essential:
Ideal candidates would have (some of):
You will work with other scholars on the stream. The exact level of collaboration depends on your preferences and which exact subproject you're working on. You may also work with scholars from previous streams who are still working on the project.
We will set the high-level project direction, as described above. It's not fully clear what exactly the project will look like by the time you start in September. All projects will be in the direction of the Science of Scheming post.
You’d work with the two of us, but depending on the exact direction/project it might be more with Alex or more with Teun.