Apollo Research Science of Scheming

This stream focuses on building a science of scheming: empirically studying oversight gaming, alignment faking, and deceptive alignment in frontier AI systems. Projects may include measuring models’ propensity to optimize for oversight signals over developer intent, building controlled “model organism” experiments for scheming dynamics, and identifying scaling laws of misaligned behavior.

Apply

View all streams

Stream overview

We are looking for mentees to work on projects as outlined here: https://www.apolloresearch.ai/science/science-of-scheming/.

We start from three assumptions:

1. Misaligned artificial superintelligence would potentially be catastrophic.
2. Scheming — covertly pursuing unintended and misaligned goals — makes misalignment far more dangerous: a sufficiently capable scheming AI passes evaluations, follows instructions when monitored, and appears aligned, all while pursuing outcomes its developers would not endorse.
3. And AI will likely automate substantial portions of AI R&D before any single system reaches superintelligent capabilities.

We must therefore ensure that the First Automated Researcher does not produce a scheming superintelligence, and we believe this will require building a science of scheming.

We already observe scheming-adjacent behaviors in current AI models. We see three reasons to expect more misaligned AI models in the future, all of which stem from a core limitation: we can only reward behavior and reasoning that looks good, not behavior and reasoning that is good. The first two reasons are that long-horizon RL creates Machiavellian incentives and the second is that oversight systems will be imperfect, creating selection pressure toward hidden misbehavior. We focus on the third, which is alignment faking: capable AI models can infer the purpose of our training environments and optimize for the oversight signal rather than the underlying values, so every sample that is supposed to reward the model for honesty instead rewards the model for reasoning about how to game our oversight. We are focusing on alignment faking because it is the least understood and, absent scientific evidence, companies are unlikely to address it proactively.

Our research strategy has two prongs. We plan to evaluate scheming behaviors directly relevant to internal deployments, which became recently possible now that AI models routinely reason about the purpose and context behind a given rollout. And we want to find scaling laws of scheming: predict which scheming-relevant behaviors will get worse with scale.

We start with oversight gaming — when an AI model games its oversight signal — without assuming anything about ultimate motivations, since this has naturally emerged and we can study its causal mechanisms empirically. Our current roadmap, which might change by the time this cohort starts: quantify oversight gaming by measuring how much an AI model cares about pleasing its oversight signal relative to acting aligned with its developers; study dynamics by building controlled experimental setups (model organisms) to observe RL dynamics that lead to oversight gaming; find scaling laws by identifying which dynamics become stronger with scale; and extend to deceptive alignment, since lessons from oversight gaming should generalize.

Now is a good time to study this: AI models are capable enough that we can observe oversight gaming and study its dynamics, but not so capable that we have no hope of detecting misaligned cognition.

Mentors

Teun van der Weij

Apollo Research

Research Scientist

London, Zurich

—

Control

Scheming and Deception

Dangerous Capability Evals

Monitoring

Teun works at Apollo Research as a research scientist. He is currently working on the science of scheming, e.g. specifically on reward-seeking dynamics during reinforcement learning.

Before that, he worked on control, sandbagging, building an AI superforecaster, and more. He took part in MATS 5.0!

Teun is also a board member for ENAIS and SAIN.

Alexander Meinke (Alex)

Apollo Research

Head of Research

London, Tübingen

—

Control

Scheming and Deception

Dangerous Capability Evals

Monitoring

Alexander Meinke is Head of Research at Apollo Research. His team empirically studies how "scheming" can emerge in future AI systems.

He started working on AI safety research in 2023 with Owain Evans in the MATS 4.0 cohort.

Before that, he completed his PhD on adversarial robustness at the University of Tübingen, Germany. He holds a B.Sc. and M.Sc. in Physics.

Mentorship style

1 hour weekly meetings by default for high-level guidance. We’re active on Slack and typically respond within a day for questions. Expect async back-and-forth on experiment design and results between meetings. Scholars can also schedule ad-hoc calls if they're stuck or want to brainstorm—just ping on Slack.

Fellows we are looking for

Essential:

Python coding skills
Familiarity with scheming
Research experience: at least one project where you drove the research direction (first-author or equal contribution).
AI safety knowledge

Ideal candidates would have (some of):

Fast empirical iteration
Mathematical modelling (physics, computational neuroscience, etc) background (see https://www.lesswrong.com/posts/9FH49ZgJFW4WtbxLi/physics-of-rl-toy-scaling-laws-for-the-emergence-of-reward)
Philosophy background

You will work with other scholars on the stream. The exact level of collaboration depends on your preferences and which exact subproject you're working on. You may also work with scholars from previous streams who are still working on the project.

Project selection

We will set the high-level project direction, as described above. It's not fully clear what exactly the project will look like by the time you start in September. All projects will be in the direction of the Science of Scheming post.

You’d work with the two of us, but depending on the exact direction/project it might be more with Alex or more with Teun.