Resisting RL Elicitation of Biosecurity Capabilities: Reasoning Models Exploration Hacking on WMDP

MATS Alumnus

Joschka Braun, Damon Falck, Yeonwoo Jang

Collabortators

Joschka Braun, Eyon Jang, Damon Falck, Roland Zimmermann, David Lindner, Scott Emmons

Citations

0 Citations

Abstract

As frontier reasoning models become more capable, accurate dangerous capability evaluation is becoming essential for risk estimation and governance. Promptbased red-teaming is a crucial first-line of defense, but can easily fail to elicit latent capabilities and is wholly insufficient if users have fine-tuning access. Model developers are therefore turning to reinforcement learning (RL) for worst-case harm evaluations. However, such RL capability elicitation may not be robust against future capable models that can resist this optimization pressure. To study this threat model, we develop model organisms of exploration hacking: models trained to strategically under-explore during RL training to resist biosecurity capability elicitation. Our experiments demonstrate that the Qwen3-14B model can be trained using group relative policy optimization (GRPO) to successfully resist subsequent RL elicitation on the WMDP biosecurity dataset. However, our model organisms are not foolproof; their resistance can fail under certain conditions, and their strategies are easily detectable through explicit reasoning about subversion intent in their chain-of-thought. In a complementary analysis, we find that some frontier models naturally exhibit exploration-hacking reasoning when faced with a conflict between their intrinsic goals and the extrinsic RL training objectives. Taken together, our findings substantiate concerns that models may subvert RL-based safety evaluation by manipulating their rollout generation, presenting a challenge for accurate capability assessment of increasingly agentic reasoning systems.

Recent research

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

Authors:

Jorio Cocola, Dylan Feng

Date:

December 10, 2025

Citations:

0

AI agents find $4.6M in blockchain smart contract exploits

Authors:

Fellow: Winnie Xiao

Date:

December 1, 2025

Citations:

0

Frequently asked questions

What is the MATS Program?
How long does the program last?