Peter Henderson

Peter Henderson’s stream focuses on developing safe, aligned AI agents, with projects on scalable oversight rules informed by law and game theory, safe long-horizon exploration, and measuring “jagged” capability/safety frontiers. Scholars will join an independently driven, engineering-heavy research environment, collaborating with other MATS scholars and PhD students, with weekly 1:1s and active async mentorship.

Stream overview

I'd be interested in a variety of potential projects, but three directions that I'm excited by are:

+ Building better AI agent rules for scalable oversight: In our recent work we showed that different models take different interpretation strategies when interpreting laws for AI. But there are still lots of things to figure out in making more complete specifications for AI agents. Can we leverage tools from law? What about reward specification and game theory?

+ Safe long-horizon exploration for agents: As we create agents with the goal of discovery, the risks of reward mis-specification grow. How can we create good explorers with agency that nonetheless remain aligned over long horizons? Are there differential signs of misalignment in specific high-risk applications: AI R&D, bio research, cybersecurity research? How do we disentangle optimizing for long horizon tasks from implicit misaligned long-horizon goals (e.g., survival/agency versus task completion)? What if we want self-directed learning and exploration, how can we make this aligned in the long run?

+ Measuring the jagged frontier: Jaggedness has been a problem, both for capabilities and safety. As the frontier becomes more jagged we might have seemingly highly capable models fail in surprising ways. How can we measure jaggedness of a model as a property? How can we leverage this in understanding safety and jailbreaking risks?

Mentors

Peter Henderson
Princeton University
,
Assistant Professor
New York City
Dangerous Capability Evals, Control, Strategy & Forecasting, Policy & Governance, Scalable Oversight, Agent Foundations

Mentorship style

45 min weekly meetings by default for high-level guidance. I'm active on Slack for quick questions or conceptual (not code) debugging. Expect async back-and-forth on experiment design and results between meetings. Scholars can also schedule ad-hoc calls if they're stuck or want to brainstorm—just ping me on Slack. Other team members (PhD students) will also be around to help brainstorm, getting unstuck.

Representative papers

https://arxiv.org/abs/2509.01186

https://arxiv.org/abs/2510.07686

https://proceedings.neurips.cc/paper_files/paper/2024/hash/c4e380fb74dec9da9c7212e834657aa9-Abstract-Conference.html

https://arxiv.org/abs/2503.06378

Scholars we are looking for

Essential:

  • ML engineering experience—specifically implementing and training models on, e.g., slurm clusters.
  • Research experience: at least one project where you drove the research direction (first-author or equal contribution).
  • Familiarity with basic reinforcement learning principles and algorithms.

Nice to have, but not necessary:

  • Prior RL papers.
  • Theory background.
  • Game theory background
  • Law background

Not a good fit:

  • Scholars who only have high-level ML knowledge without hands-on implementation
  • Those who need heavily structured guidance (I will provide guidance and the goal is to work as part of a larger team, but will still need to independently drive research between team meetings and 1-1s).

Collaborators will be other MATS scholars, as well as PhD students/post-docs at multiple institutions (including Princeton).

Project selection

Mentors in the group will pitch projects, and scholars will try ones they find interesting for a week. We'll iterate together at the end of week 1 and pick final assignments in week 2.

Community at MATS

MATS Research phase provides scholars with a community of peers.

During the Research phase, scholars work out of a shared office, have shared housing, and are supported by a full-time Community Manager.

Working in a community of independent researchers gives scholars easy access to future collaborators, a deeper understanding of other alignment agendas, and a social network in the alignment community.

Previous MATS cohorts included regular lightning talks, scholar-led study groups on mechanistic interpretability and linear algebra, and hackathons. Other impromptu office events included group-jailbreaking Bing chat and exchanging hundreds of anonymous compliment notes.  Scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups.