Peter Henderson’s stream focuses on developing safe, aligned AI agents, with projects on scalable oversight rules informed by law and game theory, safe long-horizon exploration, and measuring “jagged” capability/safety frontiers. Scholars will join an independently driven, engineering-heavy research environment, collaborating with other MATS scholars and PhD students, with weekly 1:1s and active async mentorship.
I'd be interested in a variety of potential projects, but three directions that I'm excited by are:
+ Building better AI agent rules for scalable oversight: In our recent work we showed that different models take different interpretation strategies when interpreting laws for AI. But there are still lots of things to figure out in making more complete specifications for AI agents. Can we leverage tools from law? What about reward specification and game theory?
+ Safe long-horizon exploration for agents: As we create agents with the goal of discovery, the risks of reward mis-specification grow. How can we create good explorers with agency that nonetheless remain aligned over long horizons? Are there differential signs of misalignment in specific high-risk applications: AI R&D, bio research, cybersecurity research? How do we disentangle optimizing for long horizon tasks from implicit misaligned long-horizon goals (e.g., survival/agency versus task completion)? What if we want self-directed learning and exploration, how can we make this aligned in the long run?
+ Measuring the jagged frontier: Jaggedness has been a problem, both for capabilities and safety. As the frontier becomes more jagged we might have seemingly highly capable models fail in surprising ways. How can we measure jaggedness of a model as a property? How can we leverage this in understanding safety and jailbreaking risks?
45 min weekly meetings by default for high-level guidance. I'm active on Slack for quick questions or conceptual (not code) debugging. Expect async back-and-forth on experiment design and results between meetings. Scholars can also schedule ad-hoc calls if they're stuck or want to brainstorm—just ping me on Slack. Other team members (PhD students) will also be around to help brainstorm, getting unstuck.
https://arxiv.org/abs/2509.01186
https://arxiv.org/abs/2510.07686
https://proceedings.neurips.cc/paper_files/paper/2024/hash/c4e380fb74dec9da9c7212e834657aa9-Abstract-Conference.html
https://arxiv.org/abs/2503.06378
Essential:
Nice to have, but not necessary:
Not a good fit:
Collaborators will be other MATS scholars, as well as PhD students/post-docs at multiple institutions (including Princeton).
Mentors in the group will pitch projects, and scholars will try ones they find interesting for a week. We'll iterate together at the end of week 1 and pick final assignments in week 2.
MATS Research phase provides scholars with a community of peers.
.webp)
During the Research phase, scholars work out of a shared office, have shared housing, and are supported by a full-time Community Manager.
Working in a community of independent researchers gives scholars easy access to future collaborators, a deeper understanding of other alignment agendas, and a social network in the alignment community.
Previous MATS cohorts included regular lightning talks, scholar-led study groups on mechanistic interpretability and linear algebra, and hackathons. Other impromptu office events included group-jailbreaking Bing chat and exchanging hundreds of anonymous compliment notes. Scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups.