I'm interested in mentoring projects related to reward hacking and monitoring (agentic) models that produces long and complex trajectories.
He He is an associate professor at New York University. She is interested in how large language models work and potential risks of this technology.
30min to 1 hour weekly meetings (on zoom) by default for high-level guidance. I'm active on Slack and typically respond within a day for quick questions or conceptual (not code) debugging. Expect async back-and-forth on experiment design and results between meetings. Scholars can also schedule ad-hoc calls if they're stuck or want to brainstorm—just ping me on Slack.
Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort. https://arxiv.org/abs/2510.01367
ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases. https://arxiv.org/abs/2510.20270
Language Models Learn to Mislead Humans via RLHF. https://arxiv.org/abs/2409.12822
Week 1-2: Mentor will provide high level directions or problems to work on, and scholar will have the freedom to propose specific projects and discuss with mentor.
Week 3: Figure out detailed plan of the project.
MATS Research phase provides scholars with a community of peers.
.webp)
During the Research phase, scholars work out of a shared office, have shared housing, and are supported by a full-time Community Manager.
Working in a community of independent researchers gives scholars easy access to future collaborators, a deeper understanding of other alignment agendas, and a social network in the alignment community.
Previous MATS cohorts included regular lightning talks, scholar-led study groups on mechanistic interpretability and linear algebra, and hackathons. Other impromptu office events included group-jailbreaking Bing chat and exchanging hundreds of anonymous compliment notes. Scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups.