He He

I'm interested in mentoring projects related to reward hacking and monitoring (agentic) models that produces long and complex trajectories. Scholar will have freedom to propose projects within this scope. Expect 30-60min 1-1 time on zoom.

Stream overview

I'm interested in mentoring projects related to reward hacking and monitoring (agentic) models that produces long and complex trajectories.

Mentors

He He
New York University
,
Associate Professor
New York City
Monitoring, Dangerous Capability Evals, Scalable Oversight, Safeguards

He He is an associate professor at New York University. She is interested in how large language models work and potential risks of this technology.

Mentorship style

30min to 1 hour weekly meetings (on zoom) by default for high-level guidance. I'm active on Slack and typically respond within a day for quick questions or conceptual (not code) debugging. Expect async back-and-forth on experiment design and results between meetings. Scholars can also schedule ad-hoc calls if they're stuck or want to brainstorm—just ping me on Slack.

Representative papers

Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort. https://arxiv.org/abs/2510.01367

ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases. https://arxiv.org/abs/2510.20270

Language Models Learn to Mislead Humans via RLHF. https://arxiv.org/abs/2409.12822

Scholars we are looking for

  • Strong engineering skills and experience Sin training deep learning models
  • Familiarity with modern large scale RL pipelines (e.g., using frameworks such as verl)

Project selection

Week 1-2: Mentor will provide high level directions or problems to work on, and scholar will have the freedom to propose specific projects and discuss with mentor.

Week 3: Figure out detailed plan of the project.

Community at MATS

MATS Research phase provides scholars with a community of peers.

During the Research phase, scholars work out of a shared office, have shared housing, and are supported by a full-time Community Manager.

Working in a community of independent researchers gives scholars easy access to future collaborators, a deeper understanding of other alignment agendas, and a social network in the alignment community.

Previous MATS cohorts included regular lightning talks, scholar-led study groups on mechanistic interpretability and linear algebra, and hackathons. Other impromptu office events included group-jailbreaking Bing chat and exchanging hundreds of anonymous compliment notes.  Scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups.