The stream will advance empirical methodologies for third-party AI safety evaluations. Example research topics include chain-of-thought monitorability, the secret loyalties research agenda, and automatic auditing (eg, with Anthropic’s Parallel Exploration Tool for Risky Interactions).
I research AI safety and alignment. Most recently, I was a research scientist at Google DeepMind. I completed my PhD at UC Berkeley's Center for Human-Compatible AI, advised by Stuart Russell. I previously cofounded FAR.AI, a 501(c)3 research nonprofit that incubates and accelerates beneficial AI research agendas.
I develop AI alignment frameworks, stress-test their limits, and turn insights into methodology adopted across the field. I have established that chain-of-thought monitoring is a substantial defense when reasoning is necessary for misalignment, designed practical metrics to preserve monitorability during model development, shown that obfuscated activations can bypass latent-space defenses, and developed StrongREJECT, a jailbreak benchmark now used by OpenAI, US/UK AISI, Amazon, and others.
I will have scholars work in teams. During the week, the scholars will collaborate with each other and are encouraged to meet frequently. I will hold a weekly advising meeting for each project to provide help and guidance.
When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors
A Pragmatic Way to Measure Chain-of-Thought Monitorability
Obfuscated Activations Bypass LLM Latent-Space Defenses
A StrongREJECT for Empty Jailbreaks
AI-Enabled Coups: How a Small Group Could Use AI to Seize Power
Petri: An open-source auditing tool to accelerate AI safety research
My projects will be prioritizing impact on the field of AI safety over academic novelty. Beyond the skills of doing empirical AI safety research, I am looking for collaborators who are excited about doing sound and impactful science - including the mundane aspects of doing good science.
Probably will work with collaborators from stream
I will provide a list of projects, but can also talk through other project ideas
MATS Research phase provides scholars with a community of peers.
.webp)
During the Research phase, scholars work out of a shared office, have shared housing, and are supported by a full-time Community Manager.
Working in a community of independent researchers gives scholars easy access to future collaborators, a deeper understanding of other alignment agendas, and a social network in the alignment community.
Previous MATS cohorts included regular lightning talks, scholar-led study groups on mechanistic interpretability and linear algebra, and hackathons. Other impromptu office events included group-jailbreaking Bing chat and exchanging hundreds of anonymous compliment notes. Scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups.