Megan Kinniment

This stream will focus on the science and development of model evaluations, especially monitorability and alignment evals.

Stream overview

Here are some examples of projects I have been interested in, but I may be interested in other projects by the time this cohort starts:

  • Improve METR's general agent performance using SFT / RL / and scaffolding. (Want to rule out a big capabilities jump with a small amount of effort). 
  • Models seem quite bad at judging the quality of their own solutions. Is this a capabilities bottleneck? Is so, can we shortcut running our full task suite by creating a benchmark that just requires models to judge what score an agent trajectory would have gotten?
  • How can we convert model X% time horizons into 'expected speedup for person doing X task'?
  • How can we model the performance of agents on multistep tasks. (Models like e.g. toby ord's constand hazard rate model seem interesting - but don't fit with our data in some ways)
  • Building agentic scheming evals, e.g. where models can choose to sabotage products, and evade or disrupt monitoring. 
  • What fundamental weaknesses are holding models back from being dangerous, and how can we test those?

Mentors

Megan Kinniment
METR
,
Member of Technical Staff
SF Bay Area
Dangerous Capability Evals

I am a researcher at METR. 

I think the development of AI is going to be a confusing time for the world. I want to help provide good evidence and methodologies for tracking AI development and risk, so humanity can make sensible decisions.

I've had different roles at different times, including leading task development and our monitoring stream. I like prototyping new kinds of evaluations. I think it's healthy to read transcripts. I'm interested in what capabilities matter for being a competent agent, and why current AI agents fall short. I feel lucky that I get to spend time building an understanding of the models. 

I've previously spent time at the Centre on Long-Term Risk and FHI. Before that I studied physics at university, where I did malaria diagnostics research. 

Mentorship style

I'll meet with scholars 2x/week each. I'll also be generally available async and potentially for code review. 

Scholars we are looking for

Various profiles could be a good fit.

Wanted: 

  • Enjoys making progress quickly, some 'productive impatience'
  • Proactivity and agency (e.g. to unblock yourself, and to ask for help when you need it)
  • Comfort doing things you haven't done before (or that nobody has done before!)
  • Basic coding skills. (E.g. git, python, bash, know what uv is)
  • Able to write reasonably clean code that other researchers can build on
  • Prior experience doing research of some kind (broadly defined, could be independent)
  • Ok with, or excited to, read some agent transcripts
  • Interest in or curious about the models
  • Interest in understanding METR as an organization, and how can we better achieve our goals. 
  • Enthusiasm is always a plus!

Some of the following would be great but not essential: 

  • Prior experience making blackbox evaluations.
  • Prior experience doing research of some kind (broadly defined, could be independent).
  • Prior experience with white-box evaluation techniques. 
  • Prior experience creating model organisms. 
  • Cybersecurity experience

Can independently find collaboraters, but not required

Project selection

I'll provide a list of possible projects to pick from, and talk through the options before making a decision. 

Scholars can also suggest their own projects. 

Community at MATS

MATS Research phase provides scholars with a community of peers.

During the Research phase, scholars work out of a shared office, have shared housing, and are supported by a full-time Community Manager.

Working in a community of independent researchers gives scholars easy access to future collaborators, a deeper understanding of other alignment agendas, and a social network in the alignment community.

Previous MATS cohorts included regular lightning talks, scholar-led study groups on mechanistic interpretability and linear algebra, and hackathons. Other impromptu office events included group-jailbreaking Bing chat and exchanging hundreds of anonymous compliment notes.  Scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups.