Stream overview

I would be interested in advising work in control and dangerous capability/propensity/alignment evals, as I feel I have the best chance of being a good mentor for these sorts of projects. Currently, I work on propensity evaluations, evaluation awareness, and sandbagging. I would be happy to suggest concrete project ideas and help with brainstorming topic choices, or help guide an existing project.

Mentors

Jacob Merizian
UK AISI
,
Research Scientist, Workstream Lead
London
Control, Monitoring, Red-Teaming, Scalable Oversight, Scheming & Deception

I work at the UK AI Security Institute. In the past, I’ve done research in high-performance computing, language model pretraining, interpretability, and hardware enabled governance.

Mentorship style

I prefer a weekly meeting cadence of at least one research meeting per week, where we discuss results from the previous week and potential next steps, and just generally align ourselves on priorities and stay motivated. I'm also a fan of relatively few meetings, and much more support given asynchronously, so I can think carefully about my responses and help throughout the process.

I have a decent amount of experience on the technical side, and so in the past have had good experiences unblocking scholars when they were stuck on technical obstacles right away (e.g. low-level bugs like memory issues, taking a step back and thinking about alternative approaches, etc). For example, I'm a huge fan of impromptu pair programming sessions to debug things together, and I always learn new things from dropping into someone's workflow. I'm also happy to help clarify things conceptually and just brainstorm together. The two biggest bottlenecks in my experience have been 1) getting stuck on technical obstacles and 2) conceptually understanding the problem we're trying to solve.

Representative papers

Scholars we are looking for

I'm open to a wider variety of skillsets, but these would be a big plus:

  • some relevant technical background in running basic finetuning/inference/interp experiments on a multi-gpu cluster
  • some prior level of interest in any of the research categories I've listed
  • ability to work independently once there is a clear enough goal (though I'm happy to be the one supplying the goal if that is the bottleneck)

It depends on the chosen category, but I would probably try to connect scholars together if they are working on similar-enough projects, or help find UK AISI or external collaborators as makes sense. Scholars are of course welcome to find collaborators on their own as well, and I can help facilitate that if need be.

Project selection

I would be happy to suggest concrete project ideas and help with brainstorming topic choices, or help guide an existing project that the scholar is interested in. My preference is that the scholar picks a category that overlaps with an area I actively work on so that I can give effective high-level advice.

Community at MATS

MATS Research phase provides scholars with a community of peers.

During the Research phase, scholars work out of a shared office, have shared housing, and are supported by a full-time Community Manager.

Working in a community of independent researchers gives scholars easy access to future collaborators, a deeper understanding of other alignment agendas, and a social network in the alignment community.

Previous MATS cohorts included regular lightning talks, scholar-led study groups on mechanistic interpretability and linear algebra, and hackathons. Other impromptu office events included group-jailbreaking Bing chat and exchanging hundreds of anonymous compliment notes.  Scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups.