Jacob Merizian

Stream overview

I would be interested in advising work in control and dangerous capability/propensity/alignment evals, as I feel I have the best chance of being a good mentor for these sorts of projects. Currently, I work on propensity evaluations, evaluation awareness, and sandbagging. I would be happy to suggest concrete project ideas and help with brainstorming topic choices, or help guide an existing project.

Mentors

Jacob Merizian

UK AISI

Research Scientist, Workstream Lead

London

—

Control

Dangerous Capability Evals

I work at the UK AI Security Institute. In the past, I’ve done research in high-performance computing, language model pretraining, interpretability, and hardware enabled governance.

Mentorship style

I prefer a weekly meeting cadence of at least one research meeting per week, where we discuss results from the previous week and potential next steps, and just generally align ourselves on priorities and stay motivated. I'm also a fan of relatively few meetings, and much more support given asynchronously, so I can think carefully about my responses and help throughout the process.

I have a decent amount of experience on the technical side, and so in the past have had good experiences unblocking scholars when they were stuck on technical obstacles right away (e.g. low-level bugs like memory issues, taking a step back and thinking about alternative approaches, etc). For example, I'm a huge fan of impromptu pair programming sessions to debug things together, and I always learn new things from dropping into someone's workflow. I'm also happy to help clarify things conceptually and just brainstorm together. The two biggest bottlenecks in my experience have been 1) getting stuck on technical obstacles and 2) conceptually understanding the problem we're trying to solve.

Representative papers

For eval awareness: Large Language Models Often Know When They Are Being Evaluated

For alignment red-teaming: Investigatng Models for Misalignment (UKAISI)

For sandbagging: Auditing games for sandbagging detection

Scholars we are looking for

I'm open to a wider variety of skillsets, but these would be a big plus:

some relevant technical background in running basic finetuning/inference/interp experiments on a multi-gpu cluster
some prior level of interest in any of the research categories I've listed
ability to work independently once there is a clear enough goal (though I'm happy to be the one supplying the goal if that is the bottleneck)

It depends on the chosen category, but I would probably try to connect scholars together if they are working on similar-enough projects, or help find UK AISI or external collaborators as makes sense. Scholars are of course welcome to find collaborators on their own as well, and I can help facilitate that if need be.

Project selection

I would be happy to suggest concrete project ideas and help with brainstorming topic choices, or help guide an existing project that the scholar is interested in. My preference is that the scholar picks a category that overlaps with an area I actively work on so that I can give effective high-level advice.