Mary Phuong (GDM stream)

GDM stream focused on scheming risk, AI control, monitoring, monitorability, and loss-of-control evaluations. Probably running in-person in London.

Apply

View all streams

Stream overview

My research interests currently revolve around:

a) AI control -- i.e. defending against the threat of scheming via monitoring, maintaining monitorability, and (to a lesser extent) security, and

b) Model evaluations relevant to AI control and loss of control risk more broadly.

A project would typically involve building an evaluation, prototyping and stress-testing a mitigation, or investigating some threat model or trade-off relevant to these topics.

It's hard for me to give specific project descriptions so far in advance, but here's a selection of projects I've been excited about in the past (some of which are now papers!). You can expect the actual projects to be of similar flavour but not exactly the same:

develop evaluations for models' opaque reasoning ability
study the extent to which a scheming model can subvert control by acting borderline suspicious all the time
develop collusion countermeasures for untrusted monitoring
evaluate models' monitoring awareness and monitoring subversion propensity
develop AI control methodology in the distributed attacks setting
do a control evaluation of decision sabotage
study the advantages and disadvantages of chain-of-thought vs action monitoring
prototype "bootstrapped monitoring" -- a hybrid between trusted and untrusted monitoring
study a version of untrusted monitoring where the monitor is an honesty-tuned version of the agent
develop a setting for studying the insider data poisoning threat model, then study attacks and defences in the setting
develop consistency as a sabotage detection methodology and evaluate it on model organisms

Mentors

Mentorship style

I'm generally quite hands-off. I propose projects that I think matter for AGI safety and are tractable, to set scholars up for success. I then expect scholars to fully own the project, and update / consult me as needed.

By default we'd meet once a week to discuss the project for 30 min - 1 hour. I see my role as giving feedback on the project direction, stress-testing or advising on design / prioritisation decisions, and occasionally suggesting experiments or methodological improvements (which you should treat as suggestions / advice, not orders!).

You can also book ad-hoc meetings with me, ping me on Slack, or send me docs / paper drafts for review.

I also offer scholars to meet with me once a month for 30 min to discuss career stuff, skill-building, feedback on their progress, or anything else.

Fellows we are looking for

You think the development of AGI / ASI will be one of the most significant events in human history and you want to dedicate your career to helping things go well.
You are excellent or quickly improving at some though not necessarily all of the technical skills listed below. You have basic competence at all the technical skills listed below.
You are a team player: You impartially consider other people’s ideas, disagree respectfully, are not afraid to change your mind or admit when you’re wrong, and are willing to commit to the team’s direction.

Preferred technical skills:

Coding / Software engineering: You (skilfully use AI to) write high-quality, legible code quickly. You can navigate large codebases not written by you and build on them productively. You know when to invest in shared code or a refactor, when to invest in making your iteration loops faster or higher-signal, and when to move quickly using throwaway code. You understand what your AI does. You can quickly and thoroughly debug code.
ML and LLM engineering: You are fluent at training models, be it via SFT / RL APIs or by writing the training code yourself. You can create high-quality synthetic training data, and debug training runs. You are comfortable writing LLM inference code, scaffolds, and eval pipelines (e.g. in Inspect). You know what to log, and can quickly extract key insights from logs and model transcripts.
Research ability: You can design sound experiments targeted at the research question. You understand confounders, competing hypotheses, baselines, ablations, and statistical significance. You come up with ideas for interesting experiments or research questions that align with the project plan.
Research context and big picture: You are familiar with AI control, model evaluations and model organisms. At minimum, you know what most of the terms used in the project ideas above mean. You know the field's contours and open problems. You understand how your project contributes to AGI safety going well, and you make design and prioritisation decisions accordingly.
Communication: You can clearly communicate your research ideas, project status, experimental methodology, results, etc, verbally and in writing.

I prefer scholars to work in pairs or groups of three within the stream, but happy to take on external collaborators as long as they are committed full-time to the project.

Project selection

I'll propose ~5 projects for scholars to choose from. I am also open to scholar-proposed projects if they are well articulated, promising, and align with my research interests.

No results