GDM stream focused on scheming risk, AI control, monitoring, monitorability, and loss-of-control evaluations. Probably running in-person in London.
My research interests currently revolve around:
a) AI control -- i.e. defending against the threat of scheming via monitoring, maintaining monitorability, and (to a lesser extent) security, and
b) Model evaluations relevant to AI control and loss of control risk more broadly.
A project would typically involve building an evaluation, prototyping and stress-testing a mitigation, or investigating some threat model or trade-off relevant to these topics.
It's hard for me to give specific project descriptions so far in advance, but here's a selection of projects I've been excited about in the past (some of which are now papers!). You can expect the actual projects to be of similar flavour but not exactly the same:
I'm generally quite hands-off. I propose projects that I think matter for AGI safety and are tractable, to set scholars up for success. I then expect scholars to fully own the project, and update / consult me as needed.
By default we'd meet once a week to discuss the project for 30 min - 1 hour. I see my role as giving feedback on the project direction, stress-testing or advising on design / prioritisation decisions, and occasionally suggesting experiments or methodological improvements (which you should treat as suggestions / advice, not orders!).
You can also book ad-hoc meetings with me, ping me on Slack, or send me docs / paper drafts for review.
I also offer scholars to meet with me once a month for 30 min to discuss career stuff, skill-building, feedback on their progress, or anything else.
Preferred technical skills:
I prefer scholars to work in pairs or groups of three within the stream, but happy to take on external collaborators as long as they are committed full-time to the project.
I'll propose ~5 projects for scholars to choose from. I am also open to scholar-proposed projects if they are well articulated, promising, and align with my research interests.