Evaluating Dangerous Capabilities — ML Alignment & Theory Scholars

Evaluations

Many stories of AI accident and misuse involve potentially dangerous capabilities, such as sophisticated deception and situational awareness, that have not yet been demonstrated in AI. Can we evaluate such capabilities in existing AI systems to form a foundation for policy and further technical work?

Please note that each of the mentors below will be working on their own respective research agendas with separate application, admissions, and mentorship.

Mentors

Evan Hubinger

Evan Hubinger is a research scientist at Anthropic where he leads the Alignment Stress-Testing team. Before joining Anthropic, Evan was a research fellow at the Machine Intelligence Research Institute. Evan has done both very empirical alignment research, such as “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”, and very theoretical alignment research, such as "Risks from Learned Optimization in Advanced Machine Learning Systems.”

Research
Marius Hobbhahn

Marius Hobbhahn is currently building up a new technical AI safety organization, Apollo Research. His organization’s research agenda revolves around deception, interpretability, model evaluations and auditing. Before founding this organization, he was a Research Fellow at Epoch and a MATS Scholar while he pursued his PhD in Machine Learning (currently on pause) at International Max-Planck Research School Tübingen.

Research
Owain Evans
Owain is currently focused on:
- Defining and evaluating situational awareness in LLMs (relevant paper);
- How to predict the emergence of other dangerous capabilities empirically (see “out-of-context” reasoning and the Reversal Curse);
- Honesty, lying, truthfulness and introspection in LLMs.
He leads a research group in Berkeley and has mentored 25+ alignment researchers in the past, primarily at Oxford’s Future of Humanity Institute.
Research

Research projects

Owain Evans
- Defining and evaluating situational awareness in LLMs (relevant paper).
- Predicting the emergence of other dangerous capabilities in LLMs (e.g. deception, agency, misaligned goals).
- Studying emergent reasoning at training time (“out-of-context” reasoning). See Reversal Curse.
- Detecting deception and dishonesty in LLMs using black-box methods
- Enhancing human epistemic abilities using LLMs (e.g., Autocast, TruthfulQA).
Marius Hobbhahn

Personal Fit

Owain Evans

Some of Owain's projects involve running experiments on large language models. For these projects, scholars need to have some kind of experience running machine learning experiments (either with LLMs or some other kind of machine learning model).

Evaluations

Evan Hubinger

Marius Hobbhahn

Owain Evans

Owain Evans

Marius Hobbhahn

Owain Evans