Evaluations

Many stories of AI accident and misuse involve potentially dangerous capabilities, such as sophisticated deception and situational awareness, that have not yet been demonstrated in AI. Can we evaluate such capabilities in existing AI systems to form a foundation for policy and further technical work?

Please note that each of the mentors below will be working on their own respective research agendas with separate application, admissions, and mentorship.

Mentors

  • Evan Hubinger

    Evan Hubinger is a research scientist at Anthropic where he leads the Alignment Stress-Testing team. Before joining Anthropic, Evan was a research fellow at the Machine Intelligence Research Institute. Evan has done both very empirical alignment research, such as “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”, and very theoretical alignment research, such as "Risks from Learned Optimization in Advanced Machine Learning Systems.”

  • Marius Hobbhahn

    Marius Hobbhahn is currently building up a new technical AI safety organization, Apollo Research. His organization’s research agenda revolves around deception, interpretability, model evaluations and auditing. Before founding this organization, he was a Research Fellow at Epoch and a MATS Scholar while he pursued his PhD in Machine Learning (currently on pause) at International Max-Planck Research School Tübingen.

  • Owain Evans

    Owain is currently focused on:

    • Defining and evaluating situational awareness in LLMs (relevant paper);

    • How to predict the emergence of other dangerous capabilities empirically (see “out-of-context” reasoning and the Reversal Curse);

    • Honesty, lying, truthfulness and introspection in LLMs.

    He leads a research group in Berkeley and has mentored 25+ alignment researchers in the past, primarily at Oxford’s Future of Humanity Institute.

Research projects

  • Owain Evans

    • Defining and evaluating situational awareness in LLMs (relevant paper).

    • Predicting the emergence of other dangerous capabilities in LLMs (e.g. deception, agency, misaligned goals).

    • Studying emergent reasoning at training time (“out-of-context” reasoning). See Reversal Curse.

    • Detecting deception and dishonesty in LLMs using black-box methods

    • Enhancing human epistemic abilities using LLMs (e.g., Autocast, TruthfulQA).

  • Marius Hobbhahn

Personal Fit

  • Owain Evans

    Some of Owain's projects involve running experiments on large language models. For these projects, scholars need to have some kind of experience running machine learning experiments (either with LLMs or some other kind of machine learning model).