Provable AI Safety

If we could encode a detailed world model and coarse human preferences in a formal language, it could be possible to formally verify that an AI-generated agent won’t take actions leading to catastrophe. Can we use frontier AI to help create such a detailed multi-scale world model, and/or to synthesize agents and proof certificates for small-scale demonstrations already?

Mentor

David ‘davidad’ Dalrymple

David A. Dalrymple (known online as Davidad) is a Programme Director at the UK Advanced Research + Invention Agency focusing on safe transformative AI. He formerly held positions at at Protocol Labs, FHI (Oxford), Harvard Biophysics, and MIT Mathematics and Computation. While at FHI, he developed the Open Agency Architecture (OAA) paradigm, a proposal to utilize powerful computers to build a detailed world simulation and formally verify within it that a particular AGI adheres to coarse preferences and avoids catastrophic outcomes (see this explainer).
In the words of Connor Leahy, this approach has “the rare distinction of actually solving most/all of the core problems” of AI safety. However, it also has the disadvantage of requiring substantial research advances in well-trodden areas of computer science and mathematics, opening up many questions for interested researchers.
Davidad’s academic background includes category theory, formal methods, systems programming, protocol design, probabilistic modeling, biophysics, and computer architecture.

Research

Research projects

David ‘davidad’ Dalrymple
Davidad will soon release an official ARIA document that can be referenced for possible problem areas.
Until then, these papers should indicate the reference class of possible projects:
Project selection will depend on a scholar’s strengths and interests, but the scope is generally “demonstrate the use of frontier AI in the development of formal world-models, formal safety specifications, synthesis of provably correct solutions, and proofs.”
More information

Personal Fit

Competitive applicants will have strong familiarity with:

{probabilistic programming semantics OR markov decision processes OR hybrid system reachability OR model-checking tools OR Lean 4}

AND

{fine-tuning Llamas OR hacking on a legacy C++ codebase OR programming language design and implementation OR database design and implementation}

AND

{applied category theory OR stochastic processes OR temporal logic}

Mentorship will be pretty sparse, likely a 2-hour session twice a month. I am a good technical leader but less good as a manager; mentees should bring their own executive functioning. I am most valuable to mentees when they relentlessly send me questions.

Selection Questions

Selection questions include choosing any two of the following questions. We suggest you spend no more than 5 hours on these questions. Questions include the following:

Given a state space X and a convex set of probability measures S (where 𝜇∈S is a probability measure over X), is the operator (f : X→ℝ) ↦ min_{𝜇∈S} 𝔼_{x~𝜇} f(x) concave or convex? Provide a proof.
What is your favorite definition/formalization of a stochastic hybrid system, and why?
Monte Carlo estimates do not give sound bounds. Explain how to obtain sound bounds using abstract interpretation with the box domain, a priority queue, and recursive subdivision. Demonstrate by coding a simple example such as successive approximation of pi via the area of a circle and its complement. Ensure that floating-point rounding does not compromise soundness or precision.
Given a category C, what is the Grothendieck construction of the indexed category which assigns to each object a in C its overcategory squared, (C/a)×(C/a)?
Discuss ways that one could apply a GNN to an existing relational data base, and graft the resulting activations into an existing pretrained LLM whose attention mechanism is only meant for token sequences.