Theory

The hardest problems in AI safety may not be solvable with experiments alone — they require the kind of foundational thinking in mathematics and philosophy that gives the field something solid to build on. Streams in this track work on agent foundations, formal models of trust and agency, mechanistic interpretability theory, and AI welfare. We're looking for researchers with deep mathematical maturity who want to tackle the problems that will still matter when AI systems are far more capable than they are today.

Apply by June 7th

Application process

Initial application: No track-specific questions.
Stream applications & follow-up: Apply to individual streams; follow-up includes interviews or additional assessments depending on the stream.

Theory track overview

This track works on problems where the goal is durable conceptual progress rather than experimental results on today's models. The bet is that some of the hardest alignment questions, such as around agency, optimization, trust, and the structure of cognition, will not be settled by scaling current empirical techniques, and that mathematical and philosophical foundations will matter when AI systems are much more capable than they are now. Projects here cover agent foundations, formal models of trust and agency, mechanistic interpretability theory, and AI welfare. Methods are largely paper, pen, and proof, though some work intersects with empirical interpretability or formal verification.

We are looking for researchers with serious mathematical maturity and a willingness to sit with problems where the right formalization is itself part of the work. Essential traits are research independence (theory questions are open-ended and require self-direction), fluency with formal reasoning (proofs, probability, type theory, dynamical systems, or analogous), and the ability to write clearly about abstract ideas. Strong candidates have come from mathematics, theoretical computer science, theoretical physics, formal philosophy, and economic theory, but background is less load-bearing than demonstrated ability to do hard formal work, ideally with written output we can read.

Fellows are matched to mentors based on fit and produce concrete artifacts by program end: papers, technical reports, conceptual write-ups, or formal results. Target audiences include the agent foundations and alignment theory communities, alignment-relevant teams at frontier labs, and academic venues for formal work. Theory outputs typically have longer time horizons than empirical ones, and we expect many fellows to continue refining results past the program.

Theory track streams

Abram Demski

Theory

Agent Foundations research focused on clarifying conditions under which humans can justifiably trust artificial intelligence systems.

Alignment Research Center (ARC)

Theory

The Alignment Research Center is a small non-profit research group based in Berkeley, California, that is working on a systematic and theoretically grounded approach to mechanistically explaining neural network behavior. We are interested in scholars with a strong math background and mathematical maturity. If you'd be excited to work on the research direction described in this blog post – then we'd encourage you to apply!

Alignment Research Center (ARC)

Theory

Dan Murfet, Jesse Hoogland

Empirical

Theory

We study applications of singular learning theory (SLT) to AI safety, with a focus on interpretability and alignment. Ideal candidates come from a strong technical background in mathematics, physics, computer science, or biology, and aren't afraid to get their hands dirty with ML experiments. We don't expect you to have deep expertise in SLT, but a shallow familiarity will help.

Krishnamurthy Dvijotham (Dj)

Empirical

Theory

This stream will pursue research on securing and hardening AI systems through rigorous testing, provable defenses, and formal specification, including improving benchmarks for agentic security, scaling mathematically-grounded robustness techniques like randomized smoothing and Lipschitz-constrained training, and developing formal methods for specifying safe agent behaviors.

Lee Sharkey

Empirical

Theory

Lee's stream will focus primarily on improving mechanistic interpretability methods for reverse-engineering neural networks.

Oliver Sourbut

Empirical

Theory

Policy and Governance

Making society safe from AI doesn't just mean making safe AI: we're figuring out how to uplift human collective intelligence, manage a highly multiagent world, improve foresight and institutional competence, ideally learning how to make best positive use of frontier AI systems as we go. FLF has a small, sharp team of researchers with a wide network, and we're looking to nurture new and missing approaches to minimising large-scale risks while steering to a flourishing future.

Patrick Butlin

Theory

Empirical

Projects in this stream will be on AI welfare and moral status; more specifically, on what it takes to be a moral patient and how we can determine whether AI systems meet the conditions. I'm looking for applicants who have ideas about these topics and are motivated to explore them in more detail.

Theory

Application process

Theory track overview

Theory track streams

Abram Demski

Alignment Research Center (ARC)

Alignment Research Center (ARC)

Dan Murfet, Jesse Hoogland

Krishnamurthy Dvijotham (Dj)

Lee Sharkey

Oliver Sourbut

Patrick Butlin

Frequently asked questions