Abram Demski

Agent Foundations research focused on clarifying conditions under which humans can justifiably trust artificial intelligence systems.

Stream overview

Methodologically speaking, a proposed AI safety technique is only as good as its safety argument. This does not imply that we need to mathematically prove absolute safety from no assumptions (which is impossible), but it does mean we need to track what assumptions we need to make and rigorously connect them into an argument. In my view, there are still fundamental unknowns about how to put such an argument together, specifically in relation to how one agent can trust another. Intelligence/agency make empirical observations more difficult to generalize, since intelligent agents can negate otherwise-robust empirical patterns, often deliberately.

My main line of research tackles this problem head-on by analyzing agentic trust directly, starting with the simpler (already difficult) case of an agent trusting itself, with the aim of clarifying how humans can justifiably trust AI systems.

As of this writing, my current thinking about research directions for this project is described here

An older, more out-of-date but more thorough description of research directions is here

If this line of research goes well, it would produce relevant advice for AI engineering (such as LLM work). A second, more speculative (but more near-term) line of research I've been working on involves "guessing ahead" about what such advice would look like, in order to produce ideas for near-term ML research that might be useful for increasing the safety of AI systems resembling those popular today.

Mentors

Abram Demski
Independent
,
Researcher
Grand Rapids
Agent Foundations

Abram Demski is an AI Safety researcher specializing in Agent Foundations, best known for Embedded Agency (co-written with Scott Garrabrant). His overall approach primarily involves deconfusion research in relation to various concepts related to AI risks, including agency, optimization, trust, meaning, understanding, interpretability, and computational uncertainty (more commonly but less precisely known as bounded rationality). More specifically, his recent work focuses on modeling trust, with the objective of clarifying conditions under which humans can justifiably trust AI.

Mentorship style

We can discuss this more and decide on a different structure, but by default, 1 hour 1-on-1 meetings with each scholar once a week, plus a 2 hour group meeting which may also include outside collaborators.

Scholars we are looking for

Essential:

  • Ability to read and write mathematical proofs.
  • Some fluency with probability theory and expected utility theory.

Preferred:

  • Familiarity with Updateless Decision Theory. [More important.]
  • Familiarity with Logical Induction (aka Garrabrant Induction). [Less important.]

Quality of fit is roughly proportional to philosophical skill times mathematical skill. Someone with excellent philosophical depth and almost no mathematics could be an OK fit, but would probably struggle to produce or evaluate proofs. Someone with excellent mathematical depth but no philosophy could be an OK fit, but might struggle to understand what assumptions and theorems are useful/interesting.

Scholars will primarily collaborate with Abram, and may collaborate with other scholars if there is overlap in research topics.

Project selection

There will be some flexibility about what specific projects scholars will pursue. Abram will discuss the current state of his research with scholars and what topics scholars are interested in, aiming to settle on a topic by or before week 2.

Community at MATS

MATS Research phase provides scholars with a community of peers.

During the Research phase, scholars work out of a shared office, have shared housing, and are supported by a full-time Community Manager.

Working in a community of independent researchers gives scholars easy access to future collaborators, a deeper understanding of other alignment agendas, and a social network in the alignment community.

Previous MATS cohorts included regular lightning talks, scholar-led study groups on mechanistic interpretability and linear algebra, and hackathons. Other impromptu office events included group-jailbreaking Bing chat and exchanging hundreds of anonymous compliment notes.  Scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups.