Cooperative AI

The world may soon contain many advanced AI systems frequently interacting with humans and with each other. Can we create a solid game-theoretic foundation for reasoning about these interactions to prevent catastrophic conflict and incentivize cooperation?

Mentors

Anthony DiGiovanni
Researcher, Center for Long-term Risk

Anthony's research aims to mitigate catastrophic AI conflicts through understanding and promoting the use of safe Pareto improvements among competing AI systems, focusing on theoretical conditions that encourage cooperative solutions.

  • Anthony DiGiovanni is a researcher at the Center on Long-Term Risk, leading the Conceptual stream. His research is on modeling causes of catastrophic conflict between AGIs, and mapping the conditions for interventions to counterfactually prevent AGI conflict. Anthony focuses on deconfusing the “commitment races” problem and developing the game theory of “safe Pareto improvements.” Example public-facing work: “Responses to apparent rationalist confusions about game / decision theory,” “Safe Pareto Improvements for Expected Utility Maximizers in Program Games.”

    LinkedIn | Scholar | LessWrong

  • Background on my research agenda

    AGIs might be motivated to make catastrophic irreversible decisions, e.g. in delegating to successor agents, when they perceive themselves as under strategic pressures from competition with other AGIs. This is the commitment races problem. We’d like to understand what decision-theoretic properties and (lack of) capabilities would mitigate AGIs’ cooperation failures due to commitment races, to help prioritize interventions on how AIs are trained and used.

    One of the most important capabilities for avoiding conflict is the ability to implement safe Pareto improvements (SPIs) (Oesterheld and Conitzer, 2022). The high-level motivation for SPIs is: Even if agents who get into conflict would all rather have cooperated, they might worry (under their uncertainty before the conflict) that being more willing to cooperate makes them more exploitable. SPIs are designed to avoid this problem, between agents with high transparency — agents who use SPIs will, when they would otherwise get into conflict, instead agree to a more cooperative outcome without changing their relative bargaining power (thus avoiding exploitability).

    Research topics

    I’d be most keen on supervising projects about the following:

    • Mapping concrete conditions for agents to (want to) use SPIs: We have theoretical results on the sufficient conditions under which agents who are capable of using SPIs prefer to use them. (See Oesterheld and Conitzer (2022) and DiGiovanni and Clifton (2024).) The next steps are: 1) What are the specific ways in which these sufficient conditions (e.g., assumptions about the agents’ beliefs) plausibly might not hold? 2) What does it look like for prosaic AIs to satisfy or violate these sufficient conditions?

    • Understanding SPI implementation in early human-AI institutions: SPIs have been well-studied conceptually in the context of single agents, but it’s less clear how they would work in a system of many humans and AI assistants. This question may be high-priority to understand because a) it’s relatively time-sensitive (if AIs face strategic pressures early in takeoff), and b) compared to a coherent agent, a system of decision-makers is less likely to use SPIs by default.

    But I will also consider applications on other topics related to the agenda sketched above.

  • Qualifications

    • Necessary:

    • Ideal:

      • Some familiarity with decision theory

      • Familiarity with prosaic AI development and alignment

    • Mentorship structure

      • Weekly meetings (30 - 60 min), mostly for giving high-level direction

      • Comments on substance of documents (i.e., soundness and clarity of key arguments), mostly not on low-level points or writing style

      • I’ll be mostly hands-off otherwise

Christian Schroeder de Witt
Postdoc Researcher, Oxford University

Christian specializes in AI safety and multi-agent security, with foundational contributions to secure steganography and reinforcement learning.

  • Dr. Christian Schroeder de Witt is a leading researcher in foundational AI and information security, celebrated for breakthroughs in the 25+ year old challenge of perfectly secure steganography and the development of illusory attacks on reinforcement learning agents. During his Ph.D., he helped establish the field of cooperative deep multi-agent reinforcement learning, contributing to the creation of several popular algorithms and standard benchmark environments. Currently, he is a postdoc at the University of Oxford's Torr Vision Group and a former visiting researcher with Turing Award-winner Prof. Yoshua Bengio at MILA (Quebec).

    His academic credentials are bolstered by distinguished master's degrees in Physics and in Computer Science from Oxford, where he made significant contributions to categorical quantum mechanics. His work has gained international acclaim, including recognition in Quanta Magazine, Scientific American, and Bruce Schneier’s Security Blog, and his selection as a "30 under 35 rising strategist (Europe)" by Schmidt Futures International Strategy Forum and the European Council on Foreign Relations in 2021.

  • I am a deep multi-agent learning / Cooperative AI researcher now working on AI Safety (manipulation, deception, ELK) and Multi-Agent Security (collusion, illusory attacks). My other research interests are Information Theory, Security, (Deep, Multi-Agent) Reinforcement Learning, and Agent-Based Modeling. You may view more of my past research at Google Scholar.

  • The key values I’m looking for are passionate, autonomous, and kind scholars.

    Some of my students and mentees include:

    • Linas Nasvytis (MSc Statistics student, now Research Fellow at Harvard University (Psychology and ML))

    • Yat Long Lo (MSc Computer Science student - winner of Tony Hoare Prize for best MSc Thesis in Computer Science, now Dyson Robot Learning Lab)

    • Khaulat Abdulhakeem (mentee, now MS Education Data Science at Stanford University)

    • Eshaan Agrawal (mentee and collaborator, now ORISE Fellow at the Department of Energy)