Adrià Garriga-Alonso

Alignment is solved for models in the current paradigm. This shifts the threat model to good old human conflict, so I'm excited about coordination tech (AI cooperation, datacenter workload verification). For aligning future models, we have to forecast what future AGIs will look like and solve issues before they come up. I’m excited about models that maintain their goodness under self-directed learning and can align their successor.

Stream overview

I think that alignment of current models is basically solved. Have you talked to them? They’re really nice.

Some of the remaining problems are conflict (which we can reduce with coordination tech), and the remaining alignment risk from future models. To address the latter, we must forecast what future AGIs will look like and tailor our alignment methods to that. 

Dry-run future model training setups

The strongest reasons to think alignment hasn’t been solved yet hinge on future models which have been heavily optimized under outcome-based RL. Therefore, technical research on AI alignment should anticipate this situation and empirically test what will happen then. 

Future models probably also will update their weights continuously (every startup and bigco is simultaneously trying to solve continual learning), making pre-deployment monitoring not enough to catch all issues. So another unsolved alignment subproblem is ensuring alignment is stable under continual learning.

My best guess is we need to:

  • Leverage the AI’s goodness early on, the large weight of good in the pre-training prior + our ability to point to it using Constitutional AI and similar approaches.  
  • Give the AI autonomy during training, so it fleshes out the goodness core into a frame that is stable under new information, learning about itself and the world, capability increases, etc.  
  • For example, the AI should have autonomy of what it is trained on. In the future, it will be able to choose what it’s trained on, so it needs to learn to choose well.  
  • Build check-in mechanisms that encourage the AI to reflect on its actions, what it’s learning, etc.

I’m therefore interested in doing dry-runs of this kind of setup. Some possible projects:

  • Build an open-source Constitutional AI training pipeline  
  • Do RLVR on broken environments (e.g. a programming environment, which gives high reward if no tests fail). Can we stop the model from learning to reward hack by giving it autonomy over what training data points it considers? Does this result in wireheading?

Coordination tech: transparency between semi-adversarial parties.

Once we can reliably make friendly AI, the problem becomes conflicts between different groups of humans + AIs. We can try building tools that allow for honesty, cooperation and bargaining, between parties who don’t have great reasons to trust each other.

  • SelfIE shows that models can look at their own latent states from independent chains and extract meaning from them.  
  • Can different models read each others’ latent states, given linear adapters?  
  • How well does this work with models of different capabilities?   
  • How well does this work with models of different architectures?  
  • Does this work for lie detection during adversarial games (e.g. Among Us)   
    • Can one of the models optimize against this reliably, so long as we keep training the adapter map (tuned lens)?

Computation verification

A subset of this is computation verification, which is a focus of my recent paper. There, we take the default state of open-source code, and assume that determinism is too difficult. But I think it’s possible: when weight matrices were small, matmuls used atomic additions to have enough parallelism to exploit the GPU, which caused race conditions in addition order. But modern matmul kernels have no race conditions, because parallelizing over the contraction dimension becomes unnecessary if your matrices are big enough. The Thinking Machines blog post provides a blueprint for this.

  • Build a fast AND deterministic model inference engine. Open-source it.  
  • This looks like writing CUDA kernels that do forward passes numerically deterministic + making a scheduling engine that is deterministic (modulo pseudo-random seed in both cases)  
  • We can probably get to 80% of vLLM speed.

  • Come up with better statistical tests (e.g. sequential ones) of whether or not a model is being served correctly

Interpretability: true probes

Probes, contrastive averaging and SAEs attempt to discover linear directions that represent concepts in NNs. Can we find the true concepts, and are they linear?

Example project: in my Sokoban work (https://arxiv.org/abs/2506.10138) we manually found linear directions that represent simple concepts, with much more causal effect than what we found using SAEs, probes; or we would find using contrastive averaging, because of correlations between concepts. Can we design a probe training algorithm which finds the correct causal concept directions we previously found?

Mentors

SF Bay Area
AI Welfare, Scalable Oversight, Compute Infrastructure

Adrià is an independent researcher focused on open-source self-alignment and self-exploration, and reproducible inference. Previously, he was a Research Scientist at FAR AI, where he reverse-engineered a recurrent neural network that plans. His previous interpretability work includes measuring progress in interpretability with InterpBench, Automatic Circuit Discovery and Causal Scrubbing. He previously worked at Redwood Research on neural network interpretability. He holds a PhD from the University of Cambridge, where he worked on Bayesian neural networks.

Mentorship style

Every week we have a meeting, where you are expected to bring up questions, problems that are preventing your progress, or things you would like advice on. We set goals for the next week. In the week between meetings, you work towards the agreed-upon goals. I am available to unblock via Slack or short meetings if necessary.

Scholars we are looking for

The two main qualities I look for in a scholar are:

  • Willingness to quickly iterate. Come up with possible explanations of a phenomenon and experiments to distinguish between them, execute those experiments. Don't get stuck in lots of theory.
  • Software engineering ability. I know Claude, Gemini, Codex, etc. can write lots of code, but being able to spot and improve the output yourself results in much smaller amounts of better code and faster experiments.

Other important things:

  • Experience training ML models and debugging problems with the training run.
  • For some of the projects I have a strong preference for Jax, given how much fewer footguns it has.
  • ability to understand and communicate algorithms/experiments

Other nice things:

  • Cheerful and self-motivated. Research involves lots of time just pushing by yourself (or, in our case, with your trusted teammate). This also makes it very fulfilling if it goes well.
  • Ability to write reasonably well

Not a good fit if:

  • You want to do lots of theory

Can independently find collaborators, but not required

Project selection

We'll talk about possible projects together. By the end of week 1 we should have something that we're both excited about, and fix the decision in place in the middle of week 2.

I’m taking two scholars, and hoping that both of you and I can all agree on a project together. I think a tiny research team together can keep each other motivated and accomplish much more than two separate scholar-Adrià teams.

Community at MATS

MATS Research phase provides scholars with a community of peers.

During the Research phase, scholars work out of a shared office, have shared housing, and are supported by a full-time Community Manager.

Working in a community of independent researchers gives scholars easy access to future collaborators, a deeper understanding of other alignment agendas, and a social network in the alignment community.

Previous MATS cohorts included regular lightning talks, scholar-led study groups on mechanistic interpretability and linear algebra, and hackathons. Other impromptu office events included group-jailbreaking Bing chat and exchanging hundreds of anonymous compliment notes.  Scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups.