Alignment is solved for models in the current paradigm. This shifts the threat model to good old human conflict, so I'm excited about coordination tech (AI cooperation, datacenter workload verification). For aligning future models, we have to forecast what future AGIs will look like and solve issues before they come up. I’m excited about models that maintain their goodness under self-directed learning and can align their successor.
I think that alignment of current models is basically solved. Have you talked to them? They’re really nice.
Some of the remaining problems are conflict (which we can reduce with coordination tech), and the remaining alignment risk from future models. To address the latter, we must forecast what future AGIs will look like and tailor our alignment methods to that.
The strongest reasons to think alignment hasn’t been solved yet hinge on future models which have been heavily optimized under outcome-based RL. Therefore, technical research on AI alignment should anticipate this situation and empirically test what will happen then.
Future models probably also will update their weights continuously (every startup and bigco is simultaneously trying to solve continual learning), making pre-deployment monitoring not enough to catch all issues. So another unsolved alignment subproblem is ensuring alignment is stable under continual learning.
My best guess is we need to:
I’m therefore interested in doing dry-runs of this kind of setup. Some possible projects:
Once we can reliably make friendly AI, the problem becomes conflicts between different groups of humans + AIs. We can try building tools that allow for honesty, cooperation and bargaining, between parties who don’t have great reasons to trust each other.
A subset of this is computation verification, which is a focus of my recent paper. There, we take the default state of open-source code, and assume that determinism is too difficult. But I think it’s possible: when weight matrices were small, matmuls used atomic additions to have enough parallelism to exploit the GPU, which caused race conditions in addition order. But modern matmul kernels have no race conditions, because parallelizing over the contraction dimension becomes unnecessary if your matrices are big enough. The Thinking Machines blog post provides a blueprint for this.
Probes, contrastive averaging and SAEs attempt to discover linear directions that represent concepts in NNs. Can we find the true concepts, and are they linear?
Example project: in my Sokoban work (https://arxiv.org/abs/2506.10138) we manually found linear directions that represent simple concepts, with much more causal effect than what we found using SAEs, probes; or we would find using contrastive averaging, because of correlations between concepts. Can we design a probe training algorithm which finds the correct causal concept directions we previously found?
Adrià is an independent researcher focused on open-source self-alignment and self-exploration, and reproducible inference. Previously, he was a Research Scientist at FAR AI, where he reverse-engineered a recurrent neural network that plans. His previous interpretability work includes measuring progress in interpretability with InterpBench, Automatic Circuit Discovery and Causal Scrubbing. He previously worked at Redwood Research on neural network interpretability. He holds a PhD from the University of Cambridge, where he worked on Bayesian neural networks.
Every week we have a meeting, where you are expected to bring up questions, problems that are preventing your progress, or things you would like advice on. We set goals for the next week. In the week between meetings, you work towards the agreed-upon goals. I am available to unblock via Slack or short meetings if necessary.
self-alignment:
Embedding communication:
Computation verification:
General CUDA resources:
The two main qualities I look for in a scholar are:
Other important things:
Other nice things:
Not a good fit if:
Can independently find collaborators, but not required
We'll talk about possible projects together. By the end of week 1 we should have something that we're both excited about, and fix the decision in place in the middle of week 2.
I’m taking two scholars, and hoping that both of you and I can all agree on a project together. I think a tiny research team together can keep each other motivated and accomplish much more than two separate scholar-Adrià teams.
MATS Research phase provides scholars with a community of peers.
.webp)
During the Research phase, scholars work out of a shared office, have shared housing, and are supported by a full-time Community Manager.
Working in a community of independent researchers gives scholars easy access to future collaborators, a deeper understanding of other alignment agendas, and a social network in the alignment community.
Previous MATS cohorts included regular lightning talks, scholar-led study groups on mechanistic interpretability and linear algebra, and hackathons. Other impromptu office events included group-jailbreaking Bing chat and exchanging hundreds of anonymous compliment notes. Scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups.