Adrià Garriga-Alonso

Alignment is solved for models in the current paradigm. This shifts the threat model to good old human conflict, so I'm excited about coordination tech (AI cooperation, datacenter workload verification). For aligning future models, we have to forecast what future AGIs will look like and solve issues before they come up. I’m excited about models that maintain their goodness under self-directed learning and can align their successor.

Apply

View all streams

Stream overview

I think that alignment of current models is basically solved. Have you talked to them? They’re really nice.

Some of the remaining problems are conflict (which we can reduce with coordination tech), and the remaining alignment risk from future models. To address the latter, we must forecast what future AGIs will look like and tailor our alignment methods to that.

Dry-run future model training setups

The strongest reasons to think alignment hasn’t been solved yet hinge on future models which have been heavily optimized under outcome-based RL. Therefore, technical research on AI alignment should anticipate this situation and empirically test what will happen then.

Future models probably also will update their weights continuously (every startup and bigco is simultaneously trying to solve continual learning), making pre-deployment monitoring not enough to catch all issues. So another unsolved alignment subproblem is ensuring alignment is stable under continual learning.

My best guess is we need to:

Leverage the AI’s goodness early on, the large weight of good in the pre-training prior + our ability to point to it using Constitutional AI and similar approaches.
Give the AI autonomy during training, so it fleshes out the goodness core into a frame that is stable under new information, learning about itself and the world, capability increases, etc.
For example, the AI should have autonomy of what it is trained on. In the future, it will be able to choose what it’s trained on, so it needs to learn to choose well.
Build check-in mechanisms that encourage the AI to reflect on its actions, what it’s learning, etc.

I’m therefore interested in doing dry-runs of this kind of setup. Some possible projects:

Build an open-source Constitutional AI training pipeline
Do RLVR on broken environments (e.g. a programming environment, which gives high reward if no tests fail). Can we stop the model from learning to reward hack by giving it autonomy over what training data points it considers? Does this result in wireheading?

Coordination tech: transparency between semi-adversarial parties.

Once we can reliably make friendly AI, the problem becomes conflicts between different groups of humans + AIs. We can try building tools that allow for honesty, cooperation and bargaining, between parties who don’t have great reasons to trust each other.

SelfIE shows that models can look at their own latent states from independent chains and extract meaning from them.
Can different models read each others’ latent states, given linear adapters?
How well does this work with models of different capabilities?
How well does this work with models of different architectures?
Does this work for lie detection during adversarial games (e.g. Among Us)
- Can one of the models optimize against this reliably, so long as we keep training the adapter map (tuned lens)?

Computation verification

A subset of this is computation verification, which is a focus of my recent paper. There, we take the default state of open-source code, and assume that determinism is too difficult. But I think it’s possible: when weight matrices were small, matmuls used atomic additions to have enough parallelism to exploit the GPU, which caused race conditions in addition order. But modern matmul kernels have no race conditions, because parallelizing over the contraction dimension becomes unnecessary if your matrices are big enough. The Thinking Machines blog post provides a blueprint for this.

Build a fast AND deterministic model inference engine. Open-source it.
This looks like writing CUDA kernels that do forward passes numerically deterministic + making a scheduling engine that is deterministic (modulo pseudo-random seed in both cases)
We can probably get to 80% of vLLM speed.

Come up with better statistical tests (e.g. sequential ones) of whether or not a model is being served correctly

Interpretability: true probes

Probes, contrastive averaging and SAEs attempt to discover linear directions that represent concepts in NNs. Can we find the true concepts, and are they linear?

Example project: in my Sokoban work (https://arxiv.org/abs/2506.10138) we manually found linear directions that represent simple concepts, with much more causal effect than what we found using SAEs, probes; or we would find using contrastive averaging, because of correlations between concepts. Can we design a probe training algorithm which finds the correct causal concept directions we previously found?

Mentors

Adrià Garriga-Alonso

Independent

SF Bay Area

—

AI Welfare

Scalable Oversight

Compute and Hardware

Adrià is an independent researcher focused on open-source self-alignment and self-exploration, and reproducible inference. Previously, he was a Research Scientist at FAR AI, where he reverse-engineered a recurrent neural network that plans. His previous interpretability work includes measuring progress in interpretability with InterpBench, Automatic Circuit Discovery and Causal Scrubbing. He previously worked at Redwood Research on neural network interpretability. He holds a PhD from the University of Cambridge, where he worked on Bayesian neural networks.

Mentorship style

Every week we have a meeting, where you are expected to bring up questions, problems that are preventing your progress, or things you would like advice on. We set goals for the next week. In the week between meetings, you work towards the agreed-upon goals. I am available to unblock via Slack or short meetings if necessary.

Representative papers

self-alignment:

Embedding communication:

Computation verification:

Deterministic inference

General CUDA resources:

Pallas docs (very informative):
- Writing Mosaic GPU kernels with Pallas
- Writing high-performance matrix multiplication kernels for Blackwell
Hazy Research blogs: This is the most important. Then if you're curious:
- ParallelKittens: Simple and Fast Multi-GPU AI Kernels
- Easier, Better, Faster, Cuter
- We Bought the Whole GPU, So We're Damn Well Going to Use the Whole GPU
- ... just keep clicking

Scholars we are looking for

The two main qualities I look for in a scholar are:

Willingness to quickly iterate. Come up with possible explanations of a phenomenon and experiments to distinguish between them, execute those experiments. Don't get stuck in lots of theory.
Software engineering ability. I know Claude, Gemini, Codex, etc. can write lots of code, but being able to spot and improve the output yourself results in much smaller amounts of better code and faster experiments.

Other important things:

Experience training ML models and debugging problems with the training run.
For some of the projects I have a strong preference for Jax, given how much fewer footguns it has.
ability to understand and communicate algorithms/experiments

Other nice things:

Cheerful and self-motivated. Research involves lots of time just pushing by yourself (or, in our case, with your trusted teammate). This also makes it very fulfilling if it goes well.
Ability to write reasonably well

Not a good fit if:

You want to do lots of theory

Can independently find collaborators, but not required

Project selection

We'll talk about possible projects together. By the end of week 1 we should have something that we're both excited about, and fix the decision in place in the middle of week 2.

I’m taking two scholars, and hoping that both of you and I can all agree on a project together. I think a tiny research team together can keep each other motivated and accomplish much more than two separate scholar-Adrià teams.