Neev Parikh

I'm interested in empirical projects that improve our ability to evaluate model capabilities or enable to understand or evaluate model monitorability. An ideal project culminates in a research output (conference/Arxiv paper or research blogpost with artifacts).

Stream overview

Some examples of projects I'd be excited about are:

  • Build an evaluation that measures how good models are at predicting which research directions/approaches are better than others, and how they compare to humans

  • Try and compute a "Judge" horizon, which aims to understand the trend of models accurately verifying/predicting the score or result of an evaluation transcript. Can we plot the discriminator-generator gap and understand if the kinds of tasks that model N can verify are the kinds of tasks model N+1 gains the most improvement on?

  • Models could likely use more inference time compute more effectively for a number of current benchmarks. Try to quantify and reduce this gap by applying existing techniques that target specific weaknesses or experiment with new techniques. 

These project ideas are less well-scoped by me but I'd be interested and excited if scholars had clear ideas of what to work on here:

  • Build better red-team attack policies for monitorability evaluations. 

  • Models are increasingly trained with synthetically generated data. Does this mean that models will be able to instinctively "underperform" (or sandbag) by behaving like the less-capable model that generated their training data without needing to reason about this?

  • Understand if emergent misalignment behaves similarly when training via reinforcement learning versus fine-tuning (e.g. do you see more reward hacking if you RL models in CTF or reverse engineering environments?).

  • Try to understand the relationship between compute and time-horizon trends. METR's time horizon measurements use release date on the x-axis instead of effective compute. Could we try and measure how effective compute relates with time-horizon?

  • Try and understand how various recently proposed neuralese/latent reasoning architectures compare to the current open-source default reasoning paradigm and estimate how the alignment tax for visible-chain-of-thought has changed over time. This might look like compiling the presented results and analyzing trends but also might involve collecting new data by benchmarking the relevant models.

Mentors

Neev Parikh
METR
,
Member of Technical Staff
SF Bay Area
Dangerous Capability Evals, Red-Teaming, Model Organisms, Control, Monitoring

I like to make computers do interesting things, deeply understand concepts and build interesting, useful tools. I’m currently thinking about AI alignment, control, and evaluations, and work with frontier models at METR. 

Recent work I've done involves MALT, training models to fool monitors in QA settings and RE-Bench.

I've previously worked at Stripe and CSM, and did a concurrent BSc/MSc in Computer Science at Brown. 

Mentorship style

Time commitments: I expect to not be able to spend more than 5 hours on any week. 

Meetings: I expect to have project meetings weekly for about an hour, where we chat about your results from last week, the planned next steps, any blockers or uncertainties. We'll have a monthly overall project check-in about broader progress towards overall goals. 

Help outside of meetings: I am available to provide some help most weeks outside of the meeting, but by and large I expect mentees to be self-directed and self-sufficient in solving problems. 

Scholars we are looking for

An ideal mentee has a strong AI research (software engineering is a plus) background. It's important that they are self-motivated and can make weekly progress with little intervention. If you are interested in working on non-concretely scoped projects, I would expect mentees to have the ability to write well-scoped project proposals, with realistic planned milestones and deliverables. Evidence of successful projects here would be very helpful in evaluating this. 

A mentee can be a PhD student and they can work on a paper that will be part of their thesis. 

Can independently find collaboraters, but not required

Project selection

I will talk through project ideas with the scholar

Community at MATS

MATS Research phase provides scholars with a community of peers.

During the Research phase, scholars work out of a shared office, have shared housing, and are supported by a full-time Community Manager.

Working in a community of independent researchers gives scholars easy access to future collaborators, a deeper understanding of other alignment agendas, and a social network in the alignment community.

Previous MATS cohorts included regular lightning talks, scholar-led study groups on mechanistic interpretability and linear algebra, and hackathons. Other impromptu office events included group-jailbreaking Bing chat and exchanging hundreds of anonymous compliment notes.  Scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups.