Scalable Oversight

Human overseers alone might not be able to supervise superhuman AI in domains that we don’t understand. Can we design systems that scalably evaluate AI and incentivize AI truthfulness?

The NYU Alignment Research Group (NYU ARG) is a set of researchers doing empirical work with language models that aims to address longer-term concerns about the impacts of deploying highly-capable AI systems.

The members of NYU ARG mentoring for MATS include:

Mentors

  • Asa Cooper Stickland

    Asa is a postdoctoral fellow at the NYU Center for Data Science. He previously worked on parameter-efficient fine-tuning and robustness for language models. He was also a MATS scholar in 2022. More recently, he’s worked on measuring situational awareness and developing evaluations that can work on powerful models. He recently finished his PhD in the EPSRC Centre for Doctoral Training in Data Science at Edinburgh University.

  • Julian Michael

    Julian is a postdoctoral fellow at the NYU Center for Data Science. His research focuses on data elicitation methods for scalable oversight and the scientific study of language. He received his PhD from the University of Washington, advised by Luke Zettlemoyer.

  • Shi Feng

    Shi is a postdoctoral fellow at the NYU Center of Data Science. His research focus is interpretability and human-AI cooperation. He worked as a postdoc with Chenhao Tan at University of Chicago and completed his PhD with Jordan Boyd-Graber at University of Maryland.

  • David Rein

    David is a researcher in the NYU Alignment Research Group, working on scalable oversight methods and evaluations. Previously, he was a Machine Learning Engineer at Cohere, where he worked on embedding models for semantic search.

Research projects

  • NYU ARG

    NYU ARG is interested in scalable oversight, broadly the problem of evaluating and supervising (AI) systems that are more capable than humans. One aspect of this problem they work on is truthfulness: is it possible to develop training methods that incentivize truthfulness even when humans are unable to directly judge the correctness of a model’s output? For example, they work on AI Safety via Debate as a truthfulness training method.

    Other potential projects focus on "scalable benchmarking", evaluation methods that would make sense for much more capable models. This might involve figuring out how to measure (proxies for) speculative capabilities like situational awareness, or potentially developing interpretability methods/modifications to model internals that allow researchers to find alignment failures even for deceptive models.

    Key research questions include:

    • How can we create robust AI debaters and judges?

    • What argumentative structures, assumptions, or paradigms make debate more truth-seeking?

    • What kinds of facts are models able to infer about their situation/environment?

    • Can we build evaluations that even very capable models would find it difficult to game?

    Example project ideas:

    • Construct a dataset of questions which humans are likely to get wrong after searching online (e.g., due to common misconceptions, incorrect priors, or widespread low-quality information), and run sandwiching experiments with humans & models on these questions.

    • Build a robust free-response evaluation method for multiple-choice datasets, to make them more useful for scalable oversight experiments.

    • Create a benchmark for situational awareness such that we could test whether a language model could figure out, say, the current exact date and time, from a scrubbed version of the internet.

    • Test various methods for "kneecapping" language models, i.e. perturbing model internals so the model loses the ability to do the (potentially) complicated reasoning necessary for deceptive alignment, but is otherwise as close as possible to the base model.

    NYU ARG researchers tend to focus on careful task design and data collection — many of these projects will involve this as well as modeling elements, including prompting and fine-tuning language models, or high-level interpretability techniques like probing. Mentors will work with you to design a project that fits your background and interests.

Personal Fit

Your project will probably involve running a large number of NLP experiments, careful data collection, or both. We'd be particularly excited by candidates who have one or more of the following: 

  • Software engineering or machine learning engineering experience, particularly in natural language processing.

  • Quantitative research experience, ideally in machine learning or related fields.

  • Experience with large-scale data collection. Experience getting data from or interacting with human subjects may also be useful. 

Each mentor will likely lead separate projects, each of which will have a small team of mentees, although mentees and mentors will help out with/provide feedback to other projects whenever it is useful to do so. Mentees will be able to choose a project/mentor at the beginning of the program (or propose their own project, assuming it aligns with mentor interest). Mentorship for scholars will likely involve:

  • 1 hour weekly meetings for each project, and occasional 1:1 meetings with a mentor + "all hands" meetings with all participants in the stream

  • Providing detailed feedback on write-up drafts

  • Slack response times typically ≤ 48 hours

Selection Questions

Selection questions for NYU ARG include a series of nine short-response questions and four optional questions. The nine required questions ask for 1-5 sentences each and contain questions such as the following:

  • In what ways are you opinionated on what you work on (if any)? (~1-3 sentences)

  • What programming languages are you fluent in? Please provide a rough estimate of how many hours you’ve spent programming in each language you list. (1 sentence)

  • Please talk briefly about an area of technical work right now you’re most interested in or excited about. (~3-5 sentences)

  • What do you consider to be your biggest strengths as a technical contributor? What kind of work do you particularly excel at? (~3-5 sentences)

There will additionally be a standard industry coding test for qualifying applicants.