Projects on this stream cluster into a few broad areas from the empirical track: scalable oversight, AI control, monitorability and interpretability, adversarial robustness, and security.
Most fellows will work closely with one or two mentors on something that fits into the mentors' ongoing research. The above list of mentors above is tentative.
Projects on this stream cluster into a few broad areas from the empirical track: scalable oversight, AI control, monitorability and interpretability, adversarial robustness, and security. Narrower threads include personas and character training, reward hacking, model spec, automated AI research, and safety eval infrastructure. The list below pulls from what individual mentors on this stream are actively working on, and you can see more of our team's published work at alignment.openai.com. Most fellows will work closely with one or two mentors on something that fits into that mentor's ongoing research line.
Some example projects:
Bijan is a Technical Program Manager at OpenAI. He previously worked as a research engineer at Scale AI, where he coauthored work on LLM jailbreaking and red-teaming workflows.
Christopher is Research Scientist on the Alignment team at OpenAI working on fundamental and applied research. His focuses are privacy-preserving and adversarial machine learning including memorization, privacy, and security harms in language modeling, auditing for risks and mitigating them.
Previously he was a Research Scientist at Google Deepmind and Google Brain on the Privacy and Security Research team. There, he led privacy and security evals for their frontier model efforts.
Gabriel Wu is an AI alignment researcher at OpenAI. Previously, he directed the AI Safety Student Team at Harvard, where he earned a Master's degree in Computer Science and a bachelor's degree in Mathematics.
Isak is a Member of Technical Staff at OpenAI. Previously a Software Engineer at Google, he worked on applications of computer vision, natural language processing, and LLMs.
Isak earned a Master of Computer Science at Carnegie Mellon University, with published work in natural language processing, style transfer, multilingual grapheme-to-phoneme modeling, and computer vision.
James is a Researcher at OpenAI working on model personality, post-training, and personalization.
Jason is a Member of Technical Staff at OpenAI working on alignment and model behavior.
Joseph works in Detections and Response at OpenAI. His public security work includes using large language models to detect malicious macOS activity.
Juan is a researcher at OpenAI working on AI alignment and adversarial robustness. His public work includes model safety and refusal contributions for GPT-4, and instruction hierarchy training to improve robustness to jailbreaks and prompt injections.
Kaiwen is a researcher at OpenAI working on AI Safety and RL. He earned my Ph.D. from Cornell Tech, where he researched and taught RL, causal inference, and LLMs.
He previously worked at Google, Microsoft, and Netflix on projects spanning core RL theory to scalable LLM algorithms. Before grad school, Kaiwen spent two years at Facebook building the RL Platform.
Maja is interested in building AI for humans and trained by humans. Previously a Senior/Staff Research Engineer at DeepMind, an AI resident at X, and an intern at DeepMind, Google, and Amazon. Her OpenAI work includes scalable oversight and auto-review systems for coding agents.
Ollie is an AI safety researcher on OpenAI’s Alignment team.
Sam is a Research Engineer on OpenAI’s Alignment team. Previously worked in NYU’s Alignment Research Group on scalable oversight and as a Software Engineer at Amazon. His research includes training language models to win debates with self-play, and recent OpenAI work on auto-review for agent actions.
Tom is a research scientist at OpenAI, working on interpretability of language models, for AI safety. He was also a core developer of scikit-learn between 2015 and 2022.
Xiangyu is a researcher at OpenAI, where he works to make LLMs robust. Previously, he obtained his Ph.D. from Princeton University, advised by Prof. Prateek Mittal and Prof. Peter Henderson.
Essential:
Preferred (at least one of):