Joshua Clymer

Redwood Research

—

Researcher

Joshua works at the cutting edge of safety cases and safety evaluation methodologies broadly. He believes this area will be important both (1) for improving researcher's understanding of the level of risk (it's very easy to underestimate risk in this field!) and (2) so that researchers have a 'feedback loop' for improving safety with AI assistance. Joshua's ultimate goal is to boil down 'safety' into a number to the greatest extent possible that AI agents can hill climb on.

Here are some highlights of his work in the past:

- https://arxiv.org/abs/2403.10462 (Safety Cases, cited in a senate hearing testimony)

- https://arxiv.org/html/2501.17315v1 (A sketch of an AI control safety case)

- https://arxiv.org/abs/2405.05466 (Poser: the first 'model organism testbed' for alignment faking he am aware of)

You can find more of Joshua's work here:

- https://scholar.google.com/citations?user=U72h6i8AAAAJ&hl=en

- He spends a lot of his time writing blog posts these days, so you might also be interested in looking here as well:

- https://www.lesswrong.com/users/joshc

Joshua has several in-progress projects:

- White box control: his team is prototyping evaluations for white box monitoring (e.g. probes).

- https://docs.google.com/document/d/1gu4oUPgmG1SEN6FKph-ZuBWMTCUZkUki0mnmXWJTrKo/edit?usp=sharing

- Alignment Drift: Once Aligned not Always Aligned.

- https://docs.google.com/document/d/1mAMU1ONX0ywLvHsJMg3NRf9r0FmALtwX75cBfdEvF7M/edit?usp=sharing

- He also is working on Collusion evaluations with a mentee.

Joshua is interested in starting a project that prototypes evaluations for "lie detectors" and other "white box evaluations" for alignment faking; however, is keen to give mentees the choice between many different projects (and allow them to suggest their own) so they are excited about what they work on.