Towards eliciting latent knowledge from LLMs with mechanistic interpretability

MATS Fellow:

Bartosz Cywiński, Emil Ryd

Authors:

Bartosz Cywiński, Emil Ryd, Senthooran Rajamanoharan, Neel Nanda

Citations

9 Citations

Abstract:

As language models become more powerful and sophisticated, it is crucial that they remain trustworthy and reliable. There is concerning preliminary evidence that models may attempt to deceive or keep secrets from their operators. To explore the ability of current techniques to elicit such hidden knowledge, we train a Taboo model: a language model that describes a specific secret word without explicitly stating it. Importantly, the secret word is not presented to the model in its training data or prompt. We then investigate methods to uncover this secret. First, we evaluate non-interpretability (black-box) approaches. Subsequently, we develop largely automated strategies based on mechanistic interpretability techniques, including logit lens and sparse autoencoders. Evaluation shows that both approaches are effective in eliciting the secret word in our proof-of-concept setting. Our findings highlight the promise of these approaches for eliciting hidden knowledge and suggest several promising avenues for future work, including testing and refining these methods on more complex model organisms. This work aims to be a step towards addressing the crucial problem of eliciting secret knowledge from language models, thereby contributing to their safe and reliable deployment.

Recent research

Underwriting the Agent Economy: The Blueprint for an AI Insurance Stack

Authors:

Anita Srinivasan

Date:

July 14, 2026

Citations:

When Role-playing, Do Models Believe What They Say?

Authors:

Benjamin Sturgeon

Date:

June 25, 2026

Citations:

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

Recent research

Frequently asked questions