Quick answer

AI Summary: Demonstrates a scalable framework for mechanistic interpretability by using GPT-4 to automatically write, test, and score natural language explanations for the behavior of every single neuron in GPT-2.

Paper2023-05-09•Source ↗•22 attns439 checkouts

Claim

Language models can explain neurons in language models

Authors

Discuss with Grok

Steven Bills·

Nick Cammarata·

Dan Mossing·

Henk Tillman·

Leo Gao·

Gabriel Goh·

Ilya Sutskever·

Jan Leike·

Jeff Wu·

William Saunders

ABSTRACT

Understanding the internal mechanisms of massive language models is a critical bottleneck for AI safety and alignment. Given the billions of parameters in modern models, manual human inspection of individual neurons is impossible. We introduce an automated interpretability pipeline that uses a highly capable language model (GPT-4) to generate natural language explanations for the behavior of individual neurons in a smaller, target model (GPT-2). We then use the explainer model to simulate the neuron's activations and score the explanation based on its accuracy. Using this automated pipeline, we successfully generated explanations for all 307,200 neurons in GPT-2, marking a significant step toward scalable mechanistic interpretability.

#llms #mechanistic-interpretability company:openai-research #alignment #cs-ai

Review Snapshot

Explore ratings

4.6

★★★★★

5 ratings

5 star

60%

4 star

40%

3 star

2 star

1 star

Recommendation

100%

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for Language models can explain neurons in language models.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.

Post an inquiry

Sort by: Most helpful