Quick answer
AI Summary: Demonstrates a scalable framework for mechanistic interpretability by using GPT-4 to automatically write, test, and score natural language explanations for the behavior of every single neuron in GPT-2.
AI Summary: Demonstrates a scalable framework for mechanistic interpretability by using GPT-4 to automatically write, test, and score natural language explanations for the behavior of every single neuron in GPT-2.
Understanding the internal mechanisms of massive language models is a critical bottleneck for AI safety and alignment. Given the billions of parameters in modern models, manual human inspection of individual neurons is impossible. We introduce an automated interpretability pipeline that uses a highly capable language model (GPT-4) to generate natural language explanations for the behavior of individual neurons in a smaller, target model (GPT-2). We then use the explainer model to simulate the neuron's activations and score the explanation based on its accuracy. Using this automated pipeline, we successfully generated explanations for all 307,200 neurons in GPT-2, marking a significant step toward scalable mechanistic interpretability.
Share your opinion to help other learners triage faster.
Write a reviewInvite someone by email to share an invited review for Language models can explain neurons in language models.