Quick answer

Paper2025-05-12•Source ↗•10 attns0 checkouts

Claim

AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection

Authors

Discuss with Grok

Kai Hua·

Steven Wu·

Ge Zhang·

Ke Shen

ABSTRACT

Recently, there has been growing interest in collecting reasoning-intensive pretraining data to improve LLMs' complex reasoning ability. Prior approaches typically rely on supervised classifiers to identify such data, which requires labeling by humans or LLMs, often introducing domain-specific biases. Due to the attention heads being crucial to in-context reasoning, we propose AttentionInfluence, a simple yet effective, training-free method without supervision signal. Our approach enables a small pretrained language model to act as a strong data selector through a simple attention head masking operation. Specifically, we identify retrieval heads and compute the loss difference when masking these heads. We apply AttentionInfluence to a 1.3B-parameter dense model to conduct data selection on the SmolLM corpus of 241B tokens, and mix the SmolLM corpus with the selected subset comprising 73B tokens to pretrain a 7B-parameter dense model using 1T training tokens and WSD learning rate scheduling. Our experimental results demonstrate substantial improvements, ranging from 1.4pp to 3.5pp, across several knowledge-intensive and reasoning-heavy benchmarks (i.e., MMLU, MMLU-Pro, AGIEval-en, GSM8K, and HumanEval). This demonstrates an effective weak-to-strong scaling property, with small models improving the final performance of larger models-offering a promising and scalable path for reasoning-centric data selection.

#deep-learning/month/202505 #computer-version/year/2025 #llm/paper/year/2025 #llm/paper/month/202505 #computer-version #multimodal-model #llm/month/202505 #llm/paper #deep-learning/from/bytedance-research #deep-learning/year/2025 #llm/year/2025 #world-model #deep-learning #llm #computer-version/month/202505 ByteDance Research

Review Snapshot

Explore ratings

0.0

★★★★★

0 ratings

5 star

4 star

3 star

2 star

1 star

Recommendation

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.

Post an inquiry

Sort by: Most helpful