Quick answer
AI Summary: Reduces 'distributional drift' in LLMs, ensuring the model's behavior matches fine-tuning objectives more closely.
AI Summary: Reduces 'distributional drift' in LLMs, ensuring the model's behavior matches fine-tuning objectives more closely.
Supervised fine-tuning (SFT) is efficient but often yields inferior generalization compared to RL, a gap driven by RL's use of on-policy data. We propose a framework to bridge this chasm by enabling On-Policy SFT. We present Distribution Discriminant Theory (DDT), which quantifies the alignment between data and the model-induced distribution. We introduce two techniques: In-Distribution Finetuning (IDFT), a loss-level method, and Hinted Decoding, which re-aligns the training corpus to the model's distribution. Experiments demonstrate that our framework achieves generalization performance on par with DPO and SimPO while maintaining the computational efficiency of a standard SFT pipeline.
Share your opinion to help other learners triage faster.
Write a reviewInvite someone by email to share an invited review for Towards On-Policy SFT: Distribution Discriminant Theory and its Applications.