Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment
Paper • Nov 5, 2024 • arxiv.org • Xin Xiao, Bohong Wu, Jiacong Wang, Chunyuan Li, Xun Zhou, Haoyuan Guo
Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner. Despite being simple and effective, this method results in sub-op...