Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
Paper • Jun 3, 2024 • arxiv.org • Yang Jin, Zhicheng Sun, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, Yadong Mu
In light of recent advances in multimodal Large Language Models (LLMs), there is increasing attention to scaling them from image-text data to more informative real-world videos. Compared to static ...