Quick answer
With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. , Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding.