Quick answer

With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. , Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding.

Paper2024-04-24•Source ↗•10 attns8,427 checkouts

Claim

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Authors

Discuss with Grok

Bo He·

Hengduo Li·

Young Kyun Jang·

Menglin Jia·

Xuefei Cao·

Ashish Shah·

Abhinav Shrivastava·

Ser-Nam Lim

ABSTRACT

With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning, and our model can achieve state-of-the-art performances across multiple datasets. Code available at https://boheumd.github.io/MA-LMM/.

#machine-learning 📋 Awesome List: multimodal #multimodal #deep-learning

Review Snapshot

Explore ratings

0.0

★★★★★

0 ratings

5 star

4 star

3 star

2 star

1 star

Recommendation

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.

Post an inquiry

Sort by: Most helpful