Quick answer
AI Summary: By leveraging structured sparsity in Monarch matrices, this model reduces the quadratic complexity of video attention to near-linear.
AI Summary: By leveraging structured sparsity in Monarch matrices, this model reduces the quadratic complexity of video attention to near-linear.
The quadratic complexity of attention severely limits the context scalability of Video Diffusion Transformers (DiTs). We find that the sparse spatio-temporal attention patterns in Video DiTs can be naturally represented by the Monarch matrix—a class of structured matrices with flexible sparsity. We propose VMonarch, an attention mechanism that adaptively captures intra-frame and inter-frame correlations using Monarch factorization. We introduce a recomputation strategy for stability and an online entropy algorithm fused into FlashAttention for fast updates. VMonarch reduces attention FLOPs by 17.5x and achieves a speedup of over 5x for long videos while maintaining generation quality.
Share your opinion to help other learners triage faster.
Write a reviewInvite someone by email to share an invited review for MonarchRT: Efficient Attention for Real-Time Video Generation.