Quick answer
AI Summary: Introduces the Sparse Transformer, an architectural breakthrough that uses factorized attention patterns to break the quadratic memory barrier, enabling the modeling of massive sequences like raw audio and images.
AI Summary: Introduces the Sparse Transformer, an architectural breakthrough that uses factorized attention patterns to break the quadratic memory barrier, enabling the modeling of massive sequences like raw audio and images.
Transformers are powerful sequence models, but their self-attention mechanism scales quadratically with sequence length, making them computationally prohibitive for long inputs like high-resolution images, audio, or entire books. We introduce Sparse Transformers, which modify the full attention matrix into factorized, sparse attention patterns. By employing specific strided and local attention kernels, we reduce the memory and compute complexity from O(N^2) to O(N * sqrt(N)). This efficiency allows us to train generative models on sequences of tens of thousands of tokens. We achieve state-of-the-art autoregressive modeling results on CIFAR-10, Enwik8, and ImageNet-64, demonstrating that dense attention is not strictly necessary for long-range dependency modeling.
Share your opinion to help other learners triage faster.
Write a reviewInvite someone by email to share an invited review for Generating Long Sequences with Sparse Attention.