← Home

Quick answer

AI Summary: Introduces the Sparse Transformer, an architectural breakthrough that uses factorized attention patterns to break the quadratic memory barrier, enabling the modeling of massive sequences like raw audio and images.

Claim

Generating Long Sequences with Sparse Attention

Rewon Child·
Scott Gray·
Alec Radford·
Ilya Sutskever

ABSTRACT

Transformers are powerful sequence models, but their self-attention mechanism scales quadratically with sequence length, making them computationally prohibitive for long inputs like high-resolution images, audio, or entire books. We introduce Sparse Transformers, which modify the full attention matrix into factorized, sparse attention patterns. By employing specific strided and local attention kernels, we reduce the memory and compute complexity from O(N^2) to O(N * sqrt(N)). This efficiency allows us to train generative models on sequences of tens of thousands of tokens. We achieve state-of-the-art autoregressive modeling results on CIFAR-10, Enwik8, and ImageNet-64, demonstrating that dense attention is not strictly necessary for long-range dependency modeling.

Review Snapshot

Explore ratings

4.6
★★★★★
5 ratings
5 star
60%
4 star
40%
3 star
0%
2 star
0%
1 star
0%

Recommendation

100%

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for Generating Long Sequences with Sparse Attention.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.
Post an inquiry
Sort by: Most helpful