Quick answer

AI Summary: Introduces the Sparse Transformer, an architectural breakthrough that uses factorized attention patterns to break the quadratic memory barrier, enabling the modeling of massive sequences like raw audio and images.

Paper2019-04-23•Source ↗•12 attns277 checkouts

Claim

Generating Long Sequences with Sparse Attention

Authors

Discuss with Grok

Rewon Child·

Scott Gray·

Alec Radford·

Ilya Sutskever

ABSTRACT

Transformers are powerful sequence models, but their self-attention mechanism scales quadratically with sequence length, making them computationally prohibitive for long inputs like high-resolution images, audio, or entire books. We introduce Sparse Transformers, which modify the full attention matrix into factorized, sparse attention patterns. By employing specific strided and local attention kernels, we reduce the memory and compute complexity from O(N^2) to O(N * sqrt(N)). This efficiency allows us to train generative models on sequences of tens of thousands of tokens. We achieve state-of-the-art autoregressive modeling results on CIFAR-10, Enwik8, and ImageNet-64, demonstrating that dense attention is not strictly necessary for long-range dependency modeling.

#transformers #cs-lg company:openai-research #sparse-attention #deep-learning

Review Snapshot

Explore ratings

4.6

★★★★★

5 ratings

5 star

60%

4 star

40%

3 star

2 star

1 star

Recommendation

100%

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for Generating Long Sequences with Sparse Attention.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.

Post an inquiry

Sort by: Most helpful