Quick answer

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. , into a shared semantic space and then process them with a single encoder-decoder transformer model.

Paper2023-12-28•Source ↗•10 attns8,132 checkouts

Claim

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Authors

Discuss with Grok

Jiasen Lu·

Christopher Clark·

Sangho Lee·

Zichen Zhang·

Savya Khosla·

Ryan Marten·

Derek Hoiem·

Aniruddha Kembhavi

ABSTRACT

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs -- images, text, audio, action, bounding boxes, etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Since training with such diverse modalities is challenging, we propose various architectural improvements to stabilize model training. We train our model from scratch on a large multimodal pre-training corpus from diverse sources with a multimodal mixture of denoisers objective. To learn an expansive set of skills, such as following multimodal instructions, we construct and finetune on an ensemble of 120 datasets with prompts and augmentations. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and strong results in more than 35 benchmarks, including image generation and understanding, natural language understanding, video and audio understanding, and robotic manipulation. We release all our models to the research community.

#machine-learning 📋 Awesome List: multimodal #multimodal #deep-learning

Review Snapshot

Explore ratings

0.0

★★★★★

0 ratings

5 star

4 star

3 star

2 star

1 star

Recommendation

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.

Post an inquiry

Sort by: Most helpful