Quick answer
We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. , into a shared semantic space and then process them with a single encoder-decoder transformer model.