Quick answer
We present MobileVLM, a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. 7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector.