Quick answer
As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. , the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content.