Computer scientists from Auburn University and the University of Alberta have conducted a study questioning the visual capabilities of large language models (LLMs) with integrated vision capabilities (VLMs). Pooyan Rahmanzadehgervi, Logan Bolton, Anh Totti Nguyen, and Mohammad Reza Taesiri tested several popular VLMs—including GPT-4o, Gemini-1.5 Pro, Claude-3 Sonnet — on their ability to process and understand visual information.
Published on the arXiv preprint server, the research challenges claims that LLMs equipped with visual inputs can match human-like visual comprehension. The study highlights that while these models can capture visual data effectively (like a camera), their ability to interpret and process this data remains rudimentary.
The researchers illustrated this limitation by asking the LLMs basic visual tasks such as counting overlapping circles or interconnected rings. The results revealed significant shortcomings: the models performed well only when trained on familiar images, struggling with tasks involving unfamiliar configurations or nuanced details.
For instance, when asked to count interlocking rings beyond a certain number, the models faltered due to lack of exposure to diverse examples beyond their training data, such as variations in ring interconnections.
This study underscores that despite advancements in incorporating visual inputs into LLMs, these models are still far from achieving human-level understanding and processing of visual information. It emphasizes the need for further development in training methodologies and model architectures to enhance their visual reasoning capabilities.
As LLMs continue to evolve, bridging this gap in visual comprehension will be crucial for their broader applicability across various domains requiring nuanced visual understanding.