← Home

Quick answer

Despite the longstanding adage "an image is worth a thousand words," generating accurate hyper-detailed image descriptions remains unsolved. Trained on short web-scraped image text, vision-language models often generate incomplete descriptions with visual inconsistencies.

Claim

ImageInWords: Unlocking Hyper-Detailed Image Descriptions

Roopal Garg·
Andrea Burns·
Burcu Karagol Ayan·
Yonatan Bitton·
Ceslee Montgomery·
Yasumasa Onoe·
Andrew Bunner·
Ranjay Krishna·
Jason Baldridge·
Radu Soricut

ABSTRACT

Despite the longstanding adage "an image is worth a thousand words," generating accurate hyper-detailed image descriptions remains unsolved. Trained on short web-scraped image text, vision-language models often generate incomplete descriptions with visual inconsistencies. We address this via a novel data-centric approach with ImageInWords (IIW), a carefully designed human-in-the-loop framework for curating hyper-detailed image descriptions. Human evaluations on IIW data show major gains compared to recent datasets (+66%) and GPT4V (+48%) across comprehensiveness, specificity, hallucinations, and more. We also show that fine-tuning with IIW data improves these metrics by +31% against models trained with prior work, even with only 9k samples. Lastly, we evaluate IIW models with text-to-image generation and vision-language reasoning tasks. Our generated descriptions result in the highest fidelity images, and boost compositional reasoning by up to 6% on ARO, SVO-Probes, and Winoground datasets. We release the IIW Eval benchmark with human judgement labels, object and image-level annotations from our framework, and existing image caption datasets enriched via IIW-model.

Review Snapshot

Explore ratings

0.0
★★★★★
0 ratings
5 star
0%
4 star
0%
3 star
0%
2 star
0%
1 star
0%

Recommendation

0%

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for ImageInWords: Unlocking Hyper-Detailed Image Descriptions.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.
Post an inquiry
Sort by: Most helpful