On the Challenges in Evaluating Visually Grounded Stories

Authors

A.K. Surikuchi
R. Fernández Rovira
S. Pezzelle

Date (dd-mm-yyyy)

2025

Title

On the Challenges in Evaluating Visually Grounded Stories

Publication Year

2025

Number of pages

Document type

Conference contribution

Abstract

Producing stories grounded in visual content is an inherent trait of human intelligence and an integral aspect of interpersonal communication. With the surge of advanced vision-to-language models, there has been increased interest in developing and understanding the capabilities of models to generate visually grounded narratives. However, recent research has highlighted the challenges in evaluating model-generated stories. In this work, we study these evaluation limitations in the visually grounded story generation task by focusing on the recently released Visual Writing Prompts dataset and shared task. Through this study, we also explore the capabilities of several general-purpose vision-to-language foundation models for generating stories grounded in sequences of images. We observe that some recent models, such as Qwen2.5-VL, can generate stories that are coherent, consistent, and well-grounded in the visual data. Nevertheless, in line with the recent studies in this area, we !nd that the existing automatic evaluation metrics and methods are insu"cient in fully capturing all the aspects essential for assessing model-generated stories. We believe our !ndings reinforce the evidence and arguments emphasizing the need for improvements to automatic approaches that can comprehensively evaluate and understand models for visual storytelling.

Permalink

https://hdl.handle.net/11245.1/6bd28fc9-2315-431c-8f48-c8282e58e6e0