Producing stories grounded in visual content is an inherent trait of human intelligence and an integral aspect of interpersonal
communication. With the surge of advanced vision-to-language models, there has been increased interest in developing and understanding
the capabilities of models to generate visually grounded narratives. However, recent research has highlighted the challenges
in evaluating model-generated stories. In this work, we study these evaluation limitations in the visually grounded story
generation task by focusing on the recently released Visual Writing Prompts dataset and shared task. Through this study, we
also explore the capabilities of several general-purpose vision-to-language foundation models for generating stories grounded
in sequences of images. We observe that some recent models, such as Qwen2.5-VL, can generate stories that are coherent, consistent,
and well-grounded in the visual data. Nevertheless, in line with the recent studies in this area, we !nd that the existing
automatic evaluation metrics and methods are insu"cient in fully capturing all the aspects essential for assessing model-generated
stories. We believe our !ndings reinforce the evidence and arguments emphasizing the need for improvements to automatic approaches
that can comprehensively evaluate and understand models for visual storytelling.