Understanding spatial relationships between objects in images is crucial for robotic navigation, augmented reality systems, and autonomous driving applications, among others. However, existing vision-language benchmarks often overlook explicit spatial reasoning, limiting progress in this area. We attribute this limitation in part to existing open datasets and evaluation metrics, which tend to overlook spatial details. To address this gap, we make three contributions: First, we greatly extend the COCO dataset with annotations of spatial relations, providing a resource for spatially aware image captioning and visual question answering. Second, we propose a new evaluation framework encompassing metrics that assess image captions’ spatial accuracy at both the sentence and dataset levels. And third, we conduct a benchmark study of various vision encoder–text decoder transformer architectures for image captioning using the introduced dataset and metrics. Results reveal that current models capture spatial information only partially, underscoring the challenges of spatially grounded caption generation.