We all know that modern AI can recognize objects in images. Show it a photo, and it will tell you: “cat,” “car,” “person.”
But what if we asked the AI:
“Is the cat in front of the sofa?”
“Are the two chairs side by side?”
That’s a different challenge — it’s about understanding where things are in relation to each other, not just recognizing the things themselves.
Today, AI struggles with this task. Describing spatial relationships like “to the left of,” “on top of,” and “next to” is still rare in machine-generated captions. And yet this kind of understanding is essential in many real-world applications:
- 🚗 Autonomous driving: knowing where a pedestrian is relative to a car.
- 🤖 Robotics: navigating around obstacles in complex environments.
- 🕶️ Assistive devices: describing scenes to visually impaired users.
- 📱 Augmented reality: placing digital content in the correct spot in physical space.
Our team set out to help address this challenge by building tools to train and evaluate AI models on spatial understanding in images.

Why Is This Hard?
The problem starts with data.
Most of the image-captioning datasets used today focus on what is in the image, rather than where things are located.
Typical captions might say:
“A man riding a bicycle” — but not: “A man riding a bicycle to the left of a car.”
Without many examples of spatial language in the training data, AI models don’t learn to express these relationships well.
Even harder: there wasn’t a good way to measure whether a model was good at spatial descriptions. Existing evaluation tools (BLEU, ROUGE, etc.) measure grammar and vocabulary, but not spatial accuracy.
What We Did
To tackle this, we developed three key components:
1. New spatial training data
We enriched an existing large-scale image dataset, COCO, with spatially grounded captions — sentences that explicitly describe where objects are in relation to each other.
2️. A new way to measure spatial understanding
We created a simple but effective evaluation process:
- Does the model generate sentences that correctly describe spatial relationships?
- Does it do so consistently across different types of relationships and images?
Rather than using complicated language metrics, we finetuned several combinations of popular text encoders and visual decoders against the truth sentences that we extracted using computer vision and machine learning models. We compared the capability of these state-of-the-art models to learn spatial descriptions.
This gives a more direct measure of whether the model is truly understanding space, not just generating plausible-sounding text.
3️. Testing different models
We evaluated several popular combinations of vision and text models:
We found that some model combinations produce noticeably better spatial captions than others.
In particular, models that combine efficient visual transformers with robust language understanding perform best at capturing spatial relationships.
Why It Matters
This work is an important step toward AI systems that don’t just list objects, but can also reason about space:
- Helping robots navigate better
- Enabling safer autonomous vehicles
- Supporting more helpful assistive technologies
- Improving human-AI interaction in AR/VR systems
We’re making the enriched dataset and evaluation tools available to the community soon, so that others can build on this work and push spatial image captioning forward.
References
- Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).
- Dosovitskiy, Alexey, et al. “An image is worth 16×16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).
- Lin, Tsung-Yi, et al. “Microsoft coco: Common objects in context.” Computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, September 6-12, 2014, proceedings, part v 13. Springer International Publishing, 2014.
- Ranftl, René, Alexey Bochkovskiy, and Vladlen Koltun. “Vision transformers for dense prediction.” Proceedings of the IEEE/CVF international conference on computer vision. 2021.
- Radford, Alec, et al. “Language models are unsupervised multitask learners.” OpenAI blog 1.8 (2019): 9.
Georgios Papadopoulos
Research Associate at Information Technologies Institute | Centre for Research and Technology Hellas