Blog post – VOXReality

VOXReality template LinkedIn v3 (3).pdf (1920 x 1080 px) (LinkedIn Post)

Teaching AI Where Things Are: A Step Forward in Image Understanding

3 July 2025

by Ana R Blog post

We all know that modern AI can recognize objects in images. Show it a photo, and it will tell you: “cat,” “car,” “person.”

But what if we asked the AI:
“Is the cat in front of the sofa?”
“Are the two chairs side by side?”

That’s a different challenge — it’s about understanding where things are in relation to each other, not just recognizing the things themselves.

Today, AI struggles with this task. Describing spatial relationships like “to the left of,” “on top of,” and “next to” is still rare in machine-generated captions. And yet this kind of understanding is essential in many real-world applications:

🚗 Autonomous driving: knowing where a pedestrian is relative to a car.
🤖 Robotics: navigating around obstacles in complex environments.
🕶️ Assistive devices: describing scenes to visually impaired users.
📱 Augmented reality: placing digital content in the correct spot in physical space.

Our team set out to help address this challenge by building tools to train and evaluate AI models on spatial understanding in images.

Why Is This Hard?

The problem starts with data.

Most of the image-captioning datasets used today focus on what is in the image, rather than where things are located.

Typical captions might say:
“A man riding a bicycle” — but not: “A man riding a bicycle to the left of a car.”

Without many examples of spatial language in the training data, AI models don’t learn to express these relationships well.

Even harder: there wasn’t a good way to measure whether a model was good at spatial descriptions. Existing evaluation tools (BLEU, ROUGE, etc.) measure grammar and vocabulary, but not spatial accuracy.

What We Did

To tackle this, we developed three key components:

1. New spatial training data
We enriched an existing large-scale image dataset, COCO, with spatially grounded captions — sentences that explicitly describe where objects are in relation to each other.

2️. A new way to measure spatial understanding
We created a simple but effective evaluation process:

Does the model generate sentences that correctly describe spatial relationships?
Does it do so consistently across different types of relationships and images?

Rather than using complicated language metrics, we finetuned several combinations of popular text encoders and visual decoders against the truth sentences that we extracted using computer vision and machine learning models. We compared the capability of these state-of-the-art models to learn spatial descriptions.

This gives a more direct measure of whether the model is truly understanding space, not just generating plausible-sounding text.

3️. Testing different models
We evaluated several popular combinations of vision and text models:

We found that some model combinations produce noticeably better spatial captions than others.

In particular, models that combine efficient visual transformers with robust language understanding perform best at capturing spatial relationships.

Why It Matters

This work is an important step toward AI systems that don’t just list objects, but can also reason about space:

Helping robots navigate better
Enabling safer autonomous vehicles
Supporting more helpful assistive technologies
Improving human-AI interaction in AR/VR systems

We’re making the enriched dataset and evaluation tools available to the community soon, so that others can build on this work and push spatial image captioning forward.

References

Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).
Dosovitskiy, Alexey, et al. “An image is worth 16×16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).
Lin, Tsung-Yi, et al. “Microsoft coco: Common objects in context.” Computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, September 6-12, 2014, proceedings, part v 13. Springer International Publishing, 2014.
Ranftl, René, Alexey Bochkovskiy, and Vladlen Koltun. “Vision transformers for dense prediction.” Proceedings of the IEEE/CVF international conference on computer vision. 2021.
Radford, Alec, et al. “Language models are unsupervised multitask learners.” OpenAI blog 1.8 (2019): 9.

Georgios Papadopoulos

Research Associate at Information Technologies Institute | Centre for Research and Technology Hellas

Copy of Maastricht General Assembly (Instagram Post) (2)

Augmenting the Past: Applied Examples of AR and AI in Cultural Heritage from Greece

25 June 2025

by Ana R Blog post

From the awe-inspiring ruins of ancient theatres to the timeless stories told through myth and drama, Greece’s cultural heritage is a rich tapestry of human creativity and memory. But how can cutting-edge technologies like Augmented Reality (AR) and Artificial Intelligence (AI) help preserve and reawaken this legacy in the present day?

At VOXReality, we explore how immersive technology can enrich live theatre with real-time augmented content, blending performance with spatial computing. The VOXReality AR theatre project sits at the crossroads of performative arts (the live performance), literary arts (theatrical texts and dramaturgy), and both tangible (theatres and spaces) and intangible (stories, rituals, performance practices) forms of cultural heritage. Our ambition is to use AR to enhance theatrical performances by overlaying digital elements such as visuals, text, and interactive content, directly onto the live stage. Yet this ambition brings a unique set of challenges. A theatre play is a dynamic, tightly choreographed experience. Augmented content must appear at the right moment and in the right place with perfect quality without disrupting the flow or distracting the audience.

To address these challenges, we look to other successful AR/AI applications within Greece’s cultural sector for insights and inspiration. What can we learn from their experiences, and how do our technological choices compare? In this post, we explore real-world examples of applied AR and AI in Greek cultural heritage and examine how they’ve tackled key issues such as content delivery, accessibility, and stability.

Example 1: NAXOS AR – From the Portara to the Temple of Apollo

The NAXOS AR application developed by MOPTIL in collaboration with the Municipality of Naxos and the Ephorate of Antiquities of Cyclades, offers an AR experience at the “Portara” historical location on Naxos Island. The experience provides a historically accurate 3D representation of the Temple of Apollo alongside additional textual background information, validated by academic and institutional partners. The application can be downloaded from acclaimed application distribution platforms and executed on the visitor’s personal smartphone. To keep the installation files lightweight, the application downloads during runtime the required 3D content from a cloud service. If onsite mode, i.e. at the archaeological site, the application performs AR matching to recognize physical landmarks and display the digital reconstruction in the right positioning compared to the existing monument. If the user is in another location, the user can manually position, scale and rotate the 3D content.

The experience successfully addresses:

performance and robustness are achieved through a standalone application,
trustworthiness and accessibility are ensured via authorized distribution channels with security validation (like Google Play Store)
optimal content delivery is facilitated through downloading additional, large-sized content when required from cloud services with ensured streaming bandwidth
3D content that is curated by expert archaeologists and
high quality UI/UX

Overall, the application showcases the feasibility of providing high-quality informational 3D content to the general audience, making cultural heritage more accessible and engaging through mobile AR.

[1] MOPTIL: NAXOS AR – Explore Cultural Heritage

Example 2: COSMOTE CHRONOS – Blending AR and AI for Cultural Heritage

An even more advanced example from Greece is COSMOTE CHRONOS app, a cultural heritage experience that combines Augmented Reality (AR), and Artificial Intelligence (AI) while making use of 5G network capabilities. Originally designed as a 5G use case project, it quickly evolved into a popular app that brings to life the monuments of the Acropolis as they were in antiquity, acquiring more than 400k downloads worldwide. COSMOTE CHRONOS is a COSMOTE (telecommunications provider) project, designed in collaboration with Greece’s Ministry of Culture, the Acropolis Museum, and MOPTIL as technology partner. The app supports both on-site visits at the Acropolis site and off-site access, reimagining how visitors explore ancient monuments and making the cultural experience available to a broader audience.

The application supports similar capabilities with NAXOS AR, in terms of AR content. For example, the 3D models need to be optimised for real time rendering, using the same low poly approach as in the Naxos AR case. But this time, there is a new advantage: at the heart of the experience is Clio, an AI-powered digital assistant who interacts with users in real time, answering questions (written or oral) and guiding them through the virtual past. This dialogue-driven interface is a 5G only feature-that poses a technical challenge – it is made possible through a backend infrastructure that streams content dynamically, relying on high-speed 5G connectivity to enable seamless communication and responsive interaction.

Technically, the app represents a major undertaking with collaboration of partners representing communication infrastructure, immersive technology, artificial intelligence and of course, classical archaeology. Beyond the advanced user interface (UI) and photorealistic 3D reconstructions, special attention was given to the accuracy and adaptability of the AI assistant. A team of archaeologists, scriptwriters, and bot trainers continuously update and refine Clio’s knowledge base by analyzing user interactions and adjusting its responses, accordingly, ensuring an engaging and informative experience for all users. In addition, users also have the option to select a solo or group guided tour, allowing them to listen to historical information as they explore each monument, bringing further challenges in content sharing and synchronization. The group option allows up to 5 visitors to listen to the responses simultaneously in real time.

Culturally, CHRONOS exemplifies a new approach to heritage interpretation through storytelling that blends scientific rigor with emotional engagement of the user. With a custom musical score, thoughtful dialogue, and immersive visuals, the app transforms a visit to the Acropolis into an interactive journey through time, while also being accessible to remote users around the world.

While CHRONOS offers a rich and dynamic UX, it also highlights a key dependency: robust digital infrastructure. Real-time AI interactions require high-bandwidth connectivity, low latency, cloud-based model hosting, and continuous content moderation making it a powerful but infrastructure-intensive solution.

[2] CHRONOS

Example 3: Ancient Kydonia AR Tour – Situated Storytelling through AR and Audio

The Ancient Kydonia AR Tour is a mobile application that offers an immersive augmented reality experience at six archaeological sites in Chania, Crete, combining 3D reconstructions, spatial audio, and gamified learning. The app, developed as part of the broader Ancient Kydonia History Tour project, focuses on enabling visitors to experience the ancient city of Kydonia, which lies beneath the modern urban fabric, through layered storytelling and site-specific digital content.

The application delivers 3D models of key structures from ancient Kydonia directly onto the physical locations where they once stood, using the visitor’s own smartphone camera and AR tracking to place these reconstructions accurately in situ.

Although this project features no AI services, it successfully addresses user engagement and immersion needs with solid design practices. Specifically, audiovisual integration and immersion is achieved with techniques like adaptive lighting conditions in real time and dynamic soundscapes with background music, ambient sounds, and historically inspired audio that enhances the emotional and spatial depth of the experience. The app further engages users with serious game mechanics that encourage exploration and learning through interaction -visitors are not just passive recipients of audiovisual information, but are invited to actively discover and reflect on the historical context of each site.

In terms of technical delivery, the Kydonia AR Tour is a lightweight application designed for recent Android devices (post-2022), and it requires minimal installation overhead. Additional 3D content is downloaded at runtime from cloud services, allowing for efficient resource usage while maintaining visual fidelity. The experience is designed for outdoor use with on-site GPS activation and requires camera and location access.

This project stands out for its integration of sensory modalities (visual, spatial, and auditory) while maintaining accessibility for the general public. It demonstrates how AR can enhance site-specific storytelling, particularly in urban archaeological environments where excavation is not always visible or accessible. By relying on context-aware delivery and immersive feedback, the app connects users to a “hidden” cultural layer of the city, expanding their understanding of place and history.

[3] Ancient Kydonia AR Tour

Conclusion: Reimagining Cultural Heritage through Emerging Technologies

As the three examples from Naxos, Athens, and Chania demonstrate, Greece is at the forefront of using AR and AI to reimagine the way we experience and interpret cultural heritage. These projects not only make ancient worlds more accessible and engaging to modern audiences but also demonstrate how thoughtful design, grounded in historical accuracy and powered by cutting-edge infrastructure, can bring the past into dialogue with the present.

Each case highlights different technical and narrative strategies:

NAXOS AR showcases how downloadable AR content with precise local spatial alignment can offer robust and high-quality reconstructions at historical sites.
CHRONOS introduces real-time AI interaction and 5G-enabled dynamic content, creating a hybrid model of digital assistance and historical storytelling.
Ancient Kydonia AR Tour elevates spatial immersion through audio-visual layering and game-like engagement, transforming urban archaeology into an interactive discovery.

For the VOXReality project, these implementations offer valuable insights. From lightweight delivery strategies and backend architecture to user-centric accessibility and dynamic adaptation to context, we see how immersive technologies one-size-fits-all solutions but rather tailored responses to the unique challenges of each heritage site and experience.

In live theatre, these lessons are particularly critical. Unlike static heritage sites, live performances involve dynamic, real-time interactions where digital content must seamlessly align with the presence and actions of performers in a way that respects the integrity of both the narrative and the audience experience. By combining elements highlighted in the examples of heritage applications, such as edge computing, contextual triggers and multimodal design, the aim is to propose a model of how emerging technology can support not just the preservation of cultural memory, but its creative and collective reactivation.

Looking forward, it becomes clear that the convergence of AR, AI, and storytelling in cultural heritage represents more than a technological trend; it constitutes a paradigm shift. This shift invites both creators and audiences to experience and engage with culture not as a static inheritance, but as a living process continuously shaped and redefined.

Olga Chatzifoti

Extended Reality applications developer working with Gruppo Maggioli for the design and development of the Augmented Reality use case of the VOXReality HORIZON research project. She is also a researcher in the Department of Informatics and Telecommunications of the University of Athens. Under the mentorship of Dr. Maria Roussou, she is studying the cognitive and affective dimensions of voice-based interactions in immersive environments, with a focus on interactive digital narratives. She has an extensive, multidisciplinary educational background, spanning from architecture to informatics, and has performed research work on serious game design and immersive environments in Europe, USA and the UK.

&

Alexandra Malouta

XR and User Experience researcher at Gruppo Maggioli, contributing to the MOTIVATE XR project that develops immersive training environments. Alexandra is also a researcher at the University of the Aegean, Department of Cultural Informatics and Communication, exploring the spatial and narrative design of collaborative XR environments for cultural heritage. Professional experience in the design, project management and communication of projects at the intersection of architecture, culture, and technology

Join Us for the VOXReality Test Pilots – Remote & In-Person Opportunities!

13 June 2025

by Ana R Blog post

As part of the VOXReality project, we are excited to invite you to participate in a series of test pilots. These pilots will showcase the project’s progress and allow participants the opportunity to try out our technology and offer valuable feedback to our research team.

We’re organizing two events:

A remote webinar on Thursday, June 19
An in-person pilot test in Rotterdam on Wednesday, June 26

Each test session will last approximately 60 minutes, and we will offer four test rounds per day to accommodate different schedules. Spaces are limited, and participants will receive a reward for their time. Don’t miss your chance to get involved!

Interested? Fantastic!

Please complete our registration form here:

https://form.typeform.com/to/yKH4P2fw

Once we receive your registration, we’ll reach out with further details and be happy to answer any questions you may have. We’re looking forward to hearing from you and welcoming you during our test days!

Pilot 0C : Outcomes and next steps

10 June 2025

by Ana R Blog post

The VOXReality project, funded under the Horizon Europe framework, is an initiative aimed at revolutionizing Extended Reality (XR) by integrating advanced language and vision AI models. The Training Assistant use case is one among the three other innovative usecases and is aimed at creating immersive and interactive training environments. In Pilot 0C, an internal stress test conducted as part of this project, the Training Assistant usecase focused on evaluating user interaction modalities within an Augmented Reality (AR) training application designed for the Microsoft HoloLens 2. This blog post delves into the objectives, execution, and outcomes of the user study in the Training Assistant use case, highlighting its contributions to open-ended learning in XR training, and outlines the next steps to refine and advance the project.

VOXReality seeks to enhance XR experiences by combining innovative AI technologies, including Automatic Speech Recognition (ASR) and a Large Language Model (LLM)-based Augmented Reality Training Assistant (ARTA). The AR Training use case, integrates ASR and ARTA into the Hololight Space Assembly software, an experimental platform originally designed for linear industrial assembly training. For this pilot, the software was customized to support open-ended learning environments, allowing users greater flexibility in task execution, aligning with constructivist learning principles that emphasize problem-solving and engagement over rigid, prescribed sequences. Conducted in Santarcangelo, Italy, at Maggioli’s headquarters, Pilot 0C involved 13 participants, including consortium members and Maggioli employees. The primary goal was to compare two user interfaces within the customized Hololight Space Assembly platform: a voice-controlled “Voxy Mode,” leveraging ARTA and ASR for voice-driven interactions, and a traditional Graphical User Interface (GUI) mode relying on hand menus. The study aimed to assess how these modalities impact key user experience metrics, including cognitive load, usability, engagement, and overall user experience, in the context of industrial assembly training tasks.

Study Design and Execution

Pilot 0C employed a within-subjects study design, where each of the 13 participants experienced both Voxy Mode and GUI Mode in two sessions, with the order randomized to minimize bias. The training scenario involved industrial assembly tasks, where participants interacted with virtual objects in an AR environment using the Microsoft HoloLens 2. In Voxy Mode, users issued voice commands to ARTA, which provided context-aware guidance, while the GUI Mode utilized hand-menu interactions for task assistance.

The study collected data on several metrics:

Cognitive Load: Measured using the NASA-TLX framework, assessing mental and physical demand, pace, successful completion, hard work, and frustration.
Usability: Evaluated through the System Usability Scale (SUS) and the perceived helpfulness of the tutorial.
User Experience: Assessed via the User Experience Questionnaire (UEQ), focusing on supportiveness, efficiency, and clarity.
Engagement: Gauged using a questionnaire to evaluate immersion and involvement.

Quantitative data, such as task completion times (recorded by the system) and SUS scores were complemented by qualitative feedback, where participants provided statements on the usefulness and experience of each interface. These insights were analyzed to compare the performance of the two modalities and identify areas for improvement.

Key Findings and Outcomes

The results of Pilot 0C revealed both strengths and challenges in the tested interfaces. Quantitatively, the GUI Mode was significantly faster for task completion, primarily due to a shorter tutorial phase. However, other metrics—cognitive load, usability, and engagement—showed no statistically significant differences between Voxy Mode and GUI Mode. This was largely attributed to a strong learning effect inherent in the within-subjects design: participants mastered the assembly tasks in their first session, rendering the second session less informative as they were already familiar with the tasks. This methodological challenge underscored the limitations of the within-subjects approach for this study.

Qualitatively, participants expressed a clear preference for Voxy Mode, highlighting its engaging and supportive nature. Users appreciated the interactivity and novelty of voice-driven interactions with ARTA, which enhanced their sense of presence and involvement. However, they also noted limitations, including inaccuracies in ASR and ARTA’s struggles with contextual understanding in the open-ended setting, which occasionally disrupted the user experience. The GUI Mode, while efficient and functional, was perceived as less engaging and immersive. Participants also provided feedback on practical issues, such as confusing color codes, performance bottlenecks, and minor bugs in the AI models, offering valuable insights for future refinements.

These findings highlight the potential of multimodal, voice-driven interfaces in XR training. The preference for Voxy Mode suggests that voice-based interactions, when supported by robust AI, can significantly enhance engagement and perceived support in open-ended learning environments. However, the study also emphasized the need for technical improvements in ASR accuracy (when it comes to different accents) and ARTA’s ability to interpret user intent and context to ensure practical efficacy in dynamic training scenarios.

Lessons Learned and Implications

Pilot 0C provided critical insights into the role of multimodal interfaces in XR training. The user preference for Voxy Mode indicates that voice-driven interactions can foster greater engagement and support in open ended training modalities, where users benefit from the flexibility to interact with objects not directly tied to prescribed tasks and complete tasks in varied orders. This aligns with the project’s goal of promoting deeper understanding through problem-solving, contrasting with linear training systems that rely on rote memorization.

The learning effect observed in the within-subjects design was a significant methodological takeaway, leading to the decision to adopt a between-subjects design for the final Pilot. In this approach, participants will be divided into separate groups for each interface, eliminating the influence of prior task familiarity and enabling a clearer comparison of the modalities’ effectiveness.

Next Steps

Building on the outcomes of Pilot 0C, the team will focus on addressing the identified issues to prepare for the final Pilot. Key priorities include enhancing the accuracy of the ASR system and improving ARTA’s contextual awareness to ensure seamless and effective voice interactions. Bug fixes, performance optimizations, and clearer color coding will also be implemented to enhance the overall user experience. The shift to a between-subjects study design for Pilot 2 will provide a more robust evaluation of Voxy Mode and GUI Mode, offering clearer insights into their impact on user experience and training outcomes. These improvements aim to strengthen the role of multimodal, AI-driven interfaces in advancing open-ended XR training applications, bringing VOXReality closer to its goal of delivering innovative, immersive training solutions.

Leesa Joyce

Head of Research @ Hololight

AR Training: Shaping the Future of Learning

10 June 2025

by Ana R Blog post

In a world where technology is constantly redefining how we work and learn, Augmented Reality (AR) is emerging as one of the most powerful tools for job training [1]. Far from being a sci-fi concept, AR is now playing a transformative role in both industrial and academic settings, making learning more immersive, safer, and more effective than ever before.

This is the era of AR Training, that blends the physical and digital worlds to support the onboarding and upskilling of students and professionals.

What Exactly Is AR Training?

AR Training leverages devices like headsets, tablets, or smartphones to overlay digital content onto the real world. This allows learners to interact with 3D models, guided instructions, and dynamic simulations while staying fully immersed in their physical environment.

According to Lee K. (2012), AR dramatically reshapes two core aspects of learning: the where and the when [2]. By delivering just-in-time visual and interactive experiences, it allows users to access the right information exactly when they need it.

Why It Works: Motivation, Safety, and Results

AR doesn’t just look futuristic, it gets results. Studies show that AR improves learner motivation, supports experiential learning, and significantly reduces error rates. For example, Bologna J.K. et al. (2020) developed an AR platform to teach operators how to calibrate HART (Highway Addressable Remote Transducer) instruments like pressure and temperature transmitters. Compared to traditional methods, 82% of users trained with AR improved both their understanding and operational safety [3].

These outcomes are particularly crucial in fields like energy, manufacturing, or medicine, where even small mistakes can lead to serious consequences. AR allows employees to gain confidence by practicing procedures in a virtual space that mimics real-life conditions, but without real-life dangers.

Costs vs. Value: Is It Worth It?

One of the main concerns around AR training is cost. Developing and deploying AR solutions can be expensive. But when you compare that investment to the high costs of traditional training, especially in hazardous sectors like firefighting, drilling, or aerospace, AR quickly begins to justify itself.

In fact, it often reduces downtime, minimizes travel and material costs, and decreases the number of repeat training sessions. For companies looking to improve the skills of their workforce while optimizing their training budgets, AR can be a strategic win.

Real-World Impact: How Industries Are Using AR

Across industries, AR is already transforming the way people learn and work. In the architecture, engineering, and construction (AEC) sectors, AR simulations are being used to train new technicians, helping them understand complex systems through immersive walkthroughs and interactive tools [4].

It’s not just about building structures; it’s also about building understanding. From technical education and maintenance guidance to safety simulations and design visualization, AR is making it easier to grasp complicated concepts, operate machinery, and avoid risks in the field. A systematic review by Tan Y. et al. (2022) documented more than 80 studies that explore the practical benefits of AR/VR technologies in both educational and professional contexts, reinforcing just how widespread and impactful these tools have become [5].

Final Thoughts

AR Training is no longer just a glimpse into the future, it’s a practical, scalable solution redefining how we teach, train, and upskill. From improving safety and engagement to delivering personalized learning experiences, AR is becoming a cornerstone of modern training strategies. While challenges remain, particularly around cost and integration, the benefits outweigh the barriers. For organizations ready to innovate, AR is more than a tool, it’s a competitive advantage.

References

Bologna, Jennifer K., et al. “An augmented reality platform for training in the industrial context.” IFAC-PapersOnLine 53.3 (2020): 197-202.

Lee, Kangdon. “Augmented reality in education and training.” TechTrends 56 (2012): 13-21.

Tan, Yi, et al. “Augmented and virtual reality (AR/VR) for education and training in the AEC industry: A systematic review of research and applications.” Buildings 12.10 (2022): 1529.

[1] Statista. “Augmented Reality (AR) – Worldwide,” https://www.statista.com/outlook/amo/ar-vr/worldwide

[2] Lee, Kangdon. “Augmented reality in education and training.” TechTrends 56 (2012): 13-21.

[3] Bologna, Jennifer K., et al. “An augmented reality platform for training in the industrial context.” IFAC-PapersOnLine 53.3 (2020): 197-202.

[4] GlobeNewswire. “Immersive Entertainment Strategic Market Report 2025–2030,” https://www.globenewswire.com/news-release/2025/04/07/3056595/28124/en/

[5] Tan, Yi, et al. “Augmented and virtual reality (AR/VR) for education and training in the AEC industry: A systematic review of research and applications.” Buildings 12.10 (2022): 1529.

Greta Ioli

Greta Ioli is an EU Project Manager in the R&D department of Maggioli Group, one of Italy's leading companies providing software and digital services for Public Administrations. After earning a degree in International Relations – European Affairs from the University of Bologna, she specialized in European projects. Greta is mainly involved in drafting project proposals and managing dissemination, communication, and exploitation activities.

Copy of Maastricht General Assembly (Instagram Post) (1)

VOXReality Launches Project Results Catalogue

29 May 2025

by Ana R Blog post

The VOXReality project has launched its new Project Results Catalogue, an online collection now available on the project’s website: https://voxreality.eu/project-results/. This catalogue highlights the project’s advancements and contributions to voice interaction in XR.

This comprehensive catalogue features three main types of results: AI Tools for practical use, Scientific Publications sharing research findings, and Public Deliverables.

What You’ll Find in the Catalogue

AI Tools

The “AI Tools” section of the catalogue offers various AI models, datasets and integrations developed by VOXReality. These tools are designed to make voice interaction in XR spaces more natural and responsive to human speech. They cover several key areas:

Understanding Language: Tools like T5 NLU, intent_recognition, Dialogue System, and navqa help AI understand and respond to human language and intentions within XR environments.
Multiple Languages: Tools such as Multilingual Translation and whisper-small-el-finetune help remove language barriers in XR, making it accessible worldwide.
Combining Vision and Language: Tools like Vision and Language Models, video-language_cap, and rgb_language_vqa demonstrate the project’s work in developing AI that understands both visual and spoken information simultaneously in XR. This is crucial for AI to interpret context from what a user says and what they see in the virtual or augmented world, leading to more natural interactions.
Core Integrations and Performance: Tools such as VOXReality Integration and Model Training and Inference Optimization provide frameworks and optimized solutions for smooth XR development.

Scientific Publications

The “Scientific Publications” section lists academic papers and research findings from the project’s partners. These publications detail the methods, analyses, and progress made in improving XR use cases with machine learning and combining language and vision AI models for immersive XR experiences. Examples of publication topics include:

XR Improvement & Machine Learning: Papers like “User centric Requirements for Enhancing XR Use Cases with Machine Learning Capabilities” focus on integrating machine learning into practical XR applications and understanding user needs.
Context-Aware Machine Translation: Research such as “Analyzing the Attention Heads for Pronoun Disambiguation in Context-aware Machine Translation Models” addresses language challenges important for smooth voice interaction, helping AI understand language context and manage conversations in XR environments.
Multimodal AI & Dialogue Systems: Publications like “Intent-calibrated Self-training for Answer Selection in Open-domain Dialogues” and project overviews “VOXReality: Voice driven interaction in XR spaces” show the project’s broad approach to AI for XR.

Public Deliverables

The “Public Deliverables” section provides access to key documents that outline the project’s progress and specific outcomes. These official reports offer detailed insights into various aspects of the VOXReality project’s development and findings, ensuring transparency.

Driving Progress in XR

The VOXReality Project Results Catalogue serves as a central hub for the project’s outputs, clearly demonstrating its progress and contributions to voice interaction in XR. It is designed for researchers, developers, and the general public to explore the AI tools, scientific knowledge, and public deliverables.

This open access helps speed up innovation by allowing others to build on VOXReality’s work, reducing duplicated efforts and accelerating XR development. This fosters a collaborative environment where knowledge and resources are shared, which is important for quick technological advancement and broad societal impact.

The VOXReality project invites everyone interested to visit the new Project Results Catalogue at https://voxreality.eu/project-results/. Explore the available resources, use the AI tools for your projects, and read the scientific publications and public deliverables for more details on the future of voice-driven XR.

VOXReality template LinkedIn v3 (3).pdf (1920 x 1080 px) (Instagram Post)

A Recap of the 6th VOXReality General Assembly

8 May 2025

by Ana R Blog post

From April 8th to 10th, the VOXReality General Assembly was held at the Maggioli headquarters in Santarcangelo di Romagna, Italy. It was three intense and inspiring days, during which partners from all over Europe gathered to share insights and internally test the project’s use cases: AR Training, AR Theatre, and Virtual Conference.

The first day was dedicated to hands-on experimentation. Technical teams set up their stations and carried out user testing sessions.

In the AR Theatre case, participants wore headsets that enriched the live theatrical performance with augmented reality elements, from real-time subtitle translations to interactive animations moving alongside the actors, allowing for customizable experiences tailored to audience preferences.

In the AR Training scenario, users assembled a virtual engine by manipulating fully augmented tools and components such as bolts, drills, and pliers. They tested the process both with and without the support of the VOXy voice assistant to assess which experience offered better usability.

Lastly, the VR Conference use case enabled testing of a hybrid format: some participants joined via laptop while others connected through VR headsets in separate rooms. This session explored interactions with the virtual assistant VOXy and real-time multilingual communication in virtual environments.

All demo participants completed targeted questionnaires to collect qualitative and quantitative data, which will help refine and optimize the project’s solutions.

The second day began with a project planning session, focusing on KPIs, risk management, and updates on multimodal AI for XR and model deployment. Later, technical partners presented preliminary data gathered during the first day of testing, discussed outcomes from each use case, and outlined the strategy for the second pilot phase.

The third and final day focused on sharing lessons learned, best practices, and success stories. Participants split into three groups, one per use case, and took part in a workshop aimed at identifying key strengths, challenges overcome, and strategies that worked well. Beyond pinpointing actionable insights, the workshop strengthened collaboration among partners and sparked new ideas for continuous improvement.

This General Assembly left us energized and enthusiastic, the progress achieved confirms that we’re on the right path. We look forward to the next steps and to driving VOXReality toward new horizons!

Greta Ioli

Copy of Maastricht General Assembly (Instagram Post)

XR EXPO 2025: VOXReality Open Call Projects AIXTRA and VAARHeT to Showcase XR Innovations! 🤩

5 May 2025

by Ana R Blog post

Two exciting VOXReality Open Call projects, AIXTRA and VAARHeT, are set to unveil their latest advancements in XR and AI at the XR EXPO 2025.

📍 Location: Stuttgart, Germany 🇩🇪

🗓️ Dates: May 8-9, 2025

🏢 Venues: Porsche Arena (conference) and Hanns-Martin-Schleyer-Halle (exhibition)

AIXTRA: Breaking Language Barriers in VR! 🗣️

📍 Booth: A17 (next to stage 2, Hanns-Martin-Schleyer-Halle)

AIXTRA delivers powerful tools for improved training and seamless communication in multi-user environments. Experience two immersive demos showcasing novel AI applications:

See how intent recognition can boost training effectiveness.
Try AI-based translation and synthesization services. Hear yourself speak a foreign language!

Find the AIXTRA team members there:

VAARHeT: Transforming Heritage Tours with AR and AI! 🏛️

📍 Booth: B41 (Hanns-Martin-Schleyer-Halle Innenraum)

The VAARHeT project is developing an AI-powered AR application to create a transformative, immersive visitor experience at the Āraiši ezerpils Archeological Parks.

Visit XR Ireland to explore AI-augmented applications of XR technologies using advanced voice-activated interaction:

Learn how to drive growth, reduce risk, and engage stakeholders in enterprise, cultural heritage, and international disaster response.
Try out work-in-progress pilot projects funded by the European Commission.

Find the VAARHeT team members there:

Don’t miss out! Join XR EXPO 2025 in Stuttgart to explore the future of AI-powered XR with AIXTRA and VAARHeT! ✨

Honorable mention in IEEEVR2025 Workshop (VR-HSA) Paper by CWI

23 April 2025

by Ana R Blog post

Award: Honorable mention of paper “User-Centric Requirements for Enhancing XR Use Cases with Machine Learning Capabilities” in the “Best Presentation” award at the International Workshop on Virtual Reality for Human and Spatial Augmentation (VR-HSA) in conjunction with IEEE VR 2025

We are glad to share that our team from CWI (Centrum Wiskunde & Informatica) participated and presented their work in the International Workshop on Virtual Reality for Human and Spatial Augmentation (VR-HSA), held in conjunction with IEEE VR 2025 in the beautiful coastal city of Saint-Malo, France, on March 9, 2025. At the workshop, we presented our paper, “User-Centric Requirements for Enhancing XR Use Cases with Machine Learning Capabilities,“ authored by Sueyoon Lee, Moonisa Ahsan, Irene Viola, and Pablo Cesar. Additionally. This paper is based on two use cases (a) Virtual Conference, which mimics a real-life like conference in a VR environment (VRDays Foundation) and (b) Augmented Theatre, which showcases a Greek play in AR environment (Athens Festival). The paper shows our user-centric approach for conducting two focus groups to gather user requirements for these two use cases and to find where ML technologies could be implemented using VOXReality technology modules. We also showcased the overview of the full data collection, processing and evaluation pipeline with a poster presentation in a parallel session. We are happy to share that our presentation received the honorable mention in the Best Presentation Award category.

We are excited to see our work contributing to the growing field of ML-enhanced XR user experiences. We extend our thanks to all contributors from the use case owners (VRDays Foundation, Athens Festival AEF) and everyone who was part of the process and supported in the contribution for enabling this work; to the VR-HSA organizers and the broader XR community for supporting discussions. This recognition motivates us to continue working towards more user centric immersive experiences.

Abstract: The combination of Extended Reality (XR) and Machine Learning (ML) will enable a new set of applications. This requires adopting a user-centric approach to address the evolving user needs. This paper addresses this gap by presenting findings from two independent focus groups specifically designed to gather user requirements for two use cases: (1) a VR Conference with an AI-enabled support agent and real-time translations, and (2) an AR Theatre featuring ML generated translation capabilities and voice-activated VFX. Both focus groups were designed using context-mapping principles. We engaged 6 experts in each of the focus groups. Participants took part in a combination of independent and group activities aimed at mapping their interaction timelines, identifying positive experiences, and highlighting pain points for each scenario. These activities were followed by open discussions in semi-structured interviews to share their experiences. The inputs were analysed using Thematic Analysis and resulted in a set of user-centric requirements for both applications on Virtual Conference and Augmented Theatre respectively. Subtitles and Translations were the most interesting and common findings in both cases. The results led to the design and development of both applications. By documenting user-centric requirements, these results contribute significantly to the evolving landscape of immersive technologies.

Keywords: Virtual Reality, VR conference, Augmented Reality, AR theatre, Focus groups, User requirements, Use cases, Human-centric design.

Reference

S. Lee, M. Ahsan, I. Viola, and P. Cesar, “User-centric requirements for enhancing XR use cases with machine learning capabilities,” in Proceedings of VR-HSA Workshop (IEEEVR2025), March 2025. DOI: 10.1109/VRW66409.2025.00152

Moonisa Ahsan

Moonisa Ahsan is a post-doc in the DIS (Distributed & Interactive Systems) Group of CWI. She was also the external-supervisor for the aforementioned thesis work. In VOX, she is contributing in understanding next-generation applications within Extended Reality (XR), and to better understand user needs and leveraging that knowledge to develop innovative solutions that enhance the user experience in all three use-cases. She is a Marie-Curie Alumna and her scientific and research interests are Human-Computer Interaction (HCI), User Centric Design (UCD), Extended Reality (XR) and Cultural Heritage (CH).

Master’s Thesis titled “Enhancing the Spectator Experience by Integrating Subtitle Display in eXtended Reality Theatres” defended last December in Amsterdam

16 April 2025

by Ana R Blog post

Master’s Student: Atanas Yonkov
Thesis Advisors (CWI): Moonisa Ahsan, Irene Viola and Pablo Cesar

Abstract: The rapid growth of virtual and augmented reality technologies, encapsulated by the term eXtended Reality (XR), has revolutionized the interaction with digital content, bringing new opportunities for entertainment and communication. Subtitles and closed captions are crucial in improving language learning, vocabulary acquisition, and accessibility, like understanding audiovisual content. However, little is known about integrating subtitle displays in extended reality theatre environments and their experience influence on the user. This study addresses this gap by examining subtitle placement and design attributes specific to XR settings. Building on previous research on subtitle placement, mainly in television and 360-degree videos, this project focuses on the differences between static and dynamic subtitle variants. The study uses a comprehensive literature review, Virtual Reality (VR) theatre experiment, and analytics to investigate these aspects of subtitle integration in the specific case of a VR theatrical Greek play with subtitles. The results show that the comparison between the two variants is insignificant, and both implementations produce high scores. However, thematic analysis suggests the preference for static over the dynamic variant depends heavily on the specific context and the number of speakers in the scene. Since this study focuses on a monologue theatrical play, the next step in future work would be to explore a “multi-speaker” play.

The partners from the DIS (Distributed and Interactive System) group of Centrum Wiskunde en Informatica (CWI) hosted and supervised a Master’s thesis^[1] titled as “Enhancing the Spectator Experience by Integrating Subtitle Display in eXtended Reality Theatres” by Atanas Yonkov at University of Amsterdam (UvA). The advisors from CWI were Moonisa Ahsan, Irene Viola and Pablo Cesar, and the university advisors were Prof. dr. Frank Nack and Prof. dr. Hamed Seiied Alavi. The thesis focuses on XR Theatres, investigating subtitle integration in virtual reality (VR) theatre environments designed within the VOXReality project. The user study in the thesis was based on an extended VR version of the AR Theatre Use Case Application of VOXReality project, showcasing the Greek theatrical play Euripides by Hippolytus. The goal was to bridge the existing research gap by exploring optimal subtitle positioning in VR theatre, focusing on two key approaches: static and dynamic subtitles. In the study, the Static subtitles (see fig 1a) are fixed relative to the user’s gaze, ensuring they remain within the viewer’s field of vision regardless of scene movement. The Dynamic subtitles” (see fig 1b) are anchored to objects—in this case, actors—moving naturally with them within the virtual environment.

The study was conducted from May 13, 2024, to May 22, 2024, at the DIS Immersive Media Lab, Centrum Wiskunde en Informatica (CWI) in Amsterdam, The Netherlands. The study examined how subtitle placement affects the user experience in a VR theatrical adaptation of a Greek play. Results indicated no significant difference in user experience between static and dynamic subtitle implementations, with both approaches receiving high usability scores. However, a thematic analysis revealed that user preference for static or dynamic subtitles was highly context-dependent. In particular, the number of speakers in a scene influenced subtitle readability and ease of comprehension: a) in monologue settings, static subtitles were often preferred for their stability and ease of reading; b) in potential future scenarios with multiple speakers, dynamic subtitles could enhance spatial awareness and dialogue attribution. Each session lasted approximately 60 minutes, with individual durations varying between 50 minutes and 120 minutes, depending on participant familiarity and adaptability with VR headsets and controllers. Our findings, which will be detailed in future blog posts, contribute to the growing body of research on subtitle placement in immersive environments. This work builds upon previous studies in subtitle integration for television and 360-degree videos, extending the analysis to VR theatre settings. And this study also informs several design and user experience decisions for the AR Theatre use case within the project. For the future work, given that this study focused on a monologue performance, further research should extend the analysis to multi-speaker theatrical plays to further explore subtitle effectiveness in complex dialogue scenarios.

^[1] Atanas Yonkov, “Enhancing the Spectator Experience: Integrating Subtitle Display in eXtended Reality Theatres (Master’s thesis). Universiteit van Amsterdam, 2024. Available at https://scripties.uba.uva.nl/search?id=record_55113

VOXReality

Posts in category: Blog post

Teaching AI Where Things Are: A Step Forward in Image Understanding

Why Is This Hard?

What We Did

Why It Matters

References

Georgios Papadopoulos

Augmenting the Past: Applied Examples of AR and AI in Cultural Heritage from Greece

Example 1: NAXOS AR – From the Portara to the Temple of Apollo

Example 2: COSMOTE CHRONOS – Blending AR and AI for Cultural Heritage

Example 3: Ancient Kydonia AR Tour – Situated Storytelling through AR and Audio

Conclusion: Reimagining Cultural Heritage through Emerging Technologies

Olga Chatzifoti

&

Alexandra Malouta

Join Us for the VOXReality Test Pilots – Remote & In-Person Opportunities!

Pilot 0C : Outcomes and next steps

Leesa Joyce

AR Training: Shaping the Future of Learning

Greta Ioli

VOXReality Launches Project Results Catalogue

A Recap of the 6th VOXReality General Assembly

Greta Ioli

XR EXPO 2025: VOXReality Open Call Projects AIXTRA and VAARHeT to Showcase XR Innovations! 🤩

Honorable mention in IEEEVR2025 Workshop (VR-HSA) Paper by CWI

Moonisa Ahsan

Master’s Thesis titled “Enhancing the Spectator Experience by Integrating Subtitle Display in eXtended Reality Theatres” defended last December in Amsterdam

Moonisa Ahsan

POLICIES

EMAIL

SOCIAL MEDIA