D3.2 Advanced AI multi-model for XR analysis V2

July 21, 2025

This document corresponds to the deliverable D3.1 and D3.2 – Advanced AI multi-model for XR analysis, of work package 3 (WP3) and describes the work done in the first 30 months of the project regarding the natural language processing (NLP) models.

In VOXReality, the natural language processing models are developed for the following tasks:

1. Automatic speech recognition (ASR),
2. Machine translation (MT),
3. Speech translation (ST),
4. Image captioning (IC),
5. Video captioning (VC),
6. Visual Navigation (VN),
7. Visual question answering (VQA),
8. Conversation agents (CA) for navigation and training assistance.

The document is divided into four chapters:

Chapter 1 provides a brief introduction to the WP3 tasks and to this deliverable.
Chapter 2 describes the background knowledge required to understand the VOXReality NLP models and the related works from the literature for tasks listed above.
Chapter 3 presents the models developed for each of the five tasks listed above. Section 3.1 describes the work performed regarding automatic speech recognition, speech translation, and streaming speech recognition. Section 3.2 describes the context-aware machine translation, simultaneous machine translation, robust machine translation and script alignment models and their performances with benchmark datasets, Section 3.3 describes the models implemented for image and video captioning, visual question answering, and visual navigation, and Section 3.4 presents the work performed on developing the navigation assistant conversation agent and the training assistant conversation agent.
Chapter 4 summarizes the conclusions obtained with this deliverable and with the work done during the first 30 months of the project.

The changes from the first version to the final version are as follows: Executive Summary, Chapter 1, Section 3.2.7, Section 3.3, Chapter 4, and All Appendices are updated and Section 2.5.4, Section 3.1.2, Section 3.1.3, Section 3.2.2, Section 3.2.4, pages between 100 and 113 in Section 3.4.1, and pages between 124 and 134 in Section 3.4.2 are added.

Deliverable lead: UM
Authors: Yusuf Can Semerci, Pawel Maka, Abderrahmane Issam, Gerasimos Spanakis (UM), Georgios Papadopoulos, Athanasios Ntovas, Stefanos Biliousis, Sotiris Karavarsamis, Petros Drakoulis, Alexandros Doumanoglou, Konstantinos Konstantoudakis, Dimitris Zarpalas (CERTH), Apostolos Maniatis, Stavroula Bourou (SYN), Jiahuan Pei, Irene Viola, Pablo Cesar (NWO-I)
Reviewers: Olga Chatzifoti (MAG), Carina Pamminger, Leesa Joyce (HOLO)
Keywords: Natural Language Processing, Automatic Speech Recognition, Neural Machine Translation, Visual Language Models, Conversation Agents
License: This work is licensed under a Creative Commons Attribution-No Derivatives 4.0 International License (CC BY-ND 4.0). See: https://creativecommons.org/licenses/by-nd/4.0/

VOXReality

D3.2 Advanced AI multi-model for XR analysis V2

POLICIES

EMAIL

SOCIAL MEDIA