In our second Partner Interview, we had the opportunity to discuss the VOXReality project with Konstantia Zarkogianni, Associate Professor of Human-Centered AI at Maastricht University. As the scientific coordinator of VOXReality, Maastricht University plays a crucial role in the development and integration of neural machine translation and automatic speech recognition technologies. Konstantia shares her insights into how Natural Language Processing (NLP), Computer Vision (CV), and Artificial Intelligence (AI) are driving the future of Extended Reality (XR) by enabling more immersive and intuitive interactions within virtual environments. She also discusses the technical challenges the project aims to overcome, particularly in aligning language with visual understanding, and emphasizes the importance of balancing innovation with ethical considerations. Looking ahead, Konstantia highlights the project’s approach to scalability, ensuring that these cutting-edge models are optimized for next-generation XR applications.
What is your specific role within the VOXReality Project?
UM is the scientific coordinator of the project and responsible for implementing the neural machine translation and the automatic speech recognition. My role in the consortium is to monitor and supervise UM’s activities while providing my expertise in the ethical part of AI along with the execution of the pilots and the open calls.
How do you perceive the role of Natural Language Processing (NLP), Computer Vision (CV), and Artificial Intelligence (AI) in shaping the future of Extended Reality (XR) as part of the VOXReality initiative?
The VOXReality’s technological advancements in the fields of Natural Language Processing, Computer Vision, and Artificial Intelligence pave the way for future XR applications capable of offering high level assistance and controlling. Language enhanced by visual understanding constitutes the VOXReality’s main medium for communication that it is implemented based on the combined use of NLPs, CV, and AI. The seamless fusion of linguistic expression and visual comprehension offers immersive communication and collaboration revolutionizing the way humans interact with virtual environments.
What specific technical challenges is the project aiming to overcome in developing AI models that seamlessly integrate language and visual understanding?
Within the frame of the project, innovative cross-modal and multi-modal methods to integrate language and visual understanding will be developed. Cross-modal representation learning will be applied to capture both linguistic and visual information through encoding the semantic meaning of words and images in a cohesive manner. The generated word embeddings will be aligned with the visual features to ensure that the model can associate relevant linguistic concepts with corresponding visual elements. Multi-modal analysis involves the development of attention mechanisms that endorse the model with capabilities to focus on the most important and relevant parts of both modalities.
How does the project balance technical innovation with ethical considerations in the development and deployment of XR applications?
VOXReality foresees the implementation of three use cases: (i) digital agent assisting the training of personnel in machine assembly, (ii) virtual conferencing offering a shared virtual environment that allows navigation and chatting among attendees speaking different languages, and (iii) theatre incorporating language translation and visual effects. Focus has been placed to take into consideration the ethical aspect of the implemented XR applications. Prior to initiating the pilots, the consortium identified specific ethical risks (e.g. misleading language translations), prepared relevant informed consents, and drafted a pilot study protocol ensuring safety and security. Ethical approval from the UM’s ethical review committee has been received to perform the pilots.
Given the rapid evolution of XR technologies, how is VOXReality addressing challenges related to scalability and ensuring optimal performance in next-generation XR applications?
The VOXReality technological advancements in visual language models, automatic speech recognition, and neural machine translation feature scalability and are provided to support next-generation XR applications. Having as goal to deliver these models in the form of plug-n-play optimized models, modern data-driven techniques are applied to optimize models’ inference time and storage requirements. To this end, a variety of techniques are being investigated to transform unoptimized PyTorch models to ONNX hardware optimized ones. Except from the VOXReality pilot studies that implement three use cases, new XR applications will also be developed and evaluated within the frame of the VOXReality open calls. The new XR applications will be thoroughly assessed in terms of effectiveness, efficiency, and user acceptance.
Konstantia Zarkogianni
Associate Professor of Human-Centered AI, Maastricht University, MEng, MSc, PhD