In this interview, we sit down with the project’s Scientific Coordinator, Maastricht University, to discuss how the consortium navigated the “turbo-fast” pace of AI development, the challenges of deploying models in noisy environments, and the ethical frameworks that ensured responsible innovation. From publishing 16 scientific papers to releasing dozens of open-access models, VOXReality provides a blueprint for the future of immersive, intelligent technology.
As the scientific coordinator of VOXReality, how would you summarize the project’s main scientific achievements, particularly in the areas of NLP, CV, and AI integration?
At its core, the project had two main goals: advancing natural language processing literature and making sure these technologies actually work inside real XR applications. To do that, not only did we work on the fundamental research but also, we developed applications to try them in real world cases. We published 16 scientific papers where we introduced new methods and models. We also made our work available to the public by releasing 25 models and 6 datasets shaped by our scientific outcomes and the needs of XR use cases. On top of that, we developed three optimization tools to make these models lighter and efficient enough to run in XR environments. From the end-users’ perspective, what really matters is how this work performs outside the lab. We demonstrated the capabilities of our models through three applications developed within the project and five more built by our external partners, each combining our models in different ways. These eight applications were tested with real end users in pilot studies, and the results gave us solid scientific evidence that the project’s achievements are not just academically sound, but also practical and valuable to be used in real-world XR scenarios.
Did Maastricht University’s work on neural machine translation and automatic speech recognition evolve throughout the project? Were there any key breakthroughs or lessons learned?
For sure. It would be weird if it didn’t. The project started in October 2022 and two months later ChatGPT was launched and since then there has been a new model every week released. And while ASR and MT were already robust with the typical models released, we had to follow through to see how we can update our models and incorporate improvements such as handling the context in MT or introducing robustness for the ASR. At the same time, in some cases we had to fine-tune models for some of the consortium languages (namely Greek) in order to adapt our models to our use case.
You played a central role in supervising the ethical aspects of AI within the project. How did the consortium approach ethics from both a research and deployment standpoint, and what frameworks or practices were most valuable?
In the project lifetime, UM became the primary partner to ensure responsible data collection and management. We had 3 different use-cases with different requirements per case, so the first decision was to submit 3 different applications (which made things both easier (e.g. share common issues among the 3 use cases) and harder (e.g. handle 3 applications with different locations etc.). Ethical approval (which is necessary for all studies involving humans) is primarily a reflection of the protocols we use and it is our impression that the whole VOX consortium benefitted from assessing what procedures are necessary for running the 3 successful pilots, how can we be as less intrusive as possible and in the end standardize the pilots and help us have valid results.
How did the pilot studies and open calls contribute to validating the developed technologies, and what insights did you gain regarding their scalability and future adoption?
Something that became very clear from both the pilot studies and the open calls was that it is difficult to disentangle the model from the deployed version of it, e.g. while ASR and MT models worked perfectly in our “lab/research” setup, their efficient deployment (i.e. dealing with multiple users at the same time, multiple and different types of devices, noise/real-world conditions) proved challenging and it was sometimes not easy to pinpoint where the error came from. If anything, this error chasing made our models more robust to deploy in edge devices, and as such it makes them more valuable for future applications. Same with open calls, we were more than happy to see how models can be extended both to new languages we did not specifically test (e.g. Latvian) but also other use cases (which we haven’t accounted for).
Looking ahead, how do you envision NLP, CV, and AI advancing XR even further beyond the VOXReality scope? What directions or applications are you most excited about?
Both fields are going to move at a “turbo-fast” pace (perhaps AI a little bit faster due to the huge interest). In VOXReality, as we said previously, we developed 25 models and 6 datasets, which are available for further experimentation. In the current climate, where the trend seems to be about “closing” models so that each business takes advantage of their in-house technology, our resources can play a big role by providing open access to researchers and small businesses. As XR applications are being democratized (both in terms of cost and wider adoption), it’s only natural to see more integration with AI. Also, we already have ideas and new directions for our consortium such as education and collaborative virtual spaces (e.g. immersive language environments with real-time ASR and feedback), cultural heritage (e.g. digital museums), accessible XR environments for people with disabilities etc.

Jerry Spanakis
Assistant Professor at Maastricht University
&

Yusuf Can Semerci
Assistant Professor at Maastricht University

