Partner Journey #1 with Visual Computing Lab (VCL)@CERTH/ITI

September 30, 2025

In this Partner Journey spotlight, we speak with Petros Drakoulis from the Visual Computing Lab (VCL) at CERTH/ITI, one of the key research partners in VOXReality. With strong expertise in computer vision and AI, the VCL team has helped develop models that understand space and language together, making interactions in XR more natural and intuitive. Petros shares how the team approached model optimization, spatial reasoning, and long-term flexibility to support real-world XR applications

CERTH has been instrumental in exploring the intersection of CV and NLP within VOXReality. How has your team’s expertise shaped the project’s approach to integrating these technologies?

Hello Ana Rita! As we say, CERTH-ITI, and especially the Visual Computing Lab, has a well-known pedigree in Computer Vision and graphics-related AI. The difference with VOXReality was that we had to adopt the practice of synergizing with other partners, also experts in their field such as the University of Maastricht, to a much deeper technical level than we usually do – practically on the same models and agents – to attack the multi-modal challenge we were tasked with. It was a great journey for both ends, I believe.

Your models include innovative spatial reasoning capabilities. Can you explain how this “spatial thinking” differentiates your approach and what benefits it brings to XR applications?

Good question. Let’s see… eXtended Reality is all about interacting in space; Virtual or Augmented, it doesn’t essentially matter. Humans, living in 3D, naturally think spatially. On the other hand, AI neural networks that reside in memory and are built on mathematical abstractions, until recently were modeled “flat”, in the sense that any of their “spatially correct” outcomes, were emerging only due to apparent connections with 2D structures that acted as “vessels” for underlying, but mostly obscure, 3D information. In VOXReality we explored modalities and architectural arrangements that deeply resemble the 3rd dimension. If a model thinks in 3D, it is less likely to make mistakes in interactions that prerequisite true understanding of the world around; and that’s what we tried to bring in the project.

What practical challenges did you encounter when training and deploying large-scale vision-language models, and how did the project address concerns related to efficiency and sustainability?

Training Large Language Models can be truly overwhelming… Especially 3 years ago, when the project started. In their greatness, transformers (the state-of-the-art underlying building blocks) really take up space! All the marvelous models we mostly rely on now, are trained in huge sums of data that were processed for an equivalently large amount of time to reach the level of performance we are now accustomed to. For us, that was a problem because firstly we wanted to develop for edge devices and secondly we do not have the capacity to really experiment on big hardware. So, from the proposal writing stage we had already identified this upcoming issue and transformed it into a project’s asset, by creating a task specifically assigned to the development of methods for model optimization. I think we succeeded, in the sense that we developed a novel method for generic post-training model compression that works, and is currently under peer review for publication.

Looking at user interaction in XR, how do you see your models enabling more natural and intuitive communication, and what might this mean for the future of human-computer interaction?

Ah, for sure, the future of Human-Computer interaction is voice driven, with underlying models having access to multiple modalities for drawing context, including 3D, while being efficient, fast, secure, explainable and modular for edge deployment in potentially time-critical applications, like assisted-driving. Our multi-modal, spatially aware models, enhanced by our means of optimization could be considered a solid block for future research endeavors.

With the rapid evolution of XR hardware and platforms, how is CERTH ensuring that the solutions developed remain flexible and scalable across different devices and ecosystems?

From the developer’s perspective, which we are mostly entitled to opinionate, our effort is to utilize and build on tools, frameworks and data formats that are wisely chosen to sustain the test of time, even perhaps some industry “revolution”. Of course, no one has the fortuneteller’s magic 8-ball but we are quite confident that that the use of highly popular solutions and layers of abstraction, like Docker, Hugging Face, Python and PyTorch, safeguards and practically guarantees the scalability, adaptability and potential future extension, if desired, of our created content.

Petros Drakoulis

Research Associate, Project Manager & Software Develper at Visual Computing Lab (VCL)@CERTH/ITI

VOXReality