In this third installment of our Partner Interview series, we had the pleasure of speaking with Petros Drakoulis, Research Associate, Project Manager, and Software Developer at the Visual Computing Lab (VCL)@CERTH/ITI, about their critical role in the VOXReality project. As a founding member of the project team, CERTH brings its deep expertise in Computer Vision to the forefront, working at the intersection of Vision and Language Modeling. Petros shares how their innovative models are adding a “magical” visual context to XR experiences, enabling applications to understand and interact with their surroundings in unprecedented ways. He also provides insights into the future of XR, where these models will transform how users engage with technology through natural, conversational interactions. Petros highlights the challenges of adapting models to diverse XR scenarios and ensuring seamless cross-platform compatibility, underscoring CERTH’s commitment to pushing the boundaries of immersive technology.
What is your specific role within the VOXReality Project?
CERTH has been a key contributor to the project since its conception, since it has been among the founding members of the proposal team. As one of the primal research institutes in Europe, our involvement regards research conduction and technology provision to the team. In this project, specifically, we saw a chance we wouldn’t miss; to delve into the “brave new world” of Vision and Language Modeling. A relatively new field that lies at the intersection of Computer Vision, which is our lab’s expertise, and Natural Language Processing, an excessively flourishing field with the developments in Large Language Models and Generative AI (have you heard of ChatGPT? 😊). Additionally, we work on how to train and deploy all these models efficiently, an aspect extremely important due to the sheer size of the current model generation and the necessity for green transition.
Could you share a bit about the models you're working on for VOXReality? What makes them magical in adding visual context to the experiences?
You set it nicely! Indeed, they enable interaction with the surrounding environment in a way that some years ago would seem magical. The models take an image or a short video as input (i.e. as seen from the user), and optionally a question about it, and provide a very human-like description of the scene or an answer to the question. This output can then be propagated to the other components of the VOXReality pipeline as “visual context”, endowing them with the ability to function knowing where they are and what is around them; effectively elevating their level of awareness. Speaking about the latter, what is novel about our approach is the introduction of inherent spatial reasoning, built deep into the models enabling them to fundamentally “think” spatially.
Imagine we're using VOXReality applications in the future – how would your models make the XR experience better? Can you give us a glimpse of the exciting things we could see?
The possibilities are almost limitless and as experience has shown creators rarely grasp the full potential of their creations. The community has an almost “mysterious” way of stretching whatever is available to its limits, given enough visibility (thank you F6S!). Having said that, we envision a boom in the end-user XR applications integrating Large Language and Vision models, enabling users to interact with the applications in a more natural way, using primarily their voice in a conversational manner together with body language. We cannot, of course, predict how long this transition might take or to what extent the conventional Human-Computer Interaction interfaces, like keyboards, mice and touchscreens will be deprecated but the trend is obvious, nevertheless.
In the world of XR, things can get pretty diverse. How do your models adapt to different situations and make sure they're always giving the right visual context?
It is true that in “pure” Vision-Language terms, a picture is worth a thousand words that some of them may be wrong 🤣! For real, any Machine Learning model is only as good as the data it was trained on. The latest generation of AI models is undoubtedly exceptional, but largely due to learning from massive data. The standard practice today is to reuse pretrained models developed for another, sometimes generic, task and finetune them for the intended use-case, never letting them “forget” the knowledge they acquired from previous uses. In that sense, in VOXReality we seek to utilize models pretrained and then finetuned for a variety of tasks and data, which are innately competent to treat diverse input.
In the future XR landscape, where cross-platform experiences are becoming increasingly important, how is VOXReality planning to ensure compatibility and seamless interaction across different XR devices and platforms?
Indeed, the rapid increase of edge-device capabilities we observe today is rapidly altering the notion of where the application logic should reside. Thus, models and code should be able to operate and perform on a variety of hardware and software platforms. VOXReality’s provision in this direction is two-fold. On one hand, we are developing an optimization framework that allows developers to fit initially large models to various deployment constraints. On the other hand, we definitely put emphasis on using as many platform-independent solutions as possible, in all stages of our development. Some examples of this include: the use of a RESTful API based model inference scheme, the release of all models in image-container form and the ability to export them into various cross-platform binary representations such as the ONNX.
Petros Drakoulis
Research Associate, Project Manager & Software Develper at Visual Computing Lab (VCL)@CERTH/ITI