VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (6)

Partner Interview #3 with Visual Computing Lab (VCL)@CERTH/ITI

In this third installment of our Partner Interview series, we had the pleasure of speaking with Petros Drakoulis, Research Associate, Project Manager, and Software Developer at the Visual Computing Lab (VCL)@CERTH/ITI, about their critical role in the VOXReality project. As a founding member of the project team, CERTH brings its deep expertise in Computer Vision to the forefront, working at the intersection of Vision and Language Modeling. Petros shares how their innovative models are adding a “magical” visual context to XR experiences, enabling applications to understand and interact with their surroundings in unprecedented ways. He also provides insights into the future of XR, where these models will transform how users engage with technology through natural, conversational interactions. Petros highlights the challenges of adapting models to diverse XR scenarios and ensuring seamless cross-platform compatibility, underscoring CERTH’s commitment to pushing the boundaries of immersive technology.

What is your specific role within the VOXReality Project?

CERTH has been a key contributor to the project since its conception, since it has been among the founding members of the proposal team. As one of the primal research institutes in Europe, our involvement regards research conduction and technology provision to the team. In this project, specifically, we saw a chance we wouldn’t miss; to delve into the “brave new world” of Vision and Language Modeling. A relatively new field that lies at the intersection of Computer Vision, which is our lab’s expertise, and Natural Language Processing, an excessively flourishing field with the developments in Large Language Models and Generative AI (have you heard of ChatGPT? 😊). Additionally, we work on how to train and deploy all these models efficiently, an aspect extremely important due to the sheer size of the current model generation and the necessity for green transition. 

Could you share a bit about the models you're working on for VOXReality? What makes them magical in adding visual context to the experiences?

You set it nicely! Indeed, they enable interaction with the surrounding environment in a way that some years ago would seem magical. The models take an image or a short video as input (i.e. as seen from the user), and optionally a question about it, and provide a very human-like description of the scene or an answer to the question. This output can then be propagated to the other components of the VOXReality pipeline as “visual context”, endowing them with the ability to function knowing where they are and what is around them; effectively elevating their level of awareness. Speaking about the latter, what is novel about our approach is the introduction of inherent spatial reasoning, built deep into the models enabling them to fundamentally think” spatially. 

Imagine we're using VOXReality applications in the future – how would your models make the XR experience better? Can you give us a glimpse of the exciting things we could see?

The possibilities are almost limitless and as experience has shown creators rarely grasp the full potential of their creations. The community has an almost mysterious” way of stretching whatever is available to its limits, given enough visibility (thank you F6S!). Having said that, we envision a boom in the end-user XR applications integrating Large Language and Vision models, enabling users to interact with the applications in a more natural way, using primarily their voice in a conversational manner together with body language. We cannot, of course, predict how long this transition might take or to what extent the conventional Human-Computer Interaction interfaces, like keyboards, mice and touchscreens will be deprecated but the trend is obvious, nevertheless. 

In the world of XR, things can get pretty diverse. How do your models adapt to different situations and make sure they're always giving the right visual context?

It is true that in pure Vision-Language terms, a picture is worth a thousand words that some of them may be wrong 🤣 For real, any Machine Learning model is only as good as the data it was trained on. The latest generation of AI models is undoubtedly exceptional, but largely due to learning from massive data. The standard practice today is to reuse pretrained models developed for another, sometimes generic, task and finetune them for the intended use-case, never letting them “forget” the knowledge they acquired from previous uses. In that sense, in VOXReality we seek to utilize models pretrained and then finetuned for a variety of tasks and data, which are innately competent to treat diverse input. 

In the future XR landscape, where cross-platform experiences are becoming increasingly important, how is VOXReality planning to ensure compatibility and seamless interaction across different XR devices and platforms?

Indeed, the rapid increase of edge-device capabilities we observe today is rapidly altering the notion of where the application logic should reside. Thus, models and code should be able to operate and perform on a variety of hardware and software platforms. VOXReality’s provision in this direction is two-fold. On one hand, we are developing an optimization framework that allows developers to fit initially large models to various deployment constraints. On the other hand, we definitely put emphasis on using as many platform-independent solutions as possible, in all stages of our development. Some examples of this include: the use of a RESTful API based model inference scheme, the release of all models in image-container form and the ability to export them into various cross-platform binary representations such as the ONNX. 

Picture of Petros Drakoulis

Petros Drakoulis

Research Associate, Project Manager & Software Develper at Visual Computing Lab (VCL)@CERTH/ITI

Twitter
LinkedIn
VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (4)

Partner Interview #2 with MAASTRICHT UNIVERSITY

In our second Partner Interview, we had the opportunity to discuss the VOXReality project with Konstantia Zarkogianni, Associate Professor of Human-Centered AI at Maastricht University. As the scientific coordinator of VOXReality, Maastricht University plays a crucial role in the development and integration of neural machine translation and automatic speech recognition technologies. Konstantia shares her insights into how Natural Language Processing (NLP), Computer Vision (CV), and Artificial Intelligence (AI) are driving the future of Extended Reality (XR) by enabling more immersive and intuitive interactions within virtual environments. She also discusses the technical challenges the project aims to overcome, particularly in aligning language with visual understanding, and emphasizes the importance of balancing innovation with ethical considerations. Looking ahead, Konstantia highlights the project’s approach to scalability, ensuring that these cutting-edge models are optimized for next-generation XR applications.

What is your specific role within the VOXReality Project?

UM is the scientific coordinator of the project and responsible for implementing the neural machine translation and the automatic speech recognition. My role in the consortium is to monitor and supervise UM’s activities while providing my expertise in the ethical part of AI along with the execution of the pilots and the open calls.  

How do you perceive the role of Natural Language Processing (NLP), Computer Vision (CV), and Artificial Intelligence (AI) in shaping the future of Extended Reality (XR) as part of the VOXReality initiative?

The VOXReality’s technological advancements in the fields of Natural Language Processing, Computer Vision, and Artificial Intelligence pave the way for future XR applications capable of offering high level assistance and controlling. Language enhanced by visual understanding constitutes the VOXReality’s main medium for communication that it is implemented based on the combined use of NLPs, CV, and AI. The seamless fusion of linguistic expression and visual comprehension offers immersive communication and collaboration revolutionizing the way humans interact with virtual environments.  

What specific technical challenges is the project aiming to overcome in developing AI models that seamlessly integrate language and visual understanding?

Within the frame of the project, innovative cross-modal and multi-modal methods to integrate language and visual understanding will be developed. Cross-modal representation learning will be applied to capture both linguistic and visual information through encoding the semantic meaning of words and images in a cohesive manner. The generated word embeddings will be aligned with the visual features to ensure that the model can associate relevant linguistic concepts with corresponding visual elements. Multi-modal analysis involves the development of attention mechanisms that endorse the model with capabilities to focus on the most important and relevant parts of both modalities.  

How does the project balance technical innovation with ethical considerations in the development and deployment of XR applications?

VOXReality foresees the implementation of three use cases: (i) digital agent assisting the training of personnel in machine assembly, (ii) virtual conferencing offering a shared virtual environment that allows navigation and chatting among attendees speaking different languages, and (iii) theatre incorporating language translation and visual effects. Focus has been placed to take into consideration the ethical aspect of the implemented XR applications. Prior to initiating the pilots, the consortium identified specific ethical risks (e.g. misleading language translations), prepared relevant informed consents, and drafted a pilot study protocol ensuring safety and security. Ethical approval from the UM’s ethical review committee has been received to perform the pilots.   

Given the rapid evolution of XR technologies, how is VOXReality addressing challenges related to scalability and ensuring optimal performance in next-generation XR applications?

The VOXReality technological advancements in visual language models, automatic speech recognition, and neural machine translation feature scalability and are provided to support next-generation XR applications. Having as goal to deliver these models in the form of plug-n-play optimized models, modern data-driven techniques are applied to optimize models’ inference time and storage requirements. To this end, a variety of techniques are being investigated to transform unoptimized PyTorch models to ONNX hardware optimized ones. Except from the VOXReality pilot studies that implement three use cases, new XR applications will also be developed and evaluated within the frame of the VOXReality open calls. The new XR applications will be thoroughly assessed in terms of effectiveness, efficiency, and user acceptance.   

Picture of Konstantia Zarkogianni

Konstantia Zarkogianni

Associate Professor of Human-Centered AI, Maastricht University, MEng, MSc, PhD

Twitter
LinkedIn
VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (3)

Partner Interview #1 with ADAPTIT S.A.

In our first Partner Interview, Spyros Polychronopoulos from ADAPTIT S.A. discusses their role in developing the AR Theatre application for the VOXReality project. As XR technology experts, ADAPTIT has been deeply involved in the design and deployment process, ensuring that the technology aligns with live theatre needs. They’ve focused on user-friendly interfaces, seamless integration with theatre systems, and secure data protocols to protect intellectual property. Spyros also highlights strategies for future-proofing the application, including modular design and cross-platform development, with plans to adapt to emerging XR technologies and broaden access to theatre through affordable AR devices.

What is your specific role within the VOXReality Project?

Our organization, in our capacity as XR technology experts, has undertaken the design, development and deployment of the AR Theatre application. We have been engaged in the design process since the early beginning, e.g. in discussing, interpreting and contextualizing the user requirements. Throughout the process, we have been in close contact with the theatrical partner and use case owner, offering technological knowledge transfer to their artistic and management team. This work frame for operations has proven critical to facilitating team-based decision-making during design, and thus keeping in view the needs of both the XR technology systems and the theatrical ecosystem. 

To facilitate our communication in an interdisciplinary team and to consolidate our mutual understanding, we have taken the lead in creating dedicated applications as deemed necessary. 

Firstly, to render the VOX Reality capabilities in tangible, everyday terms, we created an easily distributable mobile application which demonstrates the VOX Reality models one by one in a highly controlled environment. This application can also function as a dissemination contribution for the VOX Reality project goals. We proceeded with developing a non-VOX Reality related AR application to practically showcase the XR device capabilities to the theatrical partner, and more specifically, to the team’s theatrical and art director with a focus on the device’s audiovisual capabilities.

Furthermore, we combined the two previous projects in a new AR-empowered application to better contextualize the VOX Reality services to a general audience which is unfamiliar with AR. Since that milestone, we have been developing iterations of the theatrical application itself with increasing levels of complexity. Our first iteration was an independent application running on the XR device which simulates the theatrical play and user experience. It was produced in independent mode for increased mobility and testing and was used extensively for documenting footage and experientially evaluating design alternatives. The second iteration is a client-server system which can allow multiple XR applications to operate in sync with each other. This was performed for simulated testing in near-deployment conditions during development and was targeted on evaluating the more technical aspects of the system, like performance and stability. The third and last iteration will incorporate all the physical theatrical elements, specifically the actors and the stage, and will involve the introduction of yet new technology modules with their own challenges.  

In summary, this has been a creative and challenging journey so far, with tangible and verifiable indicators for our performance throughout, and with attention to reusability and multifunctionality of the developed modules to reinforce our future development tasks. 

As for my personal involvement, this has been a notably auspicious coincidence, since I myself am active in theatrical productions as a music producer and devoted to investigating the juncture of music creation and AI. 

What considerations went into selecting the technology stack for the theatre use case within VOXReality, and how does it align with the specific requirements of live theatrical performances?

Given the public nature of the theatrical use case, the user facing aspects of the system, specifically, the XR hardware and XR application user interface, were an important consideration.  

In terms of hardware, the form factor of the AR device was treated as a critical parameter. AR glasses are still a developing product with a limited range of devices that could support our needs. We opted for the most lightweight available option with a glass-like form to achieve improved comfort and acceptability. This option had the tradeoff of being cabled to a separate computing unit, which was considered of least concern to us given the seating and static arrangement in the theatre. In more practical terms, since the application should operate with minimal disturbance in terms of head and hand movement, in silence and in low light conditions, we had decided that any input to the application should be made using a dedicated controller and not hand tracking or voice commands. 

In terms of user interface design, we selected a persona with minimal or no XR familiarity and that defined our approach in two ways:  1) we chose the simplest possible user input methods on the controller and we implemented user guidance with visual cues and overlays. We added a visual highlight to the currently available button(s) at any point and in the next iteration, we will expand on this concept with a text prompt on the functionality of each button, triggered by user gaze tracking. 2) we tried to find the balance between providing user control which allows for customization/personalization and thus improved comfort, and limiting control which safeguards the application’s stability and removes cognitive strain and decision-making from the user. This was addressed by multiple design, testing and feedback iterations. 

How does the technical development ensure seamless integration with existing theatre systems, such as lighting, sound, and stage management, to create a cohesive and synchronized production environment?

As in most cases of innovative merging of technologies, adaptations from both sides of the domain spectrum will need to be made for a seamless merger. One problematic area involves the spatial mapping and tracking system needs for XR technology. Current best practices for its stable operation dictate conditions that typically do not match a theatrical setup: it requires well-lit conditions, stable throughout the experience, performs best in small/medium sized areas, needs surfaces with clear and definite traits that avoid specific textures, etc.  Failure of the spatial mapping and tracking system can lead to misplaced 3D content which no longer matches the scenography of the stage and thus breaks immersion and suspension of disbelief for the user. In some cases, failure may also lead to a non-detection or inaccurate detection of the XR device controller(s), thus impeding user input. 

To amend this, recommendations for the stage’s scenography can be provided from the technical team to the artistic team. Examples are to avoid reflective, transparent, or uniform in color (especially avoiding the color black) surfaces, or surfaces with strong repeating patterns.  Recommendations can also address non-tangible theatrical elements, like the lighting setup. Best practices advise avoiding strong lighting that produces intense shadows or dip areas in total or near-total darkness. 

Furthermore, there are spatial tracking support systems that a director may choose to integrate in experimental, narrative or artistic ways. One example is the incorporation of black-and-white markers (QR, ARUCO, etc) as scenography elements which have the practical function of supporting the accuracy of the XR tracking system or extending its capabilities (e.g. tracking moving objects).  

Going even further, an artistic team may even want to examine a non-typical theatre arrangement which can better match the XR technology needs and lead to innovative productions. On example is the round theatre setup, which has a smaller viewing distance between audience and actors and an inherently different approach to scenography (360° view). Other even more experimental physical setups can involve audience mobility, like standing or walking around, which can make even more use of the XR capabilities of the medium in innovative ways, like allowing the users to navigate a soundscape with invisible spatial audio sources or discover visual elements alongside pre-designed routes or from specific viewing angles. 

In terms of audio input, the merger has less parameters. Currently, users are listening to the audio feed from the theatre stage’s main speakers and are receiving no audio from the XR device. Innovative XR theatre design concepts around audio could involve making narrative and artistic use of the XR device speakers. This could e.g. be an audio recording of a thought or internal monologue that, instead of being broadcasted from the main stage, plays directly on the XR device speakers, and thus very close to the viewer and in low volume. It could be an audio effect that plays in waves rippling across the audience or plays with a spatialized effect somewhere in the hall, e.g. among the audience seating. Such effects could also make use of the left-right audio channels thus giving a stronger sense of directionality to the audio. 

The audio support could also be used in more practical terms. VOX Reality currently supports provision of subtitles in the user’s language of choice.  In the future, we could extend this functionality to provide a voice over narration using natural-sounded synthetic speech in their language of choice. This option would better accommodate people which prefer listening over reading for any physiological or neurological reason. This feature would require supplying XR devices with noise-cancelling headphones, so that the users may receive a clear audio feed from their XR devices, isolate the theatrical stage main speakers’ audio feed and not produce audio interference to each other.  

In summary, we are in the fortunate position to not only enact a functional merger of the XR technology and the art of theatre domains as we currently know them, but also to envision a redefinition of conventions that have shaped the public’s concept of theatrical experiences for centuries through the capabilities of XR. We would summarize these opening horizons in three broad directions: 1) an amplification of inclusivity by being able to provide customizable individualized access to a collectively shared experience, 2) an amplification and diversification of the audiovisual landscape in the theatrical domain and 3) an invigoration of previously niche or an invention of totally new ways for audience participation in the theatrical happenings.  

Given the sensitive nature of theatrical scripts, what security protocols have been implemented to protect against unauthorized access?

Although our use case does not manage personal or sensitive medical data as in the domains of healthcare or defense, we meticulously examined the security of our system in terms of data traffic and data storage with respect to the intellectual property protection needs of the theatrical content. To cover the needs of the theatre use case, we designed a client-server system with clients operating on the XR devices of the audience and the server operating on a workstation under the assignment of the interdisciplinary facilitation team (developer team and the theatre’s technical team). As context, core reasons for the existence of the client-server system in summary were 1) to centralize the audiovisual input from the scene (microphone and video input) in order to safeguard input media quality, 2) to simultaneously distribute the output to the end-user devices in order to assure synchronicity in the audience and 3) to offset the demanding computational needs to a more powerful device in order to avoid battery and overheating issues on the XR devices.  

In terms of data traffic security, the server and the clients are connected to the same local Wi-Fi network, protected by a WPA2 password, and communicate using a WebSocket protocol for frequent and fast communication. The local Wi-Fi network is for the explicit use of the AR theatre system and accesible only to the aforementioned devices, as a safeguarding measure against network bandwidth fluctuations, which could negatively affect the latency of the system and in turn the user experience during the performance, and as a security measure against data traffic inception. Furthermore, for the exact same reasons, the AI services are also operating locally in the same network and are accessed using RESTful API calls, with the added protection of a secure transport protocol (https).  In summary, the entire traffic is contained in a safe and isolated environment that can only be breached by an unauthorized network access violation. 

In terms of data storage, it was decided that in the release version of the application, no data logs will remain in the XR devices since safeguarding against unauthorized access of the data given the temporary provision of the devices to the public without supervision was not feasible. Any data stored will be in the server device, will hold no personalized information in any form, and will be used exclusively for technical purposes, like system monitoring and performance evaluation. 

Considering the rapid evolution of technology, how is the technical development future-proofed to accommodate emerging advancements, and what strategies are in place for seamless upgrades or integrations with future technologies?

In a rapidly changing technological domain like XR and AI, planning for change is an integral part of design and development. For us, this means asking questions in two directions: 1) what the future fate of the current product can be and 2) what can the product evolve to in the future with minimal effort. Answering these questions is enabled by the fact that we, as XR developers and producers of state-of-the-art XR applications, can create informed scenarios for the foreseeable future.  

One such scenario that is based on financial data and trends is the growth of the XR market, and specifically the AR sector. This is expected to diversify the device range and reduce purchase costs. In turn, this can affect us by enabling the selection of even more well-suited AR glasses for theatres, it can reduce the investment cost for adoption by theatrical establishments, and it can support the popularization of the XR theatre concept in artistic circles. At the same time, theatre-goes, in their role as individual consumers, can be expected to have increasing exposure and familiarity with this technology in general. Therefore, our evaluation for the first question is that we have good reasons to expect that our current product will have increasing potential for adoption. 

On the second question, our strategy is to vigorously uphold proper application design principles with explicit focus on modular, maintainable and expandable design. Operationally, we are adopting a cross-platform development approach to be able to target devices running on different operational systems using the same code base. We are prioritizing open frameworks to ensure compatibility with devices that are compliant with industry standards, thus minimizing intensive proprietary SDK use. In terms of system architecture, by separating the AI from the XR elements, we allow for independent development and evolution for each domain in their own speed and direction. By building the connections with well-established methods that are unlikely to change, like RESTful API calls, we ensure that our product is in the best position to adapt to potentially reworking of entire modules. Furthermore, we adopt a design approach with segmented “levels of XR technology” so as to be able to easily create spin-offs targeting various XR-enabled hardware as they emerge. This does not necessarily imply more powerful devices, but also more popular ones. One current example that we investigate is to single-pick the subtitles provision feature and target affordable 2D AR glasses (also called HUD glasses or smart glasses or wearable monitors) as a means of increasing theatre accessibility.  

Picture of Spyros Polychronopoulos

Spyros Polychronopoulos

Researcher on digital simulation of ancient Greek instruments, and lecturer, teaching music technology and image processing.

Twitter
LinkedIn