Audiovisual heritage refers to the collection of sound and moving image materials that capture and convey cultural, historical, and social information. It includes cultural products, such as films, radio broadcasts, music recordings, and other forms of multimedia, as well as the instruments, devices and machines used in their production, recording and reproduction, and the analog and digital formats used to store them.
Preservation and accessibility
Preserving audiovisual heritage is crucial because analog formats (like film reels, magnetic tapes, and vinyl records) are vulnerable to physical decay. Digital formats, meanwhile, face the risk of obsolescence as technology evolves. Preservation is only the first step though, since the process of digitizing and archiving makes these materials often difficult for the public to access. To allow the public to meaningfully engage with and understand the significance of each artifact, it is important to contextualize the artifact in a curated framework.
Reasons and methods to use Extended Reality
This is where Extended Reality (XR) comes in —an umbrella term that encompasses Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR). XR describes a complex branch of emerging technology that allows users to interact with content in immersive ways. XR can isolate users senses from their physical environment and allows them to experience (e.g. see and listen) to audiovisual heritage artifacts in a virtual space specifically designed for that purpose. This can be seen as a virtual counterpart to how museums thoughtfully design physical displays to best showcase their exhibits. XR also enables creators to craft narratives around artifacts, enhancing their cultural and historical value—a key area where XR shines.
Examples
One such example is the VR work “Notes on Blindness” [1], which allows users to listen to original audio recordings from the writer John Hull describing his journey into biological blindness. The XR work allows users to experience darkness while listening to the recording and in addition visualizes the narrative with a subtle yet decisive aesthetic.
Another example is the work “Traveling While Black” [2] which is a VR work documenting racial discrimination against African Americans in the United States. This work uses original audio and film excerpts, including interviews from people that lived through segregation or their descendants. Viewing this work from an immersive, first-person perspective in contrast to viewing it on a flat monitor gives the audience more affordances for critical engagement and self-reflection. Another great example is a VR work belonging to the exhibition called ‘The March’ at the DuSable Museum of African American History in Chicago, chronicling the historic events of the 1963 march on Washington [3]. The work contains recordings from Martin Luther King Jr.’s iconic ‘I Have A Dream’ speech.
Limitations
Despite its potential, applied examples yet do not abound because there is significant expertise and cost involved in such productions. Limitations exist on hardware, software and HCI design aspects, and are gradually being addressed. Research work is being invested in design methodologies to streamline production and improved audience satisfaction. Practical issues, like hardware production costs and form factor discomfort, are being mitigated by commercial investments from major tech companies, like Microsoft, Google, Samsung, Apple, Meta, etc. Industry standards with cross platform support and legacy support, such as OpenXR, are another important factor for broader adoption. Finally, the audience’s familiarity and interest with this technology is increasing as it permeates more and more aspects of daily life.
Conclusion
It is clear that extended reality technology can transform how we engage with our audiovisual heritage – it can offer contextualization, it can situate both the audience and the artifact in a narrative framework, and it can offer more depth and nuance to our interactions. While there are still challenges, the limitations are steadily being lifted by efforts from a multitude of involved fields – evidencing the importance of this domain. We eagerly anticipate the next innovative steps from museums, galleries, research centers, studios and film companies worldwide.
On January 26th 2024, Maastricht University (UM) VOXReality team was hosted by the Artificial Intelligence 4 Language Technologies (AI4LT) group of the Karlsruhe Institute of Technology (KIT). It was a long day workshop where both groups presented their work in Natural Language Processing (NLP) and more specifically Machine Translation (MT). Synergies between the two groups promise a bright future for applied language technologies!
UM kicked-off the day by presenting the VOXreality project and more specifically the 3 use-cases along with the general objectives: (1) improve human-to-machine and human-to-human XR experiences, (2) widen multilingual translation and adapting it to different contexts (3) extend and improve the visual grounding of language models, (4) provide accessible pretrained XR models optimized for deployment and (5) demonstrate clear integration paths for the pretrained models. UM’s team member, Yusuf Can Semerci elaborated (as the scientific and technical coordinator) on the technical excellence of the project which is guaranteed by applying the state-of-the-art methods in automatic speech recognition (ASR), multilingual machine translation, vision and language models and generative dialogue systems.
UM’s team has 2 active PhD candidates who shared their latest research endeavors. Abderrahmane Issam explained his latest work on efficient simultaneous machine translation (SiMT). The goal of SiMT is to provide accurate and as real-time as possible translations by developing policies that are able to balance the quality of the produced translation versus that lag which is sometimes necessary so that the model has enough information to translate properly. UM’s proposed method learns when you need to wait for more input data in the original language before starting to produce the translation taking into account the uncertainty that comes with real-time applications. Results are promising both in terms of translation accuracy but also reducing the necessary lag.
Pawel Maka presented his published paper on context-aware machine translation. Context plays an important role in all language applications: in machine translation it is essential to remove the vagueness from e.g. which pronoun should be used. Context can be represented in different ways and usually includes the previous (or next) sentences for the one that we want to translate which can either be on the source or the target language. Of course, the bigger the context, the more computationally expensive is to run a translation model. Therefore, UM proposed different methods on how context can be efficiently “compressed” through techniques like caching and shortening. Proposed methods are competitive in terms of performance both in terms of accuracy but also in terms of resources used (e.g. memory).
On the other hand, KIT’s team presented the EU project Meetween. Meetween is aiming to revolutionize video conferencing platforms, breaking linguistic barriers and geographical constraints. Meetween aspires to deliver open-source AI models and datasets and more specifically multilingual AI models that focus on speech but support text, audio and video both as inputs and outputs and multimodal and multilingual datasets that cover all official EU languages.
KIT’s team of PhD candidates presented their work in (1) multilingual translation in low-resource cases (i.e. for languages that are not widely spoken or for cases when data is not available), (2) low-resource automatic-speech recognition, (3) the use of Large Language Models (LLM) in context-aware machine translation and (4) quality/confidence estimation for machine translation.
We were happy to identify the overlaps between both EU projects (VOXReality and Meetween) as well as the UM and KIT teams. At the heart of both projects lies a common objective: harness the power of advanced AI technologies, particularly in the realms of Natural Language Processing (NLP) and Machine Translation (MT), to facilitate seamless communication across linguistic and geographical barriers. While the applications and approaches may differ, the essence of their goals remains intertwined. VOXreality (by UM), seeks to enhance extended reality (XR) experiences by integrating natural language understanding with computer vision. On the other hand, KIT’s Meetween project takes a different but complementary approach to revolutionizing communication platforms. By fostering an environment of open collaboration and knowledge exchange, UM and KIT are more than excited for what the future brings in terms of their collaboration.
Jerry Spanakis
Assistant Professor in Data Mining & Machine Learning at Maastricht University
In our sixth installment of the Partner Interview series, we sit down with Stavroula Bourou, a Machine Learning Engineer at Synelixis Solutions S.A., to explore the company’s vital role in the VOXReality project. Synelixis, a leader in advanced technology solutions, has been instrumental in developing innovative virtual agents and immersive XR applications that are transforming how we experience virtual conferences. In this interview, Stavroula shares insights into their groundbreaking work and how they are driving the future of communication in the XR landscape.
Can you provide an overview of your organization's involvement in the VOXReality project and your specific role within the consortium?
Synelixis Solutions S.A. has been an integral part of the VOXReality project from its inception, serving as one of the original members of the proposal team. Our organization brings a wealth of experience to the table, participating in numerous EU-funded research projects and providing cutting-edge technology solutions.
In the VOXReality project, our roles span several domains, significantly enhancing the project’s success. One of our pivotal contributions is the development of a virtual agent designed for use in virtual conferences. This agent is designed to be user-friendly and non-intrusive, respecting user requests and preferences while assisting users by providing navigational help and timely information about the conference schedule, among other tasks. Its design ensures that interactions are helpful without being disruptive, allowing users to engage with the conference content effectively and comfortably.
Additionally, we have developed one of the three VOXReality XR Applications—the VR Conference application. This application recreates a professional conference environment in virtual reality, complete with real-time translation capabilities and a virtual assistant. It enables users to interact seamlessly in their native languages, thanks to VOXReality’s translation services, thus breaking down language barriers. Furthermore, the virtual agent provides users with essential information about the conference environment and events, enhancing their overall experience.
Furthermore, we have outlined deployment guidelines for the VOXReality models for four different methods: source code, Docker, Kubernetes, and ONNX in Unity. These guidelines are designed to facilitate the integration of VOXReality models into various applications, making the technology accessible to a broader audience.
How do you envision the convergence of NLP and CV technologies influencing the Extended Reality (XR) field within the context of the VOXReality initiative?
In the context of the VOXReality initiative, the convergence of Natural Language Processing (NLP) and Computer Vision (CV) technologies is poised to revolutionize the Extended Reality (XR) field. By integrating NLP, we enhance communication within XR environments, making it more intuitive and effective. This allows users to interact with the system using natural language, significantly improving accessibility and engagement. Additionally, this technology enables users who speak different languages to communicate with one another or to attend presentations and theatrical plays in foreign languages, thus overcoming language barriers and reaching a broader audience. Similarly, incorporating CV enables the system to understand and interpret visual information from the environment, which enhances the realism and responsiveness of both virtual agents and XR applications.
Together, these technologies enable a more immersive and interactive experience in XR. For example, in the VOXReality project, NLP and CV are being utilized to create environments where users can naturally interact with both the system and other users through voice commands. This integration not only makes XR environments more user-friendly but also significantly broadens their potential applications, ranging from virtual meetings and training sessions to more complex collaborative and educational tasks. The synergy of NLP and CV within the VOXReality initiative is set to redefine user interaction paradigms in XR, making them as real and responsive as interacting in the physical world.
What specific challenges do you anticipate in developing AI models that seamlessly integrate language as a core interaction medium and visual understanding for next-generation XR applications?
One of the primary challenges in developing AI models that integrate language and visual understanding for next-generation XR applications is creating a genuinely natural interaction experience. Achieving this requires not just the integration of NLP and CV technologies but their sophisticated synchronization to operate in real-time without any perceptible delay. This synchronization is crucial because even minor lags can disrupt the user experience, breaking the immersion that is central to XR environments. Additionally, these models must be adept at comprehensively understanding and processing user inputs accurately across a variety of dialects. The complexity of processing multilingual and dialectical variations in real-time adds significant complexity to AI model development.
Moreover, another significant challenge is the high computational demands required to process these complex AI tasks in real-time. These AI models often need to perform intensive data processing rapidly to deliver seamless and responsive interactions. Optimizing these models to function efficiently across different types of hardware, from high-end VR headsets to more accessible mobile devices, is crucial. Efficient operation without compromising performance is essential not only for ensuring a fluid user experience but also for the broader adoption of these advanced XR applications. The ability to run these complex models on a wide range of hardware platforms ensures that more users can enjoy the benefits of enriched XR environments, making the technology more inclusive and widespread.
All these challenges are being addressed within the scope of the VOXReality project. Stay tuned to learn more about our advancements and breakthroughs in this exciting field.
How do you plan to ensure the adaptability and learning capabilities of the virtual agents in varied XR scenarios?
To ensure the adaptability and learning capabilities of our virtual agents in varied XR scenarios within the VOXReality project, we are implementing several key strategies. Firstly, we utilize advanced machine learning techniques to equip the virtual agents with the ability to learn from user interactions and adapt their responses over time. These techniques, including deep learning and large language models (LLMs), enable the virtual agents to analyze and interpret vast amounts of data rapidly, thereby improving their ability to make informed decisions and respond to user inputs in a contextually appropriate manner, making them more intuitive and responsive.
Moreover, we are actively creating and curating a comprehensive dataset that reflects the real-world diversity of XR environments. This dataset includes a wide array of interactions, environmental conditions, and user behaviors. By training our virtual agents with this rich dataset, we enhance their ability to understand and react appropriately to both common and rare events, further boosting their effectiveness across various XR applications.
Through these methods, we aim to develop virtual agents that are not only capable of adapting to new and evolving XR scenarios but are also equipped to continuously improve their performance through ongoing learning and interaction with users.
In the long term, how do you foresee digital agents evolving and becoming integral parts of our daily lives, considering advancements in spatial and semantic understanding through NLP, CV, and AI?
In the long term, we foresee digital agents evolving significantly, becoming integral to our daily lives as advancements in NLP, CV, and AI continue to enhance their spatial and semantic understanding. As these technologies develop, digital agents will become increasingly capable of understanding and interacting with the world in ways that are as complex as human interactions.
With improved NLP capabilities, digital agents will be able to comprehend and respond to natural language with greater accuracy and contextual awareness, making interactions feel more conversational and intuitive. This advancement also includes sophisticated translation capabilities, enabling agents to bridge language barriers seamlessly. As a result, they can serve global user bases by facilitating multilingual communication, which enhances accessibility and inclusivity. This will allow them to serve in more personalized roles, such as personal assistants that can manage schedules, respond to queries, and even provide companionship with a level of empathy and understanding that closely mirrors human interaction.
Advancements in CV will enable these agents to perceive the physical world with enhanced clarity and detail. They’ll be able to recognize objects, interpret scenes, and navigate spaces autonomously. This will be particularly transformative in sectors like healthcare, where agents could assist in monitoring and providing care, and in retail, where they could offer highly personalized shopping experiences.
Furthermore, as AI technologies continue to mature, we will see digital agents performing complex decision-making tasks, learning from their environments, and operating autonomously within predefined ethical guidelines. They will become co-workers, caregivers, educators, and even creative partners, deeply embedded in all aspects of human activity.
Ultimately, the integration of these agents into daily life will depend on their ability to operate seamlessly and discreetly, enhancing our productivity and well-being without compromising our privacy or autonomy. As we advance these technologies, we must also consider the ethical implications and ensure that digital agents are developed in a way that is beneficial, safe, and respectful of human values.
Stavroula Bourou
Machine Learning Engineer at Synelixis Solutions SA
In this fifth installment of our Partner Interview series, Leesa Joyce, Head of Research Implementation at Hololight, sits down with Carina Pamminger, Head of Research at Hololight, to explore their organization’s pivotal role in the VOXReality project. As a leader in extended reality (XR) technology, Hololight is pushing the boundaries of augmented reality (AR) solutions, particularly within industrial training applications. Through their work on the Virtual Training Assistant use case, Carina sheds light on how AR is transforming training processes by integrating AI-driven interactions and real-time performance evaluation. The interview delves into the innovative ways AR is being utilized to enhance assembly line training, the incorporation of safety protocols, and the future of immersive learning experiences at Hololight.
Can you provide an overview of your organisation's involvement in the VOXReality project and your specific role within the consortium?
HOLO is an extended reality (XR) technology provider contributing with its augmented reality (AR) solutions to stream and display 3D computer-aided-design (CAD) models and manipulate them in AR environment. HOLO is a task leader where it is leading the development of novel interactive XR applications. HOLO is also the leader of the use case “Virtual Training Assistant”. The Training Assistant use case revolves around the enhancement of an AR industrial assembly training application with the goal of enhancing the training process by incorporating the automated speech recognition (ASR) model and dialogue system of VOXReality. Conventional training techniques frequently exhibit a deficiency in interactivity and adaptability, resulting in less-than-optimal educational results. Through the integration of artificial intelligence within the AR setting, this scenario aims to establish a more captivating and efficient training atmosphere. Noteworthy characteristics of the application encompass the visualization and manipulation of 3D CAD files within the AR environment, an interactive virtual training aide featuring real-time performance evaluation, as well as a dynamic dialogue system driven by natural language processing (NLP) and speech-to-text functionalities.
The prime constituent of the training assistant technology is the application Hololight Space Assembly. Trainees are guided to precisely assemble components within the CAD model, ensuring everything fits perfectly. The system effortlessly integrates with pre-existing asset bundles, providing all the necessary details, such as CAD files, tools, and additional elements like tables or shelves. It also includes intuitive scripts for model interaction, easy-to-navigate menus, and smart algorithms to enhance the assembly experience. In addition, Assembly leverages Hololight Stream to remotely render the application from a high-performance laptop to AR smart glasses, overcoming the device’s rendering limitations. This remote rendering and streaming setup allows the AR training application to be hosted on a powerful laptop (server) and seamlessly streamed to the HoloLens 2 (client).
How is AR seamlessly integrated into training applications, and what specific advantages does it bring to the learning experience?
Integrating AR into training applications allows assembly line workers to train in a highly realistic, digitally replicated environment that mirrors their actual workspace. This immersive experience helps workers develop muscle memory and recognize environmental cues, making the transition to the real assembly line smoother and more intuitive. Since the training environment is digital, it can be accessed from anywhere, at any time, providing flexibility and convenience for both trainees and companies.
Moreover, AR-based training is resource-friendly and cost-effective. Multiple workers can use the same training files repeatedly, allowing for efficient use of resources. The digital nature of the environment also means that training scenarios can be easily modified, redesigned, or personalized to meet specific needs, enhancing the learning experience. By incorporating sensory cues, AR helps reinforce learning, making it a powerful tool for building skills that are critical in a fast-paced, high-precision environment like an assembly line.
How does the AR training application cater to different skill levels among trainees, ensuring a gradual learning curve for beginners and challenging modules for more experienced assembly technicians?
The AR training application is designed to accommodate various skill levels, ensuring that both beginners and experienced assembly technicians can benefit from the training. For those with some assembly knowledge but needing to master a new object / engine / machine, the difficulty modes come in handy. These modes guide trainees through the correct order of assembly, gradually increasing in complexity. This personalized approach allows the training to adapt to the expertise and learning pace of everyone, making it accessible to slow learners while still providing a challenge for those who pick up the process quickly.
By progressing through these difficulty levels, trainees not only learn the assembly process but also reinforce it through repetition, ensuring they internalize each step. As they clear each difficulty mode, they build confidence and gradually commit the entire process to memory. This approach ensures that by the end of the training, regardless of their initial skill level, all trainees will have mastered the assembly process and be fully prepared to apply their knowledge on the actual assembly line.
Considering the critical nature of turbine assembly, how does the AR application incorporate safety protocols and guidelines to ensure that trainees adhere to industry standards during the training process?
The AR application prioritizes safety by guiding trainees through the correct order of turbine assembly, creating a complete awareness about the process and the parts that need to be handled, reducing the likelihood of mistakes that could lead to serious risks in real-life scenarios. By learning and practicing each step in a controlled, digital environment, trainees can focus on mastering the process without the immediate dangers associated with heavy machinery. This approach ensures that they are well-prepared to follow industry standards and protocols when transitioning to the actual assembly line, where adherence to safety guidelines is critical.
However, some safety aspects remain areas for improvement. Currently, ergonomic assessments can only be conducted in real-life settings, requiring external analysis to ensure proper posture and technique. Additionally, the integration of Personal Protective Equipment (PPE) within the AR training is limited due to compatibility issues between safety goggles and AR glasses. While the application effectively reduces risks by teaching the correct assembly sequence, future developments could enhance safety training by incorporating ergonomic evaluations and better PPE integration.
Looking ahead, what plans are in place for future enhancements and expansions of the AR training application for turbine assembly? Are there additional features or modules on the horizon to further enrich the learning experience?
The AR training application for turbine assembly is set to undergo significant enhancements, particularly with the integration of VOXY, an AI-assisted dialogue agent with voice assistance. VOXY is already a game-changing addition, streamlining interactions within the application by eliminating the need for clumsy AR hand gestures. This ensures a smoother, more immersive experience, allowing users to stay fully engaged with the training process. VOXY also introduces AI-driven support, making it easier for trainees to navigate complex assembly tasks while receiving real-time guidance and feedback.
Future expansions include developing a platform to host training files and an analysis mode to evaluate trainee performance more comprehensively. We’re also exploring the incorporation of real, trackable tools in the AR environment, enabling physical interaction with virtual elements to improve ergonomics and weight memory. Additionally, we’re researching ways to integrate safety equipment into the AR training, with ongoing efforts under the SUN project funded by Horizon Europe. These enhancements will not only enrich the learning experience but also ensure that trainees are better prepared for the physical demands and safety requirements of turbine assembly.
Recording video using several types of modern capturing devices — such as digital cameras, web cameras, or cell-phone cameras — has been widely proliferated in the world for many years now. The reasons why we capture videos are numerous. People mostly desire to capture important moments in their lives, or less important ones, using their mobile devices. Within a series of years, people may come up with several hundreds or thousands of videos and images. Video capturing has other operational applications, too, like video-based surveillance. In this type of surveillance, a place of interest that is visible by a camera is captured in order to monitor what is happening in the surrounding area. But why would we need to capture video in this case? Shop owners would, for instance, utilise surveillance cameras to monitor people who navigate their shops for security or for business management reasons. However, there can be more than that. Another idea could be to predict when and where people visiting a very large shop or a museum should be serviced by the staff. The utility of video for artificially intelligent digital analyses is very difficult to understand fully. When we need to manage smaller or larger collections of video, one question that is important is this: How can we summarise the essential semantic visual information that is contained in a collection of videos through text?
As humans, we can instantly perceive some of the elements of our surrounding environment, without even making a significant effort. Perceiving aspects of the visual world is essential for us to function within our communities. The human brain perceives the world around us visually by being passed visual information sensed directly through our eyes and transmitted via the optic nerve. This is a true but too coarse of a statement about how vision works in the human species. The complexity of how human vision essentially works is not revealed. In fact, until today it is not fully understood how the brain processes visual information and how it makes sense of it.
Although there are important scientific questions in the modern understanding of the underpinnings of visual processing in the human brain, for years computer scientists have been trying to find explanations, algorithms and mathematical tools that can recreate visual analysis and understanding from visual data of different sorts (e.g., images and video). In particular, making sense of videos, for instance, requires us to be able to detect and localise objects in videos, track target moving objects, or to take in a streaming video of a road scene from a car roof-mounted camera and find where the vehicles, pedestrians, or road signs are. These are only very few examples of computer vision problems and applications studied by computer scientists and practitioners.
The top of the image above depicts a model of the biological analog for understanding how the biological brain recognises objects, and the bottom of it shows an artificial analog for visual processing in the same task. In the biological analog, the human eye senses a green tree through the retina and passes on the signals to the optic nerve. The different cues of this image (such as elements of motion, depth information and color) are processed by the Lateral Geniculate Nucleus (LGN) in the Thalamus area of the brain. This layer-by-layer signal propagation in the LGN makes up an explanation of how the brain encodes raw sensory information from the environment. The outcome in the brain encoding stage is passed on to the layers in visual cortex of the brain, which finally enables the human brain to sense the picture of a green tree. On the contrary, at the bottom of the figure, we see how a relatively modern neural network analog works. At an initial step, raw visual data are captured by a camera that can sense the visible spectrum of the electromagnetic spectrum. Then, the three layers of pixel intensities in terms of three color components (R for red, G for green and B for blue), are passed through a deep convolutional neural net. This forward process encodes the raw input data as an arrangement of numbers that are fixed horizontally and vertically. Each number is encoded by a tuple of coordinates in the arrangement of numbers, each telling us at which index of a dimension the number belongs to. The arrangement of these numbers give us what we call “features”. We can think of it as a numerical “signature” or “fingerprint” of what the camera captured from a fixed position in the environment. Finally, another deep neural network with intermediate neural processing modules decodes the former features by transforming them sequentially and non-linearly through different neural processing layers. Finally, by means of the forward propagation of information through these neural layers, a distribution of the likely categories of objects in the image are computed. How is this made possible? Simple: the model was trained to adapt its parameters to learn this association from raw images to distributions of object categories by minimising the categorisation error over a set of image versus category pairs.
Being able to understand how to write computer programs that can tell us what is depicted in videos through text has been one important and general-purpose computer vision application. Although for decades humans have thought of intelligent machines that can visually perceive the world around them, being able to convincingly take in a video and expecting to robustly generate a text description that tells us what is depicted in the video only recently became possible with the development or robust algorithms. Before this time-point in the horizon of AI advances, attacking the video-to-text task with older scientific techniques and methods was practically impossible. In the recent years after this turning time-point, the most important scientific concepts that were used to create effective video captioning algorithms were borrowed from the AI subfield called deep learning. Moreover, advances in computer hardware that can accelerate numerical computations have became available in the market, so that deep learning models with a large number of model parameters can be trained on raw data. Graphics Processing Units (GPUs) is the go-to hardware technology to allow for the development of deep learning-based models. Unfortunately, in the era before the proliferation of deep learning (probably around the year 2006 in which Deep Belief Networks were introduced), there were still important techniques and concepts that were employed to devise successful algorithms, but their capabilities fell short versus those exhibited by deep learning models. Moreover, the deep learning-based video captioning algorithms that have existed today have a significantly enormous amount of parameters, that is, numbers, which was atypical of older algorithms (or models) designed for exactly the same task. Although the widest adoption of deep learning can be estimated to have started around the year 2006, it is important to notice that the LeNet deep CNN model developed by Yann LeCun was published in 1998. Moreover, basic elements of deep learning were initially developed in the eighties and nineties, such as backprop algorithm for tuning model parameters.
The sets of model parameters in deep learning models can be found through algorithms that perform what is called function optimization. Through the use of function optimization algorithms, a model can hopefully work sufficiently well on its intended task. Research on explainable AI has contributed methods that can explain why a model committed a particular function (e.g., a classification or a regression operation). In the area of video captioning, many successful systems such as SwinBERT [1] follow such an approach to train a video captioning model on a large dataset of videos harvested from the Web. Each of these videos is associated with a text caption, comprised by one or more sentences, that was written by a human. What is significant here is that the designer of a video captioning deep learning algorithm can take in a large dataset of such videos and associated annotations and, after some amount of time that can vary depending on the amount of data and the size of the model, come up with a good model that can be presented with new videos that the model had never seen before during training and generate relatively accurate video captions for them.
An ordered sequence of words that describes a video may normally relate to what a real human would say aiming to describe the video. But is it technically trivial to generate a video caption by means of an algorithm? The answer to this question is a mixed “yes and no” answer. It is partly a “yes” answer, because scientists have already come up with capable algorithms for the task, although they are still not “entirely perfect”. On the contrary, the answer is partly a “no” one, because the problem of generating a video caption is ill-posed: that is, it cannot be defined in a way that can clearly determine what a video caption really is, so that then there can be an exact algorithm that can provide answers like those that are unquestionably correct. To understand this better, imagine that you would normally make different statements after seeing the same video depending on the details that you actually want to highlight. So how can we a-priori decide what to tell about a video via an algorithm when there can be several possible statements that we can actually make? There is no way we can do this, because the algorithm may miss declaring something about the video that is, in fact, important to a human observing it. Therefore, facing ill-posedness, we only go by defining what video captioning is by providing a mathematical model that can convincingly tell what video captioning is, but one that is not really the optimal one. As we already mentioned, we even do not know what the optimal model is in the first place! In the way to train this suboptimal model, the designer has to again rely on a large dataset of raw videos and associated text annotations in order for the model to learn to perform the associated task well.
In an attempt to grasp that mathematical models of reality can be suboptimal in some sense, it is helpful to recall a quote by George E. P. Box that says: “All models are wrong, but some are useful”. Video captioning systems that are built using deep learning modelling concepts can be said to operate at a level that can empirically prove they are useful and reliable for the task. Their results can achieve a good level of utility when we can quantify goodness, despite the fact that we know these models are not globally perfect models of the visual world sensed through a camera.
To glimpse an example of how a real video captioning system works, we will provide a simple and comprehensive summary of a video captioning model called SwinBERT. This model was developed by researchers at Microsoft. It was presented at the CVPR 2022 conference (see reference [1]), a top-tier conference for computer vision. The implementation of this model is publicly available on GitHub [3].
To get an understanding on how SwinBERT works, it is helpful to consider the fact that in the physical world matter is made of small pieces organised hierarchically. Small pieces of matter are combined with other small pieces of matter to create bigger chunks of matter. For example, sand is made of pieces of very tiny rocks. A selection of thousands or millions of such small rocks are arranged geometrically together to make a piece of sand. Equivalently, these small pieces tie together to form larger pieces of matter, and each small piece has a particular position in space. In the case of video captioning, the SwinBERT system provides a model that is trained to relate pieces of visual information (that is, image patches) to sequences of words. To make this happen, SwinBERT reuses two previous important ideas. The first one is the VidSwin transformer model that starts with representing video as 3D voxel patches and finally performs feature extraction to represent the visual content in an image sequence. VidSwin-generated representations can be used for classification tasks such as action recognition tasks, among others. VidSwin is a technical development that was published before the time SwinBERT became accessible. It was created by Microsoft and was presented in a 2022 paper at the CVPR conference [2]. A module that helps generate word sequences (that is, sentences) is BERT (Bidirectional Encoder Representations from Transformers) [5] developed by Devlin and collaborators.
To begin with, in order to understand the function of VidSwin, imagine a colourful mosaic created by an artist. To create a mosaic that depicts a scene, a mosaic designer takes very small coloured pieces of rock, each with a unique colour and texture. She stitches these nice, small, colorful pieces of rock together, piece by piece, to form objects like a dolphin and the sea, as in the mosaic image on the left. Normally, each selected mosaic patch certainly belongs to a unique object. In the mosaic on the left, for example, some patches belong to the dolphin, while other patches belong to the background, and some patches belong to the sea surface. Mosaic patches belonging to the same object are often adjacent, or it can be said that they are near each other; but patches that do not belong to the same object are usually not adjacent, or they are adjacent because their boundaries touch. If we represented the adjacency of image patches as a graph, we would naturally come up with an outright planar, undirected graph. For example, for the mosaic example depicted in the picture above, we can say that the “the dolphin hovers on seawater”. Now imagine the scenario in which the same designer creates slightly altered mosaics given the original mosaic. Consider the patches of each mosaic image to lie at the same positions but their color content to change from one mosaic image to the next mosaic image. Imagine that some of the patches make the appearance of the dolphin to be undergoing a displacement in time, giving the impression that the dolphin is moving. Naturally, as we iterate the mosaics from the first to the last one, some image patches may be correlated spatially (because they are adjacent within the image), while other patches belonging to different mosaic images may be correlated temporally. VidSwin aspires to model these dynamic patch relationships by adopting a transformer model to perform self-attention on the 3D patches both spatially and temporally, creating refined 3D patches that are embedded well in feature space. These refined patch embeddings are further transformed several times by self-attention layers in order to robustly model the dependencies among them at different scales of attention. Finally, these 3D embeddings are passed through a multi-head self-attention layer, followed by a forward non-linear transformation computed by a feed-forward neural network. VidSwin then outputs spatio-temporal features that numerically describe small consecutive frame segments in the video.
In the language modelling module of SwinBERT called BERT [5], the ultimate goal is to generate a sequence of words that better describes the visual content of the video as captured by VidSwin. BERT captures the relationships between the words that appear in a sentence with a heuristic that considers the importance of an anchor word given the words that appear both to the left of the anchor word and to the right of the anchor word. Due to this heuristic, BERT is described to take into account the bidirectional context of words. BERT considers this bidirectional context to train itself on a large corpus of text, so that it can be fine-tuned on other text corpora in order to serve other downstream tasks. Much like every deep learning model, BERT optimises two objective functions. The first objective function that BERT optimises is called the Mask Language Model (MLM) objective, where the model picks some words of a sentence randomly and masks them out, requiring for BERT to infer the masked (that is, missing) parts of it correctly. To correct the missing parts, BERT again considers bidirectional context described previously. The second objective optimised by BERT, is what is called Next Sentence Prediction (NSP). This last objective causes BERT to optimise its model parameters to force the model to understand relationships between pairs of sentences.
Now that we described how Swin transformer generates spatiotemporal visual-specific features from frame segments in videos — and how BERT generates textual features — it’s about time to describe how these two elements are combined together to form what is known as the SwinBERT model. The key ingredient here is a model from the literature that can combine both of these worlds: the visual world, and the textual one. One needs to define such a model, so that they can go from a VidSwin-based visual representation to a textual representation, which is the desired output of SwinBERT. The multimodal transformer actually fuses the visual representations and the textual representations to make a better representation that introduces simple and sparse interactions between the visual and textual elements. Such interactions between elements of two different modalities are, in fact, more easily interpretable. Instead, everything-versus-everything interactions are more expensive and are often unnecessarily complicated. SwinBERT does away with this via a multimodal Transfomer that employs a key element for processing multimodal data: the cross-attention layer. A simple transformer model [4] instead employs a self-attention layer, computing dense relationships of tokens from the same single modality. The self-attention layer learns a common representation of text elements and visual elements by computing linear combinations of them and, after being passed through a feed-forward neural network, are then input to the seq2seq sequence generation algorithm that can compute video captions.
References
[1] Lin et al., SwinBERT: end-to-end transformers with sparse attention for video captioning, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022, pp. 17949-17958.[2] Liu et al., Video Swin Transformer, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022, pp. 3202-3211.[3] Code accessed at https://github.com/microsoft/SwinBERT[4] Vaswani et al., Attention is all you need, in Proceedings of the Neural Information Processing Systems 2017.[5] Devlin et al. BERT: Pre-training of deep bidirectional transformers for language understanding, in AclWeb 2018.[6] Sutskever, I., Oriol V., and Quoc V. Le. Sequence to sequence learning with neural networks, in Proceedings of Neural Information Processing Systems 2014.[7] Zhang, Hongxin, and Suan Lee. Robot bionic vision technologies: A review, in Applied Sciences (2022)
Sotiris Karavarsamis
Research Assistant at Visual Computing Lab (VCL)@CERTH/ITI
In this fourth installment of our Partner Interview series, we sit down with Manuel Toledo, Head of Production at VRDays Foundation, to explore the organization’s role in the VOXReality project. VRDays Foundation, known for its commitment to advancing immersive technologies and fostering dialogue around sustainable innovation, is playing a pivotal role within the VOXReality consortium. Manuel shares insights into how the foundation is bridging the XR industry with cutting-edge developments, particularly in the realm of virtual conferencing, and the transformative potential these innovations hold for the future of communication technologies.
Can you provide an overview of your organisation's involvement in the VOXReality project and your specific role within the consortium?
At VRDays Foundation, we are advocates of innovation and creative approaches to pushing the boundaries of immersive technologies and sparking debates on sustainable technology development. Joining forces with the VOXReality consortium aligns perfectly with our mission. We’re immensely proud to serve as a gateway, a bridge if you will, to the broader XR community and industry for this project.
Our contribution to the VOXReality consortium work lies in our extensive experience and network within the XR industry, where the consortium work will have a significant impact.
During the development of VOXReality, we take several roles, from particular contributions to partners’ work, from developing one of the specific VOXReality use cases – VR Conference – to leading the pilot ideation, planning and delivery of all three use cases.
Moreover, we’re excited to amplify the impact of the consortium’s work by showcasing it at events like Immersive Tech Week in Rotterdam. It’s not just about what we accomplish within the consortium but also about how we extend its reach and influence to the broader XR community.
What technical breakthroughs do you anticipate during the course of the VOXReality project?
From the perspective of the VR Conference use case, we’re thrilled about the work we’re putting in alongside the VOXReality consortium. The implications for the event industry, especially in the realm of B2B events, are incredibly exciting.
The VR Conference case being developed by VOXReality promises to revolutionise the landscape, offering effective, high-end, non-physical, business-driven, multilingual, and assisted interactions for virtual visitors. This breakthrough will fundamentally reshape our understanding and experience of events.
What role do emerging technologies play in enhancing the technical capabilities of virtual conferencing solutions?
Thanks to today’s technological advancements, the boundaries of distance and presence have become merely matters of perception. Emerging technologies like VR, AR, and AI have opened up a new realm where perception is constantly pushed to its limits.
With VOXReality’s pioneering development of voice-driven interactions in XR spaces, both event organisers and attendees will face a fundamental shift in their preconceived notions. This innovative leap will, in turn, unlock fresh opportunities for organisers and businesses to enhance the value of their activities. Moreover, it will empower visitors to engage in meaningful and productive interactions, irrespective of their geographical constraints.
What business models do you think are worth exploring for the sustained growth of virtual conferencing technologies?
VRDays Foundation firmly believes that the development of virtual accessibility for conferences and trade shows holds the key to unlocking a wealth of new business opportunities for B2B events and their participants. By tapping into these opportunities, we can create fresh value for those already involved in events, particularly within the realm of B2B engagements such as trade shows, one-to-one meetings, demo sessions, and networking opportunities.
What role do you see virtual conferencing playing in the evolution of communication technologies over the next decade?
In its many formats, the development of virtuality in the next decade will bring change to conferencing at speed never experienced before, from simple interactions with conference speakers to complex business agreements done safely and virtually. All these interactions will support themselves on communication technologies, bringing down barriers such as complex navigation and language limitations that are normal to every event, especially today, where the scale and international reach of visitors demand new approaches from organisers.
Voice-driven interaction will play an important part in these developments by offering a seamless, intuitive means of engagement. It streamlines tasks, supports hands-free operation, and integrates with other modalities for richer experiences. Through personalisation and remote assistance, it promises to elevate usability and foster smoother interactions, charting new avenues for innovation and collaboration. In essence, it promises to elevate the usability, accessibility, and engagement of virtual conferencing, charting new avenues for innovation and cooperation in the years ahead.
In this third installment of our Partner Interview series, we had the pleasure of speaking with Petros Drakoulis, Research Associate, Project Manager, and Software Developer at the Visual Computing Lab (VCL)@CERTH/ITI, about their critical role in the VOXReality project. As a founding member of the project team, CERTH brings its deep expertise in Computer Vision to the forefront, working at the intersection of Vision and Language Modeling. Petros shares how their innovative models are adding a “magical” visual context to XR experiences, enabling applications to understand and interact with their surroundings in unprecedented ways. He also provides insights into the future of XR, where these models will transform how users engage with technology through natural, conversational interactions. Petros highlights the challenges of adapting models to diverse XR scenarios and ensuring seamless cross-platform compatibility, underscoring CERTH’s commitment to pushing the boundaries of immersive technology.
What is your specific role within the VOXReality Project?
CERTH has been a key contributor to the project since its conception, sinceithas been among the foundingmembers of the proposal team.As one of theprimal research institutes in Europe, our involvementregards research conduction and technology provision to the team. In this project,specifically, we saw a chancewe wouldn’t miss;to delve into the “brave new world” of Vision and Language Modeling.Arelatively new field that lies at the intersection ofComputer Vision, which is our lab’s expertise, and Natural Language Processing, an excessively flourishing field with thedevelopments in Large Language Models and Generative AI (have you heard of ChatGPT? 😊). Additionally, we work on how to train and deploy all these models efficiently, an aspect extremely importantdue to the sheer size of the currentmodel generation and the necessity forgreen transition.
Could you share a bit about the models you're working on for VOXReality? What makes them magical in adding visual context to the experiences?
You set it nicely! Indeed, they enable interaction with the surrounding environment in a way that some years ago would seem magical. The models take an image or a short video as input (i.e. as seen from the user), and optionally a question about it, and provide a very human-like description of the scene or an answer to the question. This output can then be propagated to the other components of the VOXReality pipeline as“visual context”, endowing them with the ability to function knowing where they are and what is around them;effectively elevating their level of awareness. Speaking about the latter, what is novel about our approachis the introduction of inherent spatial reasoning, built deep into the models enabling them to fundamentally“think”spatially.
Imagine we're using VOXReality applications in the future – how would your models make the XR experience better? Can you give us a glimpse of the exciting things we could see?
The possibilities are almost limitless and as experience has showncreators rarely grasp the full potential of their creations. Thecommunity has an almost “mysterious” way of stretching whatever is available to its limits, given enough visibility (thank you F6S!). Having said that, we envision a boom in the end-user XR applications integratingLarge Language and Vision models, enablingusers tointeract with the applications in a more natural way, using primarily their voice in a conversational manner together with body language. We cannot, of course, predict how long this transition might take or to what extentthe conventionalHuman-Computer Interaction interfaces, like keyboards, mice and touchscreenswill be deprecatedbut the trend is obvious, nevertheless.
In the world of XR, things can get pretty diverse. How do your models adapt to different situations and make sure they're always giving the right visual context?
It is true that in “pure”Vision-Language terms, a picture is worth a thousand wordsthat some of them may be wrong 🤣! For real, any Machine Learning model is only as good as the data it was trained on. The latest generation of AI models is undoubtedly exceptional, but largely due to learning from massive data. Thestandard practice today is to reuse pretrained models developed for another, sometimes generic, task and finetune them for the intended use-case,never letting them“forget” the knowledge they acquired fromprevious uses. In that sense, in VOXReality we seek to utilizemodels pretrained and then finetuned for a variety of tasks and data,which are innately competent to treat diverse input.
In the future XR landscape, where cross-platform experiences are becoming increasingly important, how is VOXReality planning to ensure compatibility and seamless interaction across different XR devices and platforms?
Indeed, the rapid increase of edge-device capabilities we observe today is rapidly altering the notion of where the application logic should reside. Thus, models and code should be able to operate and perform on a variety of hardware and software platforms. VOXReality’s provision in this direction is two-fold. On one hand, we are developing an optimization framework that allows developers to fit initially large models to various deployment constraints. On the other hand, we definitely put emphasis on using as many platform-independent solutions as possible, in all stages of our development. Some examples of this include: the use of a RESTful API based model inference scheme, the release of all models in image-container form and the ability to export them into various cross-platform binary representations such as the ONNX.
Petros Drakoulis
Research Associate, Project Manager & Software Develper at Visual Computing Lab (VCL)@CERTH/ITI
In our second Partner Interview, we had the opportunity to discuss the VOXReality project with Konstantia Zarkogianni, Associate Professor of Human-Centered AI at Maastricht University. As the scientific coordinator of VOXReality, Maastricht University plays a crucial role in the development and integration of neural machine translation and automatic speech recognition technologies. Konstantia shares her insights into how Natural Language Processing (NLP), Computer Vision (CV), and Artificial Intelligence (AI) are driving the future of Extended Reality (XR) by enabling more immersive and intuitive interactions within virtual environments. She also discusses the technical challenges the project aims to overcome, particularly in aligning language with visual understanding, and emphasizes the importance of balancing innovation with ethical considerations. Looking ahead, Konstantia highlights the project’s approach to scalability, ensuring that these cutting-edge models are optimized for next-generation XR applications.
What is your specific role within the VOXReality Project?
UM is the scientific coordinator of the project and responsible for implementing the neural machine translation and the automatic speech recognition. My role in the consortium is to monitor and supervise UM’s activities while providing my expertise in the ethical part of AI along with the execution of the pilots and the open calls.
How do you perceive the role of Natural Language Processing (NLP), Computer Vision (CV), and Artificial Intelligence (AI) in shaping the future of Extended Reality (XR) as part of the VOXReality initiative?
The VOXReality’s technological advancements in the fields of Natural Language Processing, Computer Vision, and Artificial Intelligence pave the way for future XR applications capable of offering high level assistance and controlling. Language enhanced by visual understanding constitutes the VOXReality’s main medium for communication that it is implemented based on the combined use of NLPs, CV, and AI. The seamless fusion of linguistic expression and visual comprehension offers immersive communication and collaboration revolutionizing the way humans interact with virtual environments.
What specific technical challenges is the project aiming to overcome in developing AI models that seamlessly integrate language and visual understanding?
Within the frame of the project, innovative cross-modal and multi-modal methods to integrate language and visual understanding will be developed. Cross-modal representation learning will be applied to capture both linguistic and visual information through encoding the semantic meaning of words and images in a cohesive manner. The generated word embeddings will be aligned with the visual features to ensure that the model can associate relevant linguistic concepts with corresponding visual elements. Multi-modal analysis involves the development of attention mechanisms that endorse the model with capabilities to focus on the most important and relevant parts of both modalities.
How does the project balance technical innovation with ethical considerations in the development and deployment of XR applications?
VOXReality foresees the implementation of three use cases: (i) digital agent assisting the training of personnel in machine assembly, (ii) virtual conferencing offering a shared virtual environment that allows navigation and chatting among attendees speaking different languages, and (iii) theatre incorporating language translation and visual effects. Focus has been placed to take into consideration the ethical aspect of the implemented XR applications. Prior to initiating the pilots, the consortium identified specific ethical risks (e.g. misleading language translations), prepared relevant informed consents, and drafted a pilot study protocol ensuring safety and security. Ethical approval from the UM’s ethical review committee has been received to perform the pilots.
Given the rapid evolution of XR technologies, how is VOXReality addressing challenges related to scalability and ensuring optimal performance in next-generation XR applications?
The VOXReality technological advancements in visual language models, automatic speech recognition, and neural machine translation feature scalability and are provided to support next-generation XR applications. Having as goal to deliver these models in the form of plug-n-play optimized models, modern data-driven techniques are applied to optimize models’ inference time and storage requirements. To this end, a variety of techniques are being investigated to transform unoptimized PyTorch models to ONNX hardware optimized ones. Except from the VOXReality pilot studies that implement three use cases, new XR applications will also be developed and evaluated within the frame of the VOXReality open calls. The new XR applications will be thoroughly assessed in terms of effectiveness, efficiency, and user acceptance.
Konstantia Zarkogianni
Associate Professor of Human-Centered AI, Maastricht University, MEng, MSc, PhD
In our first Partner Interview, Spyros Polychronopoulos from ADAPTIT S.A. discusses their role in developing the AR Theatre application for the VOXReality project. As XR technology experts, ADAPTIT has been deeply involved in the design and deployment process, ensuring that the technology aligns with live theatre needs. They’ve focused on user-friendly interfaces, seamless integration with theatre systems, and secure data protocols to protect intellectual property. Spyros also highlights strategies for future-proofing the application, including modular design and cross-platform development, with plans to adapt to emerging XR technologies and broaden access to theatre through affordable AR devices.
What is your specific role within the VOXReality Project?
Our organization, in our capacity as XR technology experts, has undertaken the design, development and deployment of the AR Theatre application. We have been engaged in the design process since the early beginning, e.g. in discussing, interpreting and contextualizing the user requirements. Throughout the process, we have been in close contact with the theatrical partner and use case owner, offering technological knowledge transfer to their artistic and management team. This work frame for operations has proven critical to facilitating team-based decision-making during design, and thus keeping in view the needs of both the XR technology systems and the theatrical ecosystem.
To facilitate our communication in an interdisciplinary team and to consolidate our mutual understanding, we have taken the lead in creating dedicated applications as deemed necessary.
Firstly, to render the VOX Reality capabilities in tangible, everyday terms, we created an easily distributable mobile application which demonstrates the VOX Reality models one by one in a highly controlled environment. This application can also function as a dissemination contribution for the VOX Reality project goals. We proceeded with developing a non-VOX Reality related AR application to practically showcase the XR device capabilities to the theatrical partner, and more specifically, to the team’s theatrical and art director with a focus on the device’s audiovisual capabilities.
Furthermore, we combined the two previous projects in a new AR-empowered application to better contextualize the VOX Reality services to a general audience which is unfamiliar with AR. Since that milestone, we have been developing iterations of the theatrical application itself with increasing levels of complexity. Our first iteration was an independent application running on the XR device which simulates the theatrical play and user experience. It was produced in independent mode for increased mobility and testing and was used extensively for documenting footage and experientially evaluating design alternatives. The second iteration is a client-server system which can allow multiple XR applications to operate in sync with each other. This was performed for simulated testing in near-deployment conditions during development and was targeted on evaluating the more technical aspects of the system, like performance and stability. The third and last iteration will incorporate all the physical theatrical elements, specifically the actors and the stage, and will involve the introduction of yet new technology modules with their own challenges.
In summary, this has been a creative and challenging journey so far, with tangible and verifiable indicators for our performance throughout, and with attention to reusability and multifunctionality of the developed modules to reinforce our future development tasks.
As for my personal involvement, this has been a notably auspicious coincidence, since I myself am active in theatrical productions as a music producer and devoted to investigating the juncture of music creation and AI.
What considerations went into selecting the technology stack for the theatre use case within VOXReality, and how does it align with the specific requirements of live theatrical performances?
Given the public nature of the theatrical use case, the user facing aspects of the system, specifically, the XR hardware and XR application user interface, were an important consideration.
In terms of hardware, the form factor of the AR device was treated as a critical parameter. AR glasses are still a developing product with a limited range of devices that could support our needs. We opted for the most lightweight available option with a glass-like form to achieve improved comfort and acceptability. This option had the tradeoff of being cabled to a separate computing unit, which was considered of least concern to us given the seating and static arrangement in the theatre. In more practical terms, since the application should operate with minimal disturbance in terms of head and hand movement, in silence and in low light conditions, we had decided that any input to the application should be made using a dedicated controller and not hand tracking or voice commands.
In terms of user interface design, we selected a persona with minimal or no XR familiarity and that defined our approach in two ways: 1) we chose the simplest possible user input methods on the controller and we implemented user guidance with visual cues and overlays. We added a visual highlight to the currently available button(s) at any point and in the next iteration, we will expand on this concept with a text prompt on the functionality of each button, triggered by user gaze tracking. 2) we tried to find the balance between providing user control which allows for customization/personalization and thus improved comfort, and limiting control which safeguards the application’s stability and removes cognitive strain and decision-making from the user. This was addressed by multiple design, testing and feedback iterations.
How does the technical development ensure seamless integration with existing theatre systems, such as lighting, sound, and stage management, to create a cohesive and synchronized production environment?
As in most cases of innovative merging of technologies, adaptations from both sides of the domain spectrum will need to be made for a seamless merger. One problematic area involves the spatial mapping and tracking system needs for XR technology. Current best practices for its stable operation dictate conditions that typically do not match a theatrical setup: it requires well-lit conditions, stable throughout the experience, performs best in small/medium sized areas, needs surfaces with clear and definite traits that avoid specific textures, etc. Failure of the spatial mapping and tracking system can lead to misplaced 3D content which no longer matches the scenography of the stage and thus breaks immersion and suspension of disbelief for the user. In some cases, failure may also lead to a non-detection or inaccurate detection of the XR device controller(s), thus impeding user input.
To amend this, recommendations for the stage’s scenography can be provided from the technical team to the artistic team. Examples are to avoid reflective, transparent, or uniform in color (especially avoiding the color black) surfaces, or surfaces with strong repeating patterns. Recommendations can also address non-tangible theatrical elements, like the lighting setup. Best practices advise avoiding strong lighting that produces intense shadows or dip areas in total or near-total darkness.
Furthermore, there are spatial tracking support systems that a director may choose to integrate in experimental, narrative or artistic ways. One example is the incorporation of black-and-white markers (QR, ARUCO, etc) as scenography elements which have the practical function of supporting the accuracy of the XR tracking system or extending its capabilities (e.g. tracking moving objects).
Going even further, an artistic team may even want to examine a non-typical theatre arrangement which can better match the XR technology needs and lead to innovative productions. On example is the round theatre setup, which has a smaller viewing distance between audience and actors and an inherently different approach to scenography (360° view). Other even more experimental physical setups can involve audience mobility, like standing or walking around, which can make even more use of the XR capabilities of the medium in innovative ways, like allowing the users to navigate a soundscape with invisible spatial audio sources or discover visual elements alongside pre-designed routes or from specific viewing angles.
In terms of audio input, the merger has less parameters. Currently, users are listening to the audio feed from the theatre stage’s main speakers and are receiving no audio from the XR device. Innovative XR theatre design concepts around audio could involve making narrative and artistic use of the XR device speakers. This could e.g. be an audio recording of a thought or internal monologue that, instead of being broadcasted from the main stage, plays directly on the XR device speakers, and thus very close to the viewer and in low volume. It could be an audio effect that plays in waves rippling across the audience or plays with a spatialized effect somewhere in the hall, e.g. among the audience seating. Such effects could also make use of the left-right audio channels thus giving a stronger sense of directionality to the audio.
The audio support could also be used in more practical terms. VOX Reality currently supports provision of subtitles in the user’s language of choice. In the future, we could extend this functionality to provide a voice over narration using natural-sounded synthetic speech in their language of choice. This option would better accommodate people which prefer listening over reading for any physiological or neurological reason. This feature would require supplying XR devices with noise-cancelling headphones, so that the users may receive a clear audio feed from their XR devices, isolate the theatrical stage main speakers’ audio feed and not produce audio interference to each other.
In summary, we are in the fortunate position to not only enact a functional merger of the XR technology and the art of theatre domains as we currently know them, but also to envision a redefinition of conventions that have shaped the public’s concept of theatrical experiences for centuries through the capabilities of XR. We would summarize these opening horizons in three broad directions: 1) an amplification of inclusivity by being able to provide customizable individualized access to a collectively shared experience, 2) an amplification and diversification of the audiovisual landscape in the theatrical domain and 3) an invigoration of previously niche or an invention of totally new ways for audience participation in the theatrical happenings.
Given the sensitive nature of theatrical scripts, what security protocols have been implemented to protect against unauthorized access?
Although our use case does not manage personal or sensitive medical data as in the domains of healthcare or defense, we meticulously examined the security of our system in terms of data traffic and data storage with respect to the intellectual property protection needs of the theatrical content. To cover the needs of the theatre use case, we designed a client-server system with clients operating on the XR devices of the audience and the server operating on a workstation under the assignment of the interdisciplinary facilitation team (developer team and the theatre’s technical team). As context, core reasons for the existence of the client-server system in summary were 1) to centralize the audiovisual input from the scene (microphone and video input) in order to safeguard input media quality, 2) to simultaneously distribute the output to the end-user devices in order to assure synchronicity in the audience and 3) to offset the demanding computational needs to a more powerful device in order to avoid battery and overheating issues on the XR devices.
In terms of data traffic security, the server and the clients are connected to the same local Wi-Fi network, protected by a WPA2 password, and communicate using a WebSocket protocol for frequent and fast communication. The local Wi-Fi network is for the explicit use of the AR theatre system and accesible only to the aforementioned devices, as a safeguarding measure against network bandwidth fluctuations, which could negatively affect the latency of the system and in turn the user experience during the performance, and as a security measure against data traffic inception. Furthermore, for the exact same reasons, the AI services are also operating locally in the same network and are accessed using RESTful API calls, with the added protection of a secure transport protocol (https). In summary, the entire traffic is contained in a safe and isolated environment that can only be breached by an unauthorized network access violation.
In terms of data storage, it was decided that in the release version of the application, no data logs will remain in the XR devices since safeguarding against unauthorized access of the data given the temporary provision of the devices to the public without supervision was not feasible. Any data stored will be in the server device, will hold no personalized information in any form, and will be used exclusively for technical purposes, like system monitoring and performance evaluation.
Considering the rapid evolution of technology, how is the technical development future-proofed to accommodate emerging advancements, and what strategies are in place for seamless upgrades or integrations with future technologies?
In a rapidly changing technological domain like XR and AI, planning for change is an integral part of design and development. For us, this means asking questions in two directions: 1) what the future fate of the current product can be and 2) what can the product evolve to in the future with minimal effort. Answering these questions is enabled by the fact that we, as XR developers and producers of state-of-the-art XR applications, can create informed scenarios for the foreseeable future.
One such scenario that is based on financial data and trends is the growth of the XR market, and specifically the AR sector. This is expected to diversify the device range and reduce purchase costs. In turn, this can affect us by enabling the selection of even more well-suited AR glasses for theatres, it can reduce the investment cost for adoption by theatrical establishments, and it can support the popularization of the XR theatre concept in artistic circles. At the same time, theatre-goes, in their role as individual consumers, can be expected to have increasing exposure and familiarity with this technology in general. Therefore, our evaluation for the first question is that we have good reasons to expect that our current product will have increasing potential for adoption.
On the second question, our strategy is to vigorously uphold proper application design principles with explicit focus on modular, maintainable and expandable design. Operationally, we are adopting a cross-platform development approach to be able to target devices running on different operational systems using the same code base. We are prioritizing open frameworks to ensure compatibility with devices that are compliant with industry standards, thus minimizing intensive proprietary SDK use. In terms of system architecture, by separating the AI from the XR elements, we allow for independent development and evolution for each domain in their own speed and direction. By building the connections with well-established methods that are unlikely to change, like RESTful API calls, we ensure that our product is in the best position to adapt to potentially reworking of entire modules. Furthermore, we adopt a design approach with segmented “levels of XR technology” so as to be able to easily create spin-offs targeting various XR-enabled hardware as they emerge. This does not necessarily imply more powerful devices, but also more popular ones. One current example that we investigate is to single-pick the subtitles provision feature and target affordable 2D AR glasses (also called HUD glasses or smart glasses or wearable monitors) as a means of increasing theatre accessibility.
Spyros Polychronopoulos
Researcher on digital simulation of ancient Greek instruments, and lecturer, teaching music technology and image processing.
Laval Virtual 2024 was an absolute blast, and VOXReality dove right into the heart of the action! Our mission? To scout for awesome SMEs ready to rock the XR world through our open call. But hey, it wasn’t all business—there was plenty of fun to be had!
Picture this: walking through the buzzing exhibition booths, learning a lot from the mind-blowing tech talks and conferences, and connecting with the coolest people in the European XR scene.
One standout moment? The Women In Immersive Tech (WIIT) gathering. Talk about empowering! We connected with amazing colleagues, exchanged ideas, and celebrated diversity in the industry. It was all about making meaningful connections.
We also got hands-on with jaw-dropping XR content and demos, exploring the edge of innovation. From mind-bending projects to meeting a diverse bunch of XR aficionados, Laval Virtual was the ultimate playground for techies like us!
As we say goodbye to Laval Virtual 2024, VOXReality is pumped up and ready to rock the XR world. Armed with our insights and a bunch of new friends, we’re gearing up to take SMEs on a ride through the immersive tech universe. It’s gonna be one exciting journey!
Hey there! I'm Natalia and I'm a Corporate communications specialist, I also hold Master's degree in Journalism and Digital Content Innovation by the Autonomous University of Barcelona. I currently work in dissemination, communication, and marketing of technology, innovation, and science for projects funded by the European Commission at F6S.