VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (41)

Seeing into the Black-Box: Providing textual explanations when Machine Learning models fail.

Machine learning is a scientific practice which is heavily tied with the terms “error” and “approximation”. Sciences like Mathematics and Physics are associated with error, induced by a need to model how things work. Moreover, the abilities of humans in intelligence tasks are also tied with error, since some actions associated with these abilities may be the result of failure, while other actions may be deemed as truly successful ones. There have been myriads of times when our thinking, our categorization ability, or our human decisions, have failed. Machine learning models, which try to mimic and compete with human intelligence in certain tasks, are also tied with successful operations or erroneous ones.

But how can a machine learning model, a deterministic model with the ability to empirically compute the confidence it has for a particular action, diagnose itself that it makes an error when processing a particular input? Even for a machine learning engineer, trying to intuitively understand why without studying a particular method seems difficult.

In this article, we discuss a recent algorithm for this problem that convincingly explains how; in particular, we describe the Language Based Error Explainability (LBEE) method by Csurka et al. Here, we will recreate an explanation on how this method leverages the convenience of generating embeddings via the CLIP model contributed by OpenAI, which allows one to translate text extracts and images into high-dimensional vectors that reside in a common vector space. By projecting texts or images in this common high-dimensional space, we can compute the dot product between two embeddings (which is a well-known operation that measures the similarity among two vectors) to quantitatively measure how similar the two original text/image objects are as a function of other dot product operations (or, put simply, other similarities among vectors) involving pairs of other objects.

The designers of LBEE have developed a model that can report a textual error description of a model failure in cases where the underlying model asserts an empirically low confidence score in taking an action that the model was designed for. Part of the difficulty in grasping how such a method fundamentally works is our innate wondering about how the textual descriptions explaining the model failure are generated from scratch as a function of an input datum. In our brains, we often do not put much effort when we need to explain why a failure happens and we instantly arrive at clues to describe them, unless the cause drifts apart from our fundamental understanding of the inner workings of the object that is involved in the failure. To keep things interesting, we can provide an answer to this wondering: instead of assembling these descriptions once for each input, we can generate them following a recipe a-priori and then reuse them in the LBEE task by computationally reasoning about the relevance of a candidate set of explanations in relation to a given model input. In the remainder of this article, we will see how:

Suppose that we have a classification model that was trained to classify the object type of an only object depicted in a small color image. We could, for example, take photographs of objects in a white background with our phone camera and pass these images to the model in order for it to classify the object names. The classification model can yield a confidence score ω that is between 0 and 1, representing the normalized confidence that the model has when assigning the images to a particular class in relation to all the possible object types that are recognizable by the model. It is usually observed that when a model does poorly in generating a prediction, the resulting confidence score may be quite low. But what is a good empirical threshold T that allows us to identify a poor prediction or a confident prediction? To empirically estimate two such thresholds, one for identifying easy predictions and one for identifying hard predictions, we can take a large dataset of images (e.g., the ImageNet datase) and pass each image to the classifier. For the images which were classified correctly, we can plot the confidence scores generated by the model as a normalized histogram. By doing so, we may expect to see two large lobes in the histogram, representing the confidence values which correspond to confident inferences and less confident inferences. We may also expect to see some degradation in the frequency masses around the two lobes, which is possible. Otherwise, we would come up with a histogram presenting two lobes which are highly leptokurtic. One lobe concentrates empirical prediction scores that are lower, and a second lobe will concentrate many scores which are relatively high. Then, we can set an empirical threshold that separates the two lobes.

Csurka and collaborators designate separating images as easy or hard based on the confidence score of a classification machine learning model and their relation to the cut-off threshold (see Figure 1). By distinguishing these two image sets, the authors compute the embeddings of the images from each group, and for each image in these sets they compute an ordered sequence of numbers (for our convenience, we will use the term vector to refer to this sequence of numbers) which describe the semantic information of the image. To do this, they employ the CLIP model contributed by OpenAI, the company that is famous for the delivery of the ChatGTP chatbot model, which excels at producing embeddings for images and text in a joint high-dimensional vector space. The computed embeddings can be used to measure the similarity between an image and a very small text extract, or the similarity between a pair of text extracts or images.

As a later step, the authors wanted to identify the groups of image embeddings that share similarities. To do this, they use a clustering algorithm which can take in vectors generated by a “generating machine” and identify the clusters of the embeddings. The number of clusters that fit a particular dataset is non-trivial to define. All in all, we come up with two types of clusters: clusters of CLIP embeddings for “easy” images, and clusters of CLIP embeddings for “hard” images. Then, any hard cluster center is picked and for it the closest easy cluster center is found. This allows us coming up with two embedding vectors originating from the clustering algorithm. The two clusters, “easy” and “hard”, are visually denoted at the top-right sector of Figure 1, by green and red -dotted enclosures.

The LBEE algorithm then generates a set S of sentences that describe the above-said images. Therefore, for each short sentence that is generated, the CLIP embedding is computed. As it was mentioned earlier, this text embedding can be directly compared to the embedding of any image by calculating the dot product (or inner product) of the two embedding vectors. Consider that the dot product measures a quantity that in the signal processing community is called linear correlation. The authors apply this operation directly. They compute the similarity of each text error description by computing the so-called cosine similarity measure between a text extract embedding and an image embedding, ultimately computing two relevance score vectors of dimensionality k < N. Each dimension is tied with a given textual description. By taking these two score vectors into consideration, the authors pass the two vectors in a sentence selection algorithm (we cover them in the next paragraph). The selected sentences are considered for this forward process by taking into account each hard cluster. The union of these sentence-sets is then output to the user, in return for an image that was supplied as input.

The authors chose to define four sentence selection algorithms, named SetDiff, PDiff, FPDiff and TopS. SetDiff computes the sentence-sets corresponding to a hard cluster and to an easy cluster. Then, the algorithm removes from the hard cluster sentence-set the sentences that also appear in the easy cluster sentence-set, and reports the resulting set to the user. PDiff takes two similarity score vectors i and j of dimensionality $k$ (where k denotes the top-$k$ relevant text descriptions), one from the hard set and one from the easy set. Then, Diff computes the difference between these two vectors, and from there it retains the sentences corresponding to the top $k$ values. TopS trivially reports as an answer all the sentences that correspond to the vector of top-k similarities. Figure 3 presents an example of textual failure modes generated by a computer vision model, each using one of the TopS, SetDiff, Diff and FPDiff methods. To enable evaluation of the LBEE model and methodology, the authors had to also introduce an auxiliary set of metrics, adapted to the specificalities of the technique. To further understanding on this innovative and very useful work, we recommend you to keep on reading [1].

References

[1] What could go wrong? Discovering and describing failure modes in computer vision, published in the proceedings of ECCV 2024.

Picture of Sotiris Karavarsamis

Sotiris Karavarsamis

Research Assistant at Visual Computing Lab (VCL)@CERTH/ITI

Twitter
LinkedIn
Copy of OC EIC Solvers Webinar #2 (74)

The rise of immersive technologies in theatre

Transforming the Audience Experience with VR and AR

Many aspects of our society have been profoundly impacted by the development of technology, especially the entertainment industry, which includes theatre. Virtual Reality (VR) and Augmented Reality (AR) are technologies that have massively changed how people think about entertainment and performance. These technologies are extremely versatile and can be used both to enhance the spectator’s experience without altering the essence of theatrical representation and to completely transform performances compared to classical theatre. The concept of augmented reality was first introduced in 1992 by Caudell and Mizell[1], and followingly expounded upon by Ronald Azuma, who outlined its potential applications in the entertainment sector, among others[2]. However, the real breakthrough came between 2010 and 2015, with the advent of VR headsets. In 2015, Microsoft’s HoloLens introduced the ability to overlay virtual objects onto the real world, fostering new experimentation. In the same year, the platform The Void was launched, becoming popular thanks to hyper-reality experiences that combined virtual reality with interactive physical environments. Due to its popularity, the platform was able to collaborate with major companies like Disney and work on internationally renowned projects such as Star Wars: Secrets of the Empire. The COVID-19 pandemic provided a strong push for the adoption of immersive technology, forcing theatres worldwide for necessity to experiment with digital formats and virtual experiences[3] (LINK).

The VR and AR in the entertainment market

The immersive technology market is expanding, driven by sectors such as entertainment, education, healthcare, and business, which are increasingly adopting VR and AR technologies. In 2023, the value of the immersive technology market was $29.13 billion, and future projections indicate it will reach $134 billion by 2030, with an annual growth rate of over 25%[4] (LINK). With over fifty percent of the market in 2023, the video game industry continues to be the leading industry for VR and AR in entertainment[5] (LINK). However, in live events and theatre these technologies have increasingly been used. Artificial Intelligence (AI) is being used into VR and AR experiences to enhance interactions and make them more accessible and natural[6] (LINK). Furthermore, as smart glasses and headsets have become more powerful and lighter, their latest developments have made their adoption easier for a wider range of users. Thanks to government-sponsored research initiatives like Horizon Europe, the growing investments in digital innovation, and the growing use of XR technologies in industries like entertainment, healthcare, and education, the immersive technology market in the EU is predicted to reach $108 billion by 2030[7] (LINK).

Enhancing theatre accessibility and audience engagement

Immersive technologies present plenty of possibilities to improve theatrical productions, enabling creativity in both the performance and its inclusivity. By employing virtual reality headsets, real-time subtitles and scene-specific context it is possible to improve the audience immersion and promoting inclusivity among people with hearing or language disabilities. This will increase the amount of people who attend plays, particularly in tourist cities where accessibility is severely limited by language barriers. Moreover, the use of these technologies increases the potential audience for theatrical plays because it also overcome geographic restrictions by enabling viewers to enjoy live performances from a distance in completely immersive virtual theatres.  It will allow people who are unable to travel because of age-related problems or disabilities to attend to the performances not only in two dimensions, as it is now in the case of watching a theatre show on television, but to attend performances in a very immersive experience. Finally, visual effects can be added to performances using VR and AR technologies, bridging the gap between traditional performing arts and modern production techniques.

Applications of VR/AR in Theatrical Performances

The incorporation of VR and AR into theatre has completely transformed how audiences interact with displays, these technologies have introduced new means to boost storytelling, accessibility, and interaction. The potential of these technologies in live performances has been shown by a variety of projects:

  • National Theatre’s Immersive Storytelling Studio: To increase audience engagement, the National Theatre in the UK has adopted immersive technologies. Their Immersive Storytelling Studio investigates the potential of VR and AR to produce more immersive and engaging experiences (LINK)[8].
  • White Dwarf (Lefkos Nanos) by Polyplanity Productions: this experimental project creates a novel theatrical experience by fusing augmented reality with live performance through the interaction of digital materials with performers on stage (LINK)[9].
  • Smart Subs by Demokritos Institute: This project makes theatre performances more accessible to international and hearing-impaired audiences by using AR-powered smart captions that provide live subtitles (LINK)[10].
  • XRAI Glass: The use of AI technology in this case in combination with AR smart glasses, can provide real-time transcriptions and translations, enabling people with hearing impairments to follow along or comprehend plays in multiple languages (LINK)[11].
  • National Theatre Smart caption glasses (UK): The National Theatre in collaboration with Accenture and a team of speech and language experts led by Professor Andrew Lambourne has developed a “Smart caption glasses” solution as part of their accessibility program for their performances. The smart caption glasses have been in effect since 2018 (LINK), and have also been demonstrated in the 2020 London Short Film Festival for cinematic screenings (LINK).

These applications show how VR and AR are improving visual effects while also increasing accessibility and inclusivity in theatre. Theatre companies can reach a wider audience, overcome language hurdles, and produce captivating, interactive shows that push the limits of conventional theatre by incorporating immersive technologies.

Conclusion

As technology advances, VR and AR technology will become increasingly used in theatrical performances, both to create a more immersive experience and to make theatre more accessible, attracting new audiences and expanding the reach of the performing arts. In an increasingly digital environment, these technologies will guarantee that live performances continue to be both revolutionary and relevant in the cultural context. Additionally, the creation of AI-powered VR and AR tools will enable to modify and customize shows according to audience preferences, resulting in more profound emotional experiences and unprecedented accessibility to theatre.

References

Azuma, Ronald T. “A survey of augmented reality.” Presence: teleoperators & virtual environments 6.4 (1997): 355-385Iudova-Romanova, Kateryna, et al. “Virtual reality in contemporary theatre.” ACM Journal on Computing and Cultural Heritage 15.4 (2023): 1-11.

Jernigan, Daniel, et al. “Digitally augmented reality characters in live theatre performances.” International Journal of Performance Arts and Digital Media 5.1 (2009): 35-49.

Pike, Shane. “” Make it so”: Communal augmented reality and the future of theatre and performance.” Fusion Journal 15 (2019): 108-118.

Pike, Shane. “Virtually relevant: AR/VR and the theatre.” Fusion Journal 17 (2020): 120-128.

Srinivasan, Saikrishna. Envisioning VR theatre: Virtual reality as an assistive technology in theatre performance. Diss. The University of Waikato, 2024.

[1] Caudell, Thomas & Mizell, David. (1992). Augmented reality: An application of heads-up display technology to manual manufacturing processes. Proceedings of the Twenty-Fifth Hawaii International Conference on System Sciences. 2. 659 – 669 vol.2. 10.1109/HICSS.1992.183317.

[2] Azuma, Ronald T. “A survey of augmented reality.” Presence: teleoperators & virtual environments 6.4 (1997): 355-385

[3] Signiant. VR & AR: How COVID-19 Accelerated Adoption, According to Experts. 2024

[4] Verified Market Reports. Immersive Technologies Market Report. 2024

[5] Verified Market Reports. Immersive Technologies Market Report. 2024.

[6] Reuters. VR, AR headsets demand set to surge as AI lowers costs, IDC says. 2024.

[7] Mordor Intelligence. Europe Immersive Entertainment Market Report. 2024.

[8] National Theatre. Immersive Storytelling Studio. 2024.

[9] Polyplanity Productions. White Dwarf (Lefkos Nanos). 2024.

[10] Demokritos Institute. Smart Subs Project. 2024.

[11] XRAI Glass. Smart Glasses for Real-Time Subtitles. 2024.

Picture of Greta Ioli

Greta Ioli

Greta Ioli is an EU Project Manager in the R&D department of Maggioli Group, one of Italy's leading companies providing software and digital services for Public Administrations. After earning a degree in International Relations – European Affairs from the University of Bologna, she specialized in European projects. Greta is mainly involved in drafting project proposals and managing dissemination, communication, and exploitation activities.

Twitter
LinkedIn
VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (40)

Partner Interview #8 with F6S

The VOXReality project is driving innovation in Extended Reality (XR) by bridging this technology with real-world applications. At the heart of this initiative is F6S, a key partner ensuring the seamless execution of open calls and supporting third-party projects (TPs) from selection to implementation. In this interview, we sit down with Mateusz Kowacki from F6S to discuss their role in the consortium, the impact of mentorship, and how the project is shaping the future of AI and XR technologies.

Can you provide an overview of your organization's involvement in the VOXReality project and your specific role within the consortium?

F6S played a crucial operational role in the VOXReality project by managing the preparation and execution of the open calls. This thorought approach involved: designing the application process: determining eligibility criteria, application requirements and eveluation metrics, developing and disseminating the call, managing selection and implementation of the TP’s projects.

Essentially, F6S acted as the facilitator ensuring a smooth and efficient process of preparing and implementing open calls.

How do you ensure that both mentors and the projects they guide benefit from the mentorship process, and what does that look like in practice?

There are a lot of important factors that made the process of mentoring within VOXReality project a success but one of the key elements might be communication. That involves clearly outline of the roles and responsibilities of both the mentor and the project team. This includes setting expectations for communication frequency, meeting schedules, and deliverables. What is more regular check in with both mentors and projects to assess progress, identify any challenges, and provide support. Gather feedback on the mentorship experience to continuously improve the program. Those are for sure the core and basic elements of successful implementation. What we also developed in a sprint 2, based on lessons learnt from sprint 1, is a clear calendar of upcoming activities that involve TP’s and mentors. That help us with better execution and better understanding of our tasks. 

Regular meetings, checkups, openness to discuss have also played a crucial role. F6S helped all partners to better execute and navigate through the implementation of open call. 

How does the VOXReality team ensure that the XR applications being developed are both innovative and practical for real-world use?

The VOXReality team employs a multi-faceted approach to ensure that the XR applications being developed are both innovative and practical for real-world use. By funding projects through open calls, VOXReality fosters innovation and encourages a diverse range of ideas and approaches. This collaborative approach ensures that the development of XR applications benefits from the expertise of a wider community, leading to more creative and practical solutions. So basically, the whole selection process has been designed to cover as innovative technologies as possible. We have been lucky to attract a lot of application, so our selection of 5 TP’s has not been an easy task as a lot of projects represented good value of innovations and real-world use. Nevertheless, we believe that those five selected entities represent the best potential for future development, and we are sure that their pursuit for innovation will end up with their success.

The language translation system for VOXReality prioritizes cultural sensitivity and artistic integrity by relying on these literary translations, which capture the cultural nuances and emotional subtleties of the original text. To ensure that these aspects are preserved throughout the development, we conduct thorough evaluations of the translation outputs through internal checks. This evaluation is crucial for verifying that the translations maintain the intended cultural and artistic elements, thereby respecting the integrity of the original performance.

How do you think the VOXReality Open Call and the coaching process will shape the success and growth of innovative projects in the XR and AI fields?

I believe that the idea of cascade funding is crucial for discovering potential in small teams of creative professionals and for sure the projects like VOXReality help to leverage their activities to the higher level and bigger audience. The role of a coach is to ensure successful implementation of TP’s project within VOXReality but also to see the bigger picture of possibilities within the sector of public funded projects.  

What excites you most about the Third-Party Projects joining VOXReality, and how do you believe AI and XR technologies will reshape the industries they are targeting?

The cooperation with them. That’s for sure very interesting to see, how they work, how they interact. The dynamism, agility but at the same time keeping the deadlines and meeting expectations. It is something for sure that can inspire. Not only them but also bigger entities to think sometimes outside the box, to leave the comfort zone. For some of those entities the project with VOXReality project is a game changer in their entrepreneurial history, and we are very happy to be the part of it. XR technologies have very big potential of changing and creating our everyday life but we need always to see the real, social value into what we are doing within XR technologies. That’s one of the mottos we have in VOXReality. To bring real value to the society 

Picture of Mateusz Kowacki

Mateusz Kowacki

EU Project Manager @ F6S

Twitter
LinkedIn
Copy of OC EIC Solvers Webinar #2 (Instagram Post) (1)

Hybrid XR Training : Bridging Guided Tutorials and Open-Ended learning with AI

Leesa Joyce

Head of Research Implementation at Hololight

Training methodologies significantly impact skill acquisition and retention, especially in fields requiring psychomotor skills like machining or other technical tasks. Open-ended and close-ended training represent two distinct pedagogical frameworks, each with unique advantages and limitations.

Close-Ended Training

Traditional close-ended training is characterized by its structured and prescriptive approach. Tasks are typically designed with a single correct pathway, focusing on minimizing errors and ensuring compliance with predefined standards. This method is effective for teaching specific skills that require strict adherence to safety protocols or operational sequences, such as in high-stakes environments like aviation or surgery (Abich et al., 2021).

However, close-ended systems often limit learners’ creativity and adaptability by discouraging exploration. They can result in a rigid learning experience, where trainees may struggle to apply knowledge flexibly in unstructured real-world scenarios (Studer et al., 2024). Additionally, such systems may reduce engagement, as they often emphasize repetition and discourage deviation from the expected path.

Open-Ended Training

Open-ended training, in contrast, fosters exploration and self-directed learning. It is rooted in constructivist principles, emphasizing active engagement and the development of problem-solving skills through exploration. This approach allows multiple pathways to achieve the same goal, encouraging learners to experiment and understand the underlying principles of tasks (Land & Hannafin, 1996).

In the context of psychomotor skills, open-ended training enables learners to adapt to different tools, approaches, and constraints. For example, an open-ended VR system for machining skills, as demonstrated by Studer et al. (2024), allows trainees to achieve objectives using various methods while enforcing critical protocols where necessary. This flexibility mirrors real-world scenarios where tasks rarely follow a single blueprint, enhancing learners’ readiness for practical challenges.

Benefits and Challenges

Open-ended training excels in promoting creativity, adaptability, and deeper conceptual understanding. Studies have shown that learners trained in open-ended environments often exhibit better problem-solving abilities and higher engagement levels (Ianovici & Weissblueth, 2016). For instance, in manufacturing industry, the employees may encounter situations requiring innovative approaches to meet production goals. An open-ended framework better equips them for such challenges.

However, this method may be less suitable for beginners who require a clear framework to build foundational skills. Research suggests that novices benefit from close-ended approaches to develop initial competence before transitioning to more exploratory methods (Ianovici & Weissblueth, 2016).

Applications in Modern Training

The integration of technologies such as Virtual Reality (VR) and Augmented Reality (AR) into learning processes has amplified the potential of open-ended training. Extended Reality (XR) platforms can simulate diverse scenarios, offering real-time feedback and dynamic task adjustments to accommodate different learning styles and levels of expertise (Abich et al., 2021). In contrast, close-ended modules provide step-by-step instructions for specific tasks, ensuring accuracy and consistency.

For example, the open-ended XR training system for machining tasks by Studer et al. (2024) combines guided tutorials with open-ended practice. This hybrid approach balances structure and flexibility, addressing the limitations of both methods.

The choice between open-ended and close-ended training should align with the learners’ needs, the complexity of the skills being taught, and the desired outcomes. While close-ended training ensures compliance and foundational competence, open-ended training prepares learners for the dynamic and unpredictable nature of real-world challenges. Leveraging both approaches in a complementary manner, particularly through advanced technologies like XR, offers a comprehensive framework for effective skill development.

Hybrid Learning in Industrial Assembly Lines: VOXReality’s Transformative Approach

The VOXReality project revolutionizes hybrid learning in industrial assembly lines by integrating cutting-edge AI-driven natural language processing and speech recognition modules. This approach addresses the key challenges of open-ended training, such as a lack of familiarity with machinery, uncertainty about assembly protocols, safety concerns, and insufficient guidance. By offering real-time interaction and support, VOXReality fosters an environment where workers can learn dynamically and creatively without feeling overwhelmed. The system enables users to receive immediate feedback and contextual instructions, paving the way for more efficient and engaging open-ended training scenarios. VOXReality not only enhances workforce competence but also ensures a safer and more intuitive learning process in industrial settings.

References
  1. Abich, J., Parker, J., Murphy, J. S., & Eudy, M. (2021). A review of the evidence for training effectiveness with virtual reality technology. Virtual Reality, 25(4), 919–933.
  2. Ianovici, E., & Weissblueth, E. (2016). Effects of learning strategies, styles, and skill level on motor skills acquisition. Journal of Physical Education and Sport, 16(4), 1169.
  3. Land, S. M., & Hannafin, M. J. (1996). A conceptual framework for theories-in-action with open-ended learning environments. Educational Technology Research and Development, 44(3), 37–53.
  4. Studer, K., Lie, H., Zhao, Z., Thomson, B., & Turakhia, D. (2024). An Open-Ended System in Virtual Reality for Training Machining Skills. CHI EA ’24.
Twitter
LinkedIn
image

Beyond the Jargon: Coordinating XR and NLP Projects Without Losing Your Headset

Extended Reality (XR) technologies provide interactive environments that facilitates immersivity by overlaying digital elements onto the real world, by allowing digital and physical objects to coexist and interact in real time or by transporting the users into a fully digital world. As XR technologies evolve, there’s an increasing demand for more natural and intuitive ways to interact with these environments. This is where Natural Language Processing (NLP) comes in. NLP technologies enable machines to understand, interpret, and respond to human language, providing the foundation for voice-controlled interactions, real-time translations, and conversational agents within XR environments. Integrating NLP technologies into XR applications is a complex process and requires a collaborative effort of experts from a wide range of areas such as artificial intelligence (AI), computer vision, human computer interaction, user experience design, and software development and domain experts from the fields the XR application is targeting. Consequently, such an effort requires a comprehensive understanding of both the scientific underpinnings and the technical requirements involved as well as targeted coordination of all stakeholders in the research and development processes.

The role of a scientific and technical coordinator is ensuring that the research aligns with the development pipeline, while also facilitating alignment between the interdisciplinary teams responsible for each component. The scientific and technical coordinator needs to ensure smooth and efficient workflow and facilitate cross-team collaboration, know-how alignment, scientific grounding of research and development, task and achievement monitoring, and communication of results. In VOXReality, Maastricht University’s Department of Advanced Computing Sciences (DACS) serves as the scientific and technical coordinator.

Our approach to scientific and technical coordination, centered on the pillars of Team, Users, Plan, Risks, Results, and Documentation, aligns closely with best practices for guiding the research, development, and integration of NLP and XR technologies. Building a Team of interdisciplinary experts, having clear communication, and ensuring common understanding among the team members fosters innovation and timely and quality delivery of results through collaboration. A focus on Users from the beginning until the end ensures that the project is driven by real-world needs, integrating feedback loops to create intuitive and engaging experiences. A detailed Plan, with well-defined milestones and achievement goals, provides structure and adaptability, ensuring the project stays on track despite challenges. Proactively addressing Risks through contingency planning and continuous performance testing mitigates potential disruptions. Tracking and analyzing Results against benchmarks ensures the project meets its objectives while delivering measurable value. Finally, robust Documentation preserves knowledge, captures lessons learned, informs the stakeholders, and paves the way for future innovation.

The VOXReality project consists of three phases that require a dedicated approach for scientific and technical coordination. Phase 1 is the “Design” phase, where the focus is on the design of the research activities as well as the challenges as defined by the usecases that they seek to address. This phase mainly focuses on the requirements which needs to be collected from different stakeholders, analyzed in terms of feasibility, scope and relevance to the project goals and mapped to functionalities in order to retrieve the technical and systemic requirements for each data-driven NLP model and the envisaged XR application. Phase 2 is the “Development” phase, where the core development of the NLP models and the XR applications takes place. In this phase, the models need to be verified and validated on a functional, technical and workflow basis and the XR applications that integrate the validated NLP models need to be tested both by the technical teams and end-users. Finally, Phase 3 is the “Maturation” phase, where the VOXReality NLP models and the XR applications are refined and improved based on the feedback received from the evaluation of Phase 2.  A thorough technical assessment of the models and applications need to be conducted before their final validation with end-users in the demonstration pilot studies. Moreover, five projects outside of the project consortium will have access to the NLP models to develop new XR applications.

Currently, we have completed the first two phases with requirement extraction completed, first versions of the NLP models released, first versions of the XR applications developed, and software tests and user pilot studies completed successfully. In Phase 1, from the scientific and technical coordination point of view, weekly meetings were organized. These meetings allowed each team involved in the project to able to familiarize themselves with the various disciplines represented and the terminologies that would be utilized throughout the project, the consolidation of viewpoints regarding the design, implementation, and execution of each use case, and the formal definition and documentation of the use cases. In Phase 2, bi-weekly technical meetings and monthly achievement meetings were conducted. These meetings allowed monitoring the advancements in the technical tasks such as developments, internal tests and experiments and achievements, especially the methodologies followed to evaluate them and potential achievements reached.

This structured approach and the coordinated effort of all team members in the VOXReality project resulted in the release of 13 NLP models, 4 datasets, a code repository that includes 6 modular components, and 3 use case applications and the publication of 4 deliverables in the form of technical reports and 6 scientific articles in journals and conferences.

Picture of Yusuf Can Semerci

Yusuf Can Semerci

Assistant Professor in Computer Vision and Human Computer Interaction at Maastricht University

Twitter
LinkedIn
Picture2

Augmented Reality Theatre: Expanding Accessibility and Cultural Heritage Through XR Technologies

As new XR technologies emerge, the potential for significantly increasing the accessibility of theatrical performances grows, creating new opportunities for inclusion and ensuring that wider audiences can fully experience these performances. Our project exemplifies the benefit of using new technologies in preserving cultural heritage and improving accessibility through a pilot on Augmented Reality Theatre, realized by the joint efforts of Gruppo Maggioli, a leading IT company in the Italian market, Adaptit, a telecommunications innovator, and Athens Epidaurus Festival, one of Greece’s leading cultural organisations and organiser of the summer festival of the same name. Our project is part of the European Union-funded Research and Innovation Action VOXReality , investigating voice-driven interaction in XR spaces.

Pilot 1 at Alkmini Theatre in Athens, Greece, with Lefteris Polychronis
Captions in theatres

Captions are an essential feature for increasing accessibility and inclusivity in theatres. They provide real-time text descriptions of spoken dialogue or important auditory cues. Primarily designed to assist individuals with hearing impairments (e.g. deaf or hard of hearing), they can provide comprehension support for any member of the audience. Providing captions allows any individual to follow the narrative, understand nuanced dialogues, and appreciate the performance’s full context.

Translations of the captions are designed to allow non-native speakers or those who do not understand the performance’s original language to also fully engage with the content. This feature is particularly beneficial in culturally diverse communities or international venues where audiences may come from various linguistic backgrounds. Caption translations open up educational and cultural exchange by broadening the reach of performances to global audiences.

Delivery formats: Open vs. Closed captions

Typical caption delivery in theatres falls under two categories: open captions and closed captions. Open captions are displayed on screens placed around the stage or projected onto the stage itself. They are visible to everyone in the audience and cannot be turned off. Since they are essentially a part of the theatrical stage, they can be designed to artistically blend with the stage’s scenic elements. Open captions fall short when it comes to translations, since only a limited number of languages can be displayed simultaneously. Furthermore, their readability is not uniform across all audience seats, since distance, angle and obstacles affect visibility.
Closed captions are typically displayed on devices, such as captioning devices or smartphone apps, that can be activated by audience members themselves. They provide flexibility to be turned on or off depending on the individual’s need and allow for customizable settings, such as font size and color adjustments, catering to individual preferences. They are also ideal for caption translations, since each user can select their preferred language.

With regards to accessibility and inclusivity, closed captions are a preferable option due to extensive customizations which can improve readability and comprehension. On the downside, they require a more elaborate technical framework to synchronize the delivery of the captions to the audience’s devices and bring up considerations of usability of the device or application from the audience.

AR closed captions

Closed captions are usually delivered using smartphone screens, but they can also be delivered using AR glasses. AR glasses can display the captions directly on the lenses with minimal obstructions to the user’s visual field. This allows the user to focus on a single visual frame of reference, instead of looking back and forth between smartphone screen and stage. This makes for an improved user experience without fear of missing out and can also benefit comprehension because of reduced mental workload. The AR delivery mode multiplies the potential benefits in terms of accessibility, but also the usability concerns and the theatre’s technological capacity.

Contextual Commentary

Aiming to foster a deeper connection with the performance, another feature introduces contextual commentary. Delivered through AR glasses, this commentary may include background information such as character insights, cultural and historical context, or artistic influences. This approach enhances artistic expression by giving theater directors the ability to curate and control the information shared with the audience, but it may also serve as a powerful tool for the preservation and sharing of the cultural context of theatrical plays. This is especially needed for the preservation of the ancient Greek culture and its dissemination to a wider audience. The interactive and immersive mode of delivery allows for a dynamic presentation, overcoming cultural and language barriers and making performances more inclusive.

VoxReality contribution

To assess those benefits in practice, our project aims to deliver an excerpt of an Ancient Greek tragedy to an international audience amplified with augmented reality translated captions and audiovisual effects. The play selected for the project is Hippolytus by Euripides, translated by Kostas Topouzis and adapted by the Athens Epidaurus Festival’s Artistic Director, Katerina Evangelatos. This is an ambitious pilot whose results can help determine AEF’s future course for performances of international appeal – a challenging task taking AEF’s heavy cultural weight into account.

The first user evaluation was completed in May 2024 with a closed performance. Despite being at an early technical and aesthetic level, the initial evaluation was decisively positive with users stating that they would be interested in attending this kind of theatre in real conditions, and that they saw practical benefit and artistic merit in the provided features. Negative feedback was focused on the technical performance of the system and the learning curve of the AR application.

Screen recording from the AR glasses during performance, showing AR captions and visual effects.
Future Perspectives: Enhancing Theatre Through Innovation

An important potential aspect of future theatre will be the audience’s ability to individualize what is otherwise a collective experience—whether for practical reasons, as in VoxReality, or for various artistic purposes. This technology-supported sensitivity can allow for broader participation, and thus, broader representation in the future of performative arts. Our relationship with the future though can be shaped by revisiting our relationship with past: through this new lens we can lift linguistic barriers to exchange cultural works between communities worldwide, and we can revisit our own cultural heritage with a renewed understanding, both of which can shapes our contemporary cultural identity. This is an exciting era with fast moving changes that leave us with the challenge of comprehending our own potential – a challenge that will be determined by our ability to disseminate knowledge and promote collaboration.

The first public performances will be delivered on May 2025, in Athens, Greece, during the Festival’s 70th anniversary year and will be open for attendance through a user recruitment process. Theatre lovers who are not fluent in Greek are wholeheartedly welcome to attend.

Picture of Olga Chatzifoti

Olga Chatzifoti

Extended Reality applications developer working with Gruppo Maggioli for the design and development of the Augmented Reality use case of the VOXReality HORIZON research project. She is also a researcher in the Department of Informatics and Telecommunications of the University of Athens. Under the mentorship of Dr. Maria Roussou, she is studying the cognitive and affective dimensions of voice-based interactions in immersive environments, with a focus on interactive digital narratives. She has an extensive, multidisciplinary educational background, spanning from architecture to informatics, and has performed research work on serious game design and immersive environments in Europe, USA and the UK.

&
Picture of Elena Oikonomou

Elena Oikonomou

Project manager for the Athens Epidaurus Festival, representing the organization in the HORIZON Europe VOXReality research project to advance innovative, accessible applications for cultural engagement. With a decade of experience in European initiatives, she specializes in circular economy, accessibility, innovation and skill development. She contributes a background that integrates insights from social sciences and environmental research, supporting AEF’s commitment to outreach and inclusivity. AEF, a prominent cultural institution in Greece, has hosted the renown annual Athens Epidaurus Festival for over 70 years.

Twitter
LinkedIn
VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (39)

What is to my left?

As it is usual when comparing capability in completing intelligence tasks by man against machines, we notice that people can effortlessly categorise basic spatial relationships between objects that are useful in performing tasks like reasoning and planning, or in engaging in a conversational activity to reach a goal. The objects may be at a distance from the observer, and even the objects themselves may be at a long distance from each other. In any possible setting, we may want to know how two objects relate together in terms of a fixed set of spatial relationships that are commonly used in people’s daily life.

Computationally, we may need to know these relationships by being given only a colour photograph of some objects and rectangular boxes covering all of the objects of our interest. For example, given that input, we may want to state that “one object is below another object” — or, in another case, we would want to say that “object A is to the left of object B, and object A is behind object B”. We immediately deduce that spatial correspondences relating pairs of objects can simultaneously admit at least one relationship.

Open-sourcing AI software

In the domain of Artificial Intelligence (AI), an algorithm to infer the spatial relationships between objects (by usually considering objects in pairs) could be useful if it was implemented and shared with developers around the world as a library routine that any AI developer would want to have in place. In about the last fifteen years, we have globally seen a trending practice of sharing code with the public as open-source code. Code implementing very important algorithms or intelligent workflow processes is shared with any developer, provided that they acknowledge the terms of a license agreement like the GPL, L-GPL or the MIT license. Then, the need for developers to reimplement the wheel for basic tasks becomes smaller and smaller as code gets continuously contributed publicly, by offering to them robust implementations of algorithms for different intelligent tasks and for a row of programming languages. If there is still something that a developer cannot find in an open-source software library that is dedicated to a specific domain of problems (for example, computer vision problems), they can dig into the available code and extend it for themselves to fit their technical requirements. If their contribution of features is important and useful for other programmers who may be in need of the same features, they can commit their code changes to the maintainers of the software library (or any other open source-type software) for review. Hopefully, their contribution will be included in a future release of the software.

Capturing failures

Software engineers developing robotics applications, for example, would have wanted to have a set of such routines to use in developing simulation workflows for robots interacting with objects. About the one side of the coin, these routines should be reliable enough to be reused in software applications that feature sufficiently correct error handling and some ability to leverage failure evidence generated by the model. These are useful in order to achieve correctness and better error control in an underlying application.

What AI programs “see”

Although humans can reason effortlessly and very accurately in terms of basic spatial relationships that relate pairs of objects, this task is not as easy to be solved by computers. While humans can, for instance, see two objects and state that “the blue car is next to the lorry”, computers instead are only given a rectangular table of numbers that define the red, green and blue colour intensities of the cells in the table. These cells in this table correspond to the pixels of the underlying colour image. Although, again, humans can sense objects visually and understand spatial relationships instantly, we can instead state that computers are only initially given this table of colour pixel intensities. Then the goal is to use a program that takes in this rectangular array of pixel intensities and bounding boxes covering two objects of interest, and the program should decide how the two objects inside the two bounding boxes relate spatially. The program acts like an “artificial brain” specifically targeted to solve only this task and nothing else, that can understand the table of numbers that was input. It is helpful to realise that the program that does this operation executes a sequence of steps that finally yield a decision about the input that is given to the program. This is a basic realisation that is taught in introductory programming courses: that programs implement algorithms and that programs implementing algorithms need to receive input data and produce output data.

Usefulness of spatial relation prediction

The steps implemented in a program that identifies the spatial relationships between two objects in an image, were first devised by an algorithm that was designed to take in said input describing two objects in an image and produce as output the actual spatial relation out of a set of relations. At any given point in time, the program may be correct about the output it produces, or it may commit a wrong answer. Having such a program being correct at 100% of the time seems impossible at the moment, but it could be possible in future advances. It is important to note that this output may be reported to the person operating the computer, or the output can be passed on to another computer program that considers this output relation and then makes other decisions. For example, we may want to have a computer program that can execute a particular operation provided that one or more conditions hold. For instance, in a monitoring application that receives data from a camera, one case of a conditional execution that considers the spatial relationships between two detected entities (objects) could be this: “if a (detected) person is on the (detected) staircase, then turn the light on”.

Notice that by being able to correctly control that “a (detected) person is on the (detected) staircase”, developers can completely bypass the necessity to write complicated geometric or algebraic rules of what it means for one object to be on another object. This can be a potential failure point in the development process that is hard to cross safely without getting into trouble, as developing such rules may be wrong for some inputs and may work okay for other inputs. As a downside, a machine learning algorithm for this task would also make errors and the application logic of our program would want to “know” why. Fortunately, advances in all the settings met here so far, can enable us to write applications that reason about the spatial relationships of objects in colour images.

RelatiViT: a state-of-the-art model
Figure 1. Examples of colour images containing several objects. Two objects only are put in a red and a blue bounding box. For each case, the classification of a particular spatial relation is shown. Image adapted from [1].

It is important to note that computer algorithms are still not excellent at deciding the spatial relationships between objects (we refer only to pairs of objects). In a recent paper presented by Wen and collaborators [1] at the ICLR 2024 in Vienna, the authors devised modern spatial relationship classification algorithms that are based on deep convolutional neural networks or on Transformer [2] deep neural networks. The authors distinguished one of the models that they designed, called RelatiViT, as a superior model identified by conducting comparative experiments in two benchmarks. This computer vision algorithm can decide how two objects relate spatially when it is given a colour photograph of the objects with a surrounding background, along with rectangular bounding boxes covering the two objects.

Wen et al. used two benchmark datasets with examples of objects and bounding boxes covering them (see Figure 1); some portion of the data was reserved to train their spatial relationship classifier, and another portion of it was used to see how well their algorithm could generalise on that yet unseen data. This is a standard practice when building a machine learning model. We want to evaluate how good the model is empirically. Note that having a model being evaluated on at least one example that was used to train the model is unwanted. However, we can do that if we wish outside model testing. The first benchmark provided pairs of objects in 30 spatial relationships, and the second benchmark provided 9 spatial relationships. Interestingly, the 9 spatial object relationships that one of the benchmarks considers are: “above”, “behind”, “in”, “in front of”, “next to”, “on”, “to the left of”, “to the right of” and “under”.

Quantitative score of success

For the two benchmarks, the authors reported that the average ratio of correct spatial relationship classification with respect to each spatial relationship is a little higher than 80%. This essentially means that, in the controlled benchmark, the RelatiViT model can on average respond correctly to 8 out of 10 inputs respective of the actual spatial relationship that is picked and provided that all of the available test cases in the benchmark are tried.

Adoption of advances circa ‘17

In the last seven years, a basic and thriving technique in general-purpose deep learning algorithms has been the design of a machine learning model called the Transformer. This model was proposed in 2017 by Vaswani and collaborators [2], and it has been cited more than 139.000 times at the time this article was written. The Transformer model is an advance in deep learning that researchers have been putting effort to study, reuse or redesign in formulations of different sorts for machine learning problems. One important conclusion that is essential to accept a new machine learning model as a successor model (or winning model) for a particular problem, is that the models employing formulations that involve a Transformer-like model are empirically better than models that employed previous regimes of basic models or algorithms (such as, for instance, Generative Adversarial Networks or other past developments). Superiority is always measured in terms of one or more quantitative metrics, although this essential practice has received constructive criticism by researchers in recent years. At this point, there is a puddle in the road which is good to know: accepting that, for instance, Transformers are successful successor models against previous theory does not devalue previous theory. It certainly, however, implies that a better solution could exist by reusing the recent advance according to a row of quantitative metrics (which are still not the end of the story when we compare a sequence of models together in terms of their merits).

Basic input/output in a Transformer

The basic operation performed by a Transformer model is receiving as input a list of vectors and outputting a list of corresponding vectors by first identifying/capturing the true associations between the vectors in the input list. This model is being used in a very large set of basic AI problems. For instance, some of the important applications or application areas are: image segmentation, classification problems of all sorts, speech separation problems, or problems regarding the remote sensing of Earth observation data. Researchers have been committing time and effort to formulate virtually all known basic machine learning problems (like classification, clustering, etc.) reusing the idea of the Transformer model by Vaswani et. al. [2].

Structure of the RelatiViT
Figure 2. Depiction of the four object-pair spatial relationship classification models from the recent study of Wen and collaborators [1]. Image adapted from reference [1].

Wen et al. [1] considered four models (see Figure 2) that can take as input a colour image and two bounding boxes covering two objects of a user’s interest. In this article, we only focus on the rightmost model: the one called RelatiViT. RelatiViT is a state-of-the-art model that not only encodes information about two objects, but it also encodes information describing the context of the image. People certainly employ such cues in their decisions. The context of the image is regarded as the clipped portion of the image background that is enclosed by the union of the bounding boxes covering two objects: the subject and the object. For an example of what context is, see Figure 2 (a). Obviously, the information (or even the raw data) related to the surround of two objects is very important in deducing how two objects in a colour image are arranged spatially.

The RelatiViT model processes data in five basic steps: (a) it initially considers small image patches that reside in the background of the image, and small patches that reside in the trunks of the two objects; thereby creating three lists of patch embeddings; (b) these three lists of embeddings are passed through a ViT encoder [3]; (c) the ViT encoder recalculates (or “rewrites”) the vectors in each of the three lists so that they now are better related with each other, producing three new identical sets of embeddings; (d) since the two objects should be described by one vector each, RelatiViT considers complementary information available in a set of embeddings, and uses a pooling operation to calculate a single representation that can act as a global representation that employs features from the partial representation vectors; and finally, (e) the pooled representations of the two objects and the representations of the object context are passed to a multilayer Perceptron model (MLP) that can decide the spatial relationship characterising the two objects. Therefore, the MLP model learns how to map object-specific features to spatial relationship classes when RelatiViT is trained on example triplets (of a subject, an object and a ground truth spatial relationship relating the subject and the object), and it learns such a mapping by being provided small batches of data containing object-pair features and associated spatial relationships. To train RelatiViT, we may need at least one modern GPU that can be mounted on a regular modern personal computer with enough RAM memory capacity. Software stacks such as PyTorch and tensorflow have been implemented in the last decade, allowing machine learning and computer vision developers to design prototypes of deep neural netwoks and train them on data.

Generating explanations

Before we conclude this article, here is an important question which will always reappear when developing machine learning models in general: Can we reuse models like RelatiViT for any critical application where errors could be harmful or intolerable? We should first recognise that developments like the RelatiViT, only target to create models that can be good classifiers to recognise spatial relationships between objects. They propose models that are crafted by making use of the designer’s understanding of how such a model should be designed and no further features like classification validation are sought. One could quickly believe that this is a flaw of the method, but it is not. Each piece of research has to plan a scope so that only contributions within that scope can be committed in the research work. However, how to prove (if applicable) why a decided spatial relationship is in fact the true relationship connecting two objects falls within the subject of explainability in the deep learning field. Explainability models uncover and report evidence about a particular decision, and they are relevant for almost all of the basic machine learning problems (including, for instance, classification and clustering). For instance, if “object A is to the left of object B”, then this is the case because the mass of the second object is situated to the right of the mass of object A. However, another explanation for that could be that the center of mass of the two objects are ordered in this way, and we can deduce that by only comparing the magnitude of the projections of the centres of mass of the two objects across the horizontal axis. By doing so, we can then calculate the relative spatial relationship between two objects. We start to realise, then, that there can be many explanations that describe the same event. Some explanations are identical, but they are stated (or expressed) alternatively. Other explanations are complementary, and are useful to be reported to the user of a spatial relationship classifier. An explanatory model in a deep learning system such as the one that this article regards, should provide to the user as much comprehensive evidence as possible, and as many pieces of evidence as the explanatory algorithm can provide thanks to its design. But here is another important thing that is very critical by nature: can we just trust explanations and think of them as being correct without reasoning further about their correctness? The answer to this question is no, unless the algorithm can provably produce explanations that can be verified before it reports them to the user. This can become possible when we are limited to a particular application out of the very large pool of possible machine learning problems and data.

References
[1] Wen, Chuan, Dinesh Jayaraman, and Yang Gao. “Can Transformers Capture Spatial Relations between Objects?” arXiv preprint arXiv:2403.00729 (2024)

[2] Vaswani, A. et al. ”Attention is all you need.”, published in the proceedings of the Advances in Neural Information Processing Systems (2017)

[3] Dosovitskiy, A. et al. “An image is worth 16×16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020)

Picture of Sotiris Karavarsamis

Sotiris Karavarsamis

Research Assistant at Visual Computing Lab (VCL)@CERTH/ITI


&
Picture of Petros Drakoulis

Petros Drakoulis

Research Associate, Project Manager & Software Develper at Visual Computing Lab (VCL)@CERTH/ITI

Twitter
LinkedIn
VOXReality template LinkedIn v3 (3).pdf (1920 x 1080 px) (Instagram Post)

VOXReality Awards €1M to Boost XR Innovation Across Five European Initiatives

The VOXReality team is proud to announce the results of its Open Call, which supports pioneering institutions in their quest to innovate within the field of extended reality (XR). Each beneficiary is awarded 200K EUR in funding as part of a 1-year programme aimed at extending application domains for immersive XR technologies. This initiative is designed to integrate cutting-edge AI models into real-world applications, enhancing human-to-human and human-to-machine interactions across various sectors, including education, heritage, manufacturing, and career development.

Empowering Innovation: Spotlight on Selected Projects and Beneficiaries

Following a comprehensive evaluation process, the VOXReality team has selected five dynamic projects as beneficiaries, showcasing innovative approaches to XR applications:

MindPort GmbH & LEFX GmbH (Germany) – AIXTRA Programme

AIXTRA focuses on overcoming language barriers in international digital education through a VR authoring tool with automated real-time translation. By introducing AI-based virtual training partners, it aims to create a more inclusive learning environment, facilitating effective remote training sessions for multilingual participants.

Animorph Co-operative (UK) – CrossSense Programme

CrossSense smart glasses empower people living with Dementia and Mild Cognitive Impairment to live independently by supporting their ability to recall information. This project seeks to enhance user interaction in XR environments, laying the groundwork for a fully commercialised version through user testing and an open-sourced association engine.

XR Ireland (Ireland) & Āraiši ezerpils Archaeological Park (Latvia) – VAARHeT Programme

The Voice-Activated Augmented Reality Heritage Tours (VAARHeT) project aims to enhance visitor experiences at the Āraiši ezerpils Archaeological Park. By leveraging VOXReality’s AI models, the initiative will offer personalised museum tours and facilitate real-time multilingual translation of live tour guides, enriching educational opportunities for diverse visitors.

KONNECTA SYSTEMS P.C. (Greece) & IKNOWHOW S.A. (Greece) – WELD-E Programme

WELD-E addresses critical challenges in the welding industry by integrating voice and vision-based AI systems within an XR environment. This initiative aims to provide remote support for robotic welding operations, improving communication through speech recognition and automated translation to create a more effective training experience.

DASKALOS APPS (France) & CVCOSMOS (UK) – XR-CareerAssist Programme

XR-CareerAssist seeks to innovate career development by offering personalised, immersive experiences tailored to individual users. This project integrates VOXReality’s AI models to provide real-time feedback, potential career trajectories, and educational pathways, ensuring accessibility for a diverse range of users.

VOXReality’s overarching goal is to conduct research and develop new AI models that will drive future XR interactive experiences while delivering these innovations to the wider European market. The newly developed models focus on addressing key challenges in communication and engagement in various contexts, including unidirectional settings (such as theatre performances) and bidirectional environments (like conferences). Furthermore, the programme emphasizes the development of next-generation personal assistants to facilitate more natural human-machine interactions.

As VOXReality continues to advance XR and AI innovation, the successful implementation of these projects will pave the way for more immersive, interactive, and user-friendly applications. By fostering collaboration and knowledge sharing among the selected institutions, the VOXReality team is committed to enhancing the landscape of extended reality technologies in Europe.

Picture of Ana Rita Alves

Ana Rita Alves

Ana Rita Alves is currently working as a Communication Manager at F6S, where she specializes in managing communication and dissemination strategies for EU-funded projects. She holds an Integrated Master's Degree in Community and Organizational Psychology from the University of Minho, which has provided her with strong skills in communication, project management, and stakeholder engagement. Her professional background includes experience in proposal writing, event management, and digital content creation.

Twitter
LinkedIn
Untitled design (5)

Boosting Industrial Training through VOXReality: AR’s Edge Over VR with Hololight’s training application

Leesa Joyce

Head of Research Implementation at Hololight

In today’s rapidly evolving industrial landscape, the methods used for training and skill development are undergoing significant transformations. Companies increasingly seek innovative solutions to make their training more efficient, engaging, and adaptable in prototyping and manufacturing. Two prominent technologies leading this change are Augmented Reality (AR) and Virtual Reality (VR)1. Both offer immersive experiences, but they serve different purposes, especially when applied to training. In the VOXReality project, where AI assisted voice interaction in XR spaces plays a key role in making better user interfaces, HOLO’s extended reality application Hololight SPACE with the assembly training tool brings a new set of advantages to the user.

The Hololight SPACE, integrated into the VOXReality project, offers an augmented reality industrial assembly training system that allows workers to visualize and manipulate 3D computer-aided design (CAD) models. Through AR glasses, such as HoloLens 2, trainees can assemble components with the guidance of real-time feedback from a virtual training assistant. Augmented Reality in this context enables users to interact with both virtual models and the physical environment simultaneously, which holds several key benefits over Virtual Reality.

Overlaying 3D Objects on Real-World Environments

One of the main advantages of AR over VR is the ability to overlay virtual 3D objects onto real-world environments. In industrial assembly training, this feature is crucial when physical objects must align with virtual components. For example, AR allows a user to see a virtual engine model and align it directly on top of a real physical framework, enhancing spatial understanding. This real-time interaction between digital and physical elements ensures a seamless integration, allowing trainees to bridge the gap between virtual simulations and real-world applications.

In contrast, VR immerses users in a fully simulated environment where all objects, tools, and machinery are virtual. While this can be useful for certain training applications, it falls short when trainees need to practice in real-world contexts or with actual physical tools.

Dynamic Adaptation Based on Real-World Measurements

One of the key benefits of using AR in industrial assembly training is the ability to adapt virtual models based on real-world measurements. For example, Hololight SPACE allows for precise alignment of CAD models with real tools or machinery, ensuring accuracy in assembly tasks. The AR environment can scale or adjust virtual objects based on physical constraints, giving trainees a practical experience that is directly transferable to their real-world roles.

In VR, the environment is entirely digital, which means trainees may struggle to apply their knowledge when transitioning to real-life tasks. Without the ability to manipulate real objects, VR training can create a disconnect between theory and practice.

Enhanced Situational Awareness and Safety

In AR-based training, users remain aware of their surroundings, which is particularly important in industrial settings2. Hololight SPACE enables trainees to interact with both virtual and physical objects, all while remaining aware of their immediate environment, coworkers, and potential hazards. This situational awareness promotes a safer training environment, as trainees can avoid accidents or conflicts that might arise when entirely isolated from their surroundings, as is common in VR.

This added level of awareness is not possible in VR, where users are immersed in a completely digital world, which can lead to disorientation or accidents when trying to translate virtual skills into real-world tasks.

Reduced Cybersickness and Mental Load

Cybersickness is a common issue in VR training3. The disconnect between the user’s physical body and the virtual world can result in motion sickness and fatigue, especially during long training sessions. In contrast, AR presents virtual objects within the real world, eliminating the sensory mismatch that often leads to VR-induced cybersickness. By anchoring virtual elements to the trainee’s physical environment, AR reduces the mental and physical load, making it a more comfortable and sustainable training tool for industrial tasks.

Collaborative and Interactive Training

Another critical advantage of AR is the ability for users to see and interact with other people in the room4,5. In an industrial training setting, this means that instructors or fellow trainees can observe and provide real-time feedback by physical presence or by virtual presence while still allowing the trainee to engage with the virtual objects. This collaborative aspect of AR creates a more interactive learning environment, where knowledge is shared seamlessly between physical and digital spaces.

In contrast, VR isolates the user, making collaborative training more difficult unless all participants are also immersed in the same virtual environment.

References
  1. Oubibi, M., Wijaya, T.T., Zhou, Y., et al. (2023). Unlocking the Potential: A Comprehensive Evaluation of AR and VR in Education (LINK) 
  2. Akhmetov, T. (2023). Industrial Safety Using Augmented Reality and Artificial Intelligence (LINK) 
  3. Kim, Juno & Luu, Wilson & Palmisano, Stephen. (2020). Multisensory integration and the experience of scene instability, presence and cybersickness in virtual environments (LINK) 
  4. Syed, T. A., et al. (2022). In-depth Review of Augmented Reality: Tracking Technologies, Development Tools, AR Displays, Collaborative AR, and Security Concerns (LINK) 
  5. Timmerman, M. R. (2018). Enabling Collaboration and Visualization with Augmented Reality Technology (LINK) 
Twitter
LinkedIn
VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (37)

A Recap of the 5th VOXReality General Assembly

The VOXReality consortium gathered at Maastricht University’s Department of Advanced Computer Science for an impactful two-day General Assembly on October 30–31, 2024. This event brought together project partners and technical teams to align on the future of VOXReality’s pioneering AR and VR initiatives.

Day one kicked off with a warm welcome from MAG, leading into sessions on project planning that set a solid foundation for the days ahead. In-depth sessions followed on the AR Training Use Case and VR Conference Use Case, where HOLO and VRDAYS showcased recent pilot results, achievements, and provided hands-on demos of their immersive applications. An update on model deployment led by SYN highlighted technical progress, while F6S presented communication strategies to expand VOXReality’s public impact.

Day two focused on collaborative growth, starting with an Exploitation Workshop that explored paths for maximizing project impact. Following, our recently joined third-party contributors then presented their Open Call projects, sparking engaging discussions and potential collaborations. The event closed with AR Theatre Use Case, led by AF and MAG, captivated attendees with pilot results and a demo showcasing AR’s potential in live theatre. 

The VOXReality General Assembly showcased the power of innovation and collaboration, with each session reinforcing the project’s vision of immersive technology’s future. 

Stay tuned as VOXReality pushes forward on this exciting path! 🚀

Picture of Ana Rita Alves

Ana Rita Alves

Ana Rita Alves is an International Project Manager and current Communication Manager at F6S. With a background in European project management and a Master’s in Psychology from the University of Minho, she excels in collaborative international teams and driving impactful dissemination strategies.

Twitter
LinkedIn