VOXReality template LinkedIn v3 (3).pdf (1920 x 1080 px) (Instagram Post)

A Recap of the 6th VOXReality General Assembly

From April 8th to 10th, the VOXReality General Assembly was held at the Maggioli headquarters in Santarcangelo di Romagna, Italy. It was three intense and inspiring days, during which partners from all over Europe gathered to share insights and internally test the project’s use cases: AR Training, AR Theatre, and Virtual Conference.

The first day was dedicated to hands-on experimentation. Technical teams set up their stations and carried out user testing sessions.

In the AR Theatre case, participants wore headsets that enriched the live theatrical performance with augmented reality elements, from real-time subtitle translations to interactive animations moving alongside the actors, allowing for customizable experiences tailored to audience preferences.

In the AR Training scenario, users assembled a virtual engine by manipulating fully augmented tools and components such as bolts, drills, and pliers. They tested the process both with and without the support of the VOXy voice assistant to assess which experience offered better usability.

Lastly, the VR Conference use case enabled testing of a hybrid format: some participants joined via laptop while others connected through VR headsets in separate rooms. This session explored interactions with the virtual assistant VOXy and real-time multilingual communication in virtual environments.

All demo participants completed targeted questionnaires to collect qualitative and quantitative data, which will help refine and optimize the project’s solutions.

The second day began with a project planning session, focusing on KPIs, risk management, and updates on multimodal AI for XR and model deployment. Later, technical partners presented preliminary data gathered during the first day of testing, discussed outcomes from each use case, and outlined the strategy for the second pilot phase.

The third and final day focused on sharing lessons learned, best practices, and success stories. Participants split into three groups, one per use case, and took part in a workshop aimed at identifying key strengths, challenges overcome, and strategies that worked well. Beyond pinpointing actionable insights, the workshop strengthened collaboration among partners and sparked new ideas for continuous improvement.

This General Assembly left us energized and enthusiastic, the progress achieved confirms that we’re on the right path. We look forward to the next steps and to driving VOXReality toward new horizons!

Picture of Greta Ioli

Greta Ioli

Greta Ioli is an EU Project Manager in the R&D department of Maggioli Group, one of Italy's leading companies providing software and digital services for Public Administrations. After earning a degree in International Relations – European Affairs from the University of Bologna, she specialized in European projects. Greta is mainly involved in drafting project proposals and managing dissemination, communication, and exploitation activities.

Twitter
LinkedIn
Copy of Maastricht General Assembly (Instagram Post)

XR EXPO 2025: VOXReality Open Call Projects AIXTRA and VAARHeT to Showcase XR Innovations! 🤩

Two exciting VOXReality Open Call projects, AIXTRA and VAARHeT, are set to unveil their latest advancements in XR and AI at the XR EXPO 2025.

📍 Location: Stuttgart, Germany 🇩🇪

🗓️ Dates: May 8-9, 2025

🏢 Venues: Porsche Arena (conference) and Hanns-Martin-Schleyer-Halle (exhibition)  

AIXTRA: Breaking Language Barriers in VR! 🗣️

📍 Booth: A17 (next to stage 2, Hanns-Martin-Schleyer-Halle)

AIXTRA delivers powerful tools for improved training and seamless communication in multi-user environments. Experience two immersive demos showcasing novel AI applications:  

  • See how intent recognition can boost training effectiveness.
  • Try AI-based translation and synthesization services. Hear yourself speak a foreign language!

Find the AIXTRA team members there:

VAARHeT: Transforming Heritage Tours with AR and AI! 🏛️

📍 Booth: B41 (Hanns-Martin-Schleyer-Halle Innenraum)

The VAARHeT project is developing an AI-powered AR application to create a transformative, immersive visitor experience at the Āraiši ezerpils Archeological Parks.  

Visit XR Ireland to explore AI-augmented applications of XR technologies using advanced voice-activated interaction:

  • Learn how to drive growth, reduce risk, and engage stakeholders in enterprise, cultural heritage, and international disaster response.
  • Try out work-in-progress pilot projects funded by the European Commission. 

Find the VAARHeT team members there:

Don’t miss out! Join XR EXPO 2025 in Stuttgart to explore the future of AI-powered XR with AIXTRA and VAARHeT! ✨

Twitter
LinkedIn
images-IEEEVR2025-CWI-3-photo

Honorable mention in IEEEVR2025 Workshop (VR-HSA) Paper by CWI

Award: Honorable mention of paper “User-Centric Requirements for Enhancing XR Use Cases with Machine Learning Capabilities” in the “Best Presentation” award at the International Workshop on Virtual Reality for Human and Spatial Augmentation (VR-HSA) in conjunction with IEEE VR 2025 

We are glad to share that our team from CWI (Centrum Wiskunde & Informatica) participated and presented their work in the International Workshop on Virtual Reality for Human and Spatial Augmentation (VR-HSA), held in conjunction with IEEE VR 2025 in the beautiful coastal city of Saint-Malo, France, on March 9, 2025. At the workshop, we presented our paper, “User-Centric Requirements for Enhancing XR Use Cases with Machine Learning Capabilities,” authored by Sueyoon Lee, Moonisa Ahsan, Irene Viola, and Pablo Cesar. Additionally. This paper is based on two use cases (a) Virtual Conference, which mimics a real-life like conference in a VR environment (VRDays Foundation) and (b) Augmented Theatre, which showcases a Greek play in AR environment (Athens Festival). The paper shows our user-centric approach for conducting two focus groups to gather user requirements for these two use cases and to find where ML technologies could be implemented using VOXReality technology modules. We also showcased the overview of the full data collection, processing and evaluation pipeline with a poster presentation in a parallel session. We are happy to share that our presentation received the honorable mention in the Best Presentation Award category.  

Poster for the paper presented in the IEEEVR2025 Workshop (VR-HSA)
Poster for the paper presented in the IEEEVR2025 Workshop (VR-HSA)

We are excited to see our work contributing to the growing field of ML-enhanced XR user experiences. We extend our thanks to all contributors from the use case owners (VRDays Foundation, Athens Festival AEF) and everyone who was part of the process and supported in the contribution for enabling this work; to the VR-HSA organizers and the broader XR community for supporting discussions. This recognition motivates us to continue working towards more user centric immersive experiences. 

Photo Credits: Moonisa Ahsan at the VR-HSA Workshop at IEEE VR 2025

Abstract: The combination of Extended Reality (XR) and Machine Learning (ML) will enable a new set of applications. This requires adopting a user-centric approach to address the evolving user needs. This paper addresses this gap by presenting findings from two independent focus groups specifically designed to gather user requirements for two use cases: (1) a VR Conference with an AI-enabled support agent and real-time translations, and (2) an AR Theatre featuring ML generated translation capabilities and voice-activated VFX. Both focus groups were designed using context-mapping principles. We engaged 6 experts in each of the focus groups. Participants took part in a combination of independent and group activities aimed at mapping their interaction timelines, identifying positive experiences, and highlighting pain points for each scenario. These activities were followed by open discussions in semi-structured interviews to share their experiences. The inputs were analysed using Thematic Analysis and resulted in a set of user-centric requirements for both applications on Virtual Conference and Augmented Theatre respectively. Subtitles and Translations were the most interesting and common findings in both cases. The results led to the design and development of both applications. By documenting user-centric requirements, these results contribute significantly to the evolving landscape of immersive technologies.  

Keywords: Virtual Reality, VR conference, Augmented Reality, AR theatre, Focus groups, User requirements, Use cases, Human-centric design. 

Reference 

  1. Lee, M. Ahsan, I. Viola, and P. Cesar, “User-centric requirements for enhancing XR use cases with machine learning capabilities,” in Proceedings of VR-HSA Workshop (IEEEVR2025), March 2025.
Picture of Moonisa Ahsan

Moonisa Ahsan

Moonisa Ahsan is a post-doc in the DIS (Distributed & Interactive Systems) Group of CWI. She was also the external-supervisor for the aforementioned thesis work. In VOX, she is contributing in understanding next-generation applications within Extended Reality (XR), and to better understand user needs and leveraging that knowledge to develop innovative solutions that enhance the user experience in all three use-cases. She is a Marie-Curie Alumna and her scientific and research interests are Human-Computer Interaction (HCI), User Centric Design (UCD), Extended Reality (XR) and Cultural Heritage (CH).

Twitter
LinkedIn
blog-cwi-thesis-Images-Atanas2

Master’s Thesis titled “Enhancing the Spectator Experience by Integrating Subtitle Display in eXtended Reality Theatres” defended last December in Amsterdam

Master’s Student: Atanas Yonkov
Thesis Advisors (CWI): Moonisa Ahsan, Irene Viola and Pablo Cesar

Abstract: The rapid growth of virtual and augmented reality technologies, encapsulated by the term eXtended Reality (XR), has revolutionized the interaction with digital content, bringing new opportunities for entertainment and communication. Subtitles and closed captions are crucial in improving language learning, vocabulary acquisition, and accessibility, like understanding audiovisual content. However, little is known about integrating subtitle displays in extended reality theatre environments and their experience influence on the user. This study addresses this gap by examining subtitle placement and design attributes specific to XR settings. Building on previous research on subtitle placement, mainly in television and 360-degree videos, this project focuses on the differences between static and dynamic subtitle variants. The study uses a comprehensive literature review, Virtual Reality (VR) theatre experiment, and analytics to investigate these aspects of subtitle integration in the specific case of a VR theatrical Greek play with subtitles. The results show that the comparison between the two variants is insignificant, and both implementations produce high scores. However, thematic analysis suggests the preference for static over the dynamic variant depends heavily on the specific context and the number of speakers in the scene. Since this study focuses on a monologue theatrical play, the next step in future work would be to explore a “multi-speaker” play.

The partners from the DIS (Distributed and Interactive System) group of Centrum Wiskunde en Informatica (CWI) hosted and supervised a Master’s thesis[1] titled as “Enhancing the Spectator Experience by Integrating Subtitle Display in eXtended Reality Theatres” by Atanas Yonkov at University of Amsterdam (UvA). The advisors from CWI were Moonisa Ahsan, Irene Viola and Pablo Cesar, and the university advisors were Prof. dr. Frank Nack and Prof. dr. Hamed Seiied Alavi. The thesis focuses on XR Theatres, investigating subtitle integration in virtual reality (VR) theatre environments designed within the VOXReality project. The user study in the thesis was based on an extended VR version of the AR Theatre Use Case Application of VOXReality project, showcasing the Greek theatrical play Euripides by Hippolytus. The goal was to bridge the existing research gap by exploring optimal subtitle positioning in VR theatre, focusing on two key approaches: static and dynamic subtitles. In the study, the Static subtitles (see fig 1a) are fixed relative to the user’s gaze, ensuring they remain within the viewer’s field of vision regardless of scene movement. The Dynamic subtitles” (see fig 1b) are anchored to objects—in this case, actors—moving naturally with them within the virtual environment.

Figure 1 (a) Static and (b) Dynamic subtitles in a theatrical play scene from a participant’s VR headset perspective
Figure 1 (a) Static and (b) Dynamic subtitles in a theatrical play scene from a participant’s VR headset perspective

The study was conducted from May 13, 2024, to May 22, 2024, at the DIS Immersive Media Lab, Centrum Wiskunde en Informatica (CWI) in Amsterdam, The Netherlands. The study examined how subtitle placement affects the user experience in a VR theatrical adaptation of a Greek play. Results indicated no significant difference in user experience between static and dynamic subtitle implementations, with both approaches receiving high usability scores. However, a thematic analysis revealed that user preference for static or dynamic subtitles was highly context-dependent. In particular, the number of speakers in a scene influenced subtitle readability and ease of comprehension: a) in monologue settings, static subtitles were often preferred for their stability and ease of reading; b) in potential future scenarios with multiple speakers, dynamic subtitles could enhance spatial awareness and dialogue attribution. Each session lasted approximately 60 minutes, with individual durations varying between 50 minutes and 120 minutes, depending on participant familiarity and adaptability with VR headsets and controllers. Our findings, which will be detailed in future blog posts, contribute to the growing body of research on subtitle placement in immersive environments. This work builds upon previous studies in subtitle integration for television and 360-degree videos, extending the analysis to VR theatre settings. And this study also informs several design and user experience decisions for the AR Theatre use case within the project. For the future work, given that this study focused on a monologue performance, further research should extend the analysis to multi-speaker theatrical plays to further explore subtitle effectiveness in complex dialogue scenarios.

Image Courtesy of Atanas Yonkov: Master’s Graduation Ceremony at University of Amsterdam (UvA) (2024)

[1] Atanas Yonkov, “Enhancing the Spectator Experience: Integrating Subtitle Display in eXtended Reality Theatres (Master’s thesis). Universiteit van Amsterdam, 2024. Available at  https://scripties.uba.uva.nl/search?id=record_55113

Picture of Moonisa Ahsan

Moonisa Ahsan

Moonisa Ahsan is a post-doc in the DIS (Distributed & Interactive Systems) Group of CWI. She was also the external-supervisor for the aforementioned thesis work. In VOX, she is contributing in understanding next-generation applications within Extended Reality (XR), and to better understand user needs and leveraging that knowledge to develop innovative solutions that enhance the user experience in all three use-cases. She is a Marie-Curie Alumna and her scientific and research interests are Human-Computer Interaction (HCI), User Centric Design (UCD), Extended Reality (XR) and Cultural Heritage (CH).

Twitter
LinkedIn
mots_AI-and-Me_1.62.1-Octavian-Mot

Luxembourg’s Immersive Days 2025

The Immersive Days 2025, held on March 4 and 5 in Luxembourg City, explored immersive technologies and their intersection with art, culture, and society. This two-day conference, organized by Film Fund Luxembourg in collaboration with the Luxembourg City Film Festival and PHI Montreal, brought together international experts, professionals, and artists active in the XR industry to discuss the latest developments and challenges in the field and underscore Luxembourg’s growing prominence in the immersive arts and virtual reality (VR) sectors.

This year’s conference delivered again a programme open to the general public and provided an opportunity to engage directly with the creators behind the immersive works featured in the Immersive Pavilion 2025.

Lectures and round tables started on March 4, and they were hosted at the Cercle Cité, gathered for professionals and the general public, mainly featuring creators whose works were exhibited at this year’s Immersive Pavilion.  Discussions were centred around their unique creative processes, reflecting on their fictional and personal stories, translated to immersive content, and where the challenges presented during the ideation, production and distribution process were discussed.

The second day, held at Neumünster, was reserved for industry professionals and delved into more technical and forward-looking topics within the XR industry. This day fostered international exchanges and promoted peer networking among professionals. Discussions during day 2 covered the current challenges in the preservation of digital content, funding opportunities for XR projects showcasing the German Regional funding system, and last but most important, the impact of AI technology on immersive experiences was discussed in a panel titled “AI/XR: The Future of AI for Immersive and Virtual Arts”.

Among the guests of this year’s programme, we found Stéphane Hueber-Blies and Nicolas Blies, directors of “Ceci est mon cœur“, François Vautier director of “Champ de Bataille” awarded “Best Immersive Experience” at this year’s Immersive Pavilion, and Octavian Mot, director of “AI & Me: The Confessional and AI Ego“, a thought-provoking installation that explores the intricate relationship between humans and artificial intelligence.

Mot’s work invites participants into an intimate dialogue with an AI entity, challenging them to reflect on themes of identity, consciousness, and the evolving dynamics between human and machine. The installation was a highlight of the Immersive Pavilion 2025, showcasing the potential of AI to create deeply personal and immersive art experiences.

Image 1: "AI & Me" installation. User Analysis in process.
Image 1: "AI & Me" installation. User Analysis in process.

While “AI & Me” offers an introspective exploration of human-AI interaction of an artistic nature, it was impossible not to draw parallels to VOX Reality’s AI Agent, representing a more utilitarian and non-intrusive application of AI in immersive environments. VOX Reality’s AI Agent is designed to enhance user experiences in virtual spaces by providing responsive and adaptive interactions, serving roles ranging from virtual assistants to dynamic characters within virtual narratives, while “AI & Me” creates its narrative by “perceiving” users while completely unbound to the industry standard “rules of engagement” with humans, resulting in answers and interactions that can come across as cunning and raw, while leaving the user to deal with his sense of humour. Brilliant!

Image 2: "AI & Me" installation. User AI representation.

The juxtaposition of “AI & Me” and VOX Reality’s AI Agent underscores the diverse applications of AI agents in today’s technological landscape. On one hand, AI is leveraged as a medium for artistic expression, prompting users to engage in self-reflection and philosophical inquiry. On the other hand, AI serves practical functions, improving user engagement and functionality within virtual environments. This duality highlights the versatility of AI agents and their growing significance across various domains.

In conclusion, Immersive Days 2025 in Luxembourg City successfully bridged the gap between art and technology, providing a platform for meaningful discussions and showcasing pioneering works in the field of immersive experiences. The event not only highlighted the current state of immersive art and technology but also set the stage for future innovations, emphasising the importance of interdisciplinary collaboration and the country’s significant contribution to the international immersive production landscape, a success largely attributed to the strategic initiatives of the Film Fund Luxembourg.

Picture of Manuel Toledo - Head of Production at VRDays Foundation

Manuel Toledo - Head of Production at VRDays Foundation

Manuel Toledo is a driven producer and designer with over a decade of experience in the arts and creative industries. Through various collaborative projects, he merges his creative interests with business research experience and entrepreneurial skills. His multidisciplinary approach and passion for intercultural interaction have allowed him to work effectively with diverse teams and clients across cultural, corporate, and academic sectors.

Starting in 2015, Manuel co-founded and produced the UK’s first architecture and film festival in London. Since early 2022, he has led the production team for Immersive Tech Week at VRDays Foundation in Rotterdam and serves as the primary producer for the XR Programme at De Doelen in Rotterdam. He is also a founding member of ArqFilmfest, Latin America’s first architecture and film festival, which debuted in Santiago de Chile in 2011. In 2020, Manuel earned a Master’s degree from Rotterdam Business School, with a thesis focused on innovative business models for media enterprises. He leads the VRDays Foundation’s team’s contributions to the VOXReality project.

Twitter
LinkedIn
Copy of OC EIC Solvers Webinar #2 (1920 x 1080 px) (1920 x 1200 px) (LinkedIn Post) (1)

Developing NLP models in the age of AI race

The AI rance intensifies

During the last 10-15 years, Natural Language Processing (NLP) has undergone a profound transformation, driven by advancements in deep learning, use of massive datasets and increased computational power. These innovations led to early breakthroughs such as word embeddings (Word2Vec [1], GloVe [2]) and paved the way for advanced architectures like sequence-to-sequence models and attention mechanisms, all based on neural architectures. It was in 2018, that the introduction of transformers and especially BERT [3]  (as an open-source model) that enabled the contextualized understanding of language. Performance in NLP tasks like machine translation, sentiment analysis or speech recognition has been significantly boosted, making AI-driven language technologies more accurate and scalable than ever before.

The “AI race” has intensified with the rise of large language models (LLMs) like OpenAI’s ChatGPT [4] and DeepSeek-R1 [5], which use huge architectures with billions of parameters and massive multilingual datasets to push the boundaries of NLP. These models dominate fields like conversational AI and can perform a wide range of tasks by achieving human-like fluency and context awareness. Companies and research institutions worldwide are competing to build more powerful, efficient, and aligned AI systems, leading to a rapid cycle of innovation. However, this race also raises challenges related to interpretability, ethical AI deployment and the accessibility of high-performing models beyond large tech firms.

But what did DeepSeek achieve? In early 2025, DeepSeek released its R1 model, which has been noted to outperform many state-of-the-art LLMs at a lower cost, therefore it caused a disruption in the AI sector. DeepSeek had made its R1 model available on platforms like Azure, allowing users to take advantage of their technology. DeepSeek introduced many technical innovations that allowed their model to thrive (such as architecture innovations: hybrid transformer design, the use of mixture-of-experts models, auxiliary-loss-free load balancing) however their main contribution was the reduction of reliance on traditional labeled datasets. This innovation stems from the integration of pure reinforcement learning techniques (RL), enabling the model to learn complex reasoning tasks without the need for extensive labeled data. This approach not only reduces the dependency on large labeled datasets but also streamlines the training process, lowering the resource requirements and costs associated with developing advanced AI models.

Figure: DeepSeek architecture (taken from https://newsletter.languagemodels.co/p/the-illustrated-deepseek-r1)
Figure: DeepSeek architecture (taken from https://newsletter.languagemodels.co/p/the-illustrated-deepseek-r1)

The Enduring Relevance of models used in VOXreality

At VOXReality we take a fundamentally different approach and believe in the significant value brought by “traditional” AI models (esp. for ASR and MT) particularly in specialized domain applications. We prioritize real open-source AI by ensuring transparency, reproducibility and accessibility [6]. Unlike proprietary or restricted “open weight” models, our work is built upon truly open architectures that allow full modification and deployment without any limitations. This is the reason that our open call winners [7] are allowed to build on top of the VOXreality ecosystem. Moreover, our approaches often require less computational power and data, making them suitable for scenarios with limited resources or where deploying large-scale AI models is impractical. Our models can be tailored to specific industries or fields, incorporating domain specific expertise without extensive or expensive retraining. The implementation of models on a local scale (if chosen), can also offer enhanced control over data and compliance with privacy regulations, which can be a significant consideration in sensitive domains. 

VoxReality’s Strategic Integration

At VoxReality, we strategically integrate traditional ASR and MT approaches to complement advanced AI models, ensuring a comprehensive and adaptable solution that leverages the strengths of state-of-the art AI models. This focus on real open-source innovation and data-driven performance differentiates VOXreality from the rapidly evolving landance of AI mega-models.

Picture of Jerry Spanakis

Jerry Spanakis

Assistant Professor in Data Mining & Machine Learning at Maastricht University

References

[1] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space.

[2] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.

[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics

[4] https://openai.com/chatgpt/overview/

[5] https://github.com/deepseek-ai/DeepSeek-R1

[6] https://huggingface.co/voxreality

[7] https://voxreality.eu/open-call-winners/

Twitter
LinkedIn
Picture1

VOXReality’s Open Learning Revolution

Industrial training has traditionally followed rigid, stepwise instruction, ensuring compliance and accuracy but often at the cost of creativity and adaptability. However, with the rapid advancements in Extended Reality (XR) and Artificial Intelligence (AI), training methodologies are shifting toward more dynamic and flexible models.

At the heart of this transformation is VOXReality, an XR-powered training system that departs from traditional step-by-step assembly guides. Instead, it embraces a freemode open-learning approach, allowing workers to logically define and customize their own assembly sequences with complete freedom. This method enhances problem-solving skills, engagement, and real-world adaptability.

Unlike conventional training, which dictates a specific order of operations, VOXReality’s open-ended model empowers users to experiment, explore, and determine their optimal workflow. This approach offers several key benefits: workers are more engaged when they can approach tasks in a way that feels natural to them; trainees develop a deeper understanding of assembly processes through problem-solving rather than rote memorization; the system adapts to different skill levels, allowing experienced workers to optimize workflows while providing guidance to beginners; and, since real-world assembly is rarely linear, this method better prepares workers for unexpected challenges on the factory floor.

VOXReality integrates an AI-driven dialogue agent to ensure trainees never feel lost in this open-ended system. This virtual assistant provides real-time feedback, allowing users to receive instant insights into their choices and refine their approach. It also enhances engagement and interactive learning by enabling workers to ask questions and receive contextual guidance rather than following static instructions. Additionally, the AI helps prevent errors by highlighting potential missteps, ensuring that creativity does not come at the cost of safety or quality.

Development Progress:

Below, we outline the development status with some corresponding screenshots that showcase the system’s core functionalities and user interactions.

The interface features two text panels displaying the conversation between the user and the dialogue agent. When the user speaks, an automated speech recognition tool (created by our partners from Maastricht University) converts their speech into text and sends it to the dialogue agent (created by our partners in Synelixis), which is shown in the top panel (input panel). The dialogue agent then processes the input, provides contextual responses, and uses a text-to-speech tool to read them aloud. These responses are displayed in the lower panel (output panel). Additionally, the system can trigger audio and video cues based on user requests. The entire scene is color-coded to enhance user feedback and improve interaction clarity.

The screenshots below capture the dialogue between a naive user and the dialogue agent. The user enters the scene and asks for help. The Dialogue Agent is guiding the user for the next steps.

The screenshot below captures the user’s curious question regarding the model to be assembled. The Dialogue Agent provides contextual answers to the user.

The user asks the Dialogue Agent to show a video about one of the steps. The Dialogue Agent triggers the function in the application to show the corresponding video on the output panel.

The user grabs an object and asks the Dialogue Agent to give a hint about the step he wants perform. The Dialogue Agent triggers the function in the application to give a useful hint.

The implementation of freemode XR training is just the beginning. As AI and XR technologies continue to evolve, the potential for fully immersive, adaptive, and intelligent industrial training systems grows exponentially. The success of this approach will be measured by increased worker efficiency, reduced onboarding time, and higher retention of complex technical skills.

VOXReality’s commitment to redefining industrial learning aligns with the broader movement toward smart manufacturing and Industry 5.0. By blending technology with human intuition and adaptability, we are not just training workers—we are empowering the future of industry. We are looking forward to test the solution with unbiased users and receive feedback for improvements.

Picture of Leesa Joyce

Leesa Joyce

Head of Research @ Hololight

&

Picture of Gabriele Princiotta

Gabriele Princiotta

Unity XR Developer @ Hololight

Twitter
LinkedIn
VOXReality template LinkedIn v3 (3).pdf (1920 x 1080 px) (Instagram Post) (1)

Celebrating Women in Extended Reality: Insights and Inspiration from the Women in XR Webinar 

VOXReality, in collaboration with SERMAS, XR4ED, TRANSMIXR, HECOF, MASTER, and CORTEX2 projects, had the privilege of hosting the “Women in XR Webinar – Celebrating Women in Extended Reality.” This online event brought together leading female experts from EU-funded XR projects for an inspiring discussion on the role of women in the rapidly evolving field of Extended Reality. We were honored to have a panel featuring Regina Van Tongeren, Grace Dinan, Leesa Joyce, Moonisa Ahsan, Megha Quamara, Georgia Papaioannou, Maria Madarieta, and Marievi Xezonaki. From seasoned trailblazers with 20 years of experience to emerging voices, these panelists shared their journeys, challenges, and invaluable insights. This webinar aimed to highlight the importance of gender diversity in XR and provide practical advice for aspiring women in tech.

Navigating the Digital Divide: Realities of Women in XR

The panelists openly discussed the challenges women face in the XR industry. While the field offers immense creative potential, it is inherently challenging. Participants highlighted several key issues. Women often find fewer opportunities compared to their male counterparts. The persistent pay gap remains a significant barrier. Women’s contributions can be overlooked, hindering career advancement. Some women still experience difficulties in accessing advocacy and support from the broader XR community. These challenges underscore the need for systemic changes to ensure equal opportunities and recognition for women in XR.

Unlocking Immersive Potential: The Boundless Opportunities for Women in XR

Despite the challenges, the webinar emphasized the vast opportunities available for women in XR. The panelists pointed to the expanding applications of XR across various sectors. XR has the potential to modernize medical training, patient care, and therapy in healthcare. Immersive learning experiences enhance engagement and knowledge retention in education. Innovative applications for virtual try-ons, digital fashion, and immersive design processes are emerging in fashion and design. The XR field is not limited to technical roles; it requires a wide range of skills, including legal expertise, artistic talent, and scientific knowledge. These emerging opportunities present a unique chance for women to lead and shape the future of XR.

Empowering the Future: Actionable Insights and Key Takeaways for Women in XR

The panelists shared a wealth of practical advice for women looking to thrive in the XR industry. They emphasized the importance of building a strong network and finding a supportive community within XR. Organizations like Women in Immersive Tech Europe [1] provide valuable resources, mentorship, and networking opportunities. Seeking out inspiring role models, such as Parul Wadhwa [2], or event figures like Marie Curie, and learning from their experiences was also strongly encouraged.

Furthermore, the panelists stressed the importance of being assertive and comfortable making suggestions. Staying updated with the latest developments in the rapidly evolving XR field is crucial, as is a commitment to continuous learning. They advised against trying to conform to a pre-existing mold, urging women to bring their unique perspectives to the table and contribute to creating inclusive XR experiences. Building a strong online brand, including a professional portfolio, active social media channels, and a personal website, was highlighted as essential for visibility.

For XR teams, the message was clear: diversity must be a core value, integrated into the DNA of the team and its products, not an afterthought. Diversifying hiring teams to include a wide range of skill sets is essential. For those considering starting their own businesses or working freelance, platforms like Immersive Insiders [3] and TalentLabXR [4] were recommended, along with exploring relevant courses from institutions such as the University of London and the University of Michigan.

The webinar left us with a powerful call to action, inspiring us to work together towards a more inclusive and equitable XR future. We encourage you to follow all the panelists, especially the member from our team, Leesa Joyce and Moonisa Ahsan, and be inspired by their ongoing leadership!

Missed the Live Session? Catch the Recording! If you were unable to join us live, don’t worry! The full event recording is available on the F6S Innovation YouTube channel.

Picture of Ana Rita Alves

Ana Rita Alves

Communication Manager at F6S, where she specializes in managing communication and dissemination strategies for EU-funded projects. She holds an Integrated Master’s Degree in Community and Organizational Psychology from the University of Minho, which has provided her with strong skills in communication, project management, and stakeholder engagement. Her professional background includes experience in proposal writing, event management, and digital content creation.

Twitter
LinkedIn
Photo by James Bellorini, https://www.citymatters.london/london-short-film-festival-smart-caption-glasses/

Choosing the right AR solution for Theatrical Performances

Choosing suitable AR devices for the VOXReality AR Theatre has been a challenging endeavour. The selection criteria are based on the user and system requirements of the VOXReality use case, which have been extracted through a rigorous user-centric design process and iterative system architecture design. The four critical selection criteria dictate that:

  1. The AR device should have a comfortable and discreet glass-like form factor for improved user experience and technological acceptance in theatres.
  2. The AR device should support affordability, durability and long-term maintenance for feasible implementation at audience-wide scale.
  3. The AR device should support personalization, so that each audience member can customize the fit to their needs, and allow strict sanitization protocols, so that device distribution can adhere to high level public health standards.
  4. The AR device should support application development with open standards instead of proprietary SDKs for widespread adoption of the solution.

Given the above criteria, the selection process presents a clear challenge because no readily available AR solution offers a perfect fit. To address this need, the VOXReality team performed an extensive investigation of the range of available options with a view to the past, present, and future, and is presenting the results below.

The past

A quick look to the past shows that popular AR options were clearly unsuitable given the selection criteria. Specifically, affordable AR has a long, proven track as affordable and user-friendly camera-based AR, deployed on consumer smartphones and distributed through appropriate platforms as standalone applications. The restrictions on the user experience though, such as holding a phone up while seated to watch the play through a small screen, make this a clearly prohibiting option. Previous innovative designs of highly sophisticated -and costly- augmented reality devices, sometimes also referred to as holographic devices or mixed reality devices, do not support either scalability, or discreet presence required by the use case, and are also similarly rejected. Finally, a range of industry-oriented AR designs available as early as 2011 focused on monocular, non-invasive AR displays with limited resolution and display capabilities, at a prohibiting cost due to (among others) extensive durability and safety certifications. Therefore, with a view to the past, one can posit that the AR theatre use case had not taken up popularity due to pragmatic constraints.

The present

In recent years and as early as 2020, hardware developments picked up pace. Apart from evolutions of previously available designs and/or newcomers in similar design concepts, more diverse design concepts have been introduced in a persistent trend of lowering procurement costs and offering more support for open-source frameworks. Nowadays, this trend has culminated in a wide range of “smart glasses”, i.e. wearables with glass-like form factor supporting some level of audiovisual augmentation, which rely on external devices for computations (such as smartphones).

This design concept finds its origins in the previously mentioned industry-oriented AR designs, as well as their business-oriented counterparts. This time though, the AR glass concept is entering the consumer space with options that are durable for daily, street wear and tear while also remaining affordable for personal use. Some designs are provided directly bundled with proprietary software in a closed system approach (like AI-driven features or social media-oriented integrations), but the majority offers user-friendly, plug-n-play, tethered or wireless capabilities, directly supporting most personal smartphones or even laptops.

The VOXReality design

This landscape enables new, alternative system designs for AR Theatre use cases: instead of theatre-owned hardware/software solutions, one can envision system with a combinations of consumer and theatre-owned hardware/software elements. By investigating how various stakeholders respond to this potential, we can pinpoint best practices and future recommendations. 

Examining implemented AR theatre use cases, one can validate that the past landscape is dominated by a design approach with theatre-owned integrated (hardware/software) solutions. Excellent examples where the theatre provides hardware and software to the audience, are the National Theatre’s smart caption glasses feature [1], developed by Accenture and the National Theatre, as well as the Greek-based SmartSubs [2] project.

One new alternative that presents itself is for audience to use their own custom hardware/software solutions dedicated to live subtitling and translations during performances. In this case, each user can choose their own AR device, pre-bundled with general-purpose translation software of their preference.

As eloquently described in a recent article though [3], general purpose AI-captioning and translations frequently make mistakes and fail to capture nuances, which especially in artistic performances can break immersion and negatively impact audience experience. Therefore, in VOXReality we design for a transition from the past to the future: developing custom software dedicated to theatrical needs, optimized at generating real-time subtitles and translations of literary text, which can also be easily deployed on theatre-owned AR devices and/or on consumer-owned devices with minimal adaptations. This is enabled by a rigorous user-centric design approach which can verify the features and requirements per deployment option, as well as contemporary technical development practices using open standards such as OpenXR.

The future

The future looks bright with community driven initiatives showing how accessible AR and AI technology can be, as in the example of open-source smart glasses you can build on your own [4], and in the continuous improvements on automatic speech recognition and neural machine translation allowing models to run performantly on ever less resources. VOXReality aims to leave a long-standing contribution to the domain of AR theatre with the objective of establishing reliable, immersive and performant technological solutions as the mainstream in making cultural heritage content accessible to all.

Picture of Spyros Polychronopoulos

Spyros Polychronopoulos

Research Manager at ADAPTIT and Assistant Professor at the Department of Music Technology and Acoustics of the HMU

&

Picture of Olga Chatzifoti

Olga Chatzifoti

Extended Reality applications developer working with Gruppo Maggioli for the design and development of the Augmented Reality use case of the VOXReality HORIZON research project. She is also a researcher in the Department of Informatics and Telecommunications of the University of Athens.

Twitter
LinkedIn
VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (41)

Seeing into the Black-Box: Providing textual explanations when Machine Learning models fail.

Machine learning is a scientific practice which is heavily tied with the terms “error” and “approximation”. Sciences like Mathematics and Physics are associated with error, induced by a need to model how things work. Moreover, the abilities of humans in intelligence tasks are also tied with error, since some actions associated with these abilities may be the result of failure, while other actions may be deemed as truly successful ones. There have been myriads of times when our thinking, our categorization ability, or our human decisions, have failed. Machine learning models, which try to mimic and compete with human intelligence in certain tasks, are also tied with successful operations or erroneous ones.

But how can a machine learning model, a deterministic model with the ability to empirically compute the confidence it has for a particular action, diagnose itself that it makes an error when processing a particular input? Even for a machine learning engineer, trying to intuitively understand why without studying a particular method seems difficult.

In this article, we discuss a recent algorithm for this problem that convincingly explains how; in particular, we describe the Language Based Error Explainability (LBEE) method by Csurka et al. Here, we will recreate an explanation on how this method leverages the convenience of generating embeddings via the CLIP model contributed by OpenAI, which allows one to translate text extracts and images into high-dimensional vectors that reside in a common vector space. By projecting texts or images in this common high-dimensional space, we can compute the dot product between two embeddings (which is a well-known operation that measures the similarity among two vectors) to quantitatively measure how similar the two original text/image objects are as a function of other dot product operations (or, put simply, other similarities among vectors) involving pairs of other objects.

The designers of LBEE have developed a model that can report a textual error description of a model failure in cases where the underlying model asserts an empirically low confidence score in taking an action that the model was designed for. Part of the difficulty in grasping how such a method fundamentally works is our innate wondering about how the textual descriptions explaining the model failure are generated from scratch as a function of an input datum. In our brains, we often do not put much effort when we need to explain why a failure happens and we instantly arrive at clues to describe them, unless the cause drifts apart from our fundamental understanding of the inner workings of the object that is involved in the failure. To keep things interesting, we can provide an answer to this wondering: instead of assembling these descriptions once for each input, we can generate them following a recipe a-priori and then reuse them in the LBEE task by computationally reasoning about the relevance of a candidate set of explanations in relation to a given model input. In the remainder of this article, we will see how:

Suppose that we have a classification model that was trained to classify the object type of an only object depicted in a small color image. We could, for example, take photographs of objects in a white background with our phone camera and pass these images to the model in order for it to classify the object names. The classification model can yield a confidence score ω that is between 0 and 1, representing the normalized confidence that the model has when assigning the images to a particular class in relation to all the possible object types that are recognizable by the model. It is usually observed that when a model does poorly in generating a prediction, the resulting confidence score may be quite low. But what is a good empirical threshold T that allows us to identify a poor prediction or a confident prediction? To empirically estimate two such thresholds, one for identifying easy predictions and one for identifying hard predictions, we can take a large dataset of images (e.g., the ImageNet datase) and pass each image to the classifier. For the images which were classified correctly, we can plot the confidence scores generated by the model as a normalized histogram. By doing so, we may expect to see two large lobes in the histogram, representing the confidence values which correspond to confident inferences and less confident inferences. We may also expect to see some degradation in the frequency masses around the two lobes, which is possible. Otherwise, we would come up with a histogram presenting two lobes which are highly leptokurtic. One lobe concentrates empirical prediction scores that are lower, and a second lobe will concentrate many scores which are relatively high. Then, we can set an empirical threshold that separates the two lobes.

Csurka and collaborators designate separating images as easy or hard based on the confidence score of a classification machine learning model and their relation to the cut-off threshold (see Figure 1). By distinguishing these two image sets, the authors compute the embeddings of the images from each group, and for each image in these sets they compute an ordered sequence of numbers (for our convenience, we will use the term vector to refer to this sequence of numbers) which describe the semantic information of the image. To do this, they employ the CLIP model contributed by OpenAI, the company that is famous for the delivery of the ChatGTP chatbot model, which excels at producing embeddings for images and text in a joint high-dimensional vector space. The computed embeddings can be used to measure the similarity between an image and a very small text extract, or the similarity between a pair of text extracts or images.

As a later step, the authors wanted to identify the groups of image embeddings that share similarities. To do this, they use a clustering algorithm which can take in vectors generated by a “generating machine” and identify the clusters of the embeddings. The number of clusters that fit a particular dataset is non-trivial to define. All in all, we come up with two types of clusters: clusters of CLIP embeddings for “easy” images, and clusters of CLIP embeddings for “hard” images. Then, any hard cluster center is picked and for it the closest easy cluster center is found. This allows us coming up with two embedding vectors originating from the clustering algorithm. The two clusters, “easy” and “hard”, are visually denoted at the top-right sector of Figure 1, by green and red -dotted enclosures.

The LBEE algorithm then generates a set S of sentences that describe the above-said images. Therefore, for each short sentence that is generated, the CLIP embedding is computed. As it was mentioned earlier, this text embedding can be directly compared to the embedding of any image by calculating the dot product (or inner product) of the two embedding vectors. Consider that the dot product measures a quantity that in the signal processing community is called linear correlation. The authors apply this operation directly. They compute the similarity of each text error description by computing the so-called cosine similarity measure between a text extract embedding and an image embedding, ultimately computing two relevance score vectors of dimensionality k < N. Each dimension is tied with a given textual description. By taking these two score vectors into consideration, the authors pass the two vectors in a sentence selection algorithm (we cover them in the next paragraph). The selected sentences are considered for this forward process by taking into account each hard cluster. The union of these sentence-sets is then output to the user, in return for an image that was supplied as input.

The authors chose to define four sentence selection algorithms, named SetDiff, PDiff, FPDiff and TopS. SetDiff computes the sentence-sets corresponding to a hard cluster and to an easy cluster. Then, the algorithm removes from the hard cluster sentence-set the sentences that also appear in the easy cluster sentence-set, and reports the resulting set to the user. PDiff takes two similarity score vectors i and j of dimensionality $k$ (where k denotes the top-$k$ relevant text descriptions), one from the hard set and one from the easy set. Then, Diff computes the difference between these two vectors, and from there it retains the sentences corresponding to the top $k$ values. TopS trivially reports as an answer all the sentences that correspond to the vector of top-k similarities. Figure 3 presents an example of textual failure modes generated by a computer vision model, each using one of the TopS, SetDiff, Diff and FPDiff methods. To enable evaluation of the LBEE model and methodology, the authors had to also introduce an auxiliary set of metrics, adapted to the specificalities of the technique. To further understanding on this innovative and very useful work, we recommend you to keep on reading [1].

References

[1] What could go wrong? Discovering and describing failure modes in computer vision, published in the proceedings of ECCV 2024.

Picture of Sotiris Karavarsamis

Sotiris Karavarsamis

Research Assistant at Visual Computing Lab (VCL)@CERTH/ITI

Twitter
LinkedIn