The VOXReality project, funded under the Horizon Europe framework, is an initiative aimed at revolutionizing Extended Reality (XR) by integrating advanced language and vision AI models. The Training Assistant use case is one among the three other innovative usecases and is aimed at creating immersive and interactive training environments. In Pilot 0C, an internal stress test conducted as part of this project, the Training Assistant usecase focused on evaluating user interaction modalities within an Augmented Reality (AR) training application designed for the Microsoft HoloLens 2. This blog post delves into the objectives, execution, and outcomes of the user study in the Training Assistant use case, highlighting its contributions to open-ended learning in XR training, and outlines the next steps to refine and advance the project.
VOXReality seeks to enhance XR experiences by combining innovative AI technologies, including Automatic Speech Recognition (ASR) and a Large Language Model (LLM)-based Augmented Reality Training Assistant (ARTA). The AR Training use case, integrates ASR and ARTA into the Hololight Space Assembly software, an experimental platform originally designed for linear industrial assembly training. For this pilot, the software was customized to support open-ended learning environments, allowing users greater flexibility in task execution, aligning with constructivist learning principles that emphasize problem-solving and engagement over rigid, prescribed sequences. Conducted in Santarcangelo, Italy, at Maggioli’s headquarters, Pilot 0C involved 13 participants, including consortium members and Maggioli employees. The primary goal was to compare two user interfaces within the customized Hololight Space Assembly platform: a voice-controlled “Voxy Mode,” leveraging ARTA and ASR for voice-driven interactions, and a traditional Graphical User Interface (GUI) mode relying on hand menus. The study aimed to assess how these modalities impact key user experience metrics, including cognitive load, usability, engagement, and overall user experience, in the context of industrial assembly training tasks.
Study Design and Execution
Pilot 0C employed a within-subjects study design, where each of the 13 participants experienced both Voxy Mode and GUI Mode in two sessions, with the order randomized to minimize bias. The training scenario involved industrial assembly tasks, where participants interacted with virtual objects in an AR environment using the Microsoft HoloLens 2. In Voxy Mode, users issued voice commands to ARTA, which provided context-aware guidance, while the GUI Mode utilized hand-menu interactions for task assistance.
The study collected data on several metrics:
- Cognitive Load: Measured using the NASA-TLX framework, assessing mental and physical demand, pace, successful completion, hard work, and frustration.
- Usability: Evaluated through the System Usability Scale (SUS) and the perceived helpfulness of the tutorial.
- User Experience: Assessed via the User Experience Questionnaire (UEQ), focusing on supportiveness, efficiency, and clarity.
- Engagement: Gauged using a questionnaire to evaluate immersion and involvement.
Quantitative data, such as task completion times (recorded by the system) and SUS scores were complemented by qualitative feedback, where participants provided statements on the usefulness and experience of each interface. These insights were analyzed to compare the performance of the two modalities and identify areas for improvement.

Key Findings and Outcomes
The results of Pilot 0C revealed both strengths and challenges in the tested interfaces. Quantitatively, the GUI Mode was significantly faster for task completion, primarily due to a shorter tutorial phase. However, other metrics—cognitive load, usability, and engagement—showed no statistically significant differences between Voxy Mode and GUI Mode. This was largely attributed to a strong learning effect inherent in the within-subjects design: participants mastered the assembly tasks in their first session, rendering the second session less informative as they were already familiar with the tasks. This methodological challenge underscored the limitations of the within-subjects approach for this study.
Qualitatively, participants expressed a clear preference for Voxy Mode, highlighting its engaging and supportive nature. Users appreciated the interactivity and novelty of voice-driven interactions with ARTA, which enhanced their sense of presence and involvement. However, they also noted limitations, including inaccuracies in ASR and ARTA’s struggles with contextual understanding in the open-ended setting, which occasionally disrupted the user experience. The GUI Mode, while efficient and functional, was perceived as less engaging and immersive. Participants also provided feedback on practical issues, such as confusing color codes, performance bottlenecks, and minor bugs in the AI models, offering valuable insights for future refinements.
These findings highlight the potential of multimodal, voice-driven interfaces in XR training. The preference for Voxy Mode suggests that voice-based interactions, when supported by robust AI, can significantly enhance engagement and perceived support in open-ended learning environments. However, the study also emphasized the need for technical improvements in ASR accuracy (when it comes to different accents) and ARTA’s ability to interpret user intent and context to ensure practical efficacy in dynamic training scenarios.

Lessons Learned and Implications
Pilot 0C provided critical insights into the role of multimodal interfaces in XR training. The user preference for Voxy Mode indicates that voice-driven interactions can foster greater engagement and support in open ended training modalities, where users benefit from the flexibility to interact with objects not directly tied to prescribed tasks and complete tasks in varied orders. This aligns with the project’s goal of promoting deeper understanding through problem-solving, contrasting with linear training systems that rely on rote memorization.
The learning effect observed in the within-subjects design was a significant methodological takeaway, leading to the decision to adopt a between-subjects design for the final Pilot. In this approach, participants will be divided into separate groups for each interface, eliminating the influence of prior task familiarity and enabling a clearer comparison of the modalities’ effectiveness.
Next Steps
Building on the outcomes of Pilot 0C, the team will focus on addressing the identified issues to prepare for the final Pilot. Key priorities include enhancing the accuracy of the ASR system and improving ARTA’s contextual awareness to ensure seamless and effective voice interactions. Bug fixes, performance optimizations, and clearer color coding will also be implemented to enhance the overall user experience. The shift to a between-subjects study design for Pilot 2 will provide a more robust evaluation of Voxy Mode and GUI Mode, offering clearer insights into their impact on user experience and training outcomes. These improvements aim to strengthen the role of multimodal, AI-driven interfaces in advancing open-ended XR training applications, bringing VOXReality closer to its goal of delivering innovative, immersive training solutions.

Leesa Joyce
Head of Research @ Hololight