VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (39)

What is to my left?

As it is usual when comparing capability in completing intelligence tasks by man against machines, we notice that people can effortlessly categorise basic spatial relationships between objects that are useful in performing tasks like reasoning and planning, or in engaging in a conversational activity to reach a goal. The objects may be at a distance from the observer, and even the objects themselves may be at a long distance from each other. In any possible setting, we may want to know how two objects relate together in terms of a fixed set of spatial relationships that are commonly used in people’s daily life.

Computationally, we may need to know these relationships by being given only a colour photograph of some objects and rectangular boxes covering all of the objects of our interest. For example, given that input, we may want to state that “one object is below another object” — or, in another case, we would want to say that “object A is to the left of object B, and object A is behind object B”. We immediately deduce that spatial correspondences relating pairs of objects can simultaneously admit at least one relationship.

Open-sourcing AI software

In the domain of Artificial Intelligence (AI), an algorithm to infer the spatial relationships between objects (by usually considering objects in pairs) could be useful if it was implemented and shared with developers around the world as a library routine that any AI developer would want to have in place. In about the last fifteen years, we have globally seen a trending practice of sharing code with the public as open-source code. Code implementing very important algorithms or intelligent workflow processes is shared with any developer, provided that they acknowledge the terms of a license agreement like the GPL, L-GPL or the MIT license. Then, the need for developers to reimplement the wheel for basic tasks becomes smaller and smaller as code gets continuously contributed publicly, by offering to them robust implementations of algorithms for different intelligent tasks and for a row of programming languages. If there is still something that a developer cannot find in an open-source software library that is dedicated to a specific domain of problems (for example, computer vision problems), they can dig into the available code and extend it for themselves to fit their technical requirements. If their contribution of features is important and useful for other programmers who may be in need of the same features, they can commit their code changes to the maintainers of the software library (or any other open source-type software) for review. Hopefully, their contribution will be included in a future release of the software.

Capturing failures

Software engineers developing robotics applications, for example, would have wanted to have a set of such routines to use in developing simulation workflows for robots interacting with objects. About the one side of the coin, these routines should be reliable enough to be reused in software applications that feature sufficiently correct error handling and some ability to leverage failure evidence generated by the model. These are useful in order to achieve correctness and better error control in an underlying application.

What AI programs “see”

Although humans can reason effortlessly and very accurately in terms of basic spatial relationships that relate pairs of objects, this task is not as easy to be solved by computers. While humans can, for instance, see two objects and state that “the blue car is next to the lorry”, computers instead are only given a rectangular table of numbers that define the red, green and blue colour intensities of the cells in the table. These cells in this table correspond to the pixels of the underlying colour image. Although, again, humans can sense objects visually and understand spatial relationships instantly, we can instead state that computers are only initially given this table of colour pixel intensities. Then the goal is to use a program that takes in this rectangular array of pixel intensities and bounding boxes covering two objects of interest, and the program should decide how the two objects inside the two bounding boxes relate spatially. The program acts like an “artificial brain” specifically targeted to solve only this task and nothing else, that can understand the table of numbers that was input. It is helpful to realise that the program that does this operation executes a sequence of steps that finally yield a decision about the input that is given to the program. This is a basic realisation that is taught in introductory programming courses: that programs implement algorithms and that programs implementing algorithms need to receive input data and produce output data.

Usefulness of spatial relation prediction

The steps implemented in a program that identifies the spatial relationships between two objects in an image, were first devised by an algorithm that was designed to take in said input describing two objects in an image and produce as output the actual spatial relation out of a set of relations. At any given point in time, the program may be correct about the output it produces, or it may commit a wrong answer. Having such a program being correct at 100% of the time seems impossible at the moment, but it could be possible in future advances. It is important to note that this output may be reported to the person operating the computer, or the output can be passed on to another computer program that considers this output relation and then makes other decisions. For example, we may want to have a computer program that can execute a particular operation provided that one or more conditions hold. For instance, in a monitoring application that receives data from a camera, one case of a conditional execution that considers the spatial relationships between two detected entities (objects) could be this: “if a (detected) person is on the (detected) staircase, then turn the light on”.

Notice that by being able to correctly control that “a (detected) person is on the (detected) staircase”, developers can completely bypass the necessity to write complicated geometric or algebraic rules of what it means for one object to be on another object. This can be a potential failure point in the development process that is hard to cross safely without getting into trouble, as developing such rules may be wrong for some inputs and may work okay for other inputs. As a downside, a machine learning algorithm for this task would also make errors and the application logic of our program would want to “know” why. Fortunately, advances in all the settings met here so far, can enable us to write applications that reason about the spatial relationships of objects in colour images.

RelatiViT: a state-of-the-art model
Figure 1. Examples of colour images containing several objects. Two objects only are put in a red and a blue bounding box. For each case, the classification of a particular spatial relation is shown. Image adapted from [1].

It is important to note that computer algorithms are still not excellent at deciding the spatial relationships between objects (we refer only to pairs of objects). In a recent paper presented by Wen and collaborators [1] at the ICLR 2024 in Vienna, the authors devised modern spatial relationship classification algorithms that are based on deep convolutional neural networks or on Transformer [2] deep neural networks. The authors distinguished one of the models that they designed, called RelatiViT, as a superior model identified by conducting comparative experiments in two benchmarks. This computer vision algorithm can decide how two objects relate spatially when it is given a colour photograph of the objects with a surrounding background, along with rectangular bounding boxes covering the two objects.

Wen et al. used two benchmark datasets with examples of objects and bounding boxes covering them (see Figure 1); some portion of the data was reserved to train their spatial relationship classifier, and another portion of it was used to see how well their algorithm could generalise on that yet unseen data. This is a standard practice when building a machine learning model. We want to evaluate how good the model is empirically. Note that having a model being evaluated on at least one example that was used to train the model is unwanted. However, we can do that if we wish outside model testing. The first benchmark provided pairs of objects in 30 spatial relationships, and the second benchmark provided 9 spatial relationships. Interestingly, the 9 spatial object relationships that one of the benchmarks considers are: “above”, “behind”, “in”, “in front of”, “next to”, “on”, “to the left of”, “to the right of” and “under”.

Quantitative score of success

For the two benchmarks, the authors reported that the average ratio of correct spatial relationship classification with respect to each spatial relationship is a little higher than 80%. This essentially means that, in the controlled benchmark, the RelatiViT model can on average respond correctly to 8 out of 10 inputs respective of the actual spatial relationship that is picked and provided that all of the available test cases in the benchmark are tried.

Adoption of advances circa ‘17

In the last seven years, a basic and thriving technique in general-purpose deep learning algorithms has been the design of a machine learning model called the Transformer. This model was proposed in 2017 by Vaswani and collaborators [2], and it has been cited more than 139.000 times at the time this article was written. The Transformer model is an advance in deep learning that researchers have been putting effort to study, reuse or redesign in formulations of different sorts for machine learning problems. One important conclusion that is essential to accept a new machine learning model as a successor model (or winning model) for a particular problem, is that the models employing formulations that involve a Transformer-like model are empirically better than models that employed previous regimes of basic models or algorithms (such as, for instance, Generative Adversarial Networks or other past developments). Superiority is always measured in terms of one or more quantitative metrics, although this essential practice has received constructive criticism by researchers in recent years. At this point, there is a puddle in the road which is good to know: accepting that, for instance, Transformers are successful successor models against previous theory does not devalue previous theory. It certainly, however, implies that a better solution could exist by reusing the recent advance according to a row of quantitative metrics (which are still not the end of the story when we compare a sequence of models together in terms of their merits).

Basic input/output in a Transformer

The basic operation performed by a Transformer model is receiving as input a list of vectors and outputting a list of corresponding vectors by first identifying/capturing the true associations between the vectors in the input list. This model is being used in a very large set of basic AI problems. For instance, some of the important applications or application areas are: image segmentation, classification problems of all sorts, speech separation problems, or problems regarding the remote sensing of Earth observation data. Researchers have been committing time and effort to formulate virtually all known basic machine learning problems (like classification, clustering, etc.) reusing the idea of the Transformer model by Vaswani et. al. [2].

Structure of the RelatiViT
Figure 2. Depiction of the four object-pair spatial relationship classification models from the recent study of Wen and collaborators [1]. Image adapted from reference [1].

Wen et al. [1] considered four models (see Figure 2) that can take as input a colour image and two bounding boxes covering two objects of a user’s interest. In this article, we only focus on the rightmost model: the one called RelatiViT. RelatiViT is a state-of-the-art model that not only encodes information about two objects, but it also encodes information describing the context of the image. People certainly employ such cues in their decisions. The context of the image is regarded as the clipped portion of the image background that is enclosed by the union of the bounding boxes covering two objects: the subject and the object. For an example of what context is, see Figure 2 (a). Obviously, the information (or even the raw data) related to the surround of two objects is very important in deducing how two objects in a colour image are arranged spatially.

The RelatiViT model processes data in five basic steps: (a) it initially considers small image patches that reside in the background of the image, and small patches that reside in the trunks of the two objects; thereby creating three lists of patch embeddings; (b) these three lists of embeddings are passed through a ViT encoder [3]; (c) the ViT encoder recalculates (or “rewrites”) the vectors in each of the three lists so that they now are better related with each other, producing three new identical sets of embeddings; (d) since the two objects should be described by one vector each, RelatiViT considers complementary information available in a set of embeddings, and uses a pooling operation to calculate a single representation that can act as a global representation that employs features from the partial representation vectors; and finally, (e) the pooled representations of the two objects and the representations of the object context are passed to a multilayer Perceptron model (MLP) that can decide the spatial relationship characterising the two objects. Therefore, the MLP model learns how to map object-specific features to spatial relationship classes when RelatiViT is trained on example triplets (of a subject, an object and a ground truth spatial relationship relating the subject and the object), and it learns such a mapping by being provided small batches of data containing object-pair features and associated spatial relationships. To train RelatiViT, we may need at least one modern GPU that can be mounted on a regular modern personal computer with enough RAM memory capacity. Software stacks such as PyTorch and tensorflow have been implemented in the last decade, allowing machine learning and computer vision developers to design prototypes of deep neural netwoks and train them on data.

Generating explanations

Before we conclude this article, here is an important question which will always reappear when developing machine learning models in general: Can we reuse models like RelatiViT for any critical application where errors could be harmful or intolerable? We should first recognise that developments like the RelatiViT, only target to create models that can be good classifiers to recognise spatial relationships between objects. They propose models that are crafted by making use of the designer’s understanding of how such a model should be designed and no further features like classification validation are sought. One could quickly believe that this is a flaw of the method, but it is not. Each piece of research has to plan a scope so that only contributions within that scope can be committed in the research work. However, how to prove (if applicable) why a decided spatial relationship is in fact the true relationship connecting two objects falls within the subject of explainability in the deep learning field. Explainability models uncover and report evidence about a particular decision, and they are relevant for almost all of the basic machine learning problems (including, for instance, classification and clustering). For instance, if “object A is to the left of object B”, then this is the case because the mass of the second object is situated to the right of the mass of object A. However, another explanation for that could be that the center of mass of the two objects are ordered in this way, and we can deduce that by only comparing the magnitude of the projections of the centres of mass of the two objects across the horizontal axis. By doing so, we can then calculate the relative spatial relationship between two objects. We start to realise, then, that there can be many explanations that describe the same event. Some explanations are identical, but they are stated (or expressed) alternatively. Other explanations are complementary, and are useful to be reported to the user of a spatial relationship classifier. An explanatory model in a deep learning system such as the one that this article regards, should provide to the user as much comprehensive evidence as possible, and as many pieces of evidence as the explanatory algorithm can provide thanks to its design. But here is another important thing that is very critical by nature: can we just trust explanations and think of them as being correct without reasoning further about their correctness? The answer to this question is no, unless the algorithm can provably produce explanations that can be verified before it reports them to the user. This can become possible when we are limited to a particular application out of the very large pool of possible machine learning problems and data.

References
[1] Wen, Chuan, Dinesh Jayaraman, and Yang Gao. “Can Transformers Capture Spatial Relations between Objects?” arXiv preprint arXiv:2403.00729 (2024)

[2] Vaswani, A. et al. ”Attention is all you need.”, published in the proceedings of the Advances in Neural Information Processing Systems (2017)

[3] Dosovitskiy, A. et al. “An image is worth 16×16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020)

Picture of Sotiris Karavarsamis

Sotiris Karavarsamis

Research Assistant at Visual Computing Lab (VCL)@CERTH/ITI


&
Picture of Petros Drakoulis

Petros Drakoulis

Research Associate, Project Manager & Software Develper at Visual Computing Lab (VCL)@CERTH/ITI

Twitter
LinkedIn
VOXReality template LinkedIn v3 (3).pdf (1920 x 1080 px) (Instagram Post)

VOXReality Awards €1M to Boost XR Innovation Across Five European Initiatives

The VOXReality team is proud to announce the results of its Open Call, which supports pioneering institutions in their quest to innovate within the field of extended reality (XR). Each beneficiary is awarded 200K EUR in funding as part of a 1-year programme aimed at extending application domains for immersive XR technologies. This initiative is designed to integrate cutting-edge AI models into real-world applications, enhancing human-to-human and human-to-machine interactions across various sectors, including education, heritage, manufacturing, and career development.

Empowering Innovation: Spotlight on Selected Projects and Beneficiaries

Following a comprehensive evaluation process, the VOXReality team has selected five dynamic projects as beneficiaries, showcasing innovative approaches to XR applications:

MindPort GmbH & LEFX GmbH (Germany) – AIXTRA Programme

AIXTRA focuses on overcoming language barriers in international digital education through a VR authoring tool with automated real-time translation. By introducing AI-based virtual training partners, it aims to create a more inclusive learning environment, facilitating effective remote training sessions for multilingual participants.

Animorph Co-operative (UK) – CrossSense Programme

CrossSense smart glasses empower people living with Dementia and Mild Cognitive Impairment to live independently by supporting their ability to recall information. This project seeks to enhance user interaction in XR environments, laying the groundwork for a fully commercialised version through user testing and an open-sourced association engine.

XR Ireland (Ireland) & Āraiši ezerpils Archaeological Park (Latvia) – VAARHeT Programme

The Voice-Activated Augmented Reality Heritage Tours (VAARHeT) project aims to enhance visitor experiences at the Āraiši ezerpils Archaeological Park. By leveraging VOXReality’s AI models, the initiative will offer personalised museum tours and facilitate real-time multilingual translation of live tour guides, enriching educational opportunities for diverse visitors.

KONNECTA SYSTEMS P.C. (Greece) & IKNOWHOW S.A. (Greece) – WELD-E Programme

WELD-E addresses critical challenges in the welding industry by integrating voice and vision-based AI systems within an XR environment. This initiative aims to provide remote support for robotic welding operations, improving communication through speech recognition and automated translation to create a more effective training experience.

DASKALOS APPS (France) & CVCOSMOS (UK) – XR-CareerAssist Programme

XR-CareerAssist seeks to innovate career development by offering personalised, immersive experiences tailored to individual users. This project integrates VOXReality’s AI models to provide real-time feedback, potential career trajectories, and educational pathways, ensuring accessibility for a diverse range of users.

VOXReality’s overarching goal is to conduct research and develop new AI models that will drive future XR interactive experiences while delivering these innovations to the wider European market. The newly developed models focus on addressing key challenges in communication and engagement in various contexts, including unidirectional settings (such as theatre performances) and bidirectional environments (like conferences). Furthermore, the programme emphasizes the development of next-generation personal assistants to facilitate more natural human-machine interactions.

As VOXReality continues to advance XR and AI innovation, the successful implementation of these projects will pave the way for more immersive, interactive, and user-friendly applications. By fostering collaboration and knowledge sharing among the selected institutions, the VOXReality team is committed to enhancing the landscape of extended reality technologies in Europe.

Picture of Ana Rita Alves

Ana Rita Alves

Ana Rita Alves is currently working as a Communication Manager at F6S, where she specializes in managing communication and dissemination strategies for EU-funded projects. She holds an Integrated Master's Degree in Community and Organizational Psychology from the University of Minho, which has provided her with strong skills in communication, project management, and stakeholder engagement. Her professional background includes experience in proposal writing, event management, and digital content creation.

Twitter
LinkedIn
Untitled design (5)

Boosting Industrial Training through VOXReality: AR’s Edge Over VR with Hololight’s training application

Leesa Joyce

Head of Research Implementation at Hololight

In today’s rapidly evolving industrial landscape, the methods used for training and skill development are undergoing significant transformations. Companies increasingly seek innovative solutions to make their training more efficient, engaging, and adaptable in prototyping and manufacturing. Two prominent technologies leading this change are Augmented Reality (AR) and Virtual Reality (VR)1. Both offer immersive experiences, but they serve different purposes, especially when applied to training. In the VOXReality project, where AI assisted voice interaction in XR spaces plays a key role in making better user interfaces, HOLO’s extended reality application Hololight SPACE with the assembly training tool brings a new set of advantages to the user.

The Hololight SPACE, integrated into the VOXReality project, offers an augmented reality industrial assembly training system that allows workers to visualize and manipulate 3D computer-aided design (CAD) models. Through AR glasses, such as HoloLens 2, trainees can assemble components with the guidance of real-time feedback from a virtual training assistant. Augmented Reality in this context enables users to interact with both virtual models and the physical environment simultaneously, which holds several key benefits over Virtual Reality.

Overlaying 3D Objects on Real-World Environments

One of the main advantages of AR over VR is the ability to overlay virtual 3D objects onto real-world environments. In industrial assembly training, this feature is crucial when physical objects must align with virtual components. For example, AR allows a user to see a virtual engine model and align it directly on top of a real physical framework, enhancing spatial understanding. This real-time interaction between digital and physical elements ensures a seamless integration, allowing trainees to bridge the gap between virtual simulations and real-world applications.

In contrast, VR immerses users in a fully simulated environment where all objects, tools, and machinery are virtual. While this can be useful for certain training applications, it falls short when trainees need to practice in real-world contexts or with actual physical tools.

Dynamic Adaptation Based on Real-World Measurements

One of the key benefits of using AR in industrial assembly training is the ability to adapt virtual models based on real-world measurements. For example, Hololight SPACE allows for precise alignment of CAD models with real tools or machinery, ensuring accuracy in assembly tasks. The AR environment can scale or adjust virtual objects based on physical constraints, giving trainees a practical experience that is directly transferable to their real-world roles.

In VR, the environment is entirely digital, which means trainees may struggle to apply their knowledge when transitioning to real-life tasks. Without the ability to manipulate real objects, VR training can create a disconnect between theory and practice.

Enhanced Situational Awareness and Safety

In AR-based training, users remain aware of their surroundings, which is particularly important in industrial settings2. Hololight SPACE enables trainees to interact with both virtual and physical objects, all while remaining aware of their immediate environment, coworkers, and potential hazards. This situational awareness promotes a safer training environment, as trainees can avoid accidents or conflicts that might arise when entirely isolated from their surroundings, as is common in VR.

This added level of awareness is not possible in VR, where users are immersed in a completely digital world, which can lead to disorientation or accidents when trying to translate virtual skills into real-world tasks.

Reduced Cybersickness and Mental Load

Cybersickness is a common issue in VR training3. The disconnect between the user’s physical body and the virtual world can result in motion sickness and fatigue, especially during long training sessions. In contrast, AR presents virtual objects within the real world, eliminating the sensory mismatch that often leads to VR-induced cybersickness. By anchoring virtual elements to the trainee’s physical environment, AR reduces the mental and physical load, making it a more comfortable and sustainable training tool for industrial tasks.

Collaborative and Interactive Training

Another critical advantage of AR is the ability for users to see and interact with other people in the room4,5. In an industrial training setting, this means that instructors or fellow trainees can observe and provide real-time feedback by physical presence or by virtual presence while still allowing the trainee to engage with the virtual objects. This collaborative aspect of AR creates a more interactive learning environment, where knowledge is shared seamlessly between physical and digital spaces.

In contrast, VR isolates the user, making collaborative training more difficult unless all participants are also immersed in the same virtual environment.

References
  1. Oubibi, M., Wijaya, T.T., Zhou, Y., et al. (2023). Unlocking the Potential: A Comprehensive Evaluation of AR and VR in Education (LINK) 
  2. Akhmetov, T. (2023). Industrial Safety Using Augmented Reality and Artificial Intelligence (LINK) 
  3. Kim, Juno & Luu, Wilson & Palmisano, Stephen. (2020). Multisensory integration and the experience of scene instability, presence and cybersickness in virtual environments (LINK) 
  4. Syed, T. A., et al. (2022). In-depth Review of Augmented Reality: Tracking Technologies, Development Tools, AR Displays, Collaborative AR, and Security Concerns (LINK) 
  5. Timmerman, M. R. (2018). Enabling Collaboration and Visualization with Augmented Reality Technology (LINK) 
Twitter
LinkedIn
VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (37)

A Recap of the 5th VOXReality General Assembly

The VOXReality consortium gathered at Maastricht University’s Department of Advanced Computer Science for an impactful two-day General Assembly on October 30–31, 2024. This event brought together project partners and technical teams to align on the future of VOXReality’s pioneering AR and VR initiatives.

Day one kicked off with a warm welcome from MAG, leading into sessions on project planning that set a solid foundation for the days ahead. In-depth sessions followed on the AR Training Use Case and VR Conference Use Case, where HOLO and VRDAYS showcased recent pilot results, achievements, and provided hands-on demos of their immersive applications. An update on model deployment led by SYN highlighted technical progress, while F6S presented communication strategies to expand VOXReality’s public impact.

Day two focused on collaborative growth, starting with an Exploitation Workshop that explored paths for maximizing project impact. Following, our recently joined third-party contributors then presented their Open Call projects, sparking engaging discussions and potential collaborations. The event closed with AR Theatre Use Case, led by AF and MAG, captivated attendees with pilot results and a demo showcasing AR’s potential in live theatre. 

The VOXReality General Assembly showcased the power of innovation and collaboration, with each session reinforcing the project’s vision of immersive technology’s future. 

Stay tuned as VOXReality pushes forward on this exciting path! 🚀

Picture of Ana Rita Alves

Ana Rita Alves

Ana Rita Alves is an International Project Manager and current Communication Manager at F6S. With a background in European project management and a Master’s in Psychology from the University of Minho, she excels in collaborative international teams and driving impactful dissemination strategies.

Twitter
LinkedIn
S40htWF1eNXCZ00Q

Empowering VR Events with AI: The Role of Dialogue Agents in Enhancing Participant Experience at VR Conferences

As virtual reality (VR) and artificial intelligence (AI) continue to advance, dialogue agents are emerging as a crucial element to improve user engagement in VR events, particularly conferences. By adding a layer of ease and engagement, these intelligent, AI-powered assistants transform VR events and conferences from static encounters into dynamic, responsive settings. Dialogue agents can guide participants, answer questions, and create a more accessible experience, all while bridging the gap between the real and virtual worlds. With this improved capabilities, VR events and conferences are no longer confined by the physical limitations that often accompany in-person events; instead, they thrive as immersive, inclusive spaces.

Having a dialogue agent in a VR conference environment significantly enhances user experience by providing real-time assistance, guidance, and personalized interaction within a virtual space. The primary role of dialogue agents in VR conferences is to serve as the interface between participants and the virtual environment, acting as assistants to help users with various tasks. Instead of struggling to perform certain actions, users can simply ask the agent, which provides valuable information and insights. This approach is like having a personal assistant delivering on-the-spot assistance, transforming a potentially complex virtual environment into an intuitive, user-friendly experience.

In our VOXReality project, we have developed a dialogue agent tailored specifically for VR conferences to provide an all-in-one assistant experience. The agent offers seamless navigation assistance, answers questions about the event program, and provides information about the trade show. Users can ask how to reach a particular room, and the agent not only responds in natural language but also provides visual cues that guide them directly to their destination. This integration of verbal and visual guidance makes navigation in VR environments feel natural and intuitive, creating a more enjoyable and accessible experience.

The VOXReality agent’s ability to offer natural language responses and visual cues for navigation makes it easy for participants to orient themselves in the VR environment, reducing the learning curve and improving accessibility, especially for first-time users. This functionality allows attendees to focus on the event itself rather than getting bogged down by navigation challenges, leading to a more engaging and immersive conference experience.

Furthermore, the agent’s ability to provide information about the event schedule and program details ensures that attendees can maximize their time, effortlessly accessing the right sessions, booths, or networking opportunities. Beyond navigation, the dialogue agent acts as a comprehensive knowledge resource, answering questions about event topics, speaker details, and exhibition information, reducing the need for attendees to consult external resources or manuals.

This agent can be configured for various VR applications, offering flexible support tailored to each event’s needs, whether it’s a trade show, panel discussion, or networking session. This adaptability enhances user satisfaction and opens up possibilities for personalized content delivery, fostering a deeper connection between attendees and the conference content. By integrating ASR, NMT, and dialogue modeling, the agent minimizes language barriers, supporting inclusive and diverse participation on a global scale.

By creating a multi-functional dialogue agent, VOXReality is setting a new standard for VR conferences and events. Our agent’s ability to respond to user needs and provide real-time support enhances the VR event experience and fosters an interactive atmosphere. As VR conferences continue to grow, the role of such intelligent agents will become even more crucial, helping to make VR a more inclusive and engaging medium for global events. Whether it’s guiding attendees to the right room, keeping them informed on program highlights, or making trade shows easily accessible, our dialogue agent embodies the potential of AI in VR environments, ensuring that every participant feels supported and connected throughout their journey.

Picture of Stavroula Bourou

Stavroula Bourou

Machine Learning Engineer at Synelixis Solutions SA

Twitter
LinkedIn
VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (22)

Voice User Interface: VOXReality bridging the gap through user friendly XR 

Leesa Joyce

Head of Research Implementation at Hololight

In a fast-evolving industrial environment, training assembly-line workers can be a complex and time-consuming process. Traditional training methods often fall short in engaging workers or adapting to their individual learning styles, leading to suboptimal outcomes. To address this, the VOXReality project aims to enhance the training experience by integrating augmented reality (AR) with cutting-edge technologies like automated speech recognition (ASR) and a dynamic dialogue system. This use case focuses on creating an immersive and interactive training environment where workers can visualize and interact with 3D CAD files while receiving real-time feedback and voice-assisted guidance. In such scenarios, the design of the user interface (UI) plays a pivotal role in shaping both attention span and mental load. Research shows that the more intuitive and user-friendly the interface is, the more focused and efficient the worker will be. Let’s dive into how UI design impacts these cognitive aspects, and how user-centric elements, like voice assistance, further enhance the experience. 

New technology often creates a barrier for users unfamiliar with complex interfaces, leading to frustration and resistance as it requires users to understand new types of inputs, commands, or gestures. Many individuals feel anxious about making mistakes or struggle with the cognitive load of learning new systems, which can result in avoidance. Voice assistance in XR interfaces addresses this by allowing users to interact through natural speech, reducing the need to master unfamiliar controls. This lowers the entry barrier, making the technology more accessible and easing the adoption process for users who might otherwise be reluctant to engage with it. 

The Role of UI in Attention Span and Mental Load

When it comes to immersive AR training, the way information is presented can either help or hinder a worker’s focus. Poorly designed interfaces, cluttered with unnecessary information or requiring too much effort to navigate, can overwhelm users, leading to reduced attention and increased mistakes. On the other hand, a well-designed UI can guide the user seamlessly through tasks, keeping their focus on the assembly process rather than on the mechanics of the interface itself. 

According to Sweller’s Cognitive Load Theory (CLT), cognitive load is divided into three categories: intrinsic, extraneous, and germane load. Intrinsic load is related to the complexity of the task—assembling an engine, for example, is naturally a challenging task. Extraneous load is the effort required to use the UI or understand instructions, while germane load refers to the mental effort invested in learning or solving problems. A well-designed AR interface reduces extraneous load, allowing workers to allocate more of their cognitive resources toward learning and performing the task (Paas et al., 2003). 

AR interfaces that minimize distractions and present only the necessary information allow workers to focus on the task at hand. This focus extends their attention span, making it easier to retain information and apply it in real-time. Research supports the idea that UIs that are simple, clean, and contextually relevant improve not only attention but also performance (Dünser et al., 2008). Over time, this efficiency can lead to better learning outcomes and fewer errors during training. 

The Impact of User-Centric UI Design

User-centric design—focused on the needs and preferences of the worker—has a profound impact on how effectively the AR training environment supports learning. For example, incorporating voice assistance into AR interfaces can significantly reduce the cognitive load. When workers can receive verbal instructions or ask the system for help hands-free, they can focus entirely on the physical task, rather than switching attention back and forth between the AR display and their hands. Studies have shown that multimodal interfaces, which combine visual, auditory, and sometimes haptic feedback, can improve performance and reduce mental strain (Billinghurst et al., 2015). Additionally, a conversational assistance through natural speech input is immersive and closer to real life training from trainers. 

Additionally, personalized UI elements, such as customizable display settings or progress-tracking tools, help workers feel more in control and confident in their training. This sense of control can reduce psychological stress and improve engagement, making it easier for users to stay focused on learning without feeling overwhelmed (Norman, 2013). A well-designed UI takes into account not only the technical aspects of the task but also the psychological well-being of the user, helping to create an environment where workers are less likely to feel fatigued or frustrated. 

Psychological Effects of AR UIs

Beyond the immediate practical benefits, there are deeper psychological impacts of a user-centric AR UI. When the interface is intuitive, users experience a state of flow, which is a heightened state of focus and engagement where they lose track of time and become fully absorbed in the task (Csikszentmihalyi, 1990). Flow states are often linked to better learning and productivity, as they help users maintain concentration without unnecessary interruptions. 

Moreover, reducing cognitive load through intuitive design contributes to lower stress levels, particularly in high-stakes environments like industrial assembly lines where mistakes can be costly. By providing clear guidance and eliminating unnecessary complexity, the AR interface acts as a supportive tool, making workers feel more competent and less anxious (Dehais et al., 2019). This is critical for building both confidence and long-term competence in a new skill.

References

Csikszentmihalyi, M. (1990). Flow: The Psychology of Optimal Experience. Harper & Row. 

Dehais, F., Causse, M., Vachon, F., & Tremblay, S. (2012). Cognitive conflict in human-automation interactions: a psychophysiological study. Applied ergonomics, 43(3), 588–595. https://doi.org/10.1016/j.apergo.2011.09.004 

Dünser, A., Grasset, R., & Billinghurst, M. (2008). A survey of evaluation techniques used in augmented reality studies. ACM SIGGRAPH ASIA 2008 Courses, 1-27. 

Mark Billinghurst; Adrian Clark; Gun Lee, A Survey of Augmented Reality , now, 2015, doi: 10.1561/1100000049. 

Norman, D. (2013). The Design of Everyday Things. Basic Books. 

Paas, F., Renkl, A., & Sweller, J. (2003). Cognitive load theory and instructional design: Recent developments. Educational Psychologist, 38(1), 1-4. 

Twitter
LinkedIn
VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (15)

Partner Interview #7 with Athens Epidaurus Festival

As one of Greece’s most esteemed cultural organizations, AF has been at the forefront of the country’s artistic landscape since 1955. In this conversation, we are joined by Eleni Oikonomou, who shares insights into AF’s involvement in the groundbreaking VOXReality project. As a use case partner, AF leads the Augmented Theatre initiative, collaborating with technical experts to merge AR technology with live theatre. Through pilots featuring excerpts from ancient Greek tragedies, AR glasses deliver translated captions and visual effects, blending the physical stage with digital elements to enhance accessibility and audience immersion.

Can you provide an overview of your organization's involvement in the VOXReality project and your specific role within the consortium?

The Athens Epidaurus Festival is one of Greece’s biggest cultural organizations and organizer of the summer festival of the same name, since 1955. In the VOXReality project the AEF is a use case partner, owning the Augmented Theatre use case. We are working with our technical partners to merge AR elements with theatre, with the goal of enhancing accessibility and audience immersion. This includes pilots featuring excerpts from an Ancient Greek tragedy, where translated captions and audiovisual effects are delivered through AR glasses, merging the physical stage with digital elements for a more immersive theatrical experience.

How is the effectiveness of the language translation feature in enhancing the audience's experience during theatrical performances ensured, to deliver a seamless and authentic experience for the audience?

Multiple strategies are employed to address issues of accuracy, precision and latency in the delivery of the caption feature of the Augmented Theater use case.

To start with, in theatre, translation is an art form in itself. Theatrical texts require a level of precision and sensitivity to convey not only the literal meaning, but also the emotional, cultural, and dramatic nuances that are essential to the performance. Therefore, the real time translation feature of VOXReality is based on literary translations from Ancient Greek, performed by acclaimed translators to safeguard the integrity of the play. Additionally, internal controls and evaluations are carried out to assess the performance of the translation feature to ensure the artistic integrity of the original text.

Finally, two internal pilots and a small-scale public pilot have already been deployed, with a goal to assess the quality of the use case and fine tune the features we are developing. During the public pilot, we had the opportunity to gather feedback on our translation and caption feature from the participants, via questionnaires and semi structured interviews. This feedback has been valuable in improving the use case and refining our future steps.

Considering the importance of cultural nuances in theatrical expressions, how does the language translation system address the challenge of maintaining cultural sensitivity and preserving the artistic intent of the performance?

This is another reason why literary translations are used in our use case. As previously explained, literary translations are irreplaceable, even as AR technology presents the exciting potential to accommodate native speakers of diverse linguistic backgrounds. While translated captions are becoming more common in theatres, they are not universally available and typically cover only a limited number of languages. The ability of Augmented Theatre to extend beyond these traditional limitations underscores the importance of a solid foundation in literary translations to ensure that cultural and artistic elements are preserved.

The language translation system for VOXReality prioritizes cultural sensitivity and artistic integrity by relying on these literary translations, which capture the cultural nuances and emotional subtleties of the original text. To ensure that these aspects are preserved throughout the development, we conduct thorough evaluations of the translation outputs through internal checks. This evaluation is crucial for verifying that the translations maintain the intended cultural and artistic elements, thereby respecting the integrity of the original performance.

Considering advancements in technology, do you foresee the integration of augmented reality to enhance theatrical experiences, and how might this impact the audience's engagement with live performances?

AR has the potential to transform traditional theatre by offering immersive experiences, blending digital and physical elements to create new artistic dimensions. This technology allows for new dynamic ways of storytelling, capturing audience attention and enabling interactive elements that can enhance engagement with the performance.

One of the key opportunities AR offers is improving inclusivity and accessibility in theatre. This aligns with our organization’s goals, driven by a strong commitment to supporting inclusive practices and leveraging AEF’s international outreach. AR has the potential to engage audiences who may feel excluded from traditional theatre spaces, whether due to physical, linguistic, or sensory barriers, in unprecedented ways.

The rise of augmented reality (AR) technologies in the coming years therefore is undoubtedly set to make a lasting impact on how audiences engage with live performances. However, thoughtful consideration is essential to balance technological advancement with artistic integrity. This includes ensuring that AR enhances rather than detracts from the human experience of live performance and considering the impact on traditional theatre roles and artists rights.

Furthermore, while integrating AR into live theatre presents exciting possibilities, it also comes with several challenges. Combining AR with physical elements can sometimes be distracting for the audience, and technical issues like glitches or misalignment can disrupt the performance’s flow and break immersion. These challenges are compounded by the current limitations of bulky AR equipment. However, advancements in technology are expected to address these issues, leading to more sophisticated and user-friendly equipment. These challenges highlight why the VOXReality project is so exciting, as it allows us to explore and refine how AR can complement theatre in a real-world context.

How do you see theatre exploring virtual platforms for performances in the foreseen future? How might AR VFX be utilized to reach broader audiences or create unique immersive theatre experiences?

As an emerging medium, virtual platforms have the potential to revolutionize theatre by expanding their reach and creating new engagement opportunities. In our project, we use Augmented Reality Visual Effects (AR VFX) to blend digital elements with a live performance, exploring how these technologies can impact and create immersion in the theatre experience. In our use case voice activated VFX serve to accompany a scene of an ancient Greek play. Since in ancient Greek tragedy events are not enacted on stage and the retelling of events by actors is the norm, the VFX developed in our Augmented Theatre use case follow this narrative tradition, bringing the described elements to life for the audience in a very innovative way.

More broadly, the integration of virtual platforms and VFX opens up numerous possibilities for innovation in theatre. They can create dynamic, interactive backgrounds that change in response to the action on stage, integrate virtual characters or objects that interact with live performers, the possibilities are very diverse. VFX can also help in overcoming physical barriers by providing virtual set designs that are not constrained by physical space, and address geographic limitations, even enabling remote audiences to experience the performance in a virtual environment or enabling actors to perform from different locations.

Picture of 	Elena Oikonomou

Elena Oikonomou

Athens Epidaurus Festival

Twitter
LinkedIn
VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (16)

Virtual Conferencing, So Far, So Close.

When the COVID-19 pandemic brought our travelling business culture to a grinding halt, we understood how dependent we were on free movement and face-to-face interaction for our business to develop and progress. The event industry was not spared, with the Dutch event industry alone suffering a staggering revenue loss of 1.23 billion Euros in 2020 (Statista, 2022).

During COVID-19, we witnessed a remarkable shift in our business culture. We did not just adapt to the new normal; we embraced it. Virtual networking, online video platforms, and innovative event formats like arcade-style networking and virtual cocktails have become a permanent part of our business landscape, inspiring us to continue evolving and finding new ways to connect.

According to Precedence Research, the global video conferencing market was valued at $7.01 billion in 2022 and is projected to reach $22.26 billion by 2032, driven by a compound annual growth rate (CAGR) of 12.30% (Research, 2024).

Due to the COVID-19 pandemic, the demand for online networking platforms surged as remote work and social distancing necessitated virtual engagement across professional and social spheres. Leading platforms such as LinkedIn, Slack, and Microsoft Teams experienced significant user growth, capitalising on their ability to facilitate professional connections, job opportunities, and industry events.

Meanwhile, video conferencing tools like Zoom and specialised event platforms like Hopin became essential for virtual conferences and networking sessions, providing users with interactive and scalable solutions. Interest-based platforms, including Discord, Facebook Groups, and Reddit communities, saw increased activity as users sought niche spaces for social interaction and knowledge exchange. During this period, Clubhouse, a new entrant, gained rapid traction with its audio-only, invite-only format, appealing to users looking for real-time conversations. Additionally, virtual event platforms like Airmeet, Remo, and Brella emerged as innovative solutions for online conferences, offering networking tools that replicated the interactivity of in-person events.

In 2022, WhatsApp Business (messenger and video) saw its user base surpass 1.26 billion, with the Asia-Pacific (APAC) region contributing the highest number of users at 808.17 million. Zoom reported a 6.9% increase in revenue in 2023, reaching $4.39 billion. Microsoft Teams experienced record downloads in Q2 2020, hitting 70.43 million. Meanwhile, Cisco Webex reported 650 million monthly meeting participants in 2021, averaging 21.7 million daily. North America held the largest market share for video conferencing globally in 2022, accounting for 41% of the total market. The APAC region is also projected for significant growth, with its video conferencing market expected to reach $6.8 billion by 2026 (Sukhanova, 2024).

As this shift permeated all areas of our social activities, the VOXReality consortium set out to develop voice-driven interactions for XR spaces, where virtual B2B events can deliver more value to their global clients. In 2021, the number of trade shows hosted in the Netherlands dropped significantly compared to 2019, primarily due to the effects of the COVID-19 pandemic. By 2023, the country only held 53 events, a sharp decline from the 132 trade fairs organised in 2018 (Statista, 2023).

Image generated by Adobe Firefly

Virtual Reality (VR) has the potential to transform our methods of communication and interaction. Unlike other distance-based communication tools, VR stands out for its enhanced interactivity and visualisation capabilities, such as displaying data, documents, and 3D models. This makes VR interactions a promising option for more effective remote business meetings and engaging social interactions.

Currently, voice-activated personal assistants are used to engage with customers, offer support, overcome language barriers, and streamline basic operations. While some assistants incorporate additional modalities like text or images, their functionality remains limited to simple tasks such as setting alarms or controlling devices. They lack the ability to handle more complex interactions.

VOXReality aims to advance AI models by developing systems that integrate audiovisual and spatio-temporal contexts, enabling personal assistants to better understand and interact with their environment. These systems will allow new applications, such as instruction assistants or navigation guides, through novel self-supervised vision and language systems. By grounding language in both spatial and semantic contexts, VOXReality seeks to enable more sophisticated assistant responses, offering richer, context-aware interactions and higher-level reasoning.

By developing a digital agent and services that provide virtual venue navigation, programme information, and, most importantly, automatic translation services for global virtual event attendees, we aim to make business interactions in virtual environments meaningful, effective, and fully assisted. We strive to challenge how conference interactions are delivered today by adding new tools and value to tomorrow’s event industry.

[1] Statista. (2022, December 9). Expected revenue and loss due to coronavirus in event industry Netherlands in 2020. https://www.statista.com/statistics/1108551/expected-revenue-and-loss-due-to-coronavirus-in-event-industry-netherlands/

[2] Research, P. (2024, August 28). Video conferencing market size to hit USD 28.26 bN by 2034. https://www.precedenceresearch.com/video-conferencing-market

[3 ] Sukhanova, K. (2024, July 22). Video Conferencing market Statistics. The Tech Report. https://techreport.com/statistics/software-web/video-conferencing-market-statistics

[4] Statista. (2023, October 27). Number of trade shows in the Netherlands 2009-2021. https://www.statista.com/statistics/460581/number-of-trade-fairs-in-the-netherlands/

Picture of Manuel Toledo - Head of Production at VRDays Foundation

Manuel Toledo - Head of Production at VRDays Foundation

Manuel Toledo is a driven producer and designer with over a decade of experience in the arts and creative industries. Through various collaborative projects, he merges his creative interests with business research experience and entrepreneurial skills. His multidisciplinary approach and passion for intercultural interaction have allowed him to work effectively with diverse teams and clients across cultural, corporate, and academic sectors.

Starting in 2015, Manuel co-founded and produced the UK’s first architecture and film festival in London. Since early 2022, he has led the production team for Immersive Tech Week at VRDays Foundation in Rotterdam and serves as the primary producer for the XR Programme at De Doelen in Rotterdam. He is also a founding member of ArqFilmfest, Latin America’s first architecture and film festival, which debuted in Santiago de Chile in 2011. In 2020, Manuel earned a Master’s degree from Rotterdam Business School, with a thesis focused on innovative business models for media enterprises. He leads the VRDays Foundation’s team’s contributions to the VOXReality project.

Twitter
LinkedIn
VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (14)

Preserving Audiovisual Heritage: Exploring the Role of Extended Reality

Audiovisual Heritage definition

Audiovisual heritage refers to the collection of sound and moving image materials that capture and convey cultural, historical, and social information. It includes cultural products, such as films, radio broadcasts, music recordings, and other forms of multimedia, as well as the instruments, devices and machines used in their production, recording and reproduction, and the analog and digital formats used to store them.

Preservation and accessibility

Preserving audiovisual heritage is crucial because analog formats (like film reels, magnetic tapes, and vinyl records) are vulnerable to physical decay. Digital formats, meanwhile, face the risk of obsolescence as technology evolves. Preservation is only the first step though, since the process of digitizing and archiving makes these materials often difficult for the public to access. To allow the public to meaningfully engage with and understand the significance of each artifact, it is important to contextualize the artifact in a curated framework.

Reasons and methods to use Extended Reality

This is where Extended Reality (XR) comes in —an umbrella term that encompasses Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR). XR describes a complex branch of emerging technology that allows users to interact with content in immersive ways. XR can isolate users senses from their physical environment and allows them to experience (e.g. see and listen) to audiovisual heritage artifacts in a virtual space specifically designed for that purpose. This can be seen as a virtual counterpart to how museums thoughtfully design physical displays to best showcase their exhibits. XR also enables creators to craft narratives around artifacts, enhancing their cultural and historical value—a key area where XR shines.

Examples

One such example is the VR work “Notes on Blindness” [1], which allows users to listen to original audio recordings from the writer John Hull describing his journey into biological blindness. The XR work allows users to experience darkness while listening to the recording and in addition visualizes the narrative with a subtle yet decisive aesthetic.

Retrieved from https://www.arte.tv/digitalproductions/en/notes-on-blindness/

Another example is the work “Traveling While Black” [2] which is a VR work documenting racial discrimination against African Americans in the United States. This work uses original audio and film excerpts, including interviews from people that lived through segregation or their descendants. Viewing this work from an immersive, first-person perspective in contrast to viewing it on a flat monitor gives the audience more affordances for critical engagement and self-reflection. Another great example is a VR work belonging to the exhibition called ‘The March’ at the DuSable Museum of African American History in Chicago, chronicling the historic events of the 1963 march on Washington [3]. The work contains recordings from Martin Luther King Jr.’s iconic ‘I Have A Dream’ speech.

Retrieved from https://dusablemuseum.org/exhibition/the-march/
Limitations

Despite its potential, applied examples yet do not abound because there is significant expertise and cost involved in such productions. Limitations exist on hardware, software and HCI design aspects, and are gradually being addressed. Research work is being invested in design methodologies to streamline production and improved audience satisfaction. Practical issues, like hardware production costs and form factor discomfort, are being mitigated by commercial investments from major tech companies, like Microsoft, Google, Samsung, Apple, Meta, etc. Industry standards with cross platform support and legacy support, such as OpenXR, are another important factor for broader adoption. Finally, the audience’s familiarity and interest with this technology is increasing as it permeates more and more aspects of daily life.

Conclusion

It is clear that extended reality technology can transform how we engage with our audiovisual heritage – it can offer contextualization, it can situate both the audience and the artifact in a narrative framework, and it can offer more depth and nuance to our interactions. While there are still challenges, the limitations are steadily being lifted by efforts from a multitude of involved fields – evidencing the importance of this domain. We eagerly anticipate the next innovative steps from museums, galleries, research centers, studios and film companies worldwide.

Picture of Spyros Polychronopoulos

Spyros Polychronopoulos

Research Manager at ADAPTIT and Assistant Professor at the Department of Music Technology and Acoustics of the HMU

Twitter
LinkedIn
VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (13)

Fostering Innovation: KIT and UM’s Collaborative Leap in NLP and Machine Translation

On January 26th 2024, Maastricht University (UM) VOXReality team was hosted by the Artificial Intelligence 4 Language Technologies (AI4LT) group of the Karlsruhe Institute of Technology (KIT). It was a long day workshop where both groups presented their work in Natural Language Processing (NLP) and more specifically Machine Translation (MT). Synergies between the two groups promise a bright future for applied language technologies!

UM kicked-off the day by presenting the VOXreality project and more specifically the 3 use-cases along with the general objectives: (1) improve human-to-machine and human-to-human XR experiences, (2) widen multilingual translation and adapting it to different contexts (3) extend and improve the visual grounding of language models, (4) provide accessible pretrained XR models optimized for deployment and (5) demonstrate clear integration paths for the pretrained models. UM’s team member, Yusuf Can Semerci elaborated (as the scientific and technical coordinator) on the technical excellence of the project which is guaranteed by applying the state-of-the-art methods in automatic speech recognition (ASR), multilingual machine translation, vision and language models and generative dialogue systems.

UM’s team has 2 active PhD candidates who shared their latest research endeavors. Abderrahmane Issam explained his latest work on efficient simultaneous machine translation (SiMT). The goal of SiMT is to provide accurate and as real-time as possible translations by developing policies that are able to balance the quality of the produced translation versus that lag which is sometimes necessary so that the model has enough information to translate properly. UM’s proposed method learns when you need to wait for more input data in the original language before starting to produce the translation taking into account the uncertainty that comes with real-time applications. Results are promising both in terms of translation accuracy but also reducing the necessary lag.

Pawel Maka presented his published paper on context-aware machine translation. Context plays an important role in all language applications: in machine translation it is essential to remove the vagueness from e.g. which pronoun should be used. Context can be represented in different ways and usually includes the previous (or next) sentences for the one that we want to translate which can either be on the source or the target language. Of course, the bigger the context, the more computationally expensive is to run a translation model. Therefore, UM proposed different methods on how context can be efficiently “compressed” through techniques like caching and shortening. Proposed methods are competitive in terms of performance both in terms of accuracy but also in terms of resources used (e.g. memory).

On the other hand, KIT’s team presented the EU project Meetween. Meetween is aiming to revolutionize video conferencing platforms, breaking linguistic barriers and geographical constraints. Meetween aspires to deliver open-source AI models and datasets and more specifically multilingual AI models that focus on speech but support text, audio and video both as inputs and outputs and multimodal and multilingual datasets that cover all official EU languages.

KIT’s team of PhD candidates presented their work in (1) multilingual translation in low-resource cases (i.e. for languages that are not widely spoken or for cases when data is not available), (2) low-resource automatic-speech recognition, (3) the use of Large Language Models (LLM) in context-aware machine translation and (4) quality/confidence estimation for machine translation.

We were happy to identify the overlaps between both EU projects (VOXReality and Meetween) as well as the UM and KIT teams. At the heart of both projects lies a common objective: harness the power of advanced AI technologies, particularly in the realms of Natural Language Processing (NLP) and Machine Translation (MT), to facilitate seamless communication across linguistic and geographical barriers. While the applications and approaches may differ, the essence of their goals remains intertwined. VOXreality (by UM), seeks to enhance extended reality (XR) experiences by integrating natural language understanding with computer vision. On the other hand, KIT’s Meetween project takes a different but complementary approach to revolutionizing communication platforms. By fostering an environment of open collaboration and knowledge exchange, UM and KIT are more than excited for what the future brings in terms of their collaboration.

Picture of Jerry Spanakis

Jerry Spanakis

Assistant Professor in Data Mining & Machine Learning at Maastricht University

Twitter
LinkedIn