S40htWF1eNXCZ00Q

Empowering VR Events with AI: The Role of Dialogue Agents in Enhancing Participant Experience at VR Conferences

As virtual reality (VR) and artificial intelligence (AI) continue to advance, dialogue agents are emerging as a crucial element to improve user engagement in VR events, particularly conferences. By adding a layer of ease and engagement, these intelligent, AI-powered assistants transform VR events and conferences from static encounters into dynamic, responsive settings. Dialogue agents can guide participants, answer questions, and create a more accessible experience, all while bridging the gap between the real and virtual worlds. With this improved capabilities, VR events and conferences are no longer confined by the physical limitations that often accompany in-person events; instead, they thrive as immersive, inclusive spaces.

Having a dialogue agent in a VR conference environment significantly enhances user experience by providing real-time assistance, guidance, and personalized interaction within a virtual space. The primary role of dialogue agents in VR conferences is to serve as the interface between participants and the virtual environment, acting as assistants to help users with various tasks. Instead of struggling to perform certain actions, users can simply ask the agent, which provides valuable information and insights. This approach is like having a personal assistant delivering on-the-spot assistance, transforming a potentially complex virtual environment into an intuitive, user-friendly experience.

In our VOXReality project, we have developed a dialogue agent tailored specifically for VR conferences to provide an all-in-one assistant experience. The agent offers seamless navigation assistance, answers questions about the event program, and provides information about the trade show. Users can ask how to reach a particular room, and the agent not only responds in natural language but also provides visual cues that guide them directly to their destination. This integration of verbal and visual guidance makes navigation in VR environments feel natural and intuitive, creating a more enjoyable and accessible experience.

The VOXReality agent’s ability to offer natural language responses and visual cues for navigation makes it easy for participants to orient themselves in the VR environment, reducing the learning curve and improving accessibility, especially for first-time users. This functionality allows attendees to focus on the event itself rather than getting bogged down by navigation challenges, leading to a more engaging and immersive conference experience.

Furthermore, the agent’s ability to provide information about the event schedule and program details ensures that attendees can maximize their time, effortlessly accessing the right sessions, booths, or networking opportunities. Beyond navigation, the dialogue agent acts as a comprehensive knowledge resource, answering questions about event topics, speaker details, and exhibition information, reducing the need for attendees to consult external resources or manuals.

This agent can be configured for various VR applications, offering flexible support tailored to each event’s needs, whether it’s a trade show, panel discussion, or networking session. This adaptability enhances user satisfaction and opens up possibilities for personalized content delivery, fostering a deeper connection between attendees and the conference content. By integrating ASR, NMT, and dialogue modeling, the agent minimizes language barriers, supporting inclusive and diverse participation on a global scale.

By creating a multi-functional dialogue agent, VOXReality is setting a new standard for VR conferences and events. Our agent’s ability to respond to user needs and provide real-time support enhances the VR event experience and fosters an interactive atmosphere. As VR conferences continue to grow, the role of such intelligent agents will become even more crucial, helping to make VR a more inclusive and engaging medium for global events. Whether it’s guiding attendees to the right room, keeping them informed on program highlights, or making trade shows easily accessible, our dialogue agent embodies the potential of AI in VR environments, ensuring that every participant feels supported and connected throughout their journey.

Picture of Stavroula Bourou

Stavroula Bourou

Machine Learning Engineer at Synelixis Solutions SA

Twitter
LinkedIn
VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (22)

Voice User Interface: VOXReality bridging the gap through user friendly XR 

Leesa Joyce

Head of Research Implementation at Hololight

In a fast-evolving industrial environment, training assembly-line workers can be a complex and time-consuming process. Traditional training methods often fall short in engaging workers or adapting to their individual learning styles, leading to suboptimal outcomes. To address this, the VOXReality project aims to enhance the training experience by integrating augmented reality (AR) with cutting-edge technologies like automated speech recognition (ASR) and a dynamic dialogue system. This use case focuses on creating an immersive and interactive training environment where workers can visualize and interact with 3D CAD files while receiving real-time feedback and voice-assisted guidance. In such scenarios, the design of the user interface (UI) plays a pivotal role in shaping both attention span and mental load. Research shows that the more intuitive and user-friendly the interface is, the more focused and efficient the worker will be. Let’s dive into how UI design impacts these cognitive aspects, and how user-centric elements, like voice assistance, further enhance the experience. 

New technology often creates a barrier for users unfamiliar with complex interfaces, leading to frustration and resistance as it requires users to understand new types of inputs, commands, or gestures. Many individuals feel anxious about making mistakes or struggle with the cognitive load of learning new systems, which can result in avoidance. Voice assistance in XR interfaces addresses this by allowing users to interact through natural speech, reducing the need to master unfamiliar controls. This lowers the entry barrier, making the technology more accessible and easing the adoption process for users who might otherwise be reluctant to engage with it. 

The Role of UI in Attention Span and Mental Load

When it comes to immersive AR training, the way information is presented can either help or hinder a worker’s focus. Poorly designed interfaces, cluttered with unnecessary information or requiring too much effort to navigate, can overwhelm users, leading to reduced attention and increased mistakes. On the other hand, a well-designed UI can guide the user seamlessly through tasks, keeping their focus on the assembly process rather than on the mechanics of the interface itself. 

According to Sweller’s Cognitive Load Theory (CLT), cognitive load is divided into three categories: intrinsic, extraneous, and germane load. Intrinsic load is related to the complexity of the task—assembling an engine, for example, is naturally a challenging task. Extraneous load is the effort required to use the UI or understand instructions, while germane load refers to the mental effort invested in learning or solving problems. A well-designed AR interface reduces extraneous load, allowing workers to allocate more of their cognitive resources toward learning and performing the task (Paas et al., 2003). 

AR interfaces that minimize distractions and present only the necessary information allow workers to focus on the task at hand. This focus extends their attention span, making it easier to retain information and apply it in real-time. Research supports the idea that UIs that are simple, clean, and contextually relevant improve not only attention but also performance (Dünser et al., 2008). Over time, this efficiency can lead to better learning outcomes and fewer errors during training. 

The Impact of User-Centric UI Design

User-centric design—focused on the needs and preferences of the worker—has a profound impact on how effectively the AR training environment supports learning. For example, incorporating voice assistance into AR interfaces can significantly reduce the cognitive load. When workers can receive verbal instructions or ask the system for help hands-free, they can focus entirely on the physical task, rather than switching attention back and forth between the AR display and their hands. Studies have shown that multimodal interfaces, which combine visual, auditory, and sometimes haptic feedback, can improve performance and reduce mental strain (Billinghurst et al., 2015). Additionally, a conversational assistance through natural speech input is immersive and closer to real life training from trainers. 

Additionally, personalized UI elements, such as customizable display settings or progress-tracking tools, help workers feel more in control and confident in their training. This sense of control can reduce psychological stress and improve engagement, making it easier for users to stay focused on learning without feeling overwhelmed (Norman, 2013). A well-designed UI takes into account not only the technical aspects of the task but also the psychological well-being of the user, helping to create an environment where workers are less likely to feel fatigued or frustrated. 

Psychological Effects of AR UIs

Beyond the immediate practical benefits, there are deeper psychological impacts of a user-centric AR UI. When the interface is intuitive, users experience a state of flow, which is a heightened state of focus and engagement where they lose track of time and become fully absorbed in the task (Csikszentmihalyi, 1990). Flow states are often linked to better learning and productivity, as they help users maintain concentration without unnecessary interruptions. 

Moreover, reducing cognitive load through intuitive design contributes to lower stress levels, particularly in high-stakes environments like industrial assembly lines where mistakes can be costly. By providing clear guidance and eliminating unnecessary complexity, the AR interface acts as a supportive tool, making workers feel more competent and less anxious (Dehais et al., 2019). This is critical for building both confidence and long-term competence in a new skill.

References

Csikszentmihalyi, M. (1990). Flow: The Psychology of Optimal Experience. Harper & Row. 

Dehais, F., Causse, M., Vachon, F., & Tremblay, S. (2012). Cognitive conflict in human-automation interactions: a psychophysiological study. Applied ergonomics, 43(3), 588–595. https://doi.org/10.1016/j.apergo.2011.09.004 

Dünser, A., Grasset, R., & Billinghurst, M. (2008). A survey of evaluation techniques used in augmented reality studies. ACM SIGGRAPH ASIA 2008 Courses, 1-27. 

Mark Billinghurst; Adrian Clark; Gun Lee, A Survey of Augmented Reality , now, 2015, doi: 10.1561/1100000049. 

Norman, D. (2013). The Design of Everyday Things. Basic Books. 

Paas, F., Renkl, A., & Sweller, J. (2003). Cognitive load theory and instructional design: Recent developments. Educational Psychologist, 38(1), 1-4. 

Twitter
LinkedIn
VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (15)

Partner Interview #7 with Athens Epidaurus Festival

As one of Greece’s most esteemed cultural organizations, AF has been at the forefront of the country’s artistic landscape since 1955. In this conversation, we are joined by Eleni Oikonomou, who shares insights into AF’s involvement in the groundbreaking VOXReality project. As a use case partner, AF leads the Augmented Theatre initiative, collaborating with technical experts to merge AR technology with live theatre. Through pilots featuring excerpts from ancient Greek tragedies, AR glasses deliver translated captions and visual effects, blending the physical stage with digital elements to enhance accessibility and audience immersion.

Can you provide an overview of your organization's involvement in the VOXReality project and your specific role within the consortium?

The Athens Epidaurus Festival is one of Greece’s biggest cultural organizations and organizer of the summer festival of the same name, since 1955. In the VOXReality project the AEF is a use case partner, owning the Augmented Theatre use case. We are working with our technical partners to merge AR elements with theatre, with the goal of enhancing accessibility and audience immersion. This includes pilots featuring excerpts from an Ancient Greek tragedy, where translated captions and audiovisual effects are delivered through AR glasses, merging the physical stage with digital elements for a more immersive theatrical experience.

How is the effectiveness of the language translation feature in enhancing the audience's experience during theatrical performances ensured, to deliver a seamless and authentic experience for the audience?

Multiple strategies are employed to address issues of accuracy, precision and latency in the delivery of the caption feature of the Augmented Theater use case.

To start with, in theatre, translation is an art form in itself. Theatrical texts require a level of precision and sensitivity to convey not only the literal meaning, but also the emotional, cultural, and dramatic nuances that are essential to the performance. Therefore, the real time translation feature of VOXReality is based on literary translations from Ancient Greek, performed by acclaimed translators to safeguard the integrity of the play. Additionally, internal controls and evaluations are carried out to assess the performance of the translation feature to ensure the artistic integrity of the original text.

Finally, two internal pilots and a small-scale public pilot have already been deployed, with a goal to assess the quality of the use case and fine tune the features we are developing. During the public pilot, we had the opportunity to gather feedback on our translation and caption feature from the participants, via questionnaires and semi structured interviews. This feedback has been valuable in improving the use case and refining our future steps.

Considering the importance of cultural nuances in theatrical expressions, how does the language translation system address the challenge of maintaining cultural sensitivity and preserving the artistic intent of the performance?

This is another reason why literary translations are used in our use case. As previously explained, literary translations are irreplaceable, even as AR technology presents the exciting potential to accommodate native speakers of diverse linguistic backgrounds. While translated captions are becoming more common in theatres, they are not universally available and typically cover only a limited number of languages. The ability of Augmented Theatre to extend beyond these traditional limitations underscores the importance of a solid foundation in literary translations to ensure that cultural and artistic elements are preserved.

The language translation system for VOXReality prioritizes cultural sensitivity and artistic integrity by relying on these literary translations, which capture the cultural nuances and emotional subtleties of the original text. To ensure that these aspects are preserved throughout the development, we conduct thorough evaluations of the translation outputs through internal checks. This evaluation is crucial for verifying that the translations maintain the intended cultural and artistic elements, thereby respecting the integrity of the original performance.

Considering advancements in technology, do you foresee the integration of augmented reality to enhance theatrical experiences, and how might this impact the audience's engagement with live performances?

AR has the potential to transform traditional theatre by offering immersive experiences, blending digital and physical elements to create new artistic dimensions. This technology allows for new dynamic ways of storytelling, capturing audience attention and enabling interactive elements that can enhance engagement with the performance.

One of the key opportunities AR offers is improving inclusivity and accessibility in theatre. This aligns with our organization’s goals, driven by a strong commitment to supporting inclusive practices and leveraging AEF’s international outreach. AR has the potential to engage audiences who may feel excluded from traditional theatre spaces, whether due to physical, linguistic, or sensory barriers, in unprecedented ways.

The rise of augmented reality (AR) technologies in the coming years therefore is undoubtedly set to make a lasting impact on how audiences engage with live performances. However, thoughtful consideration is essential to balance technological advancement with artistic integrity. This includes ensuring that AR enhances rather than detracts from the human experience of live performance and considering the impact on traditional theatre roles and artists rights.

Furthermore, while integrating AR into live theatre presents exciting possibilities, it also comes with several challenges. Combining AR with physical elements can sometimes be distracting for the audience, and technical issues like glitches or misalignment can disrupt the performance’s flow and break immersion. These challenges are compounded by the current limitations of bulky AR equipment. However, advancements in technology are expected to address these issues, leading to more sophisticated and user-friendly equipment. These challenges highlight why the VOXReality project is so exciting, as it allows us to explore and refine how AR can complement theatre in a real-world context.

How do you see theatre exploring virtual platforms for performances in the foreseen future? How might AR VFX be utilized to reach broader audiences or create unique immersive theatre experiences?

As an emerging medium, virtual platforms have the potential to revolutionize theatre by expanding their reach and creating new engagement opportunities. In our project, we use Augmented Reality Visual Effects (AR VFX) to blend digital elements with a live performance, exploring how these technologies can impact and create immersion in the theatre experience. In our use case voice activated VFX serve to accompany a scene of an ancient Greek play. Since in ancient Greek tragedy events are not enacted on stage and the retelling of events by actors is the norm, the VFX developed in our Augmented Theatre use case follow this narrative tradition, bringing the described elements to life for the audience in a very innovative way.

More broadly, the integration of virtual platforms and VFX opens up numerous possibilities for innovation in theatre. They can create dynamic, interactive backgrounds that change in response to the action on stage, integrate virtual characters or objects that interact with live performers, the possibilities are very diverse. VFX can also help in overcoming physical barriers by providing virtual set designs that are not constrained by physical space, and address geographic limitations, even enabling remote audiences to experience the performance in a virtual environment or enabling actors to perform from different locations.

Picture of 	Elena Oikonomou

Elena Oikonomou

Athens Epidaurus Festival

Twitter
LinkedIn
VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (16)

Virtual Conferencing, So Far, So Close.

When the COVID-19 pandemic brought our travelling business culture to a grinding halt, we understood how dependent we were on free movement and face-to-face interaction for our business to develop and progress. The event industry was not spared, with the Dutch event industry alone suffering a staggering revenue loss of 1.23 billion Euros in 2020 (Statista, 2022).

During COVID-19, we witnessed a remarkable shift in our business culture. We did not just adapt to the new normal; we embraced it. Virtual networking, online video platforms, and innovative event formats like arcade-style networking and virtual cocktails have become a permanent part of our business landscape, inspiring us to continue evolving and finding new ways to connect.

According to Precedence Research, the global video conferencing market was valued at $7.01 billion in 2022 and is projected to reach $22.26 billion by 2032, driven by a compound annual growth rate (CAGR) of 12.30% (Research, 2024).

Due to the COVID-19 pandemic, the demand for online networking platforms surged as remote work and social distancing necessitated virtual engagement across professional and social spheres. Leading platforms such as LinkedIn, Slack, and Microsoft Teams experienced significant user growth, capitalising on their ability to facilitate professional connections, job opportunities, and industry events.

Meanwhile, video conferencing tools like Zoom and specialised event platforms like Hopin became essential for virtual conferences and networking sessions, providing users with interactive and scalable solutions. Interest-based platforms, including Discord, Facebook Groups, and Reddit communities, saw increased activity as users sought niche spaces for social interaction and knowledge exchange. During this period, Clubhouse, a new entrant, gained rapid traction with its audio-only, invite-only format, appealing to users looking for real-time conversations. Additionally, virtual event platforms like Airmeet, Remo, and Brella emerged as innovative solutions for online conferences, offering networking tools that replicated the interactivity of in-person events.

In 2022, WhatsApp Business (messenger and video) saw its user base surpass 1.26 billion, with the Asia-Pacific (APAC) region contributing the highest number of users at 808.17 million. Zoom reported a 6.9% increase in revenue in 2023, reaching $4.39 billion. Microsoft Teams experienced record downloads in Q2 2020, hitting 70.43 million. Meanwhile, Cisco Webex reported 650 million monthly meeting participants in 2021, averaging 21.7 million daily. North America held the largest market share for video conferencing globally in 2022, accounting for 41% of the total market. The APAC region is also projected for significant growth, with its video conferencing market expected to reach $6.8 billion by 2026 (Sukhanova, 2024).

As this shift permeated all areas of our social activities, the VOXReality consortium set out to develop voice-driven interactions for XR spaces, where virtual B2B events can deliver more value to their global clients. In 2021, the number of trade shows hosted in the Netherlands dropped significantly compared to 2019, primarily due to the effects of the COVID-19 pandemic. By 2023, the country only held 53 events, a sharp decline from the 132 trade fairs organised in 2018 (Statista, 2023).

Image generated by Adobe Firefly

Virtual Reality (VR) has the potential to transform our methods of communication and interaction. Unlike other distance-based communication tools, VR stands out for its enhanced interactivity and visualisation capabilities, such as displaying data, documents, and 3D models. This makes VR interactions a promising option for more effective remote business meetings and engaging social interactions.

Currently, voice-activated personal assistants are used to engage with customers, offer support, overcome language barriers, and streamline basic operations. While some assistants incorporate additional modalities like text or images, their functionality remains limited to simple tasks such as setting alarms or controlling devices. They lack the ability to handle more complex interactions.

VOXReality aims to advance AI models by developing systems that integrate audiovisual and spatio-temporal contexts, enabling personal assistants to better understand and interact with their environment. These systems will allow new applications, such as instruction assistants or navigation guides, through novel self-supervised vision and language systems. By grounding language in both spatial and semantic contexts, VOXReality seeks to enable more sophisticated assistant responses, offering richer, context-aware interactions and higher-level reasoning.

By developing a digital agent and services that provide virtual venue navigation, programme information, and, most importantly, automatic translation services for global virtual event attendees, we aim to make business interactions in virtual environments meaningful, effective, and fully assisted. We strive to challenge how conference interactions are delivered today by adding new tools and value to tomorrow’s event industry.

[1] Statista. (2022, December 9). Expected revenue and loss due to coronavirus in event industry Netherlands in 2020. https://www.statista.com/statistics/1108551/expected-revenue-and-loss-due-to-coronavirus-in-event-industry-netherlands/

[2] Research, P. (2024, August 28). Video conferencing market size to hit USD 28.26 bN by 2034. https://www.precedenceresearch.com/video-conferencing-market

[3 ] Sukhanova, K. (2024, July 22). Video Conferencing market Statistics. The Tech Report. https://techreport.com/statistics/software-web/video-conferencing-market-statistics

[4] Statista. (2023, October 27). Number of trade shows in the Netherlands 2009-2021. https://www.statista.com/statistics/460581/number-of-trade-fairs-in-the-netherlands/

Picture of Manuel Toledo - Head of Production at VRDays Foundation

Manuel Toledo - Head of Production at VRDays Foundation

Manuel Toledo is a driven producer and designer with over a decade of experience in the arts and creative industries. Through various collaborative projects, he merges his creative interests with business research experience and entrepreneurial skills. His multidisciplinary approach and passion for intercultural interaction have allowed him to work effectively with diverse teams and clients across cultural, corporate, and academic sectors.

Starting in 2015, Manuel co-founded and produced the UK’s first architecture and film festival in London. Since early 2022, he has led the production team for Immersive Tech Week at VRDays Foundation in Rotterdam and serves as the primary producer for the XR Programme at De Doelen in Rotterdam. He is also a founding member of ArqFilmfest, Latin America’s first architecture and film festival, which debuted in Santiago de Chile in 2011. In 2020, Manuel earned a Master’s degree from Rotterdam Business School, with a thesis focused on innovative business models for media enterprises. He leads the VRDays Foundation’s team’s contributions to the VOXReality project.

Twitter
LinkedIn
VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (14)

Preserving Audiovisual Heritage: Exploring the Role of Extended Reality

Audiovisual Heritage definition

Audiovisual heritage refers to the collection of sound and moving image materials that capture and convey cultural, historical, and social information. It includes cultural products, such as films, radio broadcasts, music recordings, and other forms of multimedia, as well as the instruments, devices and machines used in their production, recording and reproduction, and the analog and digital formats used to store them.

Preservation and accessibility

Preserving audiovisual heritage is crucial because analog formats (like film reels, magnetic tapes, and vinyl records) are vulnerable to physical decay. Digital formats, meanwhile, face the risk of obsolescence as technology evolves. Preservation is only the first step though, since the process of digitizing and archiving makes these materials often difficult for the public to access. To allow the public to meaningfully engage with and understand the significance of each artifact, it is important to contextualize the artifact in a curated framework.

Reasons and methods to use Extended Reality

This is where Extended Reality (XR) comes in —an umbrella term that encompasses Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR). XR describes a complex branch of emerging technology that allows users to interact with content in immersive ways. XR can isolate users senses from their physical environment and allows them to experience (e.g. see and listen) to audiovisual heritage artifacts in a virtual space specifically designed for that purpose. This can be seen as a virtual counterpart to how museums thoughtfully design physical displays to best showcase their exhibits. XR also enables creators to craft narratives around artifacts, enhancing their cultural and historical value—a key area where XR shines.

Examples

One such example is the VR work “Notes on Blindness” [1], which allows users to listen to original audio recordings from the writer John Hull describing his journey into biological blindness. The XR work allows users to experience darkness while listening to the recording and in addition visualizes the narrative with a subtle yet decisive aesthetic.

Retrieved from https://www.arte.tv/digitalproductions/en/notes-on-blindness/

Another example is the work “Traveling While Black” [2] which is a VR work documenting racial discrimination against African Americans in the United States. This work uses original audio and film excerpts, including interviews from people that lived through segregation or their descendants. Viewing this work from an immersive, first-person perspective in contrast to viewing it on a flat monitor gives the audience more affordances for critical engagement and self-reflection. Another great example is a VR work belonging to the exhibition called ‘The March’ at the DuSable Museum of African American History in Chicago, chronicling the historic events of the 1963 march on Washington [3]. The work contains recordings from Martin Luther King Jr.’s iconic ‘I Have A Dream’ speech.

Retrieved from https://dusablemuseum.org/exhibition/the-march/
Limitations

Despite its potential, applied examples yet do not abound because there is significant expertise and cost involved in such productions. Limitations exist on hardware, software and HCI design aspects, and are gradually being addressed. Research work is being invested in design methodologies to streamline production and improved audience satisfaction. Practical issues, like hardware production costs and form factor discomfort, are being mitigated by commercial investments from major tech companies, like Microsoft, Google, Samsung, Apple, Meta, etc. Industry standards with cross platform support and legacy support, such as OpenXR, are another important factor for broader adoption. Finally, the audience’s familiarity and interest with this technology is increasing as it permeates more and more aspects of daily life.

Conclusion

It is clear that extended reality technology can transform how we engage with our audiovisual heritage – it can offer contextualization, it can situate both the audience and the artifact in a narrative framework, and it can offer more depth and nuance to our interactions. While there are still challenges, the limitations are steadily being lifted by efforts from a multitude of involved fields – evidencing the importance of this domain. We eagerly anticipate the next innovative steps from museums, galleries, research centers, studios and film companies worldwide.

Picture of Spyros Polychronopoulos

Spyros Polychronopoulos

Research Manager at ADAPTIT and Assistant Professor at the Department of Music Technology and Acoustics of the HMU

Twitter
LinkedIn
VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (13)

Fostering Innovation: KIT and UM’s Collaborative Leap in NLP and Machine Translation

On January 26th 2024, Maastricht University (UM) VOXReality team was hosted by the Artificial Intelligence 4 Language Technologies (AI4LT) group of the Karlsruhe Institute of Technology (KIT). It was a long day workshop where both groups presented their work in Natural Language Processing (NLP) and more specifically Machine Translation (MT). Synergies between the two groups promise a bright future for applied language technologies!

UM kicked-off the day by presenting the VOXreality project and more specifically the 3 use-cases along with the general objectives: (1) improve human-to-machine and human-to-human XR experiences, (2) widen multilingual translation and adapting it to different contexts (3) extend and improve the visual grounding of language models, (4) provide accessible pretrained XR models optimized for deployment and (5) demonstrate clear integration paths for the pretrained models. UM’s team member, Yusuf Can Semerci elaborated (as the scientific and technical coordinator) on the technical excellence of the project which is guaranteed by applying the state-of-the-art methods in automatic speech recognition (ASR), multilingual machine translation, vision and language models and generative dialogue systems.

UM’s team has 2 active PhD candidates who shared their latest research endeavors. Abderrahmane Issam explained his latest work on efficient simultaneous machine translation (SiMT). The goal of SiMT is to provide accurate and as real-time as possible translations by developing policies that are able to balance the quality of the produced translation versus that lag which is sometimes necessary so that the model has enough information to translate properly. UM’s proposed method learns when you need to wait for more input data in the original language before starting to produce the translation taking into account the uncertainty that comes with real-time applications. Results are promising both in terms of translation accuracy but also reducing the necessary lag.

Pawel Maka presented his published paper on context-aware machine translation. Context plays an important role in all language applications: in machine translation it is essential to remove the vagueness from e.g. which pronoun should be used. Context can be represented in different ways and usually includes the previous (or next) sentences for the one that we want to translate which can either be on the source or the target language. Of course, the bigger the context, the more computationally expensive is to run a translation model. Therefore, UM proposed different methods on how context can be efficiently “compressed” through techniques like caching and shortening. Proposed methods are competitive in terms of performance both in terms of accuracy but also in terms of resources used (e.g. memory).

On the other hand, KIT’s team presented the EU project Meetween. Meetween is aiming to revolutionize video conferencing platforms, breaking linguistic barriers and geographical constraints. Meetween aspires to deliver open-source AI models and datasets and more specifically multilingual AI models that focus on speech but support text, audio and video both as inputs and outputs and multimodal and multilingual datasets that cover all official EU languages.

KIT’s team of PhD candidates presented their work in (1) multilingual translation in low-resource cases (i.e. for languages that are not widely spoken or for cases when data is not available), (2) low-resource automatic-speech recognition, (3) the use of Large Language Models (LLM) in context-aware machine translation and (4) quality/confidence estimation for machine translation.

We were happy to identify the overlaps between both EU projects (VOXReality and Meetween) as well as the UM and KIT teams. At the heart of both projects lies a common objective: harness the power of advanced AI technologies, particularly in the realms of Natural Language Processing (NLP) and Machine Translation (MT), to facilitate seamless communication across linguistic and geographical barriers. While the applications and approaches may differ, the essence of their goals remains intertwined. VOXreality (by UM), seeks to enhance extended reality (XR) experiences by integrating natural language understanding with computer vision. On the other hand, KIT’s Meetween project takes a different but complementary approach to revolutionizing communication platforms. By fostering an environment of open collaboration and knowledge exchange, UM and KIT are more than excited for what the future brings in terms of their collaboration.

Picture of Jerry Spanakis

Jerry Spanakis

Assistant Professor in Data Mining & Machine Learning at Maastricht University

Twitter
LinkedIn
VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (11)

Partner Interview #6 with Synelixis Solutions S.A.

In our sixth installment of the Partner Interview series, we sit down with Stavroula Bourou, a Machine Learning Engineer at Synelixis Solutions S.A., to explore the company’s vital role in the VOXReality project. Synelixis, a leader in advanced technology solutions, has been instrumental in developing innovative virtual agents and immersive XR applications that are transforming how we experience virtual conferences. In this interview, Stavroula shares insights into their groundbreaking work and how they are driving the future of communication in the XR landscape.

Can you provide an overview of your organization's involvement in the VOXReality project and your specific role within the consortium?

Synelixis Solutions S.A. has been an integral part of the VOXReality project from its inception, serving as one of the original members of the proposal team. Our organization brings a wealth of experience to the table, participating in numerous EU-funded research projects and providing cutting-edge technology solutions.

In the VOXReality project, our roles span several domains, significantly enhancing the project’s success. One of our pivotal contributions is the development of a virtual agent designed for use in virtual conferences. This agent is designed to be user-friendly and non-intrusive, respecting user requests and preferences while assisting users by providing navigational help and timely information about the conference schedule, among other tasks. Its design ensures that interactions are helpful without being disruptive, allowing users to engage with the conference content effectively and comfortably.

Additionally, we have developed one of the three VOXReality XR Applications—the VR Conference application. This application recreates a professional conference environment in virtual reality, complete with real-time translation capabilities and a virtual assistant. It enables users to interact seamlessly in their native languages, thanks to VOXReality’s translation services, thus breaking down language barriers. Furthermore, the virtual agent provides users with essential information about the conference environment and events, enhancing their overall experience.

Furthermore, we have outlined deployment guidelines for the VOXReality models for four different methods: source code, Docker, Kubernetes, and ONNX in Unity. These guidelines are designed to facilitate the integration of VOXReality models into various applications, making the technology accessible to a broader audience.

How do you envision the convergence of NLP and CV technologies influencing the Extended Reality (XR) field within the context of the VOXReality initiative?

In the context of the VOXReality initiative, the convergence of Natural Language Processing (NLP) and Computer Vision (CV) technologies is poised to revolutionize the Extended Reality (XR) field. By integrating NLP, we enhance communication within XR environments, making it more intuitive and effective. This allows users to interact with the system using natural language, significantly improving accessibility and engagement. Additionally, this technology enables users who speak different languages to communicate with one another or to attend presentations and theatrical plays in foreign languages, thus overcoming language barriers and reaching a broader audience. Similarly, incorporating CV enables the system to understand and interpret visual information from the environment, which enhances the realism and responsiveness of both virtual agents and XR applications.

Together, these technologies enable a more immersive and interactive experience in XR. For example, in the VOXReality project, NLP and CV are being utilized to create environments where users can naturally interact with both the system and other users through voice commands. This integration not only makes XR environments more user-friendly but also significantly broadens their potential applications, ranging from virtual meetings and training sessions to more complex collaborative and educational tasks. The synergy of NLP and CV within the VOXReality initiative is set to redefine user interaction paradigms in XR, making them as real and responsive as interacting in the physical world.

What specific challenges do you anticipate in developing AI models that seamlessly integrate language as a core interaction medium and visual understanding for next-generation XR applications?

One of the primary challenges in developing AI models that integrate language and visual understanding for next-generation XR applications is creating a genuinely natural interaction experience. Achieving this requires not just the integration of NLP and CV technologies but their sophisticated synchronization to operate in real-time without any perceptible delay. This synchronization is crucial because even minor lags can disrupt the user experience, breaking the immersion that is central to XR environments. Additionally, these models must be adept at comprehensively understanding and processing user inputs accurately across a variety of dialects. The complexity of processing multilingual and dialectical variations in real-time adds significant complexity to AI model development.

Moreover, another significant challenge is the high computational demands required to process these complex AI tasks in real-time. These AI models often need to perform intensive data processing rapidly to deliver seamless and responsive interactions. Optimizing these models to function efficiently across different types of hardware, from high-end VR headsets to more accessible mobile devices, is crucial. Efficient operation without compromising performance is essential not only for ensuring a fluid user experience but also for the broader adoption of these advanced XR applications. The ability to run these complex models on a wide range of hardware platforms ensures that more users can enjoy the benefits of enriched XR environments, making the technology more inclusive and widespread.

All these challenges are being addressed within the scope of the VOXReality project. Stay tuned to learn more about our advancements and breakthroughs in this exciting field.

How do you plan to ensure the adaptability and learning capabilities of the virtual agents in varied XR scenarios?

To ensure the adaptability and learning capabilities of our virtual agents in varied XR scenarios within the VOXReality project, we are implementing several key strategies. Firstly, we utilize advanced machine learning techniques to equip the virtual agents with the ability to learn from user interactions and adapt their responses over time. These techniques, including deep learning and large language models (LLMs), enable the virtual agents to analyze and interpret vast amounts of data rapidly, thereby improving their ability to make informed decisions and respond to user inputs in a contextually appropriate manner, making them more intuitive and responsive.

Moreover, we are actively creating and curating a comprehensive dataset that reflects the real-world diversity of XR environments. This dataset includes a wide array of interactions, environmental conditions, and user behaviors. By training our virtual agents with this rich dataset, we enhance their ability to understand and react appropriately to both common and rare events, further boosting their effectiveness across various XR applications.

Through these methods, we aim to develop virtual agents that are not only capable of adapting to new and evolving XR scenarios but are also equipped to continuously improve their performance through ongoing learning and interaction with users.

In the long term, how do you foresee digital agents evolving and becoming integral parts of our daily lives, considering advancements in spatial and semantic understanding through NLP, CV, and AI?

In the long term, we foresee digital agents evolving significantly, becoming integral to our daily lives as advancements in NLP, CV, and AI continue to enhance their spatial and semantic understanding. As these technologies develop, digital agents will become increasingly capable of understanding and interacting with the world in ways that are as complex as human interactions.

With improved NLP capabilities, digital agents will be able to comprehend and respond to natural language with greater accuracy and contextual awareness, making interactions feel more conversational and intuitive. This advancement also includes sophisticated translation capabilities, enabling agents to bridge language barriers seamlessly. As a result, they can serve global user bases by facilitating multilingual communication, which enhances accessibility and inclusivity. This will allow them to serve in more personalized roles, such as personal assistants that can manage schedules, respond to queries, and even provide companionship with a level of empathy and understanding that closely mirrors human interaction.

Advancements in CV will enable these agents to perceive the physical world with enhanced clarity and detail. They’ll be able to recognize objects, interpret scenes, and navigate spaces autonomously. This will be particularly transformative in sectors like healthcare, where agents could assist in monitoring and providing care, and in retail, where they could offer highly personalized shopping experiences.

Furthermore, as AI technologies continue to mature, we will see digital agents performing complex decision-making tasks, learning from their environments, and operating autonomously within predefined ethical guidelines. They will become co-workers, caregivers, educators, and even creative partners, deeply embedded in all aspects of human activity.

Ultimately, the integration of these agents into daily life will depend on their ability to operate seamlessly and discreetly, enhancing our productivity and well-being without compromising our privacy or autonomy. As we advance these technologies, we must also consider the ethical implications and ensure that digital agents are developed in a way that is beneficial, safe, and respectful of human values.

Picture of Stavroula Bourou

Stavroula Bourou

Machine Learning Engineer at Synelixis Solutions SA

Twitter
LinkedIn
VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (9)

Partner Interview #5 with Hololight

In this fifth installment of our Partner Interview series, Leesa Joyce, Head of Research Implementation at Hololight, sits down with Carina Pamminger, Head of Research at Hololight, to explore their organization’s pivotal role in the VOXReality project. As a leader in extended reality (XR) technology, Hololight is pushing the boundaries of augmented reality (AR) solutions, particularly within industrial training applications. Through their work on the Virtual Training Assistant use case, Carina sheds light on how AR is transforming training processes by integrating AI-driven interactions and real-time performance evaluation. The interview delves into the innovative ways AR is being utilized to enhance assembly line training, the incorporation of safety protocols, and the future of immersive learning experiences at Hololight.

Can you provide an overview of your organisation's involvement in the VOXReality project and your specific role within the consortium?

HOLO is an extended reality (XR) technology provider contributing with its augmented reality (AR) solutions to stream and display 3D computer-aided-design (CAD) models and manipulate them in AR environment. HOLO is a task leader where it is leading the development of novel interactive XR applications. HOLO is also the leader of the use case “Virtual Training Assistant”. The Training Assistant use case revolves around the enhancement of an AR industrial assembly training application with the goal of enhancing the training process by incorporating the automated speech recognition (ASR) model and dialogue system of VOXReality. Conventional training techniques frequently exhibit a deficiency in interactivity and adaptability, resulting in less-than-optimal educational results. Through the integration of artificial intelligence within the AR setting, this scenario aims to establish a more captivating and efficient training atmosphere. Noteworthy characteristics of the application encompass the visualization and manipulation of 3D CAD files within the AR environment, an interactive virtual training aide featuring real-time performance evaluation, as well as a dynamic dialogue system driven by natural language processing (NLP) and speech-to-text functionalities.

The prime constituent of the training assistant technology is the application Hololight Space Assembly. Trainees are guided to precisely assemble components within the CAD model, ensuring everything fits perfectly. The system effortlessly integrates with pre-existing asset bundles, providing all the necessary details, such as CAD files, tools, and additional elements like tables or shelves. It also includes intuitive scripts for model interaction, easy-to-navigate menus, and smart algorithms to enhance the assembly experience. In addition, Assembly leverages Hololight Stream to remotely render the application from a high-performance laptop to AR smart glasses, overcoming the device’s rendering limitations. This remote rendering and streaming setup allows the AR training application to be hosted on a powerful laptop (server) and seamlessly streamed to the HoloLens 2 (client).

How is AR seamlessly integrated into training applications, and what specific advantages does it bring to the learning experience? 

Integrating AR into training applications allows assembly line workers to train in a highly realistic, digitally replicated environment that mirrors their actual workspace. This immersive experience helps workers develop muscle memory and recognize environmental cues, making the transition to the real assembly line smoother and more intuitive. Since the training environment is digital, it can be accessed from anywhere, at any time, providing flexibility and convenience for both trainees and companies. 

Moreover, AR-based training is resource-friendly and cost-effective. Multiple workers can use the same training files repeatedly, allowing for efficient use of resources. The digital nature of the environment also means that training scenarios can be easily modified, redesigned, or personalized to meet specific needs, enhancing the learning experience. By incorporating sensory cues, AR helps reinforce learning, making it a powerful tool for building skills that are critical in a fast-paced, high-precision environment like an assembly line. 

How does the AR training application cater to different skill levels among trainees, ensuring a gradual learning curve for beginners and challenging modules for more experienced assembly technicians? 

The AR training application is designed to accommodate various skill levels, ensuring that both beginners and experienced assembly technicians can benefit from the training. For those with some assembly knowledge but needing to master a new object / engine / machine, the difficulty modes come in handy. These modes guide trainees through the correct order of assembly, gradually increasing in complexity. This personalized approach allows the training to adapt to the expertise and learning pace of everyone, making it accessible to slow learners while still providing a challenge for those who pick up the process quickly.

By progressing through these difficulty levels, trainees not only learn the assembly process but also reinforce it through repetition, ensuring they internalize each step. As they clear each difficulty mode, they build confidence and gradually commit the entire process to memory. This approach ensures that by the end of the training, regardless of their initial skill level, all trainees will have mastered the assembly process and be fully prepared to apply their knowledge on the actual assembly line.

Considering the critical nature of turbine assembly, how does the AR application incorporate safety protocols and guidelines to ensure that trainees adhere to industry standards during the training process? 

The AR application prioritizes safety by guiding trainees through the correct order of turbine assembly, creating a complete awareness about the process and the parts that need to be handled, reducing the likelihood of mistakes that could lead to serious risks in real-life scenarios. By learning and practicing each step in a controlled, digital environment, trainees can focus on mastering the process without the immediate dangers associated with heavy machinery. This approach ensures that they are well-prepared to follow industry standards and protocols when transitioning to the actual assembly line, where adherence to safety guidelines is critical.

However, some safety aspects remain areas for improvement. Currently, ergonomic assessments can only be conducted in real-life settings, requiring external analysis to ensure proper posture and technique. Additionally, the integration of Personal Protective Equipment (PPE) within the AR training is limited due to compatibility issues between safety goggles and AR glasses. While the application effectively reduces risks by teaching the correct assembly sequence, future developments could enhance safety training by incorporating ergonomic evaluations and better PPE integration.

Looking ahead, what plans are in place for future enhancements and expansions of the AR training application for turbine assembly? Are there additional features or modules on the horizon to further enrich the learning experience?

The AR training application for turbine assembly is set to undergo significant enhancements, particularly with the integration of VOXY, an AI-assisted dialogue agent with voice assistance. VOXY is already a game-changing addition, streamlining interactions within the application by eliminating the need for clumsy AR hand gestures. This ensures a smoother, more immersive experience, allowing users to stay fully engaged with the training process. VOXY also introduces AI-driven support, making it easier for trainees to navigate complex assembly tasks while receiving real-time guidance and feedback.

Future expansions include developing a platform to host training files and an analysis mode to evaluate trainee performance more comprehensively. We’re also exploring the incorporation of real, trackable tools in the AR environment, enabling physical interaction with virtual elements to improve ergonomics and weight memory. Additionally, we’re researching ways to integrate safety equipment into the AR training, with ongoing efforts under the SUN project funded by Horizon Europe. These enhancements will not only enrich the learning experience but also ensure that trainees are better prepared for the physical demands and safety requirements of turbine assembly.

Leesa Joyce

Head of Research Implementation at Hololight


&

Carina Pamminger

Head of Research at Hololight

Twitter
LinkedIn
jr-korpa-6dN9l-gopyo-unsplash

One piece at a time: Assembling textual video captions from single words and image patches

Recording video using several types of modern capturing devices — such as digital cameras, web cameras, or cell-phone cameras — has been widely proliferated in the world for many years now. The reasons why we capture videos are numerous. People mostly desire to capture important moments in their lives, or less important ones, using their mobile devices. Within a series of years, people may come up with several hundreds or thousands of videos and images. Video capturing has other operational applications, too, like video-based surveillance. In this type of surveillance, a place of interest that is visible by a camera is captured in order to monitor what is happening in the surrounding area. But why would we need to capture video in this case? Shop owners would, for instance, utilise surveillance cameras to monitor people who navigate their shops for security or for business management reasons. However, there can be more than that. Another idea could be to predict when and where people visiting a very large shop or a museum should be serviced by the staff. The utility of video for artificially intelligent digital analyses is very difficult to understand fully. When we need to manage smaller or larger collections of video, one question that is important is this: How can we summarise the essential semantic visual information that is contained in a collection of videos through text?

By VectorMine from Getty Images

As humans, we can instantly perceive some of the elements of our surrounding environment, without even making a significant effort. Perceiving aspects of the visual world is essential for us to function within our communities. The human brain perceives the world around us visually by being passed visual information sensed directly through our eyes and transmitted via the optic nerve. This is a true but too coarse of a statement about how vision works in the human species. The complexity of how human vision essentially works is not revealed. In fact, until today it is not fully understood how the brain processes visual information and how it makes sense of it.

Although there are important scientific questions in the modern understanding of the underpinnings of visual processing in the human brain, for years computer scientists have been trying to find explanations, algorithms and mathematical tools that can recreate visual analysis and understanding from visual data of different sorts (e.g., images and video). In particular, making sense of videos, for instance, requires us to be able to detect and localise objects in videos, track target moving objects, or to take in a streaming video of a road scene from a car roof-mounted camera and find where the vehicles, pedestrians, or road signs are. These are only very few examples of computer vision problems and applications studied by computer scientists and practitioners.

Intermediate processes taking place in the visual cortex of a biological brain, and an artificial neural network-based analog for object detection that uses a CNN, intermediate non-linear feature transformations and a fully connected layer calculating a distribution of the numerical confidences of detecting several object categories from an input image. Image adapted from [7].

The top of the image above depicts a model of the biological analog for understanding how the biological brain recognises objects, and the bottom of it shows an artificial analog for visual processing in the same task. In the biological analog, the human eye senses a green tree through the retina and passes on the signals to the optic nerve. The different cues of this image (such as elements of motion, depth information and color) are processed by the Lateral Geniculate Nucleus (LGN) in the Thalamus area of the brain. This layer-by-layer signal propagation in the LGN makes up an explanation of how the brain encodes raw sensory information from the environment. The outcome in the brain encoding stage is passed on to the layers in visual cortex of the brain, which finally enables the human brain to sense the picture of a green tree. On the contrary, at the bottom of the figure, we see how a relatively modern neural network analog works. At an initial step, raw visual data are captured by a camera that can sense the visible spectrum of the electromagnetic spectrum. Then, the three layers of pixel intensities in terms of three color components (R for red, G for green and B for blue), are passed through a deep convolutional neural net. This forward process encodes the raw input data as an arrangement of numbers that are fixed horizontally and vertically. Each number is encoded by a tuple of coordinates in the arrangement of numbers, each telling us at which index of a dimension the number belongs to. The arrangement of these numbers give us what we call “features”. We can think of it as a numerical “signature” or “fingerprint” of what the camera captured from a fixed position in the environment. Finally, another deep neural network with intermediate neural processing modules decodes the former features by transforming them sequentially and non-linearly through different neural processing layers. Finally, by means of the forward propagation of information through these neural layers, a distribution of the likely categories of objects in the image are computed. How is this made possible? Simple: the model was trained to adapt its parameters to learn this association from raw images to distributions of object categories by minimising the categorisation error over a set of image versus category pairs.

Being able to understand how to write computer programs that can tell us what is depicted in videos through text has been one important and general-purpose computer vision application. Although for decades humans have thought of intelligent machines that can visually perceive the world around them, being able to convincingly take in a video and expecting to robustly generate a text description that tells us what is depicted in the video only recently became possible with the development or robust algorithms. Before this time-point in the horizon of AI advances, attacking the video-to-text task with older scientific techniques and methods was practically impossible. In the recent years after this turning time-point, the most important scientific concepts that were used to create effective video captioning algorithms were borrowed from the AI subfield called deep learning. Moreover, advances in computer hardware that can accelerate numerical computations have became available in the market, so that deep learning models with a large number of model parameters can be trained on raw data. Graphics Processing Units (GPUs) is the go-to hardware technology to allow for the development of deep learning-based models. Unfortunately, in the era before the proliferation of deep learning (probably around the year 2006 in which Deep Belief Networks were introduced), there were still important techniques and concepts that were employed to devise successful algorithms, but their capabilities fell short versus those exhibited by deep learning models. Moreover, the deep learning-based video captioning algorithms that have existed today have a significantly enormous amount of parameters, that is, numbers, which was atypical of older algorithms (or models) designed for exactly the same task. Although the widest adoption of deep learning can be estimated to have started around the year 2006, it is important to notice that the LeNet deep CNN model developed by Yann LeCun was published in 1998. Moreover, basic elements of deep learning were initially developed in the eighties and nineties, such as backprop algorithm for tuning model parameters.

The sets of model parameters in deep learning models can be found through algorithms that perform what is called function optimization. Through the use of function optimization algorithms, a model can hopefully work sufficiently well on its intended task. Research on explainable AI has contributed methods that can explain why a model committed a particular function (e.g., a classification or a regression operation). In the area of video captioning, many successful systems such as SwinBERT [1] follow such an approach to train a video captioning model on a large dataset of videos harvested from the Web. Each of these videos is associated with a text caption, comprised by one or more sentences, that was written by a human. What is significant here is that the designer of a video captioning deep learning algorithm can take in a large dataset of such videos and associated annotations and, after some amount of time that can vary depending on the amount of data and the size of the model, come up with a good model that can be presented with new videos that the model had never seen before during training and generate relatively accurate video captions for them.

An ordered sequence of words that describes a video may normally relate to what a real human would say aiming to describe the video. But is it technically trivial to generate a video caption by means of an algorithm? The answer to this question is a mixed “yes and no” answer. It is partly a “yes” answer, because scientists have already come up with capable algorithms for the task, although they are still not “entirely perfect”. On the contrary, the answer is partly a “no” one, because the problem of generating a video caption is ill-posed: that is, it cannot be defined in a way that can clearly determine what a video caption really is, so that then there can be an exact algorithm that can provide answers like those that are unquestionably correct. To understand this better, imagine that you would normally make different statements after seeing the same video depending on the details that you actually want to highlight. So how can we a-priori decide what to tell about a video via an algorithm when there can be several possible statements that we can actually make? There is no way we can do this, because the algorithm may miss declaring something about the video that is, in fact, important to a human observing it. Therefore, facing ill-posedness, we only go by defining what video captioning is by providing a mathematical model that can convincingly tell what video captioning is, but one that is not really the optimal one. As we already mentioned, we even do not know what the optimal model is in the first place! In the way to train this suboptimal model, the designer has to again rely on a large dataset of raw videos and associated text annotations in order for the model to learn to perform the associated task well.

In an attempt to grasp that mathematical models of reality can be suboptimal in some sense, it is helpful to recall a quote by George E. P. Box that says: “All models are wrong, but some are useful”. Video captioning systems that are built using deep learning modelling concepts can be said to operate at a level that can empirically prove they are useful and reliable for the task. Their results can achieve a good level of utility when we can quantify goodness, despite the fact that we know these models are not globally perfect models of the visual world sensed through a camera.

To glimpse an example of how a real video captioning system works, we will provide a simple and comprehensive summary of a video captioning model called SwinBERT. This model was developed by researchers at Microsoft. It was presented at the CVPR 2022 conference (see reference [1]), a top-tier conference for computer vision. The implementation of this model is publicly available on GitHub [3].

To get an understanding on how SwinBERT works, it is helpful to consider the fact that in the physical world matter is made of small pieces organised hierarchically. Small pieces of matter are combined with other small pieces of matter to create bigger chunks of matter. For example, sand is made of pieces of very tiny rocks. A selection of thousands or millions of such small rocks are arranged geometrically together to make a piece of sand. Equivalently, these small pieces tie together to form larger pieces of matter, and each small piece has a particular position in space. In the case of video captioning, the SwinBERT system provides a model that is trained to relate pieces of visual information (that is, image patches) to sequences of words. To make this happen, SwinBERT reuses two previous important ideas. The first one is the VidSwin transformer model that starts with representing video as 3D voxel patches and finally performs feature extraction to represent the visual content in an image sequence. VidSwin-generated representations can be used for classification tasks such as action recognition tasks, among others. VidSwin is a technical development that was published before the time SwinBERT became accessible. It was created by Microsoft and was presented in a 2022 paper at the CVPR conference [2]. A module that helps generate word sequences (that is, sentences) is BERT (Bidirectional Encoder Representations from Transformers) [5] developed by Devlin and collaborators.

By Alea Mosaic

To begin with, in order to understand the function of VidSwin, imagine a colourful mosaic created by an artist. To create a mosaic that depicts a scene, a mosaic designer takes very small coloured pieces of rock, each with a unique colour and texture. She stitches these nice, small, colorful pieces of rock together, piece by piece, to form objects like a dolphin and the sea, as in the mosaic image on the left. Normally, each selected mosaic patch certainly belongs to a unique object. In the mosaic on the left, for example, some patches belong to the dolphin, while other patches belong to the background, and some patches belong to the sea surface. Mosaic patches belonging to the same object are often adjacent, or it can be said that they are near each other; but patches that do not belong to the same object are usually not adjacent, or they are adjacent because their boundaries touch. If we represented the adjacency of image patches as a graph, we would naturally come up with an outright planar, undirected graph. For example, for the mosaic example depicted in the picture above, we can say that the “the dolphin hovers on seawater”. Now imagine the scenario in which the same designer creates slightly altered mosaics given the original mosaic. Consider the patches of each mosaic image to lie at the same positions but their color content to change from one mosaic image to the next mosaic image. Imagine that some of the patches make the appearance of the dolphin to be undergoing a displacement in time, giving the impression that the dolphin is moving. Naturally, as we iterate the mosaics from the first to the last one, some image patches may be correlated spatially (because they are adjacent within the image), while other patches belonging to different mosaic images may be correlated temporally. VidSwin aspires to model these dynamic patch relationships by adopting a transformer model to perform self-attention on the 3D patches both spatially and temporally, creating refined 3D patches that are embedded well in feature space. These refined patch embeddings are further transformed several times by self-attention layers in order to robustly model the dependencies among them at different scales of attention. Finally, these 3D embeddings are passed through a multi-head self-attention layer, followed by a forward non-linear transformation computed by a feed-forward neural network. VidSwin then outputs spatio-temporal features that numerically describe small consecutive frame segments in the video.

From Canva by shutter_m from Getty Images

In the language modelling module of SwinBERT called BERT [5], the ultimate goal is to generate a sequence of words that better describes the visual content of the video as captured by VidSwin. BERT captures the relationships between the words that appear in a sentence with a heuristic that considers the importance of an anchor word given the words that appear both to the left of the anchor word and to the right of the anchor word. Due to this heuristic, BERT is described to take into account the bidirectional context of words. BERT considers this bidirectional context to train itself on a large corpus of text, so that it can be fine-tuned on other text corpora in order to serve other downstream tasks. Much like every deep learning model, BERT optimises two objective functions. The first objective function that BERT optimises is called the Mask Language Model (MLM) objective, where the model picks some words of a sentence randomly and masks them out, requiring for BERT to infer the masked (that is, missing) parts of it correctly. To correct the missing parts, BERT again considers bidirectional context described previously. The second objective optimised by BERT, is what is called Next Sentence Prediction (NSP). This last objective causes BERT to optimise its model parameters to force the model to understand relationships between pairs of sentences.

The mapping of visual information to word sequences through a multi-modal transformer model with a cross-attention layer in SwinBERT

Now that we described how Swin transformer generates spatiotemporal visual-specific features from frame segments in videos — and how BERT generates textual features — it’s about time to describe how these two elements are combined together to form what is known as the SwinBERT model. The key ingredient here is a model from the literature that can combine both of these worlds: the visual world, and the textual one. One needs to define such a model, so that they can go from a VidSwin-based visual representation to a textual representation, which is the desired output of SwinBERT. The multimodal transformer actually fuses the visual representations and the textual representations to make a better representation that introduces simple and sparse interactions between the visual and textual elements. Such interactions between elements of two different modalities are, in fact, more easily interpretable. Instead, everything-versus-everything interactions are more expensive and are often unnecessarily complicated. SwinBERT does away with this via a multimodal Transfomer that employs a key element for processing multimodal data: the cross-attention layer. A simple transformer model [4] instead employs a self-attention layer, computing dense relationships of tokens from the same single modality. The self-attention layer learns a common representation of text elements and visual elements by computing linear combinations of them and, after being passed through a feed-forward neural network, are then input to the seq2seq sequence generation algorithm that can compute video captions.

References

[1] Lin et al., SwinBERT: end-to-end transformers with sparse attention for video captioning, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022, pp. 17949-17958.

[2] Liu et al., Video Swin Transformer, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022, pp. 3202-3211.

[3] Code accessed at https://github.com/microsoft/SwinBERT

[4] Vaswani et al., Attention is all you need, in Proceedings of the Neural Information Processing Systems 2017.

[5] Devlin et al. BERT: Pre-training of deep bidirectional transformers for language understanding, in AclWeb 2018.

[6] Sutskever, I., Oriol V., and Quoc V. Le. Sequence to sequence learning with neural networks, in Proceedings of Neural Information Processing Systems 2014.

[7] Zhang, Hongxin, and Suan Lee. Robot bionic vision technologies: A review, in Applied Sciences (2022)

Picture of Sotiris Karavarsamis

Sotiris Karavarsamis

Research Assistant at Visual Computing Lab (VCL)@CERTH/ITI

Twitter
LinkedIn
VOXReality template LinkedIn v3 (2).pdf (Instagram Post) (7)

Partner Interview #4 with VRDays Foundation

In this fourth installment of our Partner Interview series, we sit down with Manuel Toledo, Head of Production at VRDays Foundation, to explore the organization’s role in the VOXReality project. VRDays Foundation, known for its commitment to advancing immersive technologies and fostering dialogue around sustainable innovation, is playing a pivotal role within the VOXReality consortium. Manuel shares insights into how the foundation is bridging the XR industry with cutting-edge developments, particularly in the realm of virtual conferencing, and the transformative potential these innovations hold for the future of communication technologies.

Can you provide an overview of your organisation's involvement in the VOXReality project and your specific role within the consortium?

At VRDays Foundation, we are advocates of innovation and creative approaches to pushing the boundaries of immersive technologies and sparking debates on sustainable technology development. Joining forces with the VOXReality consortium aligns perfectly with our mission. We’re immensely proud to serve as a gateway, a bridge if you will, to the broader XR community and industry for this project. 

Our contribution to the VOXReality consortium work lies in our extensive experience and network within the XR industry, where the consortium work will have a significant impact.  

During the development of VOXReality, we take several roles, from particular contributions to partners’ work, from developing one of the specific VOXReality use cases – VR Conference – to leading the pilot ideation, planning and delivery of all three use cases.  

Moreover, we’re excited to amplify the impact of the consortium’s work by showcasing it at events like Immersive Tech Week in Rotterdam. It’s not just about what we accomplish within the consortium but also about how we extend its reach and influence to the broader XR community. 

What technical breakthroughs do you anticipate during the course of the VOXReality project?

From the perspective of the VR Conference use case, we’re thrilled about the work we’re putting in alongside the VOXReality consortium. The implications for the event industry, especially in the realm of B2B events, are incredibly exciting.  

The VR Conference case being developed by VOXReality promises to revolutionise the landscape, offering effective, high-end, non-physical, business-driven, multilingual, and assisted interactions for virtual visitors. This breakthrough will fundamentally reshape our understanding and experience of events. 

What role do emerging technologies play in enhancing the technical capabilities of virtual conferencing solutions?

Thanks to today’s technological advancements, the boundaries of distance and presence have become merely matters of perception. Emerging technologies like VR, AR, and AI have opened up a new realm where perception is constantly pushed to its limits. 

With VOXReality’s pioneering development of voice-driven interactions in XR spaces, both event organisers and attendees will face a fundamental shift in their preconceived notions. This innovative leap will, in turn, unlock fresh opportunities for organisers and businesses to enhance the value of their activities. Moreover, it will empower visitors to engage in meaningful and productive interactions, irrespective of their geographical constraints. 

What business models do you think are worth exploring for the sustained growth of virtual conferencing technologies?

VRDays Foundation firmly believes that the development of virtual accessibility for conferences and trade shows holds the key to unlocking a wealth of new business opportunities for B2B events and their participants. By tapping into these opportunities, we can create fresh value for those already involved in events, particularly within the realm of B2B engagements such as trade shows, one-to-one meetings, demo sessions, and networking opportunities. 

What role do you see virtual conferencing playing in the evolution of communication technologies over the next decade?

In its many formats, the development of virtuality in the next decade will bring change to conferencing at speed never experienced before, from simple interactions with conference speakers to complex business agreements done safely and virtually. All these interactions will support themselves on communication technologies, bringing down barriers such as complex navigation and language limitations that are normal to every event, especially today, where the scale and international reach of visitors demand new approaches from organisers. 

Voice-driven interaction will play an important part in these developments by offering a seamless, intuitive means of engagement. It streamlines tasks, supports hands-free operation, and integrates with other modalities for richer experiences. Through personalisation and remote assistance, it promises to elevate usability and foster smoother interactions, charting new avenues for innovation and collaboration. In essence, it promises to elevate the usability, accessibility, and engagement of virtual conferencing, charting new avenues for innovation and cooperation in the years ahead. 

Picture of Manuel Toledo

Manuel Toledo

Head of Production at VRDays Foundation

Twitter
LinkedIn