ben-sweet-2LowviVHZ-E-unsplash (1)

The Future of Virtual Events: Simplifying Attendance and Amplifying Attendee Experience with Voice-Driven Interaction in XR environments

VOXReality’s research has the potential to transform the future of virtual and hybrid events. As the demand for virtual events in XR environments grows, there is a need for innovative solutions that can provide a more natural and immersive experience for attendees. VOXReality’s research and development of new AI models to drive future language-driven interactive XR experiences has the potential to create more personal, natural and accessible event experiences and enhance attendee engagement and efficiency. 

The COVID-19 pandemic spurred the rapid adoption of virtual events, as in-person gatherings became impossible or too risky. According to Grand View Research, the global virtual events market was valued at $78.6 billion in 2020, with a compound annual growth rate (CAGR) of 23.2% from 2021 to 2028.

While virtual events have provided a lifeline for businesses and organisations to continue their operations, one of the main challenges has been to recreate the same level of engagement, ease and interactivity as in-person events.

Virtual events have traditionally relied on XR (Extended Reality) technologies, such as VR (Virtual Reality), to provide immersive experiences. However, XR user interfaces are relatively new for many users who are not used to them like they are to many on-screen interfaces.

Also transferring from a screen to a 360 environment can be daunting for attendees. This means XR interfaces can feel complicated and need a longer onboarding process. This is where the potential of Voice Driven Interaction comes in.

Voice-driven interaction provides a more natural and human way of interacting with a system, making events hosted in XR spaces more personal and accessible to a wider audience. In this blog post we want to look ahead and explore the value VOXReality’s research findings/ results can present for future virtual and hybrid  events.

Paving the way for experiencing events in XR

VOXReality’s goal is to conduct research and develop new AI models that can drive the future of XR interactive experiences. Their innovative models will address both human-to-human interaction in unidirectional (theatre) and bidirectional (conference) settings, as well as human-to-machine interaction by building the next generation of personal assistants. All elements that are set to play an important role in shaping the future of virtual and hybrid events.

More Personal and Accessible Events

The main benefits of using language as a core interaction medium in XR spaces, is that it makes events more personal and accessible. Attendees can use their voice to navigate the virtual environment, access information, and interact with other attendees, speakers, and exhibitors. 

This creates a more natural and seamless experience that mimics in-person events, where people can communicate through speech and body language. Voice-Driven interaction also removes the need for complex XR interfaces that can be overwhelming or challenging to use for some attendees. By using voice-driven interaction, virtual events can become more inclusive and welcoming to a broader range of participants.

Improved Attendee Experience

VOXReality’s innovative Artificial Intelligence (AI) models are set to combine language in combination with visual understanding that deliver next-generation applications that provide comprehension of users goals, surrounding environment and context. This has the potential to significantly enhance the attendee experience as the system is better tuned into their needs and expectations. 

Attendees can use their voice to perform various actions, such as asking questions, participating in polls, or even controlling the environment. For example, imagine attending a virtual trade show and being able to say, “Hey, show me the new products from Company X.” to your own personal virtual assistant. The system could then display relevant information or even take you to the Company X virtual booth. 

Voice-Driven Interaction allows attendees to engage with the event on a deeper level, leading to more meaningful interactions and better networking opportunities.

Increased Efficiency and Engagement

Finally, Voice-Driven Interaction in XR spaces can also increase efficiency and engagement. Attendees can use their voice to perform tasks quickly and easily, without the need for extensive navigation or typing. This gives them more freedom to focus their attention on the content of the conference and interacting with other attendees than on solving technical issues and figuring out how the interface works. This can lead to more productive and dynamic discussions, ultimately enhancing the value of the event for everyone involved.

Future Use Cases

All set to take virtual events to a new level. In order to do so, VOXReality will be testing three particular use cases of the project at Immersive Tech Week 2023. These use cases are: Digital Agents, Virtual Conferencing and Theatre. Stay tuned to find out more details on what these use cases are and how they will impact our experience of virtual events. 

Picture of Regina Van Tongeren

Regina Van Tongeren

Hi, I'm Regina, Head of Marketing at VRDays Foundation. I help organise Immersive Tech Week in Rotterdam, a festival that brings together diverse voices to celebrate and explore immersive technologies' potential for a better world. I've always loved how films and games create new worlds and realities through stories, and I am fascinated by how immersive technologies are changing storytelling.

With a background in the games industry and teaching marketing, I believe immersive tech will revolutionise brand experiences, and I am curious to see the possibilities they offer for events. As a marketeer at Immersive Tech Week, I am passionate about bringing as many people as possible from all backgrounds and walks of life to Rotterdam so they can discover, experience and think about these new technologies.

brandon-romanchuk-AkCpJd6R2QU-unsplash (1)

Task-Oriented Dialogue Systems: Bridging the Gap Between Language and Action

Dialogue systems have become an increasingly important technology in recent years, with the potential to change the way we interact with machines and access information. Those systems can be divided into two categories, task-oriented and open-domain dialogue systems. Task- oriented dialogue systems have been developed to assist users in specific tasks or goals, while open-domain dialogue systems have been developed to generate responses on a wide range of topics, allowing for more natural and engaging conversations.

In this article, we will focus on task-oriented dialogue systems and discuss the recent advancements in the field, including end-to-end trainable systems and multimodal input and output. We will also highlight the challenges that remain, such as handling ambiguity as well as maintaining user engagement, and explore the potential for future developments in context-aware and multilingual dialogue systems.

Task-Oriented Dialogue Systems

Task-oriented dialogue systems are designed to help users achieve a specific goal or complete a particular task, such as booking a flight, ordering food, or scheduling a meeting. These systems are different from open-domain dialogue systems, which are designed to converse with users on a wide range of topics.

At its core, task-oriented dialogue systems are about bridging the gap between language and action. Language involves the ability to communicate meaning through words and sentences, while action involves the ability to perform physical tasks based on that communication. By combining these two modalities, task-oriented dialogue systems can enable machines to understand human language and advise the users to perform tasks based on that understanding.

Task-oriented systems are increasingly being used in a variety of applications, such as customer service, e-commerce, navigation instruction, etc. The expanded use of these systems is achieved since they offer a more natural and intuitive way for users to interact with technology as well as easily to accomplish specific tasks.

Task-oriented dialogue system design. Image by Microsoft

Task-oriented dialogue systems are typically composed of several components, including Automatic Speech Recognition (ASR) system, Natural Language Understanding (NLU) module, Dialogue Manager (DM), and Natural Language Generation (NLG) module. ASR and NLU are responsible for converting the users spoken or written input into structured data that can be processed by the DM. The DM uses this data to determine the users intent and generate an appropriate response. The NLG module is then responsible for generating a natural-sounding response that can be spoken or displayed to the user.

Advancements in Task-Oriented Dialogue Systems

In recent years, there have been several significant advancements in task-oriented dialogue systems. One of the most important advancements has been the development of end-to-end trainable systems. These systems can be trained using large amounts of conversational data, and they can learn to generate responses that are more natural and contextually appropriate.

End-to-end systems have also been shown to be effective in handling out-of-domain queries, which are queries that are not related to the primary task of the system. These systems can leverage the conversational context to generate a response that is relevant to the users query, even if it is not directly related to the primary task.

Challenges and Limitations

Despite the significant advancements in task-oriented dialogue systems, there are still several challenges that need to be addressed. One of the main challenges is developing systems that can understand the nuances of language and context. For example, understanding the difference between “I want to book a flight to New York” and “Can you book a flight to New York for me?” requires a deep understanding of language and context that is difficult to replicate in machines. 

Another challenge is handling ambiguity and uncertainty in user queries. Users may use vague language, make mistakes, or provide incomplete information, and the system needs to be able to handle these situations and generate an appropriate response.

There are also ethical considerations in the field of task-oriented dialogue systems. For instance, the use of these systems in sensitive domains such as healthcare raises concerns about privacy and confidentiality. It is important for researchers and practitioners in the field to consider the ethical implications of their work and develop systems that are designed with accountability in mind.

Looking Ahead: The Future of Task-Oriented Dialogue Systems

The future of task-oriented dialogue systems is likely to be shaped by the increasing availability of multimodal data and input. As users interact with these systems using a variety of input modalities, including speech, text, and images, task-oriented dialogue systems will need to become more flexible and adaptable to accommodate these varied inputs. This could lead to the development of more sophisticated dialogue management systems that can handle a wide range of input and output modalities and enable task-oriented dialogue systems to be more effective and engaging for users.

At VOXReality, we are working on developing context-aware task-oriented dialogue systems that can understand the users intent and generate appropriate responses in a wide range of contexts. We are also exploring the use of multimodal input and output using a combination of speech, text, and images to make these systems more flexible and engaging for users.

Picture of Apostolos Maniatis

Apostolos Maniatis

Hello! I'm Apostolos Maniatis, and I'm a dialogue system researcher. With a background in natural language processing and computer science, I spend my time developing innovative algorithms and techniques for creating intelligent systems that can converse with humans in natural language. I'm fascinated by the many ways that dialogue systems are transforming the way we interact with technology, and I'm committed to making these systems more intuitive, responsive, and adaptable to the needs of users.


Newsletter #1 | Discover VOXReality

Introducing the first VOXReality project newsletter!

Stay up-to-date with the latest news and updates on our XR project. From personal assistants to revolutionary theatre experiences, our team is pushing the boundaries of what’s possible in the world of XR.

By subscribing to our newsletter, you’ll be the first to know about new project releases, exclusive content, and special opportunities. Plus, you’ll get to join a community of like-minded XR enthusiasts and be part of the conversation about the future of this exciting technology.

Don’t miss out on this chance to be at the forefront of the XR revolution!


Breaking Language Barriers: Advancements in Speech Recognition and Machine Translation

Machine Translation (MT) is a powerful tool that can help overcome the language barrier and facilitate cross-cultural communication, making it easier for people to access information in languages other than their own.

Given that speech is the natural medium of communication between humans, developing solutions that can translate from speech is a crucial step towards deploying MT models in different scenarios (e.g., conferences, theaters, …) where speech is the main medium of communication.

In this article, we discuss advancements in Automatic Speech Recognition (ASR) and Machine Translation and highlight the competition between cascade and end-to-end speech translation solutions and their challenges.

Automatic Speech Recognition

Automatic Speech Recognition (ASR) refers to the technology used by machines to recognise human speech and transcribe it into text. The field of ASR has evolved significantly over the years, from classical techniques that relied on Hidden Markov Models and Gaussian Mixture Models to more recent deep learning models such as Whisper.

Image by storyset on Freepik

Classical ASR techniques worked by breaking down speech into smaller segments called phonemes (e.g., the word “cat” can be broken into \k\ \æ\ \t\ with the International Phonetic Alphabet representation), and then using statistical models to predict the most likely sequence of phonemes that correspond to a given audio signal. While these techniques were effective to some degree, they had limitations in their ability to handle variability in speech patterns and accents.

Deep learning models have revolutionised the field of ASR by using neural networks to learn more complex and nuanced patterns in speech data. These models are robust to a wide variety of accents and dialects and are able to perform well in noisy environments.

Another critical aspect of models like Whisper is their multilingual nature, as they are able to transcribe speech from multiple languages with high accuracy. Overall, ASR has come a long way in recent years, and these advancements are making it easier for machines to understand and interpret human speech.

Multilingual Neural Machine Translation

Multilingual Machine Translation refers to the technology used by machines to automatically translate text or speech from one language to another, across multiple languages.

The field of machine translation has evolved significantly over the years, from Statistical Machine Translation (SMT) models that relied on large corpora of parallel texts (sentences and their translation in the target language) to the more powerful neural models.

Neural Machine Translation (NMT) have become the go-to approach especially after the introduction of the Transformer Architecture, which has revolutionized the field by making it possible to build powerful models that can handle complex language structures with ease.

Transformer Architecture. Image by Jay Alammar

SMT systems learn statistical relationships between words in the source language and the target based on their co-occurrence in the training corpus. A word “T” (e.g., world in English) in the target language that occurs many times with a word “S” (e.g., welt in German) in the source language is more likely to be a translation of the word “S” (e.g., “world” is the translation of “welt”).

A translation from one language to another can end up with a different number of words or different word order. To deal with this, SMT systems learn an alignment function that maps the source sentence from its order in source language to the new target order.

SMT models can perform well on specific domains or language pairs where there is sufficient data available. However, they often struggle to generalise to new domains or to produce fluent and natural-sounding translations.

On the other hand, NMT is capable of generalising across domains and learning shared patterns between different languages. This has contributed to the rise of multilingual models that are able to transfer knowledge from languages with large amounts of data (e.g., English, Western-European languages, Japanese, Chinese) to low resource languages (e.g., Vietnamese, Swahili, Urdu).

No Language Left Behind (NLLB) is a notable example that has pushed the number of supported languages by one model to over 200 and has achieved state-of-the-art results in multiple languages especially low resource ones. Efforts like NLLB and other multilingual models have the potential to greatly improve access to information and open the channels of communication and collaboration between different cultures.

Cascade vs. End-to-end Speech Translation

Cascade solutions for Speech Translation involve the combination of ASR and NMT components to translate speech input. However, since the ASR and NMT models are trained separately, this can lead to a reduction in the quality of the translation due to inconsistencies in the training data and procedures of the two models. Furthermore, cascade solutions are also susceptible to error propagation, where errors produced by the ASR model can negatively impact the quality of the translation.

End-to-end solutions are promising in circumventing these issues by translating directly from speech to text. While these models are capable of achieving competitive results compared to cascade solutions, they still face challenges due to the limited availability of speech-to-translated text datasets, resulting in insufficient data for training.

Despite these challenges, ongoing advancements in end-to-end solutions show promising results in closing the gap with cascade solutions. With further developments in data collection and model optimisation, end-to-end solutions may eventually surpass cascade solutions in terms of translation quality and accuracy.

In conclusion, the recent advancements in Automatic Speech Recognition and Machine Translation have significantly improved the ability of machines to understand and interpret human speech, paving the way for more effective communication across different languages and cultures.

However, there are still open issues like generalising to different domains and challenging contexts that are crucial for ensuring a satisfactory performance when Machine Translation systems are used in real-world scenarios.

In VOXReality, our mission is to develop multilingual context-aware Automatic Speech Recognition and Neural Machine Translation models, which are capable of learning new languages and accents and consider the surrounding textual and visual context to obtain higher quality of transcriptions and translations.

Picture of Abderrahmane Issam

Abderrahmane Issam

Hello! My name is Abderrahmane Issam and I'm a PhD student at the Maastricht University where I'm working on Neural Machine Translation for non-native speakers. I'm passionate about research in Natural Language Processing and my job is to make Machine Translation systems robust under real-world scenarios and especially to non-native speakers input.


The Magical Intersection of Vision and Language: Simplified for Everyone

Vision-language refers to the field of artificial intelligence that aims to develop systems capable of understanding and generating both visual and textual information. The goal is to enable machines to perceive the world as humans do, by combining the power of computer vision with natural language processing.

Understanding the Basics: Vision and Language

At its core, vision-language is about bridging the gap between two distinct modalities: vision and language. Vision involves the ability to perceive and interpret visual information, such as images and videos, while language is the system of communication that humans use to convey meaning through words and sentences.

By combining these two modalities, vision-language systems can enable machines to understand the visual world and communicate about it in a more natural and human-like way.

Image Captioning

This has numerous applications, from enhancing human-machine communication to improving image and video search capabilities. One area where vision-language is making significant progress is in image captioning, where machines are trained to generate textual descriptions of images.

This involves developing deep learning models that can analyse an image and generate a corresponding natural language description. This can be especially useful for individuals with visual impairments or for search engines looking to better understand the content of images.

Visual Question Answering

Another application of vision-language is in visual question answering (VQA), where machines are trained to answer questions based on visual information. This involves combining computer vision and natural language processing to enable machines to understand both the visual information and the meaning behind the questions being asked.

One major challenge in the field of vision-language is developing systems that can understand the nuances of language and context. For example, understanding the difference between “a red car” and “a car that is red” requires a deep understanding of language and context that is difficult to replicate in machines.

Despite these challenges, vision-language is a rapidly growing field with tremendous potential to revolutionise how machines interact with and understand the visual world.


As technology continues to advance, we can expect to see even more exciting applications of vision-language in the years to come. Another area where vision-language is making strides is in visual storytelling, where machines are trained to generate a narrative or a story from a sequence of images or videos. This involves developing models that can understand the visual content and generate a coherent and engaging story that is both natural and human-like.

Vision-language also has implications in the fields of education and healthcare. For instance, machines can be trained to understand medical images and provide more accurate diagnoses. In the education sector, vision-language can be used to develop more interactive and engaging learning materials that combine visual and textual information. 

One exciting development in the field of vision-language is the use of pre-trained language models such as GPT and BERT to improve the performance of computer vision models. By combining pre-trained language models with computer vision models, machines can be trained to perform more complex tasks such as image retrieval, image synthesis, and image recognition with greater accuracy and efficiency.

However, as with any emerging technology, there are also ethical considerations to be aware of in the field of vision-language. For instance, the use of vision-language in surveillance systems raises concerns about privacy and individual rights.

It is important for researchers and practitioners in the field to consider the ethical implications of their work and develop systems that are designed with fairness, transparency, and accountability in mind.

One of the main drivers of progress in the field of vision-language is the availability of large-scale datasets. These datasets contain millions of images and corresponding textual descriptions or annotations, and they are used to train and evaluate vision- language models. Popular datasets in the field include COCO (Common Objects in Context), Visual Genome, and Flickr30k.

In addition to datasets, the field of vision-language is also supported by a variety of tools and frameworks. These include deep learning libraries such as TensorFlow and PyTorch, as well as specialised vision-language libraries such as MMF (Multimodal Framework) and Hugging Face, Transformers.

Another important aspect of vision- language research is the evaluation of models. Because vision-language models can be used for a variety of tasks, it is important to have standardised evaluation metrics that can measure performance across different domains. Popular metrics include
BLEU (Bilingual Evaluation Understudy), ROUGE (Recall-Oriented Understudy for Gisting Evaluation), and METEOR (Metric for Evaluation of Translation with Explicit ORdering).

Finally, it is worth noting that vision-language is a highly interdisciplinary field, with researchers and practitioners from computer science, linguistics, psychology, and other disciplines contributing to its development. This cross-disciplinary collaboration has been critical to the progress of the field and will continue to be important in the future.

In conclusion, vision-language is an exciting and rapidly evolving field that has the potential to transform how machines interact with and understand the visual world. With the availability of large-scale datasets, powerful deep learning libraries, and standardised evaluation metrics, we can expect to see continued progress in the development of vision-language models and applications in a range of domains.

In VOXReality, our mission is to develop vision-language models that will be useful in a variety of applications, such as assisting people or improving the accessibility of digital content. Specifically, our tooling will generate captions or summaries for videos, which could benefit content creators, journalists or educators.

Picture of Stefanos Biliousis

Stefanos Biliousis

Hello! My name is Stefanos Biliousis and I'm a computer vision researcher with a passion for exploring the latest advances in artificial intelligence. With a background in machine learning and computer science, I spend my days developing innovative algorithms and techniques for image and video analysis. I'm fascinated by the many ways that computer vision and AI are revolutionizing the world around us.

vr headset

European Commission’s Action Plan for Media Industry’s Recovery and Transformation in the Digital Age

Europe’s media industry has faced a significant challenge in the digital decade, with traditional business models being disrupted by the rise of digital technology and the shift to online content consumption. The pandemic has only exacerbated these challenges, as advertising revenues have declined, and media companies have struggled to adapt to the new reality.  

However, there is hope for the future, as the European Commission has unveiled the Media and Audiovisual action plan to support the recovery and transformation of Europe’s media industry. The plan is a comprehensive approach aimed at supporting the media sector’s transition to the digital age, ensuring that media companies can thrive in the new media landscape. The key elements of the plan include measures for recovery, transformation, and enabling and empowering. 

The first pillar of the plan is to aid the sector’s recovery, assisting audiovisual and media organisations. The aim is to provide financial stability and liquidity by offering a user-friendly tool that will guide European audiovisual and news media companies on available sources of EU aid. Additionally, increasing investment in the European audiovisual industry to support production and distribution by enhancing equity investments.  

Furthermore, introducing the ‘NEWS’ initiative will bring together various measures and support for the news media sector. This first pillar of action will ensure that citizens are equipped to navigate the complex and rapidly evolving digital media landscape. 

The second pillar is transformation. To tackle structural challenges and support the media industry in embracing the green and digital transitions, while facing intense global competition, the European Commission has several initiatives in place.  

These include establishing a European media data space to enable media companies to collaborate on data and innovate; promoting a European coalition for virtual and augmented reality (VR/AR) to allow EU media to leverage the benefits of immersive media; and working towards making the industry climate neutral by 2050 by facilitating the sharing of best practices and placing a greater emphasis on environmental sustainability in the Creative Europe MEDIA program

Finally, the third pillar of the plan is to enable and empower. The goal is to foster innovation in the media sector, promote fair competition, and empower citizens to access content and make informed decisions 

To achieve this, the European Commission will be taking the following actions: engaging in a dialogue with the audiovisual industry to determine concrete steps to enhance access to and availability of audiovisual content throughout the EU; investing in European media talent through mentorship, training, and supporting promising European media start-ups; improving media literacy by providing a toolbox and guidelines for member states to fulfil their media literacy obligations under the Audiovisual Media Services Directive and supporting the development of independent alternative news aggregation services that offer a diverse range of accessible information sources; and strengthening the cooperation among European media regulators through the European Regulators Group for Audiovisual Media Services (ERGA). 

This last pillar will include measures such as tax relief, subsidies, and financial guarantees, which will help to ensure that media companies can continue to operate and invest in the future. 

In conclusion, the European Commission’s action plan for Europe’s media industry is a significant step forward in supporting the sector’s recovery and transformation. It provides a comprehensive approach to address the challenges posed by the digital age, and ensures that media companies, particularly SMEs, have the resources they need to thrive in the new media landscape. The plan is a positive sign for the future of Europe’s media industry, and significant progress is expected in the years to come. 

IMG_2343 2

The VOXReality Project: Transforming Europe through XR Innovation

The Voice-drive interaction in XR spaces (VOXReality) initiative is funded by the European commission and aims to  integrate language- and vision-based Artificial intelligence models for Extended Reality (XR) experiences. 

Athens, 11 January 2023 – VOXReality project aims to disrupt the European Extended Reality ecosystem by using AI technologies to create multi-modal XR experiences combining vision and sound. The consortium is formed by 10 industry leaders from 5 European countries (Greece, Germany, Italy, Ireland and Netherlands).  The members held a two-day meeting the past October where the project roadmap was presented.

VOXReality’s main goal is to conduct research and develop new AI models to drive future XR interactive experiences, and to deliver these models to the wider European market.

These new models will address human-to-human interaction in unidirectional (theater) and bidirectional (virtual conference) settings, as well as human-to-machine interaction by building the next generation of personal assistants.

The project will validate these developments through three use cases. The first one addressing novel assistant applications like instruction assistants, HMD-mounted technical support assistants, or navigation guides. 

The second use case at virtual conferences by developing a speech-driven animation technology to carry posture and gesture cues into avatar-based environments.

The third case will be focused on theatrical experiences through the combination of language translation, audio-visual user associations, and AR VFX triggered by predetermined speech which will be driven by VOXReality’s language and vision pretrained models.

The initiative, funded by the European Commission, brings together the key partners to tackle the great challenge of innovation in sectors heavily affected by the COVID-19 pandemic.

The consortium coordinated by Gruppo Maggioli includes representatives from research and academic institutions, end-users, industry innovators and social XR experts. 


Natalia Cardona – Communications Manager