engin-akyurt-Z2xz7U1tSfo-unsplash (1)

Embracing Humanity in Virtual Realms: A Journey Towards Inclusivity and Accessibility

In the bustling realm of technology, where advancements seem to leapfrog one another, the emergence of extended reality (XR) technologies has marked a significant milestone. As we delve into the vast possibilities that XR brings, it’s crucial to remember the heart of this revolution: the people. In a world rapidly embracing immersive technologies, the human-centered approach stands as the guiding light, ensuring that progress aligns with inclusivity and accessibility, fostering a society where no one is left behind.

Immersive Technologies: A Human-Centric Odyssey

In the corridors of innovation, the team at VOXReality recognises the paramount importance of putting humanity at the forefront of our extended reality projects. With a profound understanding that technology should enhance lives rather than alienate, we have embarked on a human-centric odyssey, ensuring that our immersive experiences cater to the diverse needs of people.

Our team’s commitment goes beyond the realms of technology; it’s a commitment to building bridges, connecting hearts, and making the extraordinary accessible to everyone. In this human-centric odyssey, VOXReality doesn’t just create virtual worlds; we aim to craft inclusive spaces where differences are celebrated, where barriers are shattered, and where the shared human experience becomes the cornerstone of innovation. 

It’s a conscious effort to empower individuals, irrespective of their abilities or backgrounds, to not just participate in the digital revolution but to lead it, ensuring that the promise of a better, more connected future is within reach for all.

Inclusivity and Accessibility

One of the fundamental pillars of VOXReality’s human-centred approach is inclusivity. XR technologies are breaking barriers, enabling individuals regardless of physical abilities, to explore new worlds and partake in experiences previously deemed impossible. From virtual assistants for events to immersive access to theatre plays, inclusivity is not just a concept but a tangible reality within VOXReality’s vision.

In the tapestry of human experiences, accessibility weaves the threads that connect us all. VOXReality takes pride in our meticulous design process, ensuring that XR applications are not only user-friendly but also accessible. Through several iterations from the consortium partners, we aim to empowering potential users to navigate and engage effortlessly.

The Future Beckons: A Harmonious Coexistence

As we stand on the adge of a future where the lines between reality and virtuality blur, VOXReality exemplifies the harmonious coexistence of humanity and technology. Our commitment to a human-centred approach ensures that the digital realms we create are not just immersive but also inherently humane. In this symbiotic relationship, technology amplifies human potential, fostering empathy, understanding, and shared moments of joy.

In conclusion, as we navigate the uncharted territories of extended reality, let us remember that the true essence of progress lies in the way it uplifts the human spirit. VOXReality’s unwavering dedication to a human-centred approach serves as an example illuminating the path towards an inclusive, accessible, and harmonious digital future. 

Together, as we embrace the boundless possibilities of XR technologies, let us continue this journey, ensuring that no one is left behind, and every soul finds solace and belonging in the immersive worlds we create.

Picture of Natalia Cardona

Natalia Cardona

Hi! My name is Natalia Cardona and i'm a Corporate communications specialist, Master in Journalism and Digital Content Innovation by the Autonomous University of Barcelona. Currently working in dissemination, communication, and marketing of technology, innovation, and science for projects funded by the European Commission at F6S.

Twitter
LinkedIn
uriel-sc-11KDtiUWRq4-unsplash

Enhancing Open-Domain Dialogue Answer Selection through Intent-Calibrated Self-Training

Can predicted intents calibrate correct answers in open-domain dialogues?

The capability of predicted intents to refine answer selection in open-domain dialogues is a topic of significant interest.

The mission of VOXReality is to explore the development of advanced context-aware task-oriented dialogue systems. In this context, Centrum Wiskunde & Informatica (CWI) has extensively explored and provided insights into whether predicted intent labels have the potential to calibrate answer labels in open-domain dialogues.

Spearheaded by the Distributions and Interactive Systems (DIS) group, this initiative has culminated in the publication of a paper titled “Intent-Calibrated Self-Training for Answer Selection in Open-domain Dialogues” on Transactions of the Association for Computational Linguistics (TACL).

This publication serves as an evidence of the significant progress made in understanding the intricate interplay between predicted intent labels and calibrated answer selection. Full paper is available here. 

Challenge

Answer selection models have achieved notable success through training on extensive labelled datasets. However, the process of amassing large-scale labelled data is not only labour-intensive but also time-consuming. This challenge is further exacerbated for Open-Domain Systems (ODSs) as they grapple with deciphering users’ information needs due to the unstructured nature of open-ended goals (Huang et al., 2020).

Motivation

The concept of user intents, encompassing a taxonomy of utterances, has been introduced to provide guidance to the information-seeking process (Qu et al., 2018, 2019a; Yang et al., 2020). When a potential answer (PA) does not satisfy the intent of the original question (OQ), the subsequent intent of the user is likely to be an information request (IR). For instance, if a user queries, “Can you direct me to a website for more information?” their intent is classified as IR. Overlooking the intent label IR may result in providing an answer that fails to fulfil the user’s request.

Method

We introduce a novel approach known as Intent-Calibrated Self-Training (ICAST) to enhance answer label calibration within a self-training framework. Specifically, our proposal involves leveraging predicted intent labels to calibrate answer labels. The ICAST method encompasses the following steps:

    1. Teacher Model Training: A teacher model is trained on labelled data (D^l) to predict pseudo intent labels for unlabeled data (D^u).
    2. Intent-Calibrated Pseudo Labelling: High-quality intent labels are identified using intent confidence gain, subsequently influencing the selection of samples. The answer labels are calibrated by integrating selected intent labels as supplementary inputs for answer selection.
    3. Student Model Training: The student model is trained using both labelled and pseudo-labeled data.

Figure (b) visually represents the Intent-Calibrated Self-Training (ICAST) process. The flow involves training the teacher model, intent-calibrated pseudo labelling, and student model training. In contrast to the basic teacher-student training depicted in Figure (a), ICAST enhances the quality of pseudo-labeled data, resulting in significant performance gains.

Conclusion

In this article, we introduce Intent-Calibrated Self-Training (ICAST), a framework rooted in teacher-student self-training and intent-calibrated answer selection. The approach entails training a teacher model on labelled data to predict intent labels for unlabeled data, selecting high-quality intents via intent confidence gain to enhance pseudo answer label prediction, and retraining a student model using labelled and pseudo-labeled data.

Extensive experimentation on two benchmark datasets demonstrates the superiority of ICAST over baselines even with minimal labelled data (1%, 5%, and 10%). Our future work aims to explore additional predictable dialogue contexts, such as user profiles, beyond intents.

Thanks to this research, VOXReality will stand poised to harness these insights to advance the frontiers of context-aware task-oriented dialogue systems. These insights will serve as the driving force propelling us to push the boundaries, ushering in a new era of inquiry, innovation, and seamless application.

Resources

References

  • Deng, W., Pei, J., Ren, Z., Chen, Z., & Ren, P. (2023). Intent-calibrated Self-training for Answer Selection in Open-domain Dialogues. arXiv preprint arXiv:2307.06703.
  • Minlie Huang, Xiaoyan Zhu, and Jianfeng Gao. 2020. Challenges in building intelligent opendomain dialog systems. ACM Transactions on Information Systems.
  • Chen Qu, Liu Yang, W. Bruce Croft, Johanne R Trippas, Yongfeng Zhang, and Minghui Qiu.2018. Analyzing and characterizing user intent in information-seeking conversations. In Proceedings of International ACM SIGIR conference on research and development in information retrieval.
  • Chen Qu, Liu Yang, W. Bruce Croft, Yongfeng Zhang, Johanne R. Trippas, and Minghui Qiu. 2019a. User intent prediction in informationseeking conversations. In Human Information Interaction and Retrieval.
  • Liu Yang, Minghui Qiu, Chen Qu, Cen Chen, Jiafeng Guo, Yongfeng Zhang, W. Bruce Croft, and Haiqing Chen. 2020. IART: Intent-aware response ranking with transformers in informationseeking conversation systems. In The Web Conference.
Picture of Jiahuan Pei

Jiahuan Pei

As a researcher at the CWI (NWO-I), I focus on generative dialogue systems in extended reality (XR) specifically for the VOXReality project. This project combines the fields of artificial intelligence, natural language processing, and immersive technologies to create interactive and engaging conversational experiences in virtual and augmented reality environments. We explore innovative ways to enhance human-computer interactions by enabling natural and realistic conversations with virtual entities. By leveraging the power of generative dialogue systems, we aim to develop intelligent agents capable of understanding and responding to user input in a dynamic and contextually appropriate manner.

Twitter
LinkedIn
possessed-photography-jIBMSMs4_kA-unsplash

Writing a Blog Post with ChatGPT: Unlocking the Power of Generative AI

Unless you have been living under a rock for the last couple of months you should have heard about ChatGPT. It is an example of a generative AI model – a neural network that is trained on existing data to generate new content.

In this blog post, we will look at using OpenAI’s ChatGPT to write a blog post about writing a blog post with ChatGPT. Let us get the basics sorted out first. You can access the algorithm to go along with this blog post or to use it in your projects by following this link and creating a free account. And that is it! You have the power of generative AI at your fingertips.

Introduction

The goal of the VOXReality project is to enhance the speech-based interface in extended reality. Our team at Maastricht University is working on AI models that can transcribe and translate speech into a chosen language while also taking into account the surrounding context – previous sentences, visual information, etc.

This task is part of a broader field of Natural Language Processing, which also includes Large Language Models such as ChatGPT. Recently, researchers have found that ChatGPT is performing well on context-aware translation.

While ChatGPT cannot be used directly in VOXReality due to its memory and computational requirements, we are investigating what makes it and other models work well on this task and how we can transfer this knowledge into models that are more suitable to our needs. But before we dive deeper, let us take a look at ChatGPT from the everyday user’s perspective.

We can instruct ChatGPT to write the whole blog post in one go and then ask it to refine it or we can take a more step-by-step approach and first work on the structure of the text and later expand each section. Here we will use the second approach. We can start by asking ChatGPT to write us the outline of our blog:

> I want you to write me the structure of a short blog post for a general audience about writing a blog post with ChatGPT. Keep only the section titles

Title: Writing a Blog Post with ChatGPT: Unlocking the Power of AI

  • Introduction: Embracing the Potential of AI in Blogging
  • Understanding ChatGPT:
  • What is ChatGPT?
  • How does ChatGPT work
  • Benefits of using ChatGPT in blogging
  • Choosing the Right Topic:
  • Selecting an engaging topic
  • Considering the target audience
  • Researching and gathering information
    […]

We asked to only include the titles of the sections. The algorithm does not always follow instructions precisely. This can be mitigated by asking to refine the answer – ChatGPT remembers our conversation. It has a limit of 16 or 32 thousand tokens of memory, depending on the version. For simplicity, you can think of tokens as words (although less common words can be split into sub-words). So, let us give it another chance and ask again:

> Keep only the section titles

Title: Writing a Blog Post with ChatGPT: Unlocking the Power of AI

  1. Understanding ChatGPT
  2. Choosing the Right Topic
  3. Planning Your Blog Post
  4. Engaging Your Readers
  5. Enhancing Readability and Coherence
  6. Leveraging the Power of ChatGPT
  7. Human Touch and Editing
  8. Optimizing for SEO
  9. Conclusion: Unleashing Your Blogging Potential with ChatGPT

Now, that is better. The structure is not what I had in mind, though. It would be nice to know a bit more about ChatGPT and how it works, but we already know the topic. Also, planning, engaging readers, and SEO could be left for another post. We can touch on “Enhancing Readability and Coherence” in the “Human Touch and Editing” section. Let us change that:

> Remove sections: 2, 3, 4, 5, 8. Add the section “Limitations of ChatGPT” before Conclusions.

Title: Writing a Blog Post with ChatGPT: Unlocking the Power of AI

  1. Understanding ChatGPT
  2. Leveraging the Power of ChatGPT
  3. Human Touch and Editing Limitations of ChatGPT
  4. Conclusion: Unleashing Your Blogging Potential with ChatGPT

Understanding ChatGPT

We are ready to expand the first section. Let us see what ChatGPT can tell us about itself. I split the response to discuss each part in detail:

> Write the “Understanding ChatGPT” section in less than 100 words

ChatGPT is an AI-powered language model that generates text responses based on input prompts. It uses deep learning and extensive training on vast amounts of text data to mimic human-like conversations.

The first two sentences answer the question of what ChatGPT is. It is a language model, meaning it is trained to predict the next word, based on the previous words. To illustrate this assume that we have this not-fully-complete sentence:
“How are you …”


The task of the language model would be to predict the next word (probably “doing”) in this case. We can use this mechanism to generate text word-by-word (we call this type of algorithm autoregressive). The algorithm is trained on the vast amount of text coming from the Internet, books, newspapers, and others. As this text is written by humans, the algorithm learns to mimic them. 

This is only the first step, though. The problem with learning just by trying to complete the input of the user is that it might not be the actual expected answer. That is why ChatGPT is next trained on the questions and answers from the internet and produced by humans hired specifically to generate data to train the model. 

In the last step, ChatGPT is trained using Reinforcement Learning from Human Feedback, in which humans rate the performance and the model is trained to obtain the highest rating.

Architecturally, ChatGPT is based on the Transformer – a type of neural network used extensively in many other tasks of Natural Language Processing. Two characteristics made Transformers a de-facto industry standard for NLP. 

First, they are highly parallelizable, meaning that we can train them substantially faster than previous architectures on modern graphics cards. Graphics cards can make a vast amount of computations at the same time, and a model that can make use of this fact can benefit greatly. 

Second, it makes use of the self-attention, which allows it to learn long-term dependencies. In the self-attention module, the words (or tokens more specifically) can gather information from even very distant parts of the sentence. This makes the Transformer very powerful and allows it to accept and produce long inputs and outputs.

While it can be a valuable tool for generating ideas and refining content, it’s important to remember that ChatGPT lacks true understanding and may produce inaccurate or nonsensical responses. By combining the power of ChatGPT with human judgment and critical thinking, bloggers can harness its capabilities to enhance their writing process.

ChatGPT has made a good point there. There is no guarantee that whatever it writes is true or makes sense. We will touch on this topic more in the “Limitations of ChatGPT” section.

Leveraging the Power of ChatGPT

> Write the “Leveraging the Power of ChatGPT” section in less than 150 words

ChatGPT offers a range of benefits that bloggers can leverage to enhance their writing process. Firstly, it serves as a valuable brainstorming tool by generating ideas and expanding on concepts. You can use it to overcome writer’s block or explore different angles for your blog post.
Additionally, ChatGPT can help you clarify and elaborate on specific topics. If you need more information or want to dive deeper into a subject, ChatGPT can provide insights and explanations to enrich your content.

In this blog post we use ChatGPT to accomplish a specific goal – to write a blog post on a concrete topic. But it can be used almost as a search engine but without links. You can ask questions on a wide range of topics and use it as a stepping stone if you are interested in pursuing the topic in more depth using other tools.

Another advantage of using ChatGPT is its ability to assist in proofreading and refining your blog post. It can offer alternative phrasing, suggest improvements, or help identify errors or inconsistencies in your writing.
However, it’s important to exercise caution and critically evaluate the responses generated by ChatGPT. While it can be a valuable tool, it may occasionally provide inaccurate or irrelevant information. Therefore, it’s crucial to review and validate the content before incorporating it into your blog post.

We see this warning reiterated again and again throughout this blog. ChatGPT is very cautious and avoids presenting itself as a trusted source of knowledge. It reminds us that the responsibility for the text is on us – human (co)authors. This is recognized by the scientific community and increasing correctness and alignment is an active area of research.

By harnessing the power of ChatGPT effectively, bloggers can enhance their writing process, overcome challenges, and create engaging and informative content for their readers.

Human Touch and Editing

> Write the “Human Touch and Editing” section in less than 100 words

While ChatGPT can assist in generating content, it’s essential to add your unique human touch to the writing process. Human editing and refinement are crucial to ensure clarity, coherence, and a personal voice in your blog post. Take the time to review and revise the content generated by ChatGPT, making sure it aligns with your style and resonates with your audience. Injecting your expertise, creativity, and personality into the post adds authenticity and makes it stand out. Embrace the role of an editor, fine-tuning the content to perfection and crafting a compelling narrative that captures the attention of your readers.

One point that I hoped ChatGPT would touch on is plagiarism. Remember, language models are trained on existing texts – books, blogs, etc. They learn to mimic the authors by predicting the next word autoregressive. While the amount of data used to train the models like ChatGPT gives hope that it can learn to generalize and not memorize the training data, we cannot be sure that it does not rehearse what it has already seen before. Thus, it is important to check the responses for plagiarism if we intend to incorporate them into our work.

Limitations of ChatGPT

> Write the “Limitations of ChatGPT” section in less than 100 words

While ChatGPT is a powerful tool, it does have limitations. One key limitation is that it lacks true understanding and consciousness, relying solely on patterns and statistical correlations in the training data. As a result, it may generate responses that are incorrect, nonsensical, or biased. ChatGPT also tends to be sensitive to the input phrasing, meaning slight changes in wording can yield significantly different responses. It’s important to critically evaluate and verify the information provided by ChatGPT. Balancing the use of ChatGPT with human judgment is crucial to ensure the accuracy, quality, and ethical considerations of the content produced.

While ChatGPT is correct here, the list of limitations is longer than we are led to believe by reading that paragraph. It can give incorrect information and it does not have an option to access the Internet to retrieve necessary information. It was trained on the data that can be dated, so it is not “aware” of any information after the data cut-off.

Another thing is that it has problems with mathematical calculations and specialised topics. We can also see that it repeats the need for human review for its outputs. It is important to keep those limitations in mind when using this powerful tool.

Conclusions

> Write the “Conclusions” section in less than 100 words

ChatGPT presents an exciting opportunity to augment your blogging journey with the power of AI. By understanding its capabilities and limitations, you can leverage ChatGPT effectively to generate ideas, clarify concepts, and refine your content. Remember to incorporate your unique human touch and editorial skills to add authenticity and engage your readers. While ChatGPT is a valuable tool, it should be used in conjunction with critical thinking and careful review. By harnessing the potential of ChatGPT responsibly, you can unlock new levels of creativity, efficiency, and impact in your blog posts.

Let me close this post by stating that in VOXReality we are using the same type of neural network that ChatGPT is based on for the task of translation and speech recognition. Apart from that, we are also investigating how to incorporate language models, such as ChatGPT, to enhance our algorithms.
Finally, let us ask ChatGPT to write the introduction and put everything together. Here is the whole blog post written by ChatGPT:

Welcome to the future of blogging with ChatGPT! As AI technology continues to evolve, writers now have access to powerful language models like ChatGPT to enhance their creative process. In this blog post, we’ll explore how you can effectively utilize ChatGPT to generate engaging content, overcome writing challenges, and amplify your blogging potential. From understanding the fundamentals of ChatGPT to adding your personal touch, we’ll delve into the tips and strategies that will empower you to create captivating blog posts that leave a lasting impact on your readers. Let’s unlock the power of AI in blogging together!

Understanding ChatGPT

ChatGPT is an AI-powered language model that generates text responses based on input prompts. It uses deep learning and extensive training on vast amounts of text data to mimic human-like conversations.

While it can be a valuable tool for generating ideas and refining content, it’s important to remember that ChatGPT lacks true understanding and may produce inaccurate or nonsensical responses. By combining the power of ChatGPT with human judgment and critical thinking, bloggers can harness its capabilities to enhance their writing process.

Leveraging the Power of ChatGPT

ChatGPT offers a range of benefits that bloggers can leverage to enhance their writing process. Firstly, it serves as a valuable brainstorming tool by generating ideas and expanding on concepts. You can use it to overcome writer’s block or explore different angles for your blog post.

Additionally, ChatGPT can help you clarify and elaborate on specific topics. If you need more information or want to dive deeper into a subject, ChatGPT can provide insights and explanations to enrich your content.

Another advantage of using ChatGPT is its ability to assist in proofreading and refining your blog post. It can offer alternative phrasing, suggest improvements, or help identify errors or inconsistencies in your writing.

However, it’s important to exercise caution and critically evaluate the responses generated by ChatGPT. While it can be a valuable tool, it may occasionally provide inaccurate or irrelevant information. Therefore, it’s crucial to review and validate the content before incorporating it into your blog post.

By harnessing the power of ChatGPT effectively, bloggers can enhance their writing process, overcome challenges, and create engaging and informative content for their readers.

Human Touch and Editing

While ChatGPT can assist in generating content, it’s essential to add your unique human touch to the writing process. Human editing and refinement are crucial to ensure clarity, coherence, and a personal voice in your blog post. Take the time to review and revise the content generated by ChatGPT, making sure it aligns with your style and resonates with your audience. Injecting your expertise, creativity, and personality into the post adds authenticity and makes it stand out. Embrace the role of an editor, fine-tuning the content to perfection and crafting a compelling narrative that captures the attention of your readers.

Limitations of ChatGPT

While ChatGPT is a powerful tool, it does have limitations. One key limitation is that it lacks true understanding and consciousness, relying solely on patterns and statistical correlations in the training data. As a result, it may generate responses that are incorrect, nonsensical, or biased. ChatGPT also tends to be sensitive to the input phrasing, meaning slight changes in wording can yield significantly different responses. It’s important to critically evaluate and verify the information provided by ChatGPT. Balancing the use of ChatGPT with human judgment is crucial to ensure the accuracy, quality, and ethical considerations of the content produced.

Conclusions: Unleashing Your Blogging Potential with ChatGPT

ChatGPT presents an exciting opportunity to augment your blogging journey with the power of AI. By understanding its capabilities and limitations, you can leverage ChatGPT effectively to generate ideas, clarify concepts, and refine your content. Remember to incorporate your unique human touch and editorial skills to add authenticity and engage your readers. While ChatGPT is a valuable tool, it should be used in conjunction with critical thinking and careful review. By harnessing the potential of ChatGPT responsibly, you can unlock new levels of creativity, efficiency, and impact in your blog posts.

 

Picture of Paweł Mąka

Paweł Mąka

Hello! My name is Paweł Mąka and I am a PhD candidate at Maastricht University. I am working on Natural Language Processing, particularly focusing on Machine Translation. My research centers around empowering artificial neural networks to leverage translation context more effectively. My passion lies in the fascinating world of neural networks and their deep understanding of language.

Twitter
LinkedIn
maxim-hopman-8vn4KvfU640-unsplash (1)

Understanding AR Modes and Choosing AR Devices

Extended reality – also called XR for short- is a term that props up ever more frequently in blog posts, articles and social media, and is gaining irreversible traction. In such a field marked by rapid evolution, it is worth wondering, what is the shape of XR today and how can it shape our collective tomorrow?

XR describes a spectrum of interactive environments which combine physical with virtual space to varying degrees with Virtual Reality sitting at one end and physical reality at the other.

The Virtual Reality end of this spectrum is a combination which can be clearly described and
communicated. It describes experiences where human vision to the physical world is entirely blocked and the user is encompassed instead with virtual content. Such experiences require a specialised device to be delivered, called a virtual reality headset. 

The headset is typically also equipped with speakers for spatial audio output, so that the resulting experience occupies two of the user’s predominant senses for perceiving the environment. Virtual reality applications use human physical movement as an input modality for driving interactions, frequently assisted using controllers which can also provide haptic input to the sense of touch, thus situating the user in an interactive virtual reality environment in an immersive way.

Augmented reality, on the other hand, is not as straightforward to describe. The definition for AR is the overlay of virtual content on physical space, but the execution of this idea entails variations that end up offering a range of experiences with significant differences.

First, unlike with the specialised devices required to run VR applications, the public so far has associated the technology of Augmented Reality with commonplace smartphone devices that have the necessary hardware and software components to support AR applications. In hardware terms, that involves adequate computation power, one or more embedded cameras and one or more inertia measurement units (IMU). 

In software terms, it involves at the minimum algorithms for tracking translations in a relative coordinate system by using the input from the IMUs, image analysis with feature extraction and matching for correcting positional drift, and depth estimation and mesh generation for understanding the geometry of the physical space. 

The application can then overlay digital content on the feed from the smartphone’s camera in a way that respects geometry and device orientation, and display the composite on the smartphone device screen. Users are able to interact with the digital content by interacting with the device, e.g. with touches on the touchscreen, as in other non-XR mobile applications. 

It follows that viewing and interacting with the AR application as a user experience is not innovative per se. AR technology in this context has so far been used for very popular games, like Pokémon Go, and in the art space, e.g. for AR art galleries like in Artivive.

On another line of thought, the public is slowly but steadily getting introduced to devices called smart-glasses or AR glasses or AR displays – the terminology is still not standardised, since this is a rather new field for mass production. For the scope of this post, we will adopt the term AR displays. 

AR displays are glass frames with an embedded display, either only on one lens (monocular, for less obtrusion) or on both (binocular, for more immersion). AR displays can cast a projection on the surface of the lens so that the virtual content is interjected between the user’s eye and physical space. In other words, AR displays can be perceived as portable monitor displays. Just like with traditional monitors, the size of the projection area and its resolution define its visualisation capabilities. 

AR displays may have no or limited computation power themselves and usually depend on their connection to an external device. AR displays in the monocular category can usually be connected wirelessly to a smartphone device for increased portability, while AR displays in the binocular category (which typically support higher resolution) can relate to a cable to a variety of devices, notably laptops and handheld consoles.

Common use cases for monocular AR displays include short interactions on the go, like reading messages, previewing notifications, getting directions, viewing photos or watching videos. Common use cases for binocular AR displays include a great range of everyday tasks, like office work, multimedia editing or gaming. AR displays may also have a built-in camera, so that the user can take pictures of their surroundings in a less intrusive manner (as you would e.g., with a GoPro camera).

In conclusion, it is important to differentiate that although these devices indeed succeed in overlaying virtual content on physical space, the composite result is restricted to overlaying a flat 2D screen on 3D space, unlike in the previously mentioned smartphone AR mobile applications, where the virtual content is placing in a way that respects depth and perspective to simulate 3D objects.

The next question that sensibly comes to mind is: Is there an AR display with compositing capabilities on par with AR mobile applications? Here is where the waters become even more murky. It can be stated that there are currently no consumer-oriented devices that support such an experience. The lay of the land is shifting fast though, and change is coming from two opposing directions towards this desired middle ground. 

On the one hand, the VR space is working towards lifting the impenetrable visual barrier between the user and the physical space. Producers are testing out VR headsets with front-facing colour cameras of adequate resolution to record and stream the physical environment back to the VR display, so that the user can re-establish visual contact with their surroundings while wearing the headset. 

This feature, called pass-through, can allow developers to produce applications which combine physical world data with virtual world data in novel ways. One could not only view 3D virtual objects positioned in their physical space in a more immersive way compared to a flat smartphone screen, but one could also manipulate said objects using controllers and/or their hands, as well as define how these objects interact with the rest of the physical world (such as collisions or context-aware behaviours). This combination is frequently called Mixed Reality.

On the other hand, the AR space is also evolving rapidly by experimenting with AR devices that have a glass-like form factor, but adequate sensors and computational power to independently support AR like a smartphone device would. Such devices can vary greatly in their concept and the related keywords used for communication can become even more confusing. 

Terms like head mounted AR device, holographic device and/or Mixed Reality device can be sometimes used interchangeably. Regardless of the choice of words, the decisive factor remains the same: these headsets allow for a more seamless and powerful blend between the virtual and the physical to offer higher valuable experiences to their users.

It is worth noting that such devices are currently restricted by short battery times (approximately 2 hours depending on intensity of use), may exhibit (over)heating issues due to their compressed form factor that also relate to intensity and duration of use, and may be reported as uncomfortable by their users due to weight or fitting issues. 

They also may compete or require additional lens fittings for individuals with vision- correction glasses. Due to their high price ranges, they are currently out of reach for the greater public and are mainly used by enterprises in research and industry. Nonetheless, popularisation is growing fast, as the VR industry has proven in the past 5 years. 

Therefore, it is reasonable to expect decisive changes in the few coming years that will determine if one tendency or the other will take hold and which concepts will pass the experimental phase and establish themselves as the XR standards for the future. The future is promising and unraveling fast!

Picture of Tareq Chihabi

Tareq Chihabi

Project Manager.

Projects mainly focus on Medical and Health, IT, and Cybersecurity, XR/AR domains. both on the management and technical level.
Moreover, a deep background in Business modelling and Business planning, constructing an exploitation plan for a specific product.

Twitter
LinkedIn
james-yarema-SN8__WW3vzk-unsplash

Exploring Mozilla Hubs: Creating Immersive Social Experiences

One of the objectives of VOXReality is to transform the future of virtual events. Those events can be completely hosted and run online, allowing attendees to participate from anywhere in the world. To achieve this, a virtual conference platform is utilised to develop and share the virtual environment. Specifically, VOXReality explores the possibilities of Mozilla Hubs for the creation of immersive experiences for virtual conferences. 

In the ever-evolving landscape of virtual reality (VR) and augmented reality (AR), Mozilla Hubs has emerged as a powerful platform for creating and sharing immersive social experiences. Developed by the Mozilla Foundation, the same organisation behind the Firefox web browser, Mozilla Hubs offers a unique blend of collaboration, communication, and creativity. 

In this article, we will delve into the capabilities of Mozilla Hubs and explore how it enables users to connect and interact in virtual spaces, fostering a new era of social engagement.

 

Virtual Avatars by Mozilla Hubs

Virtual Spaces and Customisation

At the core of Mozilla Hubs lies the ability to create and customize virtual spaces, allowing users to design their own digital environments. Whether it’s a conference room, art gallery, classroom, or a dreamy landscape, the possibilities are endless.

Users can access a diverse range of 3D models, images, and audio files to customise their environments and create a special ambiance. With an easy-to-use interface, users can modify layouts, rearrange objects, and create interactive elements, providing a truly immersive experience for participants.

Seamless Communication and Collaboration

Mozilla Hubs facilitates real-time communication and collaboration, enabling users to interact with one another within the virtual space. 

Integrated voice chat allows participants to engage in natural conversations, making it ideal for meetings, workshops, and social gatherings. Additionally, text chat features enable users to share messages, links, and emojis, enhancing communication in noisy or crowded environments. The platform also supports screen sharing, enabling easy collaboration on documents, presentations, and media content. 

These collaborative features empower users to work together, learn, and share ideas, breaking down geographical barriers and fostering meaningful connections.

Cross-Platform Accessibility

Mozilla Hubs’ cross-platform compatibility is one of its key advantages. The platform can be accessed from various devices, including desktop computers, laptops, smartphones, and VR headsets. 

This inclusivity ensures that users can connect and participate regardless of their preferred device or technology. With the rise of WebXR, Mozilla Hubs leverages web technologies, allowing users to access virtual spaces through web browsers, eliminating the need for additional software installations. 

This accessibility makes Mozilla Hubs an attractive choice for a broad range of users, extending its reach and impact in the virtual world.

Integration with Web-Based Content

Mozilla Hubs smoothly integrates with web-based content, unlocking a world of possibilities for immersive experiences. Users can embed websites, videos, images, and interactive applications within virtual spaces, augmenting the environment with dynamic content. 

This integration enhances the versatility of Mozilla Hubs, making it suitable for educational purposes, virtual events, art installations, and more. By incorporating web-based content, users can leverage existing resources and create interactive experiences that blend the real and virtual worlds.

Office scenario in Hubs, by Mozilla Hubs

Privacy and Security

Mozilla Hubs prioritizes user privacy and security, adopting measures to protect participants’ personal information and provide a safe virtual environment. The platform adheres to Mozilla’s data practices, which emphasize user control and data minimization. Additionally, Mozilla Hubs supports room moderation, enabling hosts to manage and control access to spaces, ensuring a secure and respectful environment for all participants.

Mozilla Hubs is a powerful platform that unlocks the potential of robust social interactions. In the VOXReality project, we intend to build a truly immersive experience by fusing Hubs’ features with VOXReality AI algorithms and leveraging all the advantages that a virtual event has to offer. Attending conferences has never been so enjoyable and simple.

Picture of Dimitris Kontopoulos

Dimitris Kontopoulos

Hello, I am Dimitris Kontopoulos, a virtual reality engineer with a passion for creating immersive experiences in virtual spaces. Drawing on my background of utilizing machine learning algorithms in virtual reality development, along with my passion in game development, I constantly strive to push the boundaries of interactive and captivating virtual reality adventures. I find it fascinating how virtual reality technologies have the power to revolutionize the way people engage in various activities across all aspects of life.

Twitter
LinkedIn
ux-indonesia-ywwuOBJy60c-unsplash

Understanding the User-Center Design Process of VOXReality: A Comprehensive Overview 

Enhancing VR Conferences, Augmented Theatre, and Training Assistants.

The development of VOXReality is centred around prioritising the current needs of the users. This interactive systems development approach places significant emphasis on comprehending and assessing user requirements and preferences. With this approach at the core, VOXReality endeavours to unlock the potential XR applications that will revolutionise immersive experiences and drive progress in various industries. The efficacy of these technologies will be demonstrated through three distinct use cases: VR Conferences, Augmented Theatres, and Training Assistants.

This blog post provides the first year timeline of requirement gathering phases, and then an overview of each of the three focus-group workshops (VR Conferences, Augmented Theatres, and Training Assistants), followed by a brief conclusion regarding our data collection and analysis of the user-centric methods employed in our focus group study. 

Timeline of our initial requirement gathering leading towards planning for developmental stages.

First Focus Group Workshop: VR conference

The CWI Team conducted a focus-group workshop on January 27th, 2023, at Spaces Herengracht in Amsterdam, (Netherlands) for getting the user requirements for the VOXReality project. The goal was to gather deeper insights and understand the needs of users and organisers at VR conferences.

The workshop had 6 participants, including VRDays conference organisers and experienced end-users. With the help of a moderator and an assistant, we brainstormed ideas for the role and design of virtual agents and language translation at VR conferences. We collected and documented all the activity sheets and transcribed the voice recordings of the participants.

The outcomes were then analyzed by our user-centered research team and turned into a final list of high-quality user requirements, which are scientifically documented in the D2.1 Project Deliverable.

Our workshop was a success in gathering valuable insights and meeting the needs of our stakeholders. We extend our sincere appreciation to the hosts and participants for their invaluable contributions to furthering scientific research. 

Second Focus Group Workshop: Augmented Theatre

On January 30th, 2023 CWI Team organised their second focus group workshop in Athens to gather user-requirements for the Augmented Theatre Use Case in partnership with Athens Epidaurus Festival.

The session provided detailed user-information aimed to understand the preferences and requirements of both users and organisers regarding subtitles and visual effects in theatre plays, thereby enhancing the overall theatre experience.

Through participant insights in user-centric activities, specific needs and preferences were identified to inform improvements in these areas. Furthermore, the workshop facilitated a brainstorming session to explore ideas for the user interface and interaction methods between the audience and augmented reality (AR) application.

This collaborative exercise fostered the exploration of innovative approaches, enabling seamless and intuitive interactions with AR technology to enhance audience engagement and immersion during theatre performances.

After the workshop, our user-centred research team diligently analysed the outcomes and transformed them into a finalised and scientifically documented list of high-quality user requirements, which can be found in the deliverable report of the project. The workshop’s collaborative atmosphere and active participation contributed to significant advancements in our understanding and meaningful progress in the Augmented Theater Use Case.

Thanks to hosts and participants for their invaluable contributions in advancing scientific research. 

Third Focus Group Workshop: Training Assistant

On February 28th, 2023 a focus group workshop was conducted with HOLO-LIGHT to acquire user-centric prerequisites for AR-Training Assistant use-case.

We sought to gain a deep understanding of the requirements involved in delivering machine assembly training through virtual agents utilising AR glasses. The activities were user-centric to identify the specific needs and considerations for effective training using this technology.

This brainstorming session with the chief project manager and a scientific researcher from HOLO-LIGHT generated innovative ideas regarding the role and design of AR Training Agents in machine assembly setup. Since the discussions included inputs from sophisticated/expert users, therefore outputs resulted in finest user-requirements.

This workshop marked the conclusion of our focus-group workshops trilogy, and we are pleased to share that the outcomes have been successfully processed and documented in the project deliverable.

By the time this post is public, this documentation is already in the pipeline for further developmental stages. We thank the hosts and participants for their valuable contribution towards scientific research. 

Collected Data and Analysis

During the sessions, a wide range of data was collected, including text, images/drawings, and audio. Textual data consisted of worksheets with questions and answers about users’ introductions, expertise in VR/AR, and related domains, as well as their past experiences and preferences.

Images and drawings included timeline visuals depicting their entire experience as participants and visual markers representing user choices. Audio recordings of interactive discussions were also transcribed for easier analysis. We used Thematic Analysis [1] to analyze the collected qualitative data from the focus groups, about which we will explore further in detail in our future blog posts.  

Selected user-activities scans from the data worksheets filled by participants

Towards the evaluations

Through thorough analysis and synthesis of the collected data, we were able to uncover significant themes, patterns, and insights into the attitudes and behaviours of the participants in augmented reality (AR) and virtual reality (VR) applications across all the use-cases of the project. 

This analysis allowed us to develop a comprehensive understanding of how users engage with and perceive the AR/VR experiences, providing valuable insights that can inform the design and improvement of our VOXReality applications in these use cases. 

Picture of Moonisa Ahsan

Moonisa Ahsan

Moonisa Ahsan is a post-doc researcher in the DIS (Distributed & Interactive Systems) Group of CWI (NWO-I). In VOXReality, she is contributing in understanding next-generation applications within Extended Reality (XR), and to better understand user needs and leveraging that knowledge to develop innovative solutions that enhance the user experience in all three use-cases. She is a Marie-Curie Alumna and her scientific and research interests are Computer Graphics (CG), Interface Design, Cultural Heritage (CH), Human-Computer-Interaction (HCI), and User Experience (UX).

Twitter
LinkedIn

References

[1] Sueyoon Lee, Abdallah El Ali, Maarten Wijntjes, and Pablo Cesar. 2022. Understanding and Designing Avatar Biosignal Visualizations for Social Virtual Reality Entertainment. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 425, 1–15. https://doi.org/10.1145/3491102.3517451

[2] Sueyoon Lee, Alina Striner, and Pablo Cesar. 2022. Designing a VR Lobby for Remote Opera Social Experiences. In ACM International Conference on Interactive Media Experiences (IMX ’22). Association for Computing Machinery, New York, NY, USA, 293–298. https://doi.org/10.1145/3505284.3532980

[3] Greg Guest, Kathleen M. MacQueen, and Emily E. Namey. 2012. Applied Thematic Analysis, Sage Publications, ISBN: 9781412971676.

Photo by Google DeepMind on Unsplash

VOXReality review of Once for All: Training One Network for Efficient Deployment in AI and Machine Learning

The article “VOXReality Review of Once for All: Training One Network for Efficient Deployment in AI and Machine Learning” serves as a comprehensive summary of the original research conducted by Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han from the esteemed Massachusetts Institute of Technology (MIT) and MIT-IBM Watson AI Lab. 

In recent years, advancements in technology, machine learning, and artificial intelligence have revolutionised various fields. One notable development in this area is the Once for All (OFA) approach, introduced in the paper “Once for All: Train One Network and Specialize it for Efficient Deployment.” This approach aims to train a single neural network and then specialise it for efficient deployment across different platforms and tasks.

In this article, we will explore the concept of OFA, its applications, and its potential impact on the field of machine learning and artificial intelligence.

What is Once for All (OFA)?

The Once for All (OFA) approach is a novel technique that involves training a single neural network and then specialising it for efficient deployment. Traditional approaches require training multiple networks for different tasks and platforms, which can be time-consuming and resource-intensive.

OFA addresses this challenge by training a large “super-network” that contains multiple sub networks, each tailored to a specific task or platform. By using a combination of network pruning and architecture search techniques, OFA allows for the efficient adaptation of a single network to various deployment scenarios.

Figure 1: Left: a single once-for-all network is trained to support versatile architectural configurations including depth, width, kernel size, and resolution. Given a deployment scenario, a specialized subnetwork is directly selected from the once-for-all network without training. Middle: this approach reduces the cost of specialized deep learning deployment from O(N) to O(1). Right: once-for-all network followed by model selection can derive many accuracy-latency trade-offs by training only once, compared to conventional methods that require repeated training.

Technical background

The Once for All approach optimises several key aspects of the network, including kernel size, depth, and number of channels to strike a balance between performance and efficiency. To optimise the network architecture, the authors propose a differentiable architecture search space that allows for efficient exploration of various network configurations. 

This search space encompasses a wide range of kernel sizes, depths, and channel configurations, enabling the network to adapt to different deployment scenarios. By training a large “super-network” that contains multiple sub-networks, OFA leverages network pruning techniques to determine the optimal combination of kernel sizes, depths, and channels for each specialised sub-network.

In detail, a resource-aware training algorithm that takes into account the computational cost and latency constraints of different deployment platforms is introduced. By incorporating resource constraints during the training process, the resulting specialised sub-networks are not only efficient but also tailored to the specific requirements of each platform. In terms of kernel size, the OFA approach explores a range of kernel sizes for convolutional layers, including 1×1, 3×3, and 5×5.

This allows the network to adapt to different receptive field sizes and capture both local and global features effectively. OFA explores a wide range of depths, from shallow to deep architectures. This flexibility enables the network to strike a balance between model complexity and computational efficiency, depending on the deployment scenario and task requirements. 

In terms of channels, the OFA approach optimises the number of channels in each layer to achieve an optimal trade-off between model capacity and computational efficiency. By dynamically adjusting the number of channels, the network can adapt to different levels of feature representation and information flow.

Figure 2: Comparison with SOTA hardware-aware NAS methods on Pixel1 phone. OFA decouples model training from neural architecture search. The search cost and training cost both stay constant as the number of deployment scenarios grows. “#25” denotes the specialized sub-networks are fine-tuned for 25 epochs after grabbing weights from the once- for-all network. “CO2e” denotes CO2 emission which is calculated based on Strubell et al. (2019). AWS cost is calculated based on the price of on-demand P3.16xlarge instances.

Key benefits of OFA

The Once for All (OFA) approach offers several benefits that make it an attractive option for efficient deployment in machine learning and artificial intelligence:

  • Simplicity and Efficiency: OFA simplifies the training process by eliminating the need to train multiple networks for different tasks and platforms. This leads to significant time and resource savings.
  • Flexibility: OFA enables the adaptation of a single network to different deployment scenarios, such as mobile devices, data centres, or edge devices. This flexibility allows for more efficient resource utilisation and improved performance.
  • State-of-the-art Performance: Despite its efficiency, OFA achieves state-of-the-art performance across various tasks and platforms. This makes it a promising approach for real-world applications.
  • Adaptability: The OFA approach can be applied to a wide range of machine learning tasks, including image classification, object detection, and natural language processing. This adaptability makes it a versatile tool for researchers and practitioners.

Applications of OFA

The Once for All (OFA) approach has gained significant attention and adoption in the machine learning and artificial intelligence community. Some notable applications and achievements include:

  • SONY Neural Architecture Search Library: SONY has adopted the OFA Network in its Neural Architecture Search Library, highlighting its potential for efficient deployment in real-world applications.
  • ADI MAX78000/MAX78002 Model Training and Synthesis Tool: ADI has also adopted the OFA
    Network in its Model Training and Synthesis Tool, further demonstrating its effectiveness in
    specialised domains.
  • Alibaba’s MLPerf Inference Benchmark: OFA ranked first in the open division of the MLPerf Inference Benchmark, conducted by Alibaba. This achievement showcases its performance and efficiency in data center and edge computing scenarios.
  • CVPR Low-Power Computer Vision Challenge: OFA secured first place in the CVPR Low-Power Computer Vision Challenge, both in the CPU detection and FPGA track. This success highlights its potential for low-power and resource-constrained environments.

Conclusion

The Once for All (OFA) approach presents a groundbreaking solution for training one network and specialising it for efficient deployment in machine learning and artificial intelligence. By eliminating the need for training multiple networks, OFA simplifies the process, improves resource utilisation, and achieves state-of-the-art performance. Its wide range of applications and notable achievements in various domains further validate its potential impact. As the field of machine learning and artificial intelligence continues to advance, the Once for All approach holds great promise for driving innovation and efficiency in the deployment of neural networks.

Picture of Stefanos Biliousis

Stefanos Biliousis

Hello! My name is Stefanos Biliousis and I'm a computer vision researcher with a passion for exploring the latest advances in artificial intelligence. With a background in machine learning and computer science, I spend my days developing innovative algorithms and techniques for image and video analysis. I'm fascinated by the many ways that computer vision and AI are revolutionising the world around us.

Twitter
LinkedIn
ben-sweet-2LowviVHZ-E-unsplash (1)

The Future of Virtual Events: Simplifying Attendance and Amplifying Attendee Experience with Voice-Driven Interaction in XR environments

VOXReality’s research has the potential to transform the future of virtual and hybrid events. As the demand for virtual events in XR environments grows, there is a need for innovative solutions that can provide a more natural and immersive experience for attendees. VOXReality’s research and development of new AI models to drive future language-driven interactive XR experiences has the potential to create more personal, natural and accessible event experiences and enhance attendee engagement and efficiency. 

The COVID-19 pandemic spurred the rapid adoption of virtual events, as in-person gatherings became impossible or too risky. According to Grand View Research, the global virtual events market was valued at $78.6 billion in 2020, with a compound annual growth rate (CAGR) of 23.2% from 2021 to 2028.

While virtual events have provided a lifeline for businesses and organisations to continue their operations, one of the main challenges has been to recreate the same level of engagement, ease and interactivity as in-person events.

Virtual events have traditionally relied on XR (Extended Reality) technologies, such as VR (Virtual Reality), to provide immersive experiences. However, XR user interfaces are relatively new for many users who are not used to them like they are to many on-screen interfaces.

Also transferring from a screen to a 360 environment can be daunting for attendees. This means XR interfaces can feel complicated and need a longer onboarding process. This is where the potential of Voice Driven Interaction comes in.

Voice-driven interaction provides a more natural and human way of interacting with a system, making events hosted in XR spaces more personal and accessible to a wider audience. In this blog post we want to look ahead and explore the value VOXReality’s research findings/ results can present for future virtual and hybrid  events.

Paving the way for experiencing events in XR

VOXReality’s goal is to conduct research and develop new AI models that can drive the future of XR interactive experiences. Their innovative models will address both human-to-human interaction in unidirectional (theatre) and bidirectional (conference) settings, as well as human-to-machine interaction by building the next generation of personal assistants. All elements that are set to play an important role in shaping the future of virtual and hybrid events.

More Personal and Accessible Events

The main benefits of using language as a core interaction medium in XR spaces, is that it makes events more personal and accessible. Attendees can use their voice to navigate the virtual environment, access information, and interact with other attendees, speakers, and exhibitors. 

This creates a more natural and seamless experience that mimics in-person events, where people can communicate through speech and body language. Voice-Driven interaction also removes the need for complex XR interfaces that can be overwhelming or challenging to use for some attendees. By using voice-driven interaction, virtual events can become more inclusive and welcoming to a broader range of participants.

Improved Attendee Experience

VOXReality’s innovative Artificial Intelligence (AI) models are set to combine language in combination with visual understanding that deliver next-generation applications that provide comprehension of users goals, surrounding environment and context. This has the potential to significantly enhance the attendee experience as the system is better tuned into their needs and expectations. 

Attendees can use their voice to perform various actions, such as asking questions, participating in polls, or even controlling the environment. For example, imagine attending a virtual trade show and being able to say, “Hey, show me the new products from Company X.” to your own personal virtual assistant. The system could then display relevant information or even take you to the Company X virtual booth. 

Voice-Driven Interaction allows attendees to engage with the event on a deeper level, leading to more meaningful interactions and better networking opportunities.

Increased Efficiency and Engagement

Finally, Voice-Driven Interaction in XR spaces can also increase efficiency and engagement. Attendees can use their voice to perform tasks quickly and easily, without the need for extensive navigation or typing. This gives them more freedom to focus their attention on the content of the conference and interacting with other attendees than on solving technical issues and figuring out how the interface works. This can lead to more productive and dynamic discussions, ultimately enhancing the value of the event for everyone involved.

Future Use Cases

All set to take virtual events to a new level. In order to do so, VOXReality will be testing three particular use cases of the project at Immersive Tech Week 2023. These use cases are: Digital Agents, Virtual Conferencing and Theatre. Stay tuned to find out more details on what these use cases are and how they will impact our experience of virtual events. 

Picture of Regina Van Tongeren

Regina Van Tongeren

Hi, I'm Regina, Head of Marketing at VRDays Foundation. I help organise Immersive Tech Week in Rotterdam, a festival that brings together diverse voices to celebrate and explore immersive technologies' potential for a better world. I've always loved how films and games create new worlds and realities through stories, and I am fascinated by how immersive technologies are changing storytelling.

With a background in the games industry and teaching marketing, I believe immersive tech will revolutionise brand experiences, and I am curious to see the possibilities they offer for events. As a marketeer at Immersive Tech Week, I am passionate about bringing as many people as possible from all backgrounds and walks of life to Rotterdam so they can discover, experience and think about these new technologies.

Twitter
LinkedIn
brandon-romanchuk-AkCpJd6R2QU-unsplash (1)

Task-Oriented Dialogue Systems: Bridging the Gap Between Language and Action

Dialogue systems have become an increasingly important technology in recent years, with the potential to change the way we interact with machines and access information. Those systems can be divided into two categories, task-oriented and open-domain dialogue systems. Task- oriented dialogue systems have been developed to assist users in specific tasks or goals, while open-domain dialogue systems have been developed to generate responses on a wide range of topics, allowing for more natural and engaging conversations.

In this article, we will focus on task-oriented dialogue systems and discuss the recent advancements in the field, including end-to-end trainable systems and multimodal input and output. We will also highlight the challenges that remain, such as handling ambiguity as well as maintaining user engagement, and explore the potential for future developments in context-aware and multilingual dialogue systems.

Task-Oriented Dialogue Systems

Task-oriented dialogue systems are designed to help users achieve a specific goal or complete a particular task, such as booking a flight, ordering food, or scheduling a meeting. These systems are different from open-domain dialogue systems, which are designed to converse with users on a wide range of topics.

At its core, task-oriented dialogue systems are about bridging the gap between language and action. Language involves the ability to communicate meaning through words and sentences, while action involves the ability to perform physical tasks based on that communication. By combining these two modalities, task-oriented dialogue systems can enable machines to understand human language and advise the users to perform tasks based on that understanding.

Task-oriented systems are increasingly being used in a variety of applications, such as customer service, e-commerce, navigation instruction, etc. The expanded use of these systems is achieved since they offer a more natural and intuitive way for users to interact with technology as well as easily to accomplish specific tasks.

Task-oriented dialogue system design. Image by Microsoft

Task-oriented dialogue systems are typically composed of several components, including Automatic Speech Recognition (ASR) system, Natural Language Understanding (NLU) module, Dialogue Manager (DM), and Natural Language Generation (NLG) module. ASR and NLU are responsible for converting the users spoken or written input into structured data that can be processed by the DM. The DM uses this data to determine the users intent and generate an appropriate response. The NLG module is then responsible for generating a natural-sounding response that can be spoken or displayed to the user.

Advancements in Task-Oriented Dialogue Systems

In recent years, there have been several significant advancements in task-oriented dialogue systems. One of the most important advancements has been the development of end-to-end trainable systems. These systems can be trained using large amounts of conversational data, and they can learn to generate responses that are more natural and contextually appropriate.

End-to-end systems have also been shown to be effective in handling out-of-domain queries, which are queries that are not related to the primary task of the system. These systems can leverage the conversational context to generate a response that is relevant to the users query, even if it is not directly related to the primary task.

Challenges and Limitations

Despite the significant advancements in task-oriented dialogue systems, there are still several challenges that need to be addressed. One of the main challenges is developing systems that can understand the nuances of language and context. For example, understanding the difference between “I want to book a flight to New York” and “Can you book a flight to New York for me?” requires a deep understanding of language and context that is difficult to replicate in machines. 

Another challenge is handling ambiguity and uncertainty in user queries. Users may use vague language, make mistakes, or provide incomplete information, and the system needs to be able to handle these situations and generate an appropriate response.

There are also ethical considerations in the field of task-oriented dialogue systems. For instance, the use of these systems in sensitive domains such as healthcare raises concerns about privacy and confidentiality. It is important for researchers and practitioners in the field to consider the ethical implications of their work and develop systems that are designed with accountability in mind.

Looking Ahead: The Future of Task-Oriented Dialogue Systems

The future of task-oriented dialogue systems is likely to be shaped by the increasing availability of multimodal data and input. As users interact with these systems using a variety of input modalities, including speech, text, and images, task-oriented dialogue systems will need to become more flexible and adaptable to accommodate these varied inputs. This could lead to the development of more sophisticated dialogue management systems that can handle a wide range of input and output modalities and enable task-oriented dialogue systems to be more effective and engaging for users.

At VOXReality, we are working on developing context-aware task-oriented dialogue systems that can understand the users intent and generate appropriate responses in a wide range of contexts. We are also exploring the use of multimodal input and output using a combination of speech, text, and images to make these systems more flexible and engaging for users.

Picture of Apostolos Maniatis

Apostolos Maniatis

Hello! I'm Apostolos Maniatis, and I'm a dialogue system researcher. With a background in natural language processing and computer science, I spend my time developing innovative algorithms and techniques for creating intelligent systems that can converse with humans in natural language. I'm fascinated by the many ways that dialogue systems are transforming the way we interact with technology, and I'm committed to making these systems more intuitive, responsive, and adaptable to the needs of users.

Twitter
LinkedIn
SSUCv3H4sIAAAAAAACA02RTWvDMAxA/4rQOeyD3XIvg8GgbLuVHVRbTcQcK9hOulLy3yen7babJT1L8vMZ95TFYXtGCWHKJVERjdg+NsheiiahgO3D0mAuVKbM2ViLHBXurLrGtya7c81jix/s+qhBuxPaxWlvqU1gV5JGcRmX5gZuWcfAv9CrZMchUGSdDPtskDqO7lRn2tDEgWldYWelr2PhNFz3mcWzXo40ealHnNVRsPpT3dfepkPNdonGXlySmVONPWeX6mxyvUQGG5GixA488/gXSdaBSxIHTodRs1RPcJTSwyHo0fWUCuihcsHceJDBds8XYlyfCR35jksGih72iSTCbFI0wX/3d7Ax8bQ3frN9X9k3jp6NivCyfYaDpsHkNVi+q0Jsri4vvfK9Z6e11Vy96pf95bIsP2K0u53pAQAA

Breaking Language Barriers: Advancements in Speech Recognition and Machine Translation

Machine Translation (MT) is a powerful tool that can help overcome the language barrier and facilitate cross-cultural communication, making it easier for people to access information in languages other than their own.

Given that speech is the natural medium of communication between humans, developing solutions that can translate from speech is a crucial step towards deploying MT models in different scenarios (e.g., conferences, theaters, …) where speech is the main medium of communication.

In this article, we discuss advancements in Automatic Speech Recognition (ASR) and Machine Translation and highlight the competition between cascade and end-to-end speech translation solutions and their challenges.

Automatic Speech Recognition

Automatic Speech Recognition (ASR) refers to the technology used by machines to recognise human speech and transcribe it into text. The field of ASR has evolved significantly over the years, from classical techniques that relied on Hidden Markov Models and Gaussian Mixture Models to more recent deep learning models such as Whisper.

Image by storyset on Freepik

Classical ASR techniques worked by breaking down speech into smaller segments called phonemes (e.g., the word “cat” can be broken into \k\ \æ\ \t\ with the International Phonetic Alphabet representation), and then using statistical models to predict the most likely sequence of phonemes that correspond to a given audio signal. While these techniques were effective to some degree, they had limitations in their ability to handle variability in speech patterns and accents.

Deep learning models have revolutionised the field of ASR by using neural networks to learn more complex and nuanced patterns in speech data. These models are robust to a wide variety of accents and dialects and are able to perform well in noisy environments.

Another critical aspect of models like Whisper is their multilingual nature, as they are able to transcribe speech from multiple languages with high accuracy. Overall, ASR has come a long way in recent years, and these advancements are making it easier for machines to understand and interpret human speech.

Multilingual Neural Machine Translation

Multilingual Machine Translation refers to the technology used by machines to automatically translate text or speech from one language to another, across multiple languages.

The field of machine translation has evolved significantly over the years, from Statistical Machine Translation (SMT) models that relied on large corpora of parallel texts (sentences and their translation in the target language) to the more powerful neural models.

Neural Machine Translation (NMT) have become the go-to approach especially after the introduction of the Transformer Architecture, which has revolutionized the field by making it possible to build powerful models that can handle complex language structures with ease.

Transformer Architecture. Image by Jay Alammar

SMT systems learn statistical relationships between words in the source language and the target based on their co-occurrence in the training corpus. A word “T” (e.g., world in English) in the target language that occurs many times with a word “S” (e.g., welt in German) in the source language is more likely to be a translation of the word “S” (e.g., “world” is the translation of “welt”).

A translation from one language to another can end up with a different number of words or different word order. To deal with this, SMT systems learn an alignment function that maps the source sentence from its order in source language to the new target order.

SMT models can perform well on specific domains or language pairs where there is sufficient data available. However, they often struggle to generalise to new domains or to produce fluent and natural-sounding translations.

On the other hand, NMT is capable of generalising across domains and learning shared patterns between different languages. This has contributed to the rise of multilingual models that are able to transfer knowledge from languages with large amounts of data (e.g., English, Western-European languages, Japanese, Chinese) to low resource languages (e.g., Vietnamese, Swahili, Urdu).

No Language Left Behind (NLLB) is a notable example that has pushed the number of supported languages by one model to over 200 and has achieved state-of-the-art results in multiple languages especially low resource ones. Efforts like NLLB and other multilingual models have the potential to greatly improve access to information and open the channels of communication and collaboration between different cultures.

Cascade vs. End-to-end Speech Translation

Cascade solutions for Speech Translation involve the combination of ASR and NMT components to translate speech input. However, since the ASR and NMT models are trained separately, this can lead to a reduction in the quality of the translation due to inconsistencies in the training data and procedures of the two models. Furthermore, cascade solutions are also susceptible to error propagation, where errors produced by the ASR model can negatively impact the quality of the translation.

End-to-end solutions are promising in circumventing these issues by translating directly from speech to text. While these models are capable of achieving competitive results compared to cascade solutions, they still face challenges due to the limited availability of speech-to-translated text datasets, resulting in insufficient data for training.

Despite these challenges, ongoing advancements in end-to-end solutions show promising results in closing the gap with cascade solutions. With further developments in data collection and model optimisation, end-to-end solutions may eventually surpass cascade solutions in terms of translation quality and accuracy.

In conclusion, the recent advancements in Automatic Speech Recognition and Machine Translation have significantly improved the ability of machines to understand and interpret human speech, paving the way for more effective communication across different languages and cultures.

However, there are still open issues like generalising to different domains and challenging contexts that are crucial for ensuring a satisfactory performance when Machine Translation systems are used in real-world scenarios.

In VOXReality, our mission is to develop multilingual context-aware Automatic Speech Recognition and Neural Machine Translation models, which are capable of learning new languages and accents and consider the surrounding textual and visual context to obtain higher quality of transcriptions and translations.

Picture of Abderrahmane Issam

Abderrahmane Issam

Hello! My name is Abderrahmane Issam and I'm a PhD student at the Maastricht University where I'm working on Neural Machine Translation for non-native speakers. I'm passionate about research in Natural Language Processing and my job is to make Machine Translation systems robust under real-world scenarios and especially to non-native speakers input.

Twitter
LinkedIn