Voice-activated personal assistants are deployed as chatbots to engage customers, or provide customer support, overcome language barriers and streamline operations or repetitive actions. Some virtual assistants use additional modalities like text/chat or images. Nevertheless, contemporary personal assistant technology is limited to simpler tasks, and cannot handle tasks more complex than setting an alarm, playing a song, or switching a light on.
VOXReality will develop novel AI models that jointly model audiovisual spatio-temporal contexts, allowing for the grounding of language commands and
semantics to the observed scene, both spatially and semantically as well. This will open up novel assistant applications like instruction assistants, HMD-mounted technical support assistants, or navigation guides.
Novel techniques to train self-supervised vision and language systems will be developed that allow for higher level reasoning when presented with both modality inputs. This way, context and actions will be modelled with joint semantics between the models (better visually grounded language semantics), as well as spatial encoding of the scene and emplacement of the actions within the scene. Voice responses and queries will be therefore both spatially and semantically grounded, opening up newpossibilities in personal assistant application development.
Virtual Conferences are events that are completely hosted and run online, typically using a virtual conferencing platform that sets up a shared virtual environment, allowing their attendees to view or participate from anywhere in the world. Even though the sector existed prior to the pandemic, the situation now is vastly different, most existing conferences have already run one or two virtual events.
To achieve effective and efficient communication in virtual settings is required to facilitate networking and interactiong among participants. Face-to-face communication provides subtle communication cues that establish a connection between the parties involved, and is severely lacking in virtual
environments due to numerous technical difficulties. VOXRelity will ensure that spoken communication is not only preserved, but also accentuated compared to the physical setting. We will aim at researching and developing new language translation and captioning models to facilitate more efficient and effective real-time verbal communication. Furthermore, VOXReality will aim at automating the navigation and deployment aspects of virtual assistants by conditioning them on the (landmark annotated) virtual environment at hand, and increasing their conversational realism.
Using VOXReality’s next-generation transcription and translation services virtual conferences will be able to deliver inclusive and effective events, while the automatic virtual guidance technology will help them address user churn rates (e.g. dropping from the online platform due to being lost). We will develop a speech-driven animation technology to carry posture and gesture cues into avatar-based environments.
For hundreds of years, theaters have been using many different effects, from lighting to sound, to create a more immersive experience for the audience. The dawn of new technologies adds an extra dimension to the performances. Already, the theater industry dips into pioneering technologies, such as virtual reality (VR), augmented reality (AR) and mixed reality (MR), to guide art in new directions. The theatre industry faces numerous challenges and have been trying to address these with subtitles screens, which provide a solution but might be distracting for the audience since they have to read these and takes the focus of the performers. Furthermore, in the majority of plays the viewer should have some prior knowledge about historical, political or even social events to appropriately understand the play, which can also take focus off from the experience. For people with disabilities such as deafness or hard of hearing present a challenge when following and experiencing theatre.
Going to the theater is a great cultural experience and XR services that display live captions to the viewer via XR headset, as well as contextual information when it is needed, can offer a highly immersive experience to the viewers. Specifically, the captions through XR services can help all the viewers in the cases of a performance with characters with strong accents, speaking in different languages or if they are singing. XR services will provide captions visible only to those individual viewers who want to see them and in the language that the viewer chooses. Those services will merge the computer-generated captions with the theater stage and the live performance of the actors, creating an immersive experience to the viewer.
This theatre use case will combine language translation, audio-visual user associations, and AR VFX triggered by predetermined speech which will be driven by VOXReality’s language and vision pretrained models.