Machine Translation (MT) is a powerful tool that can help overcome the language barrier and facilitate cross-cultural communication, making it easier for people to access information in languages other than their own.
Given that speech is the natural medium of communication between humans, developing solutions that can translate from speech is a crucial step towards deploying MT models in different scenarios (e.g., conferences, theaters, …) where speech is the main medium of communication.
In this article, we discuss advancements in Automatic Speech Recognition (ASR) and Machine Translation and highlight the competition between cascade and end-to-end speech translation solutions and their challenges.
Automatic Speech Recognition
Automatic Speech Recognition (ASR) refers to the technology used by machines to recognise human speech and transcribe it into text. The field of ASR has evolved significantly over the years, from classical techniques that relied on Hidden Markov Models and Gaussian Mixture Models to more recent deep learning models such as Whisper.
Classical ASR techniques worked by breaking down speech into smaller segments called phonemes (e.g., the word “cat” can be broken into \k\ \æ\ \t\ with the International Phonetic Alphabet representation), and then using statistical models to predict the most likely sequence of phonemes that correspond to a given audio signal. While these techniques were effective to some degree, they had limitations in their ability to handle variability in speech patterns and accents.
Deep learning models have revolutionised the field of ASR by using neural networks to learn more complex and nuanced patterns in speech data. These models are robust to a wide variety of accents and dialects and are able to perform well in noisy environments.
Another critical aspect of models like Whisper is their multilingual nature, as they are able to transcribe speech from multiple languages with high accuracy. Overall, ASR has come a long way in recent years, and these advancements are making it easier for machines to understand and interpret human speech.
Multilingual Neural Machine Translation
Multilingual Machine Translation refers to the technology used by machines to automatically translate text or speech from one language to another, across multiple languages.
The field of machine translation has evolved significantly over the years, from Statistical Machine Translation (SMT) models that relied on large corpora of parallel texts (sentences and their translation in the target language) to the more powerful neural models.
Neural Machine Translation (NMT) have become the go-to approach especially after the introduction of the Transformer Architecture, which has revolutionized the field by making it possible to build powerful models that can handle complex language structures with ease.
SMT systems learn statistical relationships between words in the source language and the target based on their co-occurrence in the training corpus. A word “T” (e.g., world in English) in the target language that occurs many times with a word “S” (e.g., welt in German) in the source language is more likely to be a translation of the word “S” (e.g., “world” is the translation of “welt”).
A translation from one language to another can end up with a different number of words or different word order. To deal with this, SMT systems learn an alignment function that maps the source sentence from its order in source language to the new target order.
SMT models can perform well on specific domains or language pairs where there is sufficient data available. However, they often struggle to generalise to new domains or to produce fluent and natural-sounding translations.
On the other hand, NMT is capable of generalising across domains and learning shared patterns between different languages. This has contributed to the rise of multilingual models that are able to transfer knowledge from languages with large amounts of data (e.g., English, Western-European languages, Japanese, Chinese) to low resource languages (e.g., Vietnamese, Swahili, Urdu).
No Language Left Behind (NLLB) is a notable example that has pushed the number of supported languages by one model to over 200 and has achieved state-of-the-art results in multiple languages especially low resource ones. Efforts like NLLB and other multilingual models have the potential to greatly improve access to information and open the channels of communication and collaboration between different cultures.
Cascade vs. End-to-end Speech Translation
Cascade solutions for Speech Translation involve the combination of ASR and NMT components to translate speech input. However, since the ASR and NMT models are trained separately, this can lead to a reduction in the quality of the translation due to inconsistencies in the training data and procedures of the two models. Furthermore, cascade solutions are also susceptible to error propagation, where errors produced by the ASR model can negatively impact the quality of the translation.
End-to-end solutions are promising in circumventing these issues by translating directly from speech to text. While these models are capable of achieving competitive results compared to cascade solutions, they still face challenges due to the limited availability of speech-to-translated text datasets, resulting in insufficient data for training.
Despite these challenges, ongoing advancements in end-to-end solutions show promising results in closing the gap with cascade solutions. With further developments in data collection and model optimisation, end-to-end solutions may eventually surpass cascade solutions in terms of translation quality and accuracy.
In conclusion, the recent advancements in Automatic Speech Recognition and Machine Translation have significantly improved the ability of machines to understand and interpret human speech, paving the way for more effective communication across different languages and cultures.
However, there are still open issues like generalising to different domains and challenging contexts that are crucial for ensuring a satisfactory performance when Machine Translation systems are used in real-world scenarios.
In VOXReality, our mission is to develop multilingual context-aware Automatic Speech Recognition and Neural Machine Translation models, which are capable of learning new languages and accents and consider the surrounding textual and visual context to obtain higher quality of transcriptions and translations.