One of the key challenges in enabling real-time multilingual communication in VR is efficiency. When many users are speaking, listening, and requesting translations at the same time, the system must manage multiple audio and text streams without overloading servers or wasting GPU resources.
In VOXReality, we solved this challenge by introducing an intermediate socket layer (middleware) between the speaker and the translation model. This middleware acts as a smart traffic controller for translations, ensuring that system resources are only used when necessary.
Here’s how it works:
- When a user activates their microphone, a dedicated speaker socket opens (/speaker/{user-id}), making their audio stream available.
- Listeners who want a translation connect through a listener socket (/listener/{user-id}/{language}), requesting subtitles in their preferred language.
- Only when both a speaker and at least one listener are connected does the middleware activate a third connection to the transcription–translation model.
This design ensures that translations are processed once per speaker and once per language, no matter how many people are listening. That means hundreds of users can receive the same subtitles in real time, without duplicating the workload.
The middleware also dynamically closes connections when they are no longer needed. If a speaker turns off their microphone, their socket closes automatically. If all listeners for a language drop out, that translation stream is shut down too. This adaptive behavior keeps the system light and efficient, saving GPU cycles for when they’re really needed.
Beyond efficiency, the middleware architecture allows the system to scale naturally. Multiple speakers can be active at the same time, each with their own user ID, translation streams, and listeners, without interfering with one another.
For presentation settings such as the Conference Room, the same architecture has been adapted for one-to-many translation. In this mode, only one person—the presenter—has the microphone at any given time. When a user steps onto the virtual stage, the system automatically designates them as the active presenter and opens the standardized endpoint /speaker/presentation. Audience members simply click the translate button on their panel to join /listener/presentation/{language} in their preferred subtitle language. The middleware then activates the transcription–translation model only when both the presenter and at least one listener are connected. Subtitles appear in real time above the stage for the audience, while the presenter sees them in front of their camera view, like movie captions.

This dual architecture—many-to-many for social and business spaces, one-to-many for conference sessions—ensures seamless multilingual communication across all scenarios. The Figure shows the overall architecture, illustrating how the middleware intelligently coordinates speaker, listener, and translation model connections to deliver efficient and scalable translations in VR.
In conclusion, VOXReality’s middleware approach creates a translation system that is efficient, adaptive, and inclusive—capable of supporting everything from casual conversations to formal presentations, ensuring that every participant can engage fully in their own language.

Georgios Nikolakis
Software Engineer @Synelixis