While today’s AI systems excel at many tasks, they still struggle with those requiring a higher level of understanding of physical context. For example, in language, these models fail at enabling live translation in challenging environments, such as crowded, noisy spaces with multiple speakers. Traditional systems struggle to identify who is speaking, while humans easily focus on the right person by using visual and acoustic cues, like gaze direction, facial orientation, voice loudness, and distance. Audio-only translation systems perform poorly in these settings because they lack contextual awareness. DVPS, by contrast, combines visual input, spatial sound, and speech direction to identify the correct speaker and deliver more accurate translations.
Potential applications span multiple areas: in language, enabling real-time translation across a wide range of languages, with an understanding of text, speech, gestures, and physical context; in health, supporting the early detection of cardiovascular risk using a 3D digital twin of the heart generated from medical imaging; and in the environment, enhancing disaster response through flood prediction powered by satellite and drone data combined with real-time observational signals.
The initiative is led by Translated, which oversees the overall vision and execution.
The DVPS founding team consists of 70 top European AI scientists from across the following partners:
- Research: University of Oxford, The Alan Turing Institute, École Polytechnique Fédérale de Lausanne, ETH Zurich, Imperial College London, Fondazione Bruno Kessler, Karlsruhe Institute of Technology, Universitat de Barcelona, and Vlaamse Instelling voor Technologisch Onderzoek
- Vertically specialized partners: Heidelberg University Hospital, Vall d’Hebron Institut de Recerca, and Amsterdam University Medical Centers, Deepset, Sistema, and MEEO, Lynkeus, Data Valley, and Pi School of AI
- High-performance computing (model training): Cyfronet, Poland’s national HPC centre