Full-stack multimodal interaction in real-time is a central goal in building intelligent embodied agents capable of natural, dynamic communication. In this work, we introduce U-Mind, the first unified system for high-intelligence multimodal dialogue that supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop.
At its core, U-Mind implements a Unified Alignment and Reasoning Framework that addresses cross-modal synchronization via a segment-wise alignment strategy and preserves reasoning abilities through Rehearsal-Driven Learning. Extensive experiments demonstrate that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks, including question answering, instruction following, and motion generation.
@inproceedings{deng2026umind,
title={U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation},
author={Deng, Xiang and Gao, Feng text, Zhang, Yong and Pang, Youxin and Xiaoming, Xu and Kang, Zhuoliang and Wei, Xiaoming and Liu, Yebin},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}