U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation

Xiang Deng1,2†, Feng Gao2, Yong Zhang2*, Youxin Pang1,2, Xu Xiaoming2,
Zhuoliang Kang2, Xiaoming Wei2, Yebin Liu1*
1Tsinghua University    2Meituan
† Work done during an internship at Meituan. * Corresponding author

Abstract

Full-stack multimodal interaction in real-time is a central goal in building intelligent embodied agents capable of natural, dynamic communication. In this work, we introduce U-Mind, the first unified system for high-intelligence multimodal dialogue that supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop.

At its core, U-Mind implements a Unified Alignment and Reasoning Framework that addresses cross-modal synchronization via a segment-wise alignment strategy and preserves reasoning abilities through Rehearsal-Driven Learning. Extensive experiments demonstrate that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks, including question answering, instruction following, and motion generation.

BibTeX

@inproceedings{deng2026umind,
  title={U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation},
  author={Deng, Xiang and Gao, Feng text, Zhang, Yong and Pang, Youxin and Xiaoming, Xu and Kang, Zhuoliang and Wei, Xiaoming and Liu, Yebin},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}