U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation

Xiang Deng^1,2†, Feng Gao², Yong Zhang^2*, Youxin Pang^1,2, Xu Xiaoming²,
Zhuoliang Kang², Xiaoming Wei², Yebin Liu^1*

¹Tsinghua University ²Meituan
† Work done during an internship at Meituan. * Corresponding author

Abstract

Full-stack multimodal interaction in real-time is a central goal in building intelligent embodied agents capable of natural, dynamic communication. In this work, we introduce U-Mind, the first unified system for high-intelligence multimodal dialogue that supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop.

At its core, U-Mind implements a Unified Alignment and Reasoning Framework that addresses cross-modal synchronization via a segment-wise alignment strategy and preserves reasoning abilities through Rehearsal-Driven Learning. Extensive experiments demonstrate that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks, including question answering, instruction following, and motion generation.

BibTeX

@inproceedings{deng2026umind,
  title={U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation},
  author={Deng, Xiang and Gao, Feng text, Zhang, Yong and Pang, Youxin and Xiaoming, Xu and Kang, Zhuoliang and Wei, Xiaoming and Liu, Yebin},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}