多模态交互：从人机协同迈向人智协同

doi:10.11871/jfdc.issn.2096-742X.2025.03.007

数据与计算发展前沿 ›› 2025, Vol. 7 ›› Issue (3): 81-93.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.03.007

doi: 10.11871/jfdc.issn.2096-742X.2025.03.007

• 专刊：中国科学院计算机网络信息中心成立30周年 • 上一篇下一篇

多模态交互：从人机协同迈向人智协同

王镇远^1,²(),田东^1,²,董禹^1,²,乔娜³,单桂华^1,^2,^*()

1.中国科学院计算机网络信息中心，北京100083
2.中国科学院大学，北京100049
3.中国人与生物圈国家委员会秘书处，北京100864

收稿日期:2025-05-15 出版日期:2025-06-20 发布日期:2025-06-25
通讯作者: *单桂华（E-mail: sgh@cnic.cn）
作者简介:王镇远，中国科学院计算机网络信息中心，中国科学院大学，硕士研究生，主要研究方向为多模态交互、沉浸式可视化、人机混合智能。
本文承担工作为：文献收集，文献分类总结。
WANG Zhenyuan is currently a postgraduate student at the Computer Network Information Center, Chinese Academy of Sciences. His research interests include multi-modal interaction, immersive visualization and hurnan-machine hybrid intelligence.
In this paper, he is mainly responsible for literature collection, classification, and summary.
E-mail: zywang@cnic.cn|单桂华，中国科学院计算机网络信息中心，研究员，博士生导师，先进交互式技术与应用发展部主任，主要研究方向为大数据分析，人机混合智能。
本文承担工作为：负责文章总体结构的确定，提供文献和写作指导。
SHAN Guihua is currently a research professor, PhD supervisor and Director of Advanced Interactive Technology and Application Development Department at the Computer Network Information Center, Chinese Academy of Sciences. Her main research interests include big data analytics and human-machine hybrid intelligence.
In this paper, she is mainly responsible for determining the overall structure of the article and providing literature and writing guidance.
E-mail: sgh@cnic.cn
基金资助:
中华人民共和国人与生物圈国家委员会项目“候鸟迁徙可视分析关键技术研究”(MAB-CN-2023-HNQX);中国科学院计算机网络信息中心青年基金“面向科学数据分析的多模态空间智能交互”(25YF08)

Multimodal Interaction: From Human-Computer Collaboration to Human-Intelligence Collaboration

WANG Zhenyuan^1,²(),TIAN Dong^1,²,DONG Yu^1,²,QIAO Na³,SHAN Guihua^1,^2,^*()

1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
2. University of Chinese Academy of Sciences, Beijing 100049, China
3. Secretariat of the Chinese National Committee for man and the Biosphere Programme.UNESCO, Beijing 100864, China

Received:2025-05-15 Online:2025-06-20 Published:2025-06-25

摘要/Abstract

摘要：

【目的】本文探讨了多模态交互技术从“人机协同”向“人智协同”的范式转变进程。【文献范围】参考文献取自近年来国内外期刊或会议中有关多模态人智交互的相关工作，总计70篇。【方法】首先介绍了多模态交互的技术从基于语音、手势、眼动等传统交互方式，到集成大模型（LLMs、VLMs）的人智交互范式的发展脉络。其次，重点探讨了交互情境感知与用户意图理解的前沿方法，并展示了多模态人智交互技术在医疗、教育创作、日常生活等场景的案例。【结果】多模态人智交互已在多个垂直或通用领域得到探索，但仍面临包括缺失补偿机制、错误/模糊意图处理等核心技术挑战。【局限】受限于文献获取范围，仅列举了典型交互模态种类，且应用场景介绍缺乏广度。【结论】多模态交互正往智能化方向迈进，未来研究应聚焦于优化缺失补偿机制、提升意图理解精度、完善动态平衡机制和提供具身化设计等方向，以更好支持不同意图任务并规模化应用。

关键词: 多模态交互, 智能体, 智能交互, 扩展现实

Abstract:

[Objective] This paper explores the paradigm shift of multimodal interaction technology from “human-computer collaboration” to “human-intelligence collaboration”. [Coverage] The references include 70 relevant works from recent domestic and international journals or conferences on multimodal human-intelligent interaction. [Methods] First, it introduces the development of multimodal interaction technology from traditional methods (speech, gestures, eye movements) to the human-intelligence interaction paradigm integrated with large models (LLMs, VLMs). Secondly, it focuses on the cutting-edge methods of interaction context awareness and user intention understanding, and presents cases of multimodal human-intelligence interaction technology in healthcare, education, creation, and daily life. [Results] Multimodal human-intelligence interaction has been explored in various vertical and general fields, but still faces core technical challenges such as lack of compensation mechanisms and appropriate handling of erroneous or ambiguous intents. [Limitations] Due to the scope of available literature, only typical types of interaction modalities are listed, and the coverage of application scenarios is limited. [Conclusions] Multimodal interaction is progressing towards greater intelligence. Future research should focus on optimizing compensation mechanisms, improving intent understanding accuracy, enhancing dynamic balancing mechanisms, and offering embodied design to better support diverse intent tasks and facilitate scalable applications.

Key words: multimodal interaction, agent, intelligent interaction, extended reality

王镇远, 田东, 董禹, 乔娜, 单桂华. 多模态交互：从人机协同迈向人智协同[J]. 数据与计算发展前沿, 2025, 7(3): 81-93.

WANG Zhenyuan, TIAN Dong, DONG Yu, QIAO Na, SHAN Guihua. Multimodal Interaction: From Human-Computer Collaboration to Human-Intelligence Collaboration[J]. Frontiers of Data and Computing, 2025, 7(3): 81-93, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2025.03.007.

图/表 2

参考文献 70

[1]	PEARL C. Designing voice user interfaces: Principles of conversational experiences[M]. O’Reilly Media, Inc., 2016:1-278.
[2]	YASEN M, JUSOH S. A systematic review on hand gesture recognition techniques, challenges and applications[J]. PeerJ Computer Science, 2019, 5: e218.
[3]	HOLMQVIST K, NYSTRÖM M, ANDERSSON R, et al. Eye tracking: A comprehensive guide to methods and measures[M]. oup Oxford, 2011: 1-560.
[4]	ACHIAM J, ADLER S, AGARWAL S, et al. Gpt-4 technical report[J]. arXiv preprint arXiv: 2303.08774, 2023.
[5]	LIU H, LI C, WU Q, et al. Visual instruction tuning[J]. Advances in neural information processing systems, 2023, 36: 34892-34916.
[6]	LAKOMKIN E, ZAMANI M A, WEBER C, et al. Incorporating end-to-end speech recognition models for sentiment analysis[C]// 2019 International Conference on Robotics and Automation (ICRA), IEEE, 2019: 7976-7982.
[7]	SCHAFFER S, REITHINGER N. Benefit, design and evaluation of multimodal interaction[C]// Proceedings of the 2016 DSLI Workshop. ACM CHI. 2016:1-6.
[8]	ACHERKI C, NIGAY L, ROY Q, et al. An Evaluation of Spatial Anchoring to position AR Guidance in Arthroscopic Surgery[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-17.
[9]	RAHMAN Y, ASISH S M, FISHER N P, et al. Exploring eye gaze visualization techniques for identifying distracted students in educational VR[C]// 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), IEEE, 2020: 868-877.
[10]	CHO H, FASHIMPAUR J, SENDHILNATHAN N, et al. Persistent Assistant: Seamless Everyday AI Interactions via Intent Grounding and Multimodal Feedback[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-19.
[11]	TAO J, WU Y, YU C, et al. 多模态人机交互综述[J]. Journal of Image and Graphics, 2022, 27(6): 1956-1987.
[12]	苟超, 卓莹, 王康, 等. 眼动跟踪研究进展与展望[J]. 自动化学报, 2022, 48(5): 1173-1192.
[13]	TANRIVERDI V, JACOB R J K. Interacting with eye movements in virtual environments[C]// Proceedings of the SIGCHI conference on Human Factors in Computing Systems, 2000: 265-272.
[14]	KHAMIS M, ALT F, BULLING A. The past, present, and future of gaze-enabled handheld mobile devices: Survey and lessons learned[C]// Proceedings of the 20th International Conference on Human-Computer Interaction with Mobile Devices and Services, 2018: 1-17.
[15]	RAYNER K. Eye movements in reading and information processing: 20 years of research[J]. Psychological bulletin, 1998, 124(3): 372. doi: 10.1037/0033-2909.124.3.372 pmid: 9849112
[16]	SHI D, WANG Y, BAI Y, et al. Chartist: Task-driven Eye Movement Control for Chart Reading[J]. arXiv preprint arXiv:2502.03575, 2025.
[17]	LUGARESI C, TANG J, NASH H, et al. Mediapipe: A framework for building perception pipelines[J]. arXiv preprint arXiv: 1906.08172, 2019.
[18]	KALANDAR B, DWORAKOWSKI Z. Sign Language Conversation Interpretation Using Wearable Sensors and Machine Learning[J]. arXiv preprint arXiv: 2312.11903, 2023.
[19]	Y ZHANG, B ENS, K A SATRIADI, et al. TimeTables: Embodied Exploration of Immersive Spatio-Temporal Data[C]. IEEE Conference on Virtual Reality and 3D User Interfaces (VR), 2022: 599-605.
[20]	WAGNER U, LYSTBÆK M N, MANAKHOV P, et al. A fitts’ law study of gaze-hand alignment for selection in 3d user interfaces[C]// Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023: 1-15.
[21]	马晗, 唐柔冰, 张义, 等. 语音识别研究综述[J]. 计算机系统应用, 2022, 31(1): 1-10.
[22]	ZHANG D, ZHANG X, ZHAN J, et al. Speechgpt-gen: Scaling chain-of-information speech generation[J]. arXiv preprint arXiv:2401.13527, 2024.
[23]	WANG P. Research and design of smart home speech recognition system based on deep learning[C]// 2020 International Conference on Computer Vision, Image and Deep Learning (CVIDL), IEEE, 2020: 218-221.
[24]	JARADAT G A, ALZUBAIDI M A, OTOOM M. A novel human-vehicle interaction assistive device for Arab drivers using speech recognition[J]. IEEE Access, 2022, 10: 127514-127529.
[25]	FURTADO J S, LIU H H T, LAI G, et al. Comparative analysis of optitrack motion capture systems[C]// Advances in Motion Sensing and Control for Robotic Applications: Selected Papers from the Symposium on Mechatronics, Robotics, and Control (SMRC’18)-CSME International Congress 2018, May 27-30, 2018 Toronto, Canada, Springer International Publishing, 2019: 15-31.
[26]	GUO J, LUO J, WEI Z, et al. TelePhantom: A User-Friendly Teleoperation System with Virtual Assistance for Enhanced Effectiveness[J]. arXiv preprint arXiv: 2412.13548, 2024.
[27]	DONG J, FANG Q, JIANG W, et al. Fast and robust multi-person 3d pose estimation and tracking from multiple views[J]. IEEE transactions on pattern analysis and machine intelligence, 2021, 44(10): 6981-6992.
[28]	MEHRABAN S, ADELI V, TAATI B. Motionagformer: Enhancing 3d human pose estimation with a transformer-gcnformer network[C]// Proceedings of the IE- EE/CVF winter conference on applications of computer vision, 2024: 6920-6930.
[29]	YE J, YU Y, WANG Q, et al. CmdVIT: A Voluntary Facial Expression Recognition Model for Complex Mental Disorders[J]. IEEE Transactions on Image Processing, 2025, 34: 3013-3024.
[30]	ZHU W, MA X, LIU Z, et al. Motionbert: A unified perspective on learning human motion representations[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023: 15085-15099.
[31]	HU Y, ZHANG S, DANG T, et al. Exploring large-scale language models to evaluate eeg-based multimodal data for mental health[C]// Companion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2024: 412-417.
[32]	KUANG B, LI X, LI X, et al. The effect of eye gaze direction on emotional mimicry: A multimodal study with electromyography and electroencephalography[J]. NeuroImage, 2021, 226: 117604.
[33]	WANG Y, HUANG W, SUN F, et al. Deep multimodal fusion by channel exchanging[J]. Advances in neural information processing systems, 2020, 33: 4835-4845.
[34]	DEBIE E, ROJAS R F, FIDOCK J, et al. Multimodal fusion for objective assessment of cognitive workload: A review[J]. IEEE transactions on cybernetics, 2019, 51(3): 1542-1555.
[35]	VERE S, BICKMORE T. A basic agent[J]. Computational intelligence, 1990, 6(1): 41-60.
[36]	GRANTER S R, BECK A H, PAPKE JR D J. AlphaGo, deep learning, and the future of the human microscopist[J]. Archives of pathology & laboratory medicine, 2017, 141(5): 619-621.
[37]	LIU A, FENG B, XUE B, et al. Deepseek-v3 technical report[J]. arXiv preprint arXiv: 2412.19437, 2024.
[38]	GONG R, HUANG Q, MA X, et al. Mindagent: Emergent gaming interaction[J]. arXiv preprint arXiv:2309.09971, 2023.
[39]	YU J, WANG X, TU S, et al. Kola: Carefully benchmarking world knowledge of large language models[J]. arXiv preprint arXiv:2306.09296, 2023.
[40]	DURANTE Z, HUANG Q, WAKE N, et al. Agent ai: Surveying the horizons of multimodal interaction[J]. arXiv preprint arXiv: 2401.03568, 2024.
[41]	WU Q, BANSAL G, ZHANG J, et al. Autogen: Enabling next-gen llm applications via multi-agent conversation[J]. arXiv preprint arXiv:2308.08155, 2023.
[42]	QIAN C, LIU W, LIU H, et al. Chatdev: Communicative agents for software development[J]. arXiv preprint arXiv: 2307.07924, 2023.
[43]	HONG S, ZHENG X, CHEN J, et al. Metagpt: Meta programming for multi-agent collaborative framework[J]. arXiv preprint arXiv:2308.00352, 2023, 3(4): 6.
[44]	HONG Y, ZHEN H, CHEN P, et al. 3d-llm: Injecting the 3d world into large language models[J]. Advances in Neural Information Processing Systems, 2023, 36: 20482-20494.
[45]	JATAVALLABHULA K M, SARYAZDI S, IYER G, et al. gradSLAM: Automagically differentiable SLAM[J]. arXiv preprint arXiv:1910.10672, 2019.
[46]	MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. Nerf: Representing scenes as neural radiance fields for view synthesis[J]. Communications of the ACM, 2021, 65(1): 99-106.
[47]	CAO Y, JIANG P, XIA H. Generative and Malleable User Interfaces with Generative and Evolving Task-Driven Data Model[J]. arXiv preprint arXiv:2503.04084, 2025.
[48]	ZHAO Z, CHAI W, WANG X, et al. See and think: Embodied agent in virtual environment[C]// European Conference on Computer Vision, Springer, Cham, 2025: 187-204.
[49]	KIRCHNER E A, FAIRCLOUGH S H, KIRCHNER F. Embedded multimodal interfaces in robotics: applications, future trends, and societal implications[M]// The Handbook of Multimodal-Multisensor Interfaces: Language Processing, Software, Commercialization, and Emerging Directions-Volume 3, 2019: 523-576.
[50]	尤明辉, 殷亚凤, 谢磊, 等. 基于行为感知的用户画像技术[J]. 浙江大学学报 (工学版), 2021, 55(4): 608-614.
[51]	QIAN C, HE B, ZHUANG Z, et al. Tell me more! towards implicit user intention understanding of language model driven agents[J]. arXiv preprint arXiv:2402.09205, 2024.
[52]	TRICK S, KOERT D, PETERS J, et al. Multimodal uncertainty reduction for intention recognition in human-robot interaction[C]// 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2019: 7009-7016.
[53]	GÖLDI A, RIETSCHE R, UNGAR L. Efficient Management of LLM-Based Coaching Agents’ Reasoning While Maintaining Interaction Quality and Speed[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-18.
[54]	PERERA M, ANANTHANARAYAN S, GONCU C, et al. The Sky is the Limit: Understanding How Generative AI can Enhance Screen Reader Users’ Experience with Productivity Applications[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-17.
[55]	SAAB K, TU T, WENG W H, et al. Capabilities of gemini models in medicine[J]. arXiv preprint arXiv:2404.18416, 2024.
[56]	LI B, YAN T, PAN Y, et al. Mmedagent: Learning to use medical tools with multi-modal agent[J]. arXiv preprint arXiv: 2407.02483, 2024.
[57]	SCHMIDGALL S, ZIAEI R, HARRIS C, et al. AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments[J]. arXiv preprint arXiv: 2405.07960, 2024.
[58]	TSAI H R, CHIU S K, WANG B. GazeNoter: Co-Piloted AR Note-Taking via Gaze Selection of LLM Suggestions to Match Users’ Intentions[J]. arXiv preprint arXiv:2407.01161, 2024.
[59]	WANG R, ZHOU X, QIU L, et al. Social-RAG: Retrieving from Group Interactions to Socially Ground AI Generation[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-25.
[60]	PICKERING M, WILLIAMS H, GAN A, et al. How Humans Communicate Programming Tasks in Natural Language and Implications For End-User Programming with LLMs[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-34.
[61]	CAO Y, JIANG P, XIA H. Generative and Malleable User Interfaces with Generative and Evolving Task-Driven Data Model[J]. arXiv preprint arXiv:2503.04084, 2025.
[62]	DE LA TORRE F, FANG C M, HUANG H, et al. Llmr: Real-time prompting of interactive worlds using large language models[C]// Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 2024: 1-22.
[63]	PENG X, KOCH J, MACKAY W E. FusAIn: Composing Generative AI Visual Prompts Using Pen-based Interaction[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-20.
[64]	ZHANG T, AU YEUNG C, AURELIA E, et al. Prompting an Embodied AI Agent: How Embodiment and Multimodal Signaling Affects Prompting Behaviour[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-25.
[65]	ZHANG C, YANG Z, LIU J, et al. Appagent: Multimodal agents as smartphone users[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-20.
[66]	LI C, WU G, CHAN G Y Y, et al. Satori 悟り: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-24.
[67]	CHO H, FASHIMPAUR J, SENDHILNATHAN N, et al. Persistent Assistant: Seamless Everyday AI Interactions via Intent Grounding and Multimodal Feedback[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-19.
[68]	QIN Y, HU S, LIN Y, et al. Tool learning with foundation models[J]. ACM Computing Surveys, 2024, 57(4): 1-40.
[69]	ZUO H, LIU R, ZHAO J, et al. Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities[C]// ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2023: 1-5.
[70]	LI M, ZHAO S, WANG Q, et al. Embodied agent interface: Benchmarking llms for embodied decision making[J]. Advances in Neural Information Processing Systems, 2024, 37: 100428-100534.

多模态交互：从人机协同迈向人智协同

Multimodal Interaction: From Human-Computer Collaboration to Human-Intelligence Collaboration

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 2

参考文献 70

相关文章 1

编辑推荐

Metrics

本文评价