数据与计算发展前沿 ›› 2025, Vol. 7 ›› Issue (3): 81-93.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.03.007

doi: 10.11871/jfdc.issn.2096-742X.2025.03.007

• 专刊:中国科学院计算机网络信息中心成立30周年 • 上一篇    下一篇

多模态交互:从人机协同迈向人智协同

王镇远1,2(),田东1,2,董禹1,2,乔娜3,单桂华1,2,*()   

  1. 1.中国科学院计算机网络信息中心,北京100083
    2.中国科学院大学,北京100049
    3.中国人与生物圈国家委员会秘书处,北京100864
  • 收稿日期:2025-05-15 出版日期:2025-06-20 发布日期:2025-06-25
  • 通讯作者: *单桂华(E-mail: sgh@cnic.cn
  • 作者简介:王镇远,中国科学院计算机网络信息中心,中国科学院大学,硕士研究生,主要研究方向为多模态交互、沉浸式可视化、人机混合智能。
    本文承担工作为:文献收集,文献分类总结。
    WANG Zhenyuan is currently a postgraduate student at the Computer Network Information Center, Chinese Academy of Sciences. His research interests include multi-modal interaction, immersive visualization and hurnan-machine hybrid intelligence.
    In this paper, he is mainly responsible for literature collection, classification, and summary.
    E-mail: zywang@cnic.cn|单桂华,中国科学院计算机网络信息中心,研究员,博士生导师,先进交互式技术与应用发展部主任,主要研究方向为大数据分析,人机混合智能。
    本文承担工作为:负责文章总体结构的确定,提供文献和写作指导。
    SHAN Guihua is currently a research professor, PhD supervisor and Director of Advanced Interactive Technology and Application Development Department at the Computer Network Information Center, Chinese Academy of Sciences. Her main research interests include big data analytics and human-machine hybrid intelligence.
    In this paper, she is mainly responsible for determining the overall structure of the article and providing literature and writing guidance.
    E-mail: sgh@cnic.cn
  • 基金资助:
    中华人民共和国人与生物圈国家委员会项目“候鸟迁徙可视分析关键技术研究”(MAB-CN-2023-HNQX);中国科学院计算机网络信息中心青年基金“面向科学数据分析的多模态空间智能交互”(25YF08)

Multimodal Interaction: From Human-Computer Collaboration to Human-Intelligence Collaboration

WANG Zhenyuan1,2(),TIAN Dong1,2,DONG Yu1,2,QIAO Na3,SHAN Guihua1,2,*()   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
    2. University of Chinese Academy of Sciences, Beijing 100049, China
    3. Secretariat of the Chinese National Committee for man and the Biosphere Programme.UNESCO, Beijing 100864, China
  • Received:2025-05-15 Online:2025-06-20 Published:2025-06-25

摘要:

【目的】本文探讨了多模态交互技术从“人机协同”向“人智协同”的范式转变进程。【文献范围】参考文献取自近年来国内外期刊或会议中有关多模态人智交互的相关工作,总计70篇。【方法】首先介绍了多模态交互的技术从基于语音、手势、眼动等传统交互方式,到集成大模型(LLMs、VLMs)的人智交互范式的发展脉络。其次,重点探讨了交互情境感知与用户意图理解的前沿方法,并展示了多模态人智交互技术在医疗、教育创作、日常生活等场景的案例。【结果】多模态人智交互已在多个垂直或通用领域得到探索,但仍面临包括缺失补偿机制、错误/模糊意图处理等核心技术挑战。【局限】受限于文献获取范围,仅列举了典型交互模态种类,且应用场景介绍缺乏广度。【结论】多模态交互正往智能化方向迈进,未来研究应聚焦于优化缺失补偿机制、提升意图理解精度、完善动态平衡机制和提供具身化设计等方向,以更好支持不同意图任务并规模化应用。

关键词: 多模态交互, 智能体, 智能交互, 扩展现实

Abstract:

[Objective] This paper explores the paradigm shift of multimodal interaction technology from “human-computer collaboration” to “human-intelligence collaboration”. [Coverage] The references include 70 relevant works from recent domestic and international journals or conferences on multimodal human-intelligent interaction. [Methods] First, it introduces the development of multimodal interaction technology from traditional methods (speech, gestures, eye movements) to the human-intelligence interaction paradigm integrated with large models (LLMs, VLMs). Secondly, it focuses on the cutting-edge methods of interaction context awareness and user intention understanding, and presents cases of multimodal human-intelligence interaction technology in healthcare, education, creation, and daily life. [Results] Multimodal human-intelligence interaction has been explored in various vertical and general fields, but still faces core technical challenges such as lack of compensation mechanisms and appropriate handling of erroneous or ambiguous intents. [Limitations] Due to the scope of available literature, only typical types of interaction modalities are listed, and the coverage of application scenarios is limited. [Conclusions] Multimodal interaction is progressing towards greater intelligence. Future research should focus on optimizing compensation mechanisms, improving intent understanding accuracy, enhancing dynamic balancing mechanisms, and offering embodied design to better support diverse intent tasks and facilitate scalable applications.

Key words: multimodal interaction, agent, intelligent interaction, extended reality