Frontiers of Data and Computing ›› 2025, Vol. 7 ›› Issue (3): 81-93.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.03.007

doi: 10.11871/jfdc.issn.2096-742X.2025.03.007

• Special Issue: 30th Anniversary of the Computer Network Information Center, Chinese Academy of Sciences • Previous Articles     Next Articles

Multimodal Interaction: From Human-Computer Collaboration to Human-Intelligence Collaboration

WANG Zhenyuan1,2(),TIAN Dong1,2,DONG Yu1,2,QIAO Na3,SHAN Guihua1,2,*()   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
    2. University of Chinese Academy of Sciences, Beijing 100049, China
    3. Secretariat of the Chinese National Committee for man and the Biosphere Programme.UNESCO, Beijing 100864, China
  • Received:2025-05-15 Online:2025-06-20 Published:2025-06-25

Abstract:

[Objective] This paper explores the paradigm shift of multimodal interaction technology from “human-computer collaboration” to “human-intelligence collaboration”. [Coverage] The references include 70 relevant works from recent domestic and international journals or conferences on multimodal human-intelligent interaction. [Methods] First, it introduces the development of multimodal interaction technology from traditional methods (speech, gestures, eye movements) to the human-intelligence interaction paradigm integrated with large models (LLMs, VLMs). Secondly, it focuses on the cutting-edge methods of interaction context awareness and user intention understanding, and presents cases of multimodal human-intelligence interaction technology in healthcare, education, creation, and daily life. [Results] Multimodal human-intelligence interaction has been explored in various vertical and general fields, but still faces core technical challenges such as lack of compensation mechanisms and appropriate handling of erroneous or ambiguous intents. [Limitations] Due to the scope of available literature, only typical types of interaction modalities are listed, and the coverage of application scenarios is limited. [Conclusions] Multimodal interaction is progressing towards greater intelligence. Future research should focus on optimizing compensation mechanisms, improving intent understanding accuracy, enhancing dynamic balancing mechanisms, and offering embodied design to better support diverse intent tasks and facilitate scalable applications.

Key words: multimodal interaction, agent, intelligent interaction, extended reality