[1] |
PEARL C. Designing voice user interfaces: Principles of conversational experiences[M]. O’Reilly Media, Inc., 2016:1-278.
|
[2] |
YASEN M, JUSOH S. A systematic review on hand gesture recognition techniques, challenges and applications[J]. PeerJ Computer Science, 2019, 5: e218.
|
[3] |
HOLMQVIST K, NYSTRÖM M, ANDERSSON R, et al. Eye tracking: A comprehensive guide to methods and measures[M]. oup Oxford, 2011: 1-560.
|
[4] |
ACHIAM J, ADLER S, AGARWAL S, et al. Gpt-4 technical report[J]. arXiv preprint arXiv: 2303.08774, 2023.
|
[5] |
LIU H, LI C, WU Q, et al. Visual instruction tuning[J]. Advances in neural information processing systems, 2023, 36: 34892-34916.
|
[6] |
LAKOMKIN E, ZAMANI M A, WEBER C, et al. Incorporating end-to-end speech recognition models for sentiment analysis[C]// 2019 International Conference on Robotics and Automation (ICRA), IEEE, 2019: 7976-7982.
|
[7] |
SCHAFFER S, REITHINGER N. Benefit, design and evaluation of multimodal interaction[C]// Proceedings of the 2016 DSLI Workshop. ACM CHI. 2016:1-6.
|
[8] |
ACHERKI C, NIGAY L, ROY Q, et al. An Evaluation of Spatial Anchoring to position AR Guidance in Arthroscopic Surgery[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-17.
|
[9] |
RAHMAN Y, ASISH S M, FISHER N P, et al. Exploring eye gaze visualization techniques for identifying distracted students in educational VR[C]// 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), IEEE, 2020: 868-877.
|
[10] |
CHO H, FASHIMPAUR J, SENDHILNATHAN N, et al. Persistent Assistant: Seamless Everyday AI Interactions via Intent Grounding and Multimodal Feedback[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-19.
|
[11] |
TAO J, WU Y, YU C, et al. 多模态人机交互综述[J]. Journal of Image and Graphics, 2022, 27(6): 1956-1987.
|
[12] |
苟超, 卓莹, 王康, 等. 眼动跟踪研究进展与展望[J]. 自动化学报, 2022, 48(5): 1173-1192.
|
[13] |
TANRIVERDI V, JACOB R J K. Interacting with eye movements in virtual environments[C]// Proceedings of the SIGCHI conference on Human Factors in Computing Systems, 2000: 265-272.
|
[14] |
KHAMIS M, ALT F, BULLING A. The past, present, and future of gaze-enabled handheld mobile devices: Survey and lessons learned[C]// Proceedings of the 20th International Conference on Human-Computer Interaction with Mobile Devices and Services, 2018: 1-17.
|
[15] |
RAYNER K. Eye movements in reading and information processing: 20 years of research[J]. Psychological bulletin, 1998, 124(3): 372.
doi: 10.1037/0033-2909.124.3.372
pmid: 9849112
|
[16] |
SHI D, WANG Y, BAI Y, et al. Chartist: Task-driven Eye Movement Control for Chart Reading[J]. arXiv preprint arXiv:2502.03575, 2025.
|
[17] |
LUGARESI C, TANG J, NASH H, et al. Mediapipe: A framework for building perception pipelines[J]. arXiv preprint arXiv: 1906.08172, 2019.
|
[18] |
KALANDAR B, DWORAKOWSKI Z. Sign Language Conversation Interpretation Using Wearable Sensors and Machine Learning[J]. arXiv preprint arXiv: 2312.11903, 2023.
|
[19] |
Y ZHANG, B ENS, K A SATRIADI, et al. TimeTables: Embodied Exploration of Immersive Spatio-Temporal Data[C]. IEEE Conference on Virtual Reality and 3D User Interfaces (VR), 2022: 599-605.
|
[20] |
WAGNER U, LYSTBÆK M N, MANAKHOV P, et al. A fitts’ law study of gaze-hand alignment for selection in 3d user interfaces[C]// Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023: 1-15.
|
[21] |
马晗, 唐柔冰, 张义, 等. 语音识别研究综述[J]. 计算机系统应用, 2022, 31(1): 1-10.
|
[22] |
ZHANG D, ZHANG X, ZHAN J, et al. Speechgpt-gen: Scaling chain-of-information speech generation[J]. arXiv preprint arXiv:2401.13527, 2024.
|
[23] |
WANG P. Research and design of smart home speech recognition system based on deep learning[C]// 2020 International Conference on Computer Vision, Image and Deep Learning (CVIDL), IEEE, 2020: 218-221.
|
[24] |
JARADAT G A, ALZUBAIDI M A, OTOOM M. A novel human-vehicle interaction assistive device for Arab drivers using speech recognition[J]. IEEE Access, 2022, 10: 127514-127529.
|
[25] |
FURTADO J S, LIU H H T, LAI G, et al. Comparative analysis of optitrack motion capture systems[C]// Advances in Motion Sensing and Control for Robotic Applications: Selected Papers from the Symposium on Mechatronics, Robotics, and Control (SMRC’18)-CSME International Congress 2018, May 27-30, 2018 Toronto, Canada, Springer International Publishing, 2019: 15-31.
|
[26] |
GUO J, LUO J, WEI Z, et al. TelePhantom: A User-Friendly Teleoperation System with Virtual Assistance for Enhanced Effectiveness[J]. arXiv preprint arXiv: 2412.13548, 2024.
|
[27] |
DONG J, FANG Q, JIANG W, et al. Fast and robust multi-person 3d pose estimation and tracking from multiple views[J]. IEEE transactions on pattern analysis and machine intelligence, 2021, 44(10): 6981-6992.
|
[28] |
MEHRABAN S, ADELI V, TAATI B. Motionagformer: Enhancing 3d human pose estimation with a transformer-gcnformer network[C]// Proceedings of the IE- EE/CVF winter conference on applications of computer vision, 2024: 6920-6930.
|
[29] |
YE J, YU Y, WANG Q, et al. CmdVIT: A Voluntary Facial Expression Recognition Model for Complex Mental Disorders[J]. IEEE Transactions on Image Processing, 2025, 34: 3013-3024.
|
[30] |
ZHU W, MA X, LIU Z, et al. Motionbert: A unified perspective on learning human motion representations[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023: 15085-15099.
|
[31] |
HU Y, ZHANG S, DANG T, et al. Exploring large-scale language models to evaluate eeg-based multimodal data for mental health[C]// Companion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2024: 412-417.
|
[32] |
KUANG B, LI X, LI X, et al. The effect of eye gaze direction on emotional mimicry: A multimodal study with electromyography and electroencephalography[J]. NeuroImage, 2021, 226: 117604.
|
[33] |
WANG Y, HUANG W, SUN F, et al. Deep multimodal fusion by channel exchanging[J]. Advances in neural information processing systems, 2020, 33: 4835-4845.
|
[34] |
DEBIE E, ROJAS R F, FIDOCK J, et al. Multimodal fusion for objective assessment of cognitive workload: A review[J]. IEEE transactions on cybernetics, 2019, 51(3): 1542-1555.
|
[35] |
VERE S, BICKMORE T. A basic agent[J]. Computational intelligence, 1990, 6(1): 41-60.
|
[36] |
GRANTER S R, BECK A H, PAPKE JR D J. AlphaGo, deep learning, and the future of the human microscopist[J]. Archives of pathology & laboratory medicine, 2017, 141(5): 619-621.
|
[37] |
LIU A, FENG B, XUE B, et al. Deepseek-v3 technical report[J]. arXiv preprint arXiv: 2412.19437, 2024.
|
[38] |
GONG R, HUANG Q, MA X, et al. Mindagent: Emergent gaming interaction[J]. arXiv preprint arXiv:2309.09971, 2023.
|
[39] |
YU J, WANG X, TU S, et al. Kola: Carefully benchmarking world knowledge of large language models[J]. arXiv preprint arXiv:2306.09296, 2023.
|
[40] |
DURANTE Z, HUANG Q, WAKE N, et al. Agent ai: Surveying the horizons of multimodal interaction[J]. arXiv preprint arXiv: 2401.03568, 2024.
|
[41] |
WU Q, BANSAL G, ZHANG J, et al. Autogen: Enabling next-gen llm applications via multi-agent conversation[J]. arXiv preprint arXiv:2308.08155, 2023.
|
[42] |
QIAN C, LIU W, LIU H, et al. Chatdev: Communicative agents for software development[J]. arXiv preprint arXiv: 2307.07924, 2023.
|
[43] |
HONG S, ZHENG X, CHEN J, et al. Metagpt: Meta programming for multi-agent collaborative framework[J]. arXiv preprint arXiv:2308.00352, 2023, 3(4): 6.
|
[44] |
HONG Y, ZHEN H, CHEN P, et al. 3d-llm: Injecting the 3d world into large language models[J]. Advances in Neural Information Processing Systems, 2023, 36: 20482-20494.
|
[45] |
JATAVALLABHULA K M, SARYAZDI S, IYER G, et al. gradSLAM: Automagically differentiable SLAM[J]. arXiv preprint arXiv:1910.10672, 2019.
|
[46] |
MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. Nerf: Representing scenes as neural radiance fields for view synthesis[J]. Communications of the ACM, 2021, 65(1): 99-106.
|
[47] |
CAO Y, JIANG P, XIA H. Generative and Malleable User Interfaces with Generative and Evolving Task-Driven Data Model[J]. arXiv preprint arXiv:2503.04084, 2025.
|
[48] |
ZHAO Z, CHAI W, WANG X, et al. See and think: Embodied agent in virtual environment[C]// European Conference on Computer Vision, Springer, Cham, 2025: 187-204.
|
[49] |
KIRCHNER E A, FAIRCLOUGH S H, KIRCHNER F. Embedded multimodal interfaces in robotics: applications, future trends, and societal implications[M]// The Handbook of Multimodal-Multisensor Interfaces: Language Processing, Software, Commercialization, and Emerging Directions-Volume 3, 2019: 523-576.
|
[50] |
尤明辉, 殷亚凤, 谢磊, 等. 基于行为感知的用户画像技术[J]. 浙江大学学报 (工学版), 2021, 55(4): 608-614.
|
[51] |
QIAN C, HE B, ZHUANG Z, et al. Tell me more! towards implicit user intention understanding of language model driven agents[J]. arXiv preprint arXiv:2402.09205, 2024.
|
[52] |
TRICK S, KOERT D, PETERS J, et al. Multimodal uncertainty reduction for intention recognition in human-robot interaction[C]// 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2019: 7009-7016.
|
[53] |
GÖLDI A, RIETSCHE R, UNGAR L. Efficient Management of LLM-Based Coaching Agents’ Reasoning While Maintaining Interaction Quality and Speed[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-18.
|
[54] |
PERERA M, ANANTHANARAYAN S, GONCU C, et al. The Sky is the Limit: Understanding How Generative AI can Enhance Screen Reader Users’ Experience with Productivity Applications[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-17.
|
[55] |
SAAB K, TU T, WENG W H, et al. Capabilities of gemini models in medicine[J]. arXiv preprint arXiv:2404.18416, 2024.
|
[56] |
LI B, YAN T, PAN Y, et al. Mmedagent: Learning to use medical tools with multi-modal agent[J]. arXiv preprint arXiv: 2407.02483, 2024.
|
[57] |
SCHMIDGALL S, ZIAEI R, HARRIS C, et al. AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments[J]. arXiv preprint arXiv: 2405.07960, 2024.
|
[58] |
TSAI H R, CHIU S K, WANG B. GazeNoter: Co-Piloted AR Note-Taking via Gaze Selection of LLM Suggestions to Match Users’ Intentions[J]. arXiv preprint arXiv:2407.01161, 2024.
|
[59] |
WANG R, ZHOU X, QIU L, et al. Social-RAG: Retrieving from Group Interactions to Socially Ground AI Generation[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-25.
|
[60] |
PICKERING M, WILLIAMS H, GAN A, et al. How Humans Communicate Programming Tasks in Natural Language and Implications For End-User Programming with LLMs[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-34.
|
[61] |
CAO Y, JIANG P, XIA H. Generative and Malleable User Interfaces with Generative and Evolving Task-Driven Data Model[J]. arXiv preprint arXiv:2503.04084, 2025.
|
[62] |
DE LA TORRE F, FANG C M, HUANG H, et al. Llmr: Real-time prompting of interactive worlds using large language models[C]// Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 2024: 1-22.
|
[63] |
PENG X, KOCH J, MACKAY W E. FusAIn: Composing Generative AI Visual Prompts Using Pen-based Interaction[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-20.
|
[64] |
ZHANG T, AU YEUNG C, AURELIA E, et al. Prompting an Embodied AI Agent: How Embodiment and Multimodal Signaling Affects Prompting Behaviour[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-25.
|
[65] |
ZHANG C, YANG Z, LIU J, et al. Appagent: Multimodal agents as smartphone users[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-20.
|
[66] |
LI C, WU G, CHAN G Y Y, et al. Satori 悟り: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-24.
|
[67] |
CHO H, FASHIMPAUR J, SENDHILNATHAN N, et al. Persistent Assistant: Seamless Everyday AI Interactions via Intent Grounding and Multimodal Feedback[C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025: 1-19.
|
[68] |
QIN Y, HU S, LIN Y, et al. Tool learning with foundation models[J]. ACM Computing Surveys, 2024, 57(4): 1-40.
|
[69] |
ZUO H, LIU R, ZHAO J, et al. Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities[C]// ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2023: 1-5.
|
[70] |
LI M, ZHAO S, WANG Q, et al. Embodied agent interface: Benchmarking llms for embodied decision making[J]. Advances in Neural Information Processing Systems, 2024, 37: 100428-100534.
|