Frontiers of Data and Computing ›› 2025, Vol. 7 ›› Issue (2): 130-140.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.02.013

doi: 10.11871/jfdc.issn.2096-742X.2025.02.013

• Technology and Application • Previous Articles     Next Articles

Improving Adversarial Transferability on Vision-Language Pre-Training Models via Block Shuffle and Rotation

WANG Wenbin1(),GAO Siyuan1,*(),GAO Manda1,LIANG Ling1,YANG Guangjun1,HE Bangyan2,LIU Yaozu2   

  1. 1. CHN Energy New Energy Technology Research Institute Co., Ltd, Beijing 102209, China
    2. Institute of Automation, Chinese Acadamy of Science, Beijing 100190, China
  • Received:2024-12-09 Online:2025-04-20 Published:2025-04-23
  • Contact: GAO Siyuan E-mail:16080112@ceic.com;20065237@ceic.com

Abstract:

[Purpose] This study focuses on the vulnerability of Visual-Language Pretraining (VLP) models to adversarial examples. The aim is to propose a method to enhance the transferability of adversarial examples to address related security risks. [Literature Review] A summary and analysis of existing relevant studies have been conducted. [Application Background] Currently, VLP models are susceptible to adversarial examples, which pose significant security risks. Moreover, black-box transfer attacks are more reflective of real-world scenarios and thus worthy of more research compared to white-box adversarial attacks. [Methods] A transfer attack method based on block shuffle and rotation is proposed. When generating adversarial images and adversarial texts, operations based on block shuffle and rotation are added to increase the diversity of samples, thereby enhancing the adversarial transferability. [Results] Experiments on the Flickr30K dataset have verified the effectiveness of the proposed method. [Limitations] The adversarial transferability still needs to be further improved. [Conclusion] The proposed transfer attack method based on block shuffle and rotation can effectively improve the adversarial transferability of VLP models.

Key words: adversarial examples, adversarial transferability, vision-language pre-training model