数据与计算发展前沿 ›› 2025, Vol. 7 ›› Issue (6): 35-43.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.06.004

doi: 10.11871/jfdc.issn.2096-742X.2025.06.004

• 专刊:第40次全国计算机安全学术交流会征文 • 上一篇    下一篇

大语言模型赋能的面向微数据脱敏的敏感信息识别分析方法

董伟(),廖佳纯*(),姚思诚,陈海粟,阚苏南   

  1. 南湖实验室,大数据技术研究中心,浙江 嘉兴 314000
  • 收稿日期:2025-08-02 出版日期:2025-12-20 发布日期:2025-12-17
  • 通讯作者: 廖佳纯
  • 作者简介:董伟,南湖实验室,硕士,主要研究方向为隐私保护。
    本文承担工作为:脱敏算法设计、算法实现。
    DONG Wei is a researcher at Nanhu Laboratory, and he holds a master’s degree. His primary research focuses on privacy protection.
    In this paper, he is mainly responsible for algorithm design and realization.
    E-mail: dwei@nanhulab.ac.cn|廖佳纯,南湖实验室,副研究员,博士,主要研究方向为隐私保护与信息安全等。
    本文承担工作为:指导优化算法流程与算法实现。
    LIAO Jiachun, Ph.D., is an associate researcher at Nanhu Laboratory. Her main research directions include privacy protection and information security.
    In this paper, she is mainly responsible for guiding the optimization algorithm process and its implementation.
    E-mail: jliao@nanhulab.ac.cn
  • 基金资助:
    国家重点研发计划资助(2022YFB4501500);国家重点研发计划资助(2022YFB4501502);南湖实验室自研项目(NSS2024CI02003)

Sensitive Information Identification and Analysis Method for Microdata Anonymization Empowered by LLM

DONG Wei(),LIAO Jiachun*(),YAO Sicheng,CHEN Haisu,KAN Sunan   

  1. Center of Big Data Technology, Nanhu Laboratory, Jiaxing, Zhejiang 314000, China
  • Received:2025-08-02 Online:2025-12-20 Published:2025-12-17
  • Contact: LIAO Jiachun

摘要:

【目的】数据脱敏是保护个人隐私,促进数据价值释放的有效途径,然而现有数据脱敏流程在敏感信息如标识符的识别分析阶段,存在标识符识别准确度较低、未对信息关联进行有效分析等问题,严重影响了脱敏流程效率与结果质量。大语言模型凭借强大的语义理解能力为解决上述问题提供了新思路。【方法】提出了大语言模型赋能的敏感信息识别分析方法,一方面结合数据语义扩充完善直接标识符识别规则,并使用大语言模型提炼准标识符识别准则实现自动识别;另一方面对微数据进行字段关联分析并基于关联优化脱敏结果。【结论】实验表明本方法可有效提升微数据标识符的识别精度。

关键词: 隐私保护, 数据脱敏, 大语言模型, 微数据

Abstract:

[Objective] Data anonymization is an effective approach to protecting personal privacy and promote the release of data value. However, current sensitive information (such as identifiers) identification phase of existing microdata anonymization processes suffer from issues such as excessive omissions of privacy information identification, and neglection of internal associations within microdata. Large Language Models (LLMs), with their robust semantic understanding capabilities, offer a new approach addressing these challenges. [Methods] This paper proposes an LLM-empowered microdata identification and analysis method. On one hand, this method combines data semantics to broaden the scope of direct identifier recognition and extracts refined quasi-identifier evaluation criteria from domestic anonymization standards, achieving high-precision automatic identification of quasi-identifiers. On the other hand, it conducts internal association analysis on microdata and performs following anonymization based on these associations to enhance the privacy and utility of the anonymized results. [Conclusions] Experiment results demonstrate that the method improves the accuracy of microdata identifiers recognition.

Key words: privacy protection, data anonymization, Large Language Model, microdata