数据与计算发展前沿 ›› 2024, Vol. 6 ›› Issue (3): 127-138.

CSTR: 32002.14.jfdc.CN10-1649/TP.2024.03.014

doi: 10.11871/jfdc.issn.2096-742X.2024.03.014

• 技术与应用 • 上一篇    下一篇

融合语义和共现特征的Web跟踪器深度识别方法

严瑾(),董科军,李洪涛*()   

  1. 中国互联网络信息中心,北京 100190
  • 收稿日期:2023-03-07 出版日期:2024-06-20 发布日期:2024-06-21
  • 通讯作者: *李洪涛(E-mail: lihongtao@cnnic.cn
  • 作者简介:严瑾,中国互联网络信息中心,助理工程师,主要研究方向为域名系统。
    本文中负责初稿撰写与方法实现验证。YAN Jin is an assistant engineer in China Internet Network Information Center. Her research interest is Domain Name System.
    In this paper, she is responsible for the paper drafting as well as the development and experiments.
    E-mail: yanjin@cnnic.cn|李洪涛,中国互联网络信息中心总工程师,正高级工程师,研究领域为计算机应用技术、下一代互联网架构,当前主要从事互联网基础资源新型解析技术及大数据分析研究。
    本文中负责方法总体策划,论文修改、审定。
    LI Hongtao is a professor in China Internet Network Information Center, where he is a chief engineer. His research interests include computer application technology and next-generation Internet architecture. He currently focuses on new resolution techniques and big data analytics for Internet resources
    In this paper, he is responsible for the design of the method and the revision of the paper.
    E-mail: lihongtao@cnnic.cn
  • 基金资助:
    国家重点研发计划课题“互联网基础设施关键信息分析技术”(2022YFB3105003)

A Deep Web Tracker Detection Method with Coordinated Semantic and Co-Occurrence Features

YAN Jin(),DONG Kejun,LI Hongtao*()   

  1. China Internet Network Information Center, Beijing 100190, China
  • Received:2023-03-07 Online:2024-06-20 Published:2024-06-21

摘要:

【目的】Web跟踪器通过嵌入用户访问的网站,收集用户的标识与访问信息,用于个性化推荐服务和网站性能分析等。然而,Web跟踪器对互联网用户来说可能会造成隐私泄漏,让用户有选择的关闭/打开Web跟踪对互联网健康发展至关重要,而Web跟踪器的自动识别是前提与基础。【方法】通过对实际数据的分析,发现Web跟踪器在URL的文本语义和嵌入关联(即共现)两个维度的重要特征,并据此设计了融合关联特征与语义特征的Web跟踪器深度识别方法。该方法首先建立用户直接访问网站和其嵌入URL的嵌入关系二部图,并基于DeepWalk算法提取URL的嵌入特征向量;其次,基于自然语言处理领域的预训练BERT模型提取URL字符串的文本语义特征;最后,使用注意力机制聚合两类特征,并使用多层感知机模型实现URL的分类,识别Web跟踪器。【结果】基于真实数据的实验结果表明,与已有方法相比,本文所提方法提高了识别的准确度,其F1分数可达到0.91。【结论】基于深度学习的Web跟踪器识别方法仅依赖跟踪器URL及其在网站的嵌入关系信息,取得了较高的识别准确度,易于部署。

关键词: Web跟踪器识别, 关联特征, 深度学习, 预训练模型, 域名系统

Abstract:

[Objective] Web trackers embedded in the website can collect the user identification and access information from user’s visit. The collected information may be used for personalized recommendation services and website performance analysis. However, web trackers may also cause Internet users privacy leakages. It is very important to allow users to selectively turn off/on web tracking, where the automatic detection of web trackers is the premise and foundation. [Methods] By analyzing real-life data sets, this paper reveals two important characteristics of web trackers from the perspectives of URL text semantics and embedded association (i.e., co-occurrence). With this basis, this paper designs a web tracker detection method based on deep learning that consolidates the semantic features and association features of URLs. Specifically, the method first constructs the bipartite graph of the embedding relationship between the websites that users visit directly and the embedded URLs of the websites, and then extracts the embedded feature vector of the URL by applying the DeepWalk algorithm. Secondly, the method extracts the text semantic features of the URL strings using the pre-trained BERT model in the field of natural language processing. Finally, the method uses the attention mechanism to consolidate the two types of features and uses the multi-layer perceptron model to implement URL classification and identify Web trackers. [Results] Experimental results based on real-life data sets show that compared with existing methods, the proposed method improves the recognition accuracy, and its F1 score can reach 0.91. [Conclusions] The proposed method achieves relatively high accuracy in detecting trackers by using only the URLs of trackers and their embedding information in websites. As such, it is easy to be deployed in practice.

Key words: Web tracker detection, association features, deep learning, pre-trained model, DNS