Frontiers of Data and Computing ›› 2024, Vol. 6 ›› Issue (3): 127-138.

CSTR: 32002.14.jfdc.CN10-1649/TP.2024.03.014

doi: 10.11871/jfdc.issn.2096-742X.2024.03.014

• Technology and Application • Previous Articles     Next Articles

A Deep Web Tracker Detection Method with Coordinated Semantic and Co-Occurrence Features

YAN Jin(),DONG Kejun,LI Hongtao*()   

  1. China Internet Network Information Center, Beijing 100190, China
  • Received:2023-03-07 Online:2024-06-20 Published:2024-06-21

Abstract:

[Objective] Web trackers embedded in the website can collect the user identification and access information from user’s visit. The collected information may be used for personalized recommendation services and website performance analysis. However, web trackers may also cause Internet users privacy leakages. It is very important to allow users to selectively turn off/on web tracking, where the automatic detection of web trackers is the premise and foundation. [Methods] By analyzing real-life data sets, this paper reveals two important characteristics of web trackers from the perspectives of URL text semantics and embedded association (i.e., co-occurrence). With this basis, this paper designs a web tracker detection method based on deep learning that consolidates the semantic features and association features of URLs. Specifically, the method first constructs the bipartite graph of the embedding relationship between the websites that users visit directly and the embedded URLs of the websites, and then extracts the embedded feature vector of the URL by applying the DeepWalk algorithm. Secondly, the method extracts the text semantic features of the URL strings using the pre-trained BERT model in the field of natural language processing. Finally, the method uses the attention mechanism to consolidate the two types of features and uses the multi-layer perceptron model to implement URL classification and identify Web trackers. [Results] Experimental results based on real-life data sets show that compared with existing methods, the proposed method improves the recognition accuracy, and its F1 score can reach 0.91. [Conclusions] The proposed method achieves relatively high accuracy in detecting trackers by using only the URLs of trackers and their embedding information in websites. As such, it is easy to be deployed in practice.

Key words: Web tracker detection, association features, deep learning, pre-trained model, DNS