Email Masquerade Attack Detection Based on Semi-Supervised Learning

doi:10.11871/jfdc.issn.2096-742X.2024.02.006

Abstract

Abstract:

[Objective] Masquerade attacks are a typical attack in email systems, where attackers illicitly obtain genuine user authentication credentials to access unauthorized services, causing significant damage. Due to the complexity of email usage scenarios and the irregular distribution of data, the limited labeled anomaly data makes the detection of masquerade attacks in email systems challenging. [Methods] To solve the above issues, we propose a rule-based self-training Auto-Encoder anomaly detection framework. Initially, the framework analyzes and categorizes scenarios of the SMTP email protocol log data, introducing coarse-grained label correction rules. Subsequently, it employs an Auto-Encoder for iterative detection through self-training, with each detection result refined by rules. Lastly, the kernel density estimation method is utilized to find an appropriate threshold to reduce the false positive rate. [Results] Utilizing data from 6736 real corporate email accounts over three months, the framework detected 7 anomalous accounts and 12 anomalous IP addresses. The proposed method detects more than 75% anomalous accounts compared to those detected by the corporate Security Operations Center (SOC), meanwhile the number of false positive accounts is reduced by 81.3%.

Key words: semi-supervised learning, self-training, auto-encoder, masquerade attack, email protocol

LI Chang, LONG Chun, ZHAO Jing, YANG Yue, WANG Yueda, PAN Qingfeng, YE Xiaohu, WU Tiejun, TANG Ning. Email Masquerade Attack Detection Based on Semi-Supervised Learning[J]. Frontiers of Data and Computing, 2024, 6(2): 56-66, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2024.02.006.

Figures/Tables 7

Fig.1

Fig.2

Fig.3

Fig.4

Table 1

Table 2

Table 3

References 25

[1]	BOSNJAK L, SRES J, BRUMEN B. Brute-force and dictionary attack on hashed real-world passwords[C]// 2018 41st international convention on information and communication technology, electronics and microelectronics, IEEE, 2018, 1(1): 1161-1166.
[2]	DONG L, HAN Z, PETROPULU A, et al. Improving wireless physical layer security via cooperating relays[J]. IEEE Transactions on Signal Processing, 2009, 58(3): 1875-1888. doi: 10.1109/TSP.2009.2038412
[3]	IYER C S, SEDAMKAR R R, GUPTA S. A novel idea on multimedia encryption using hybrid crypto approach[J]. Elsevier Procedia Computer Science, 2016, 79(1): 293-298.
[4]	GUO Y, HU G, SHAO D. RMHIL: A Rule Matching Algorithm Based on Heterogeneous Integrated Learning in Software Defined Network[J]. MDPI Sensors, 2022, 22(13): 4739. doi: 10.3390/s22134739
[5]	SHARAFF A, NAGWANI N K. Email thread identification using latent Dirichlet allocation and non-negative matrix factorization based clustering techniques[J]. Journal of Information Science, 2016, 42(2): 200-212. doi: 10.1177/0165551515587854
[6]	CHEN T, YIN X, PENG L, et al. Monitoring and recognizing enterprise public opinion from high-risk users based on user portrait and random forest algorithm[J]. MDPI Axioms, 2021, 10(2): 106.
[7]	WANG R, NIE K, WANG T, et al. Deep learning for anomaly detection[C]// Proceedings of the 13th international conference on web search and data mining, 2020, 1(1): 894-896.
[8]	SALEM M B, STOLFO S J. Masquerade attack detection using a search-behavior modeling approach[J]. Columbia University, Computer Science Department, Technical Report CUCS-027-09, 2009: 181-200.
[9]	BEN SALEM M, STOLFO S J. Decoy document deployment for effective masquerade attack detection[C]// International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011: 35-54.
[10]	CAMPOBASSO M, ALLODI L. Impersonation-as-a-service: Characterizing the emerging criminal infrastructure for user impersonation at scale[C]// Proceedings of the 2020 ACM Conference on Computer and Communications Security, 2020: 1665-1680.
[11]	CHOU H C, LEE H C, HSUEH C W, et al. Password cracking based on special keyboard patterns[J]. International Journal of Innovative Computing Information and Control, 2012, 8(1): 387-402.
[12]	TIRADO E, TURPIN B, BELTZ C, et al. A new distributed brute-force password cracking technique[C]// Future Network Systems and Security:4th International Conference, 2018: 117-127.
[13]	HUSSAIN S R, ECHEVERRIA M, CHOWDHURY O, et al. Privacy Attacks to the 4G and 5G Cellular Paging Protocols Using Side Channel Information[C]// 26th Annual Network and Distributed System Security Symposium, 2019: 669-684.
[14]	DAINOTTI A, KING A, CLAFFY K C, et al. Analysis of a“/0” Stealth Scan from a Botnet[C]// Proceedings of the 2012 Internet Measurement Conference, 2012: 1-14.
[15]	CHATTERJEE R, BONNEAY J, JUELS A, et al. Cracking-resistant password vaults using natural language encoders[C]// 2015 IEEE Symposium on Security and Privacy, IEEE, 2015: 481-498.
[16]	OZA P, PATEL V M. C2ae: Class conditioned auto-encoder for open-set recognition[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 2307-2316.
[17]	SEKAR R, GUPTA A, FRULLO J, et al. Specification-based anomaly detection: a new approach for detecting network intrusions[C]// Proceedings of the 9th ACM conference on Computer and communications security, 2002: 265-274.
[18]	HU T, GUO Q, SHEN X, et al. Utilizing unlabeled data to detect electricity fraud in AMI: A semisupervised deep learning approach[J]. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(11): 3287-3299. doi: 10.1109/TNNLS.2018.2890663 pmid: 30714931
[19]	YANG K, REN J, ZHU Y, et al. Active learning for wireless IoT intrusion detection[J]. IEEE Wireless Communications, 2018, 25(6): 1925.
[20]	RUFF L, VANDERMEULEN R A, GORNITZ N, et al. Deep semi-supervised anomaly detection[C]// Proceedings of the 2020 International Conference on Learning Representations, 2020: 300-309.
[21]	FREEMAN D, JAIN S, DURMUTH M, et al. Who Are You? A Statistical Approach to Measuring User Authenticity[C]// Proceedings of the 2016 Network and Distributed System Security Symposium, 2016, 16: 21-24.
[22]	ZOU Y, YU Z, KUMAR B V K, et al. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training[C]// Proceedings of the European conference on computer vision, 2018: 289-305.
[23]	ZHOU Y, SONG X, ZHANG Y, et al. Feature encoding with autoencoders for weakly supervised anomaly detection[J]. IEEE Transactions on Neural Networks and Learning Systems, 2021, 33(6): 2454-2465. doi: 10.1109/TNNLS.2021.3086137
[24]	CATILLO M, PECCHIA A, VILLANO U. AutoLog: Anomaly detection by deep autoencoding of system logs[J]. Expert Systems with Applications, 2022, 191: 116263. doi: 10.1016/j.eswa.2021.116263
[25]	ZONG B, SONG Q, MIN M R, et al. Deep autoencoding gaussian mixture model for unsupervised anomaly detection[C]// Proceedings of the 2018 International conference on learning representations, 2018: 277-286.

数量	框架	基准
异常账号	7	4
异常IP	12	5
预测异常账号	125	634
预测异常IP	1,098	3,474

模型	固定国家		固定城市		固定国家		非固定城市		非固定国家
模型	Acc	Pre	Rec	F1	Acc	Pre	Rec	F1	Acc	Pre	Rec	F1
FEAWAD	0.937	0.905	0.978	0.940	0.967	0.958	0.977	0.967	0.844	0.828	0.870	0.848
AutoLog	0.955	0.954	0.992	0.972	0.897	0.996	0.894	0.942	0.851	0.606	0.762	0.672
DAGMM	0.964	0.979	0.984	0.982	0.864	0.894	0.998	0.943	0.798	0.342	0.557	0.499
Ours	0.972	0.978	0.972	0.974	0.957	0.968	0.957	0.962	0.887	0.932	0.893	0.899

框架	类型	准确率	精确率	召回率	F1分数
基线	固定国家、城市	0.972	0.978	0.972	0.975
	固定国家非固定城市	0.957	0.968	0.957	0.962
	非固定国家	0.887	0.932	0.893	0.899
无分类	-	0.855	0.893	0.651	0.753
无自训练	固定国家、城市	0.968	0.971	0.966	0.968
	固定国家非固定城市	0.956	0.964	0.951	0.957
	非固定国家	0.696	0.730	0.697	0.713
无阈值策略	固定国家、城市	0.972	0.978	0.972	0.975
	固定国家非固定城市	0.951	0.958	0.943	0.954
	非固定国家	0.833	0.899	0.887	0.893