基于二重LOF与逆交叉验证的稳健AdaBoost回归模型

doi:10.11871/jfdc.issn.2096-742X.2024.05.012

数据与计算发展前沿 ›› 2024, Vol. 6 ›› Issue (5): 126-138.

CSTR: 32002.14.jfdc.CN10-1649/TP.2024.05.012

doi: 10.11871/jfdc.issn.2096-742X.2024.05.012

基于二重LOF与逆交叉验证的稳健AdaBoost回归模型

曾凡倍¹(),杨联强^2,^*()

1.安徽大学，大数据与统计学院，安徽合肥 230601
2.安徽大学，人工智能学院，安徽合肥 230601

收稿日期:2023-01-03 出版日期:2024-10-20 发布日期:2024-10-21
通讯作者: * 杨联强（E-mail: yanglq@ahu.edu.cn）
作者简介:曾凡倍，安徽大学大数据与统计学院，硕士研究生，主要研究方向为机器学习。
本文中负责论文的初稿撰写、数据收集、算法实现。
ZENG Fanbei is a postgraduate student at School of Big Data and Statistics, Anhui University. His research interest is machine learning.
In this paper, he is responsible for the paper drafting, data collection, and algorithm implementation.
E-mail: 2275920905@qq.com|杨联强，安徽大学人工智能学院，副教授，博士，硕士研究生导师，分别主持国家和省部级自然科学基金多项，在国内外重要学术刊物发表论文多篇，主要研究方向为机器学习和回归模型。
本文中负责论文思想方法、实验设计。
YANG Lianqiang is an associate professor and a master’s supervisor at the School of Artificial Intelligence, Anhui University. He has been funded by some programs of national sciences and published several papers in important academic journals both domestically and internationally. His research interests include machine learning and regression analysis.
In this paper, he is responsible for the innovative ideas and experimental designs.
E-mail: yanglq@ahu.edu.cn
基金资助:
安徽高校自然科学基金(KJ2021A0049);安徽省自然科学基金(2208085MA06)

Robust AdaBoost Regression Model Based on Double LOF and Inverse-Cross-Validation

ZENG Fanbei¹(),YANG Lianqiang^2,^*()

1. School of Big Data and Statistics, Anhui University, Hefei, Anhui 230601, China
2. School of Artificial Intelligence, Anhui University, Hefei, Anhui 230601, China

Received:2023-01-03 Online:2024-10-20 Published:2024-10-21

摘要/Abstract

摘要：

【目的】传统AdaBoost回归模型的稳健性不足，改进的AdaBoost.RT+、AdaBoost.RS算法仍然存在对异常数据抑制效果不显著和识别正确率较低等问题，增强AdaBoost方法的稳健性具有重要的实际应用价值。【方法】给出的AdaBoost.R_LOF模型，首先提出二重LOF和逆交叉验证算法，并将两种方法结合，以概率刻画数据的异常程度。然后在AdaBoost.R2算法的基础上，根据数据的异常程度，对数据设置恰当的权重系数，在不影响正常数据迭代的同时抑制异常数据的影响。【结果】使得新模型具有更好的稳健性，并且得到更小的预测均方误差。【局限】该方法需要调节的超参数有所增加，需要根据数据集分布特征进行调整。【结论】模拟和真实案例结果显示，相比于AdaBoost.R2、AdaBoost.RT+和AdaBoost.RS算法，在不同比例异常值的数据集下，该方法都具有更好的稳健性和估计效果。

关键词: AdaBoost算法, 二重LOF算法, 逆交叉验证, AdaBoost.R_LOF算法

Abstract:

[Objective] The robustness of the traditional AdaBoost regression model is insufficient. The improved AdaBoost.RT+ and AdaBoost.RS algorithms hold insignificant suppression on abnormal data and low identification accuracy of abnormal data. It is meaningful to enhance the robustness of AdaBoost algorithms. [Methods] First, dual LOF and inverse cross validation algorithms are proposed, the abnormal degree of data is characterized by probability based on these two algorithms. Then, appropriate weight coefficients are given according to the abnormal degree of the data to suppress its influence and keep no effect on the normal data. [Results] This AdaBoost.R_LOF model holds better robustness and less mean squared error on prediction. [Limitations] However, more hyperparameters are needed. [Conclusions] Simulations and real applications show that the new model has better robustness and estimation under the different proportions of outliers compared with AdaBoost.R2, AdaBoost.RT+ and AdaBoost.RS algorithms.

Key words: oAdaBoost, double LOF, Inverse-Cross-Validation, AdaBoost.R_ LOF

曾凡倍, 杨联强. 基于二重LOF与逆交叉验证的稳健AdaBoost回归模型[J]. 数据与计算发展前沿, 2024, 6(5): 126-138.

ZENG Fanbei, YANG Lianqiang. Robust AdaBoost Regression Model Based on Double LOF and Inverse-Cross-Validation[J]. Frontiers of Data and Computing, 2024, 6(5): 126-138, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2024.05.012.

图/表 12

表1

数据识别效果"

	$P$	$L 1$	$D i s 1$	$L 2$	$D i s 2$
KNN	5%	7.63%	54.97%	9.88%	12.20%
	10%	4.90%	55.86%	9.97%	10.31%
	15%	2.88%	50.32%	9.96%	10.24%
	20%	1.97%	42.11%	10.04%	9.83%
LOF	5%	1.46%	43.11%	2.52%	14.50%
	10%	1.61%	44.53%	2.52%	8.28%
	15%	1.59%	42.06%	2.51%	8.35%
	20%	1.64%	37.70%	2.50%	6.82%

表1

图1

图2

表2

表3

模拟数据集"

数据集	表达式	样本数量	变量范围	噪声范围
数据集一	$y = s i n x x + c$	300	$x ~ U [- 10,10]$	$c ~ U [- 0.2,0.2]$
数据集二	$y = t a n - 1 x 2 x 3 - 1 x 2 x 4 x 1$	1,000	$x 1 ~ U [0,100]$ $x 2 ~ U 40 π, 560 π x 3 ~ U [0,1]$ $x 4 ~ U [0,11]$	$c ~ U [- 0.2,0.2]$

表3

图3

表4

图4

图5

表5

真实数据集"

数据集名称	数据集样本数量	自变量 $X$	因变量 $Y$
翼型自噪声数据集	1,503	频率、攻角、弦长、自由流速度、吸入侧位移厚	缩放的声压级
Folds5x2_pp发电数据集	9,568	温度、环境压强、相对湿度、排气真空	每小时净电能输出
Forestfires森林火灾数据集	517	位置横纵坐标、FFMC指数、DMC指数、DC指数、ISI指数、摄氏温度、相对湿度、风速、外部降雨	森林的烧毁面积

表5

表6

表7

参考文献 22

[1]	TARUQ I, MENG Q, YAO S Y, et al. Adaboost-DSNN: an adaptive boosting algorithm basedon deep self normalized neural network for pulsar identification[J]. Monthly Noticesof the Royal Astronomical Society, 2022, 511(1): 683-690.
[2]	YANG W Z, YUAN T T, WANG L J. Micro-Blog Sentiment Classification Method Based on the Personality and Bagging Algorithm[J]. Future Internet, 2020, 12(4): 75-75.
[3]	TANG J J, LIANG J, HAN C Y, et al. Crash injury severity analysis using a two-layer Stacking framework[J]. Accident Analysis and Prevention, 2019, 122: 226-238.
[4]	曹维凡. 基于Boosting算法的股票量化多因子选股研究[D]. 杭州: 浙江工商大学, 2021.
[5]	WANG L L, GUO Y L, FAN M H, et al. Wind speed prediction using measurements from neighboring locations and combining the extreme learning machine and the AdaBoost algorithm[J]. Energy Reports, 2022, 8: 1508-1518.
[6]	杨笑, 王志章, 周子勇, 等. 基于参数优化AdaBoost算法的酸性火山岩岩性分类[J]. 石油学报, 2019, 40(4): 457-467. doi: 10.7623/syxb201904007
[7]	李翔宇, 程坤, 谭思超, 等. 基于Adaboost算法的核电站故障诊断方法[J]. 核动力工程, 2022, 43(4): 118-125.
[8]	刘禹欣, 朱勇, 孙结冰, 等. Haar-like特征双阈值Adaboost人脸检测[J]. 中国图象图形学报, 2020, 25(8): 1618-1626.
[9]	王强. 基于AdaBoost回归树的电网基建投资模型研究[D]. 重庆: 电子科技大学, 2019.
[10]	FREUND Y, SCHAPIRE R E. A decision-theoretic generalization of online learning and anapplication to boosting[J]. Journal of Computer and System Sciences, 1997, 55(1): 119-139.
[11]	DRUCKER H. Improving regressions using boosting techniques[C]// Proc of the 14th Int Conf Machine Learning, 1997: 107- 115.
[12]	SHRESTHA D L, SOLOMATINE D P. Experiments with AdaBoost.RT, an improved boosting scheme for regression[J]. Neural computation, 2006, 18(7): 1678-710. pmid: 16764518
[13]	田慧欣, 刘玉栋, 孟博. 基于AdaBoost.RS算法的LF炉钢水温度预报分析[J]. 钢铁研究学报, 2017, 29(2): 98-104+122. doi: 10.13228/j.boyuan.issn1001-0963.20160105
[14]	JI L, CHENTAO Z, XUKUN Z, et al. Temperature Compensation of Piezo-Resistive Pressure Sensor Utilizing Ensemble AMPSO-SVR Based on Improved Adaboost. RT[J]. IEEE Access, 2020, 8: 12413-12425.
[15]	汪森辉, 李海峰, 张永杰, 等. 基于改进的AdaBoost.RS算法的烧结终点预报分析[J]. 中国冶金, 2019, 29(10): 13-19.
[16]	朱亮, 徐华, 成金海, 等. AdaBoost的样本权重与组合系数的分析及改进[J]. 计算机应用, 2022, 42(07):2022-2029.
[17]	张戈, 盖赟. 局部离群因子算法(LOF)在异常检测中的应用研究[J]. 网络安全技术与应用, 2020, (11):49-50.
[18]	杭菲璐, 郭威, 陈何雄, 等. 基于iForest和LOF的流量异常检测[J/OL]. 计算机应用研究 2022, 39(10): 1-6[2022-09-01]. http://www.arocmag.com/article/02-2022-10-031.html.
[19]	孙凤琪, 史鉴. 基于AdaBoost.R2和ELM的软测量新方法[J]. 东北师大学报(自然科学版), 2008, (3):26-30.
[20]	GELBARD R, GOLDMAN O, SPIEGLER I. Investigating diversity of clustering methods: An empirical comparison[J]. Data & Knowledge Engineering, 2007, 63(1): 155-166.
[21]	屠恩美, 杨杰. 半监督学习理论及其研究进展概述[J]. 上海交通大学学报, 2018, 52(10): 1280-1291. doi: 10.16183/j.cnki.jsjtu.2018.10.017
[22]	ENGIN P, CABIR M A, TAHIR M A, et al. Decision tree regression model to predict low-rank coal moisture content during convective drying process[J]. International Journal of Coal Preparation and Utilization, 2020, 40(8): 505-512.

异常数据占比	数据集一识别提升效果	数据集二识别提升效果
5%	29%	200%
10%	30%	160%
15%	35%	191%
20%	29%	132%

数据集	方法	均方误差
数据集	方法	0%	10%	15%	20%
数据集一	Adaboost.R_LOF	0.01677	0.01701	0.01856	0.02051
	Adaboost.RT+	0.01593	0.03215	0.04115	0.05138
	Adaboost.R2	0.01644	0.44087	0.60194	0.60376
	Adaboost.RS	0.01708	0.02244	0.01865	0.03161
数据集二	Adaboost.R_LOF	218.83480	15633.68538	20072.55362	21740.92955
	Adaboost.RT+	286.98233	212509.16649	227699.79481	228349.15027
	Adaboost.R2	331.50292	253666.03595	264815.37466	272132.87081
	Adaboost.RS	197.81129	21710.02357	24945.84235	28459.65362

	Adaboost. R_LOF	Adaboost.RT+	Adaboost.R2	Adaboost.RS
翼型自噪声数据集	11.60737	10.85501	9.08831	21.42535
Folds5x2_pp发电数据集	18.73041	29.81831	25.95356	21.54093
Forestfires森林火灾数据集	3410.45386	3653.20850	4886.64688	—

	R_LOF	RT+	R2	RS
数集一	0.455 s	0.074 s	0.147 s	0.305 s
数集二	1.058 s	0.168 s	0.246 s	0.497 s

基于二重LOF与逆交叉验证的稳健AdaBoost回归模型

Robust AdaBoost Regression Model Based on Double LOF and Inverse-Cross-Validation

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 22

相关文章 0

编辑推荐

Metrics

本文评价