Small Object Detection and Recognition Based onDeep Learning

doi:10.11871/jfdc.issn.2096-742X.2020.02.010

Abstract

Abstract:

[Objective] In this paper, we aim to improve the detection performance for small objects by considering the characteristics of small objects under deep learning-based detection frameworks. [Methods] This paper improves small object detection and recognition performance from different aspects, including feature fusion, context learning and attention mechanism. Since the features of the small object are not evident, a bidirectional feature fusion method is proposed to improve the feature expression capability for small objects. In addition, a novel method is proposed to improve the detection performance by using the context information of small objects. Furthermore, to better identify the categories of small objects, an attention transfer method is proposed to improve the recognition rate. [Results] Experimental results show that the three proposed methods can significantly improve the detection and recognition performance for small objects on public datasets. [Conclusions] The research on feature fusion, context utilization and attention mechanism is very valuable for improving small object detection in complex scenes.

Key words: small object detection, feature fusion, context learning, attention mechanism

Leng Jiaxu,Liu Ying. Small Object Detection and Recognition Based onDeep Learning[J]. Frontiers of Data and Computing, 2020, 2(2): 120-135.

Figures/Tables 13

Fig.1

Fig.2

Fig.3

Fig.4

Table 1

Fig.5

Fig.6

Fig.7

Fig.8

Fig.9

Table 2

Table 3

Fig.10

References 62

[1]	Z. Cai and N. Vasconcelos . Cascade r-cnn: delving into high quality object detection [C]. in IEEE CVPR, 2018.
[2]	K. He, G. Gkioxari, P. Dolla $\acute{r}$, and R. Girshick . Mask r-cnn [C]. in Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 2017, pp. 2980-2988.
[3]	S. Ren, K. He, R. Girshick, J. Sun . Faster r-cnn: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 6, pp. 1137-1149, 2017.
[4]	W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. -Y. Fu, and A. C. Berg . Ssd: Single shot multibox detector[J]. in European conference on computer vision. Springer, 2016, pp. 21-37.
[5]	J. Redmon and A. Farhadi . Yolo9000: better, faster, stronger [C]. in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263-7271.
[6]	T. Kong, A. Yao, Y. Chen, F. Sun . “Hypernet: Towards accurate region proposal generation and joint object detection [C]. in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 845-853.
[7]	W. Liu, A. Rabinovich, A. C. Berg . Parsenet: Looking wider to see better[J]. arXiv preprint arXiv:1506.04579, 2015.
[8]	J. Long, E. Shelhamer, T. Darrell . Fully convolutional networks for semantic segmentation [C]. in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431-3440.
[9]	T. -Y. Lin, P. Dolla $\acute{r}$, R. Girshick, K. He, B. Hariharan, and S. Belongie . Feature pyramid networks for object detection [C]. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117-2125.
[10]	J. Jeong, H. Park, N. Kwak . Enhancement of ssd by concatenating feature maps for object detection. 2017.
[11]	K. He, X. Zhang, S. Ren, J. Sun . Deep residual learning for image recognition[C]. in: CVPR, 2016.
[12]	W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang C. -C. Loy , et al. Deepid-net: Deformable deep convolutional neural networks for object detection[C] in: CVPR, 2015.
[13]	W. Chu, D. Cai. Deep feature based contextual model for object detection[J]. in: Neurocomputing, 2018.
[14]	Y. Zhu, R. Urtasun, R. Salakhutdinov, S. Fidler . segdeepm: Exploiting segmentation and context in deep neural networks for object detection[C]. in: CVPR, 2015.
[15]	X. Chen, A. Gupta. Spatial memory for context reasoning in object detection[C]. in: ICCV, 2017.
[16]	K. Hara, M.-Y. Liu, O. Tuzel, and A.-m Farahmand . Attentionalnetwork for visual object detection[J]. arXiv preprint arXiv:1702.01478, 2016.
[17]	J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, S. Yan . Attentive contexts for object detection[J]. IEEE Transactions on Multimedia, 19(5):944-954, 2017.
[18]	K. He, X. Zhang, S. Ren, and J. Sun . Identity mappings in deep residual networks[J]. In European conference on computer vision, pages 630-645. Springer, 2016.
[19]	X. Liu, T. Xia, J. Wang, Y. Lin . Fully convolutional attention localization networks: Efficient attention localization for fine-grained recognition. CoRR, abs/1603.06765, 2016.
[20]	Fu J, Zheng H, Mei T . Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition [C]//CVPR. 2017,2:3.
[21]	T. -Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dolla ́r, and C. L. Zitnick . Microsoft coco: Common objects in context[J]. In European conference on computer vision, pages 740-755. Springer, 2014.
[22]	S. Bell, C. Lawrence Zitnick, K. Bala, R. Girshick . Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks [C]. in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2874-2883.
[23]	T. Kong, A. Yao, Y. Chen, F. Sun . Hypernet: Towards accurate region proposal generation and joint object detection [C]. in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 845-853.
[24]	Wang H, Wang Q, Gao M , et al. Multi-scale location-aware kernel representation for object detection [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 1248-1257.
[25]	J. Long, E. Shelhamer, T. Darrell . Fully convolutional networks for semantic segmentation [C]. in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431-3440.
[26]	T. -Y. Lin, P. Dolla $\acute{r}$, R. Girshick, K. He, B. Hariharan, and S. Belongie . Feature pyramid networks for object detection [C]. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117-2125.
[27]	J. Jeong, H. Park, N. Kwak . Enhancement of ssd by concatenating feature maps for object detection. 2017.
[28]	S. K. Divvala, D. Hoiem, J. H. Hays, A. A. Efros, M. Hebert . An empirical study of context in object detection [C]. In CVPR 2009. IEEE Conference on, pages 1271-1278. IEEE, 2009.
[29]	R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille . The role of context for object detection and semantic segmentation in the wild[J]. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 891-898, 2014.
[30]	R. Yu, X. Chen, V. I. Morariu, L. S. Davis . The role of context selection in object detection[J]. arXiv preprint arXiv:1609.02948, 2016.
[31]	S. Gidaris and N. Komodakis . Object detection via a multi-region and semantic segmentation-aware cnn model[C]. In Proceedings of the IEEE International Conference on Computer Vision, pages 1134-1142, 2015.
[32]	W. Ouyang, K. Wang, X. Zhu, X. Wang . Learning chained deep features and classifiers for cascade in object detection[J]. arXiv preprint arXiv:1702.07054, 2017.
[33]	X. Zeng, W. Ouyang, J. Yan, H. Li, T. Xiao, K. Wang, Y. Liu, Y. Zhou, B. Yang, Z. Wang , et al. Crafting gbd-net for object detection[J]. IEEE transactions on pattern analysis and machine intelligence, 40(9):2109-2123,2018.
[34]	Hu R., Xu H., Rohrbach M., Feng J., Saenko K., Darrell T. Natural language object retrieval[C]. In: CVPR. (2016).
[35]	Mao J., Huang J., Toshev A., Camburu O., Yuille A.L., Murphy K. Generation and comprehension of unambiguous object descriptions[C]. In: CVPR. (2016).
[36]	X. Chen and A. Gupta . Spatial memory for context reasoning in object detection[J]. arXiv preprint arXiv:1704.04224, 2017.
[37]	X. Chen, L.-J. Li, L. Fei-Fei, A. Gupta . Iterative visual reasoning beyond convolutions[J]. arXiv preprint arXiv:1803.11189, 2018.
[38]	Ji Y, Zhang H, Wu QMJ . Salient object detection via multi-scale attention CNN[J]. Neurocomputing 322:130-140, 2018.
[39]	Zhang H, Ji Y, Huang W et al. Sitcom-star-based clothing retrieval for video advertising: a deep learning framework[J]. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3579-x. 2018.
[40]	Xu K, Ba J, Kiros R et al. Show, attend and tell: Neural image caption generation with visual attention[C]. In: International conference on machine learning, pp 2048-2057. 2015.
[41]	Chen L, Zhang H, Xiao J et al. SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning[C]. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5659-5667,2017.
[42]	Seo PH, Lin Z, Cohen S et al. Progressive attention net- works for visual attribute prediction[J]. arXiv preprint arXiv:1606.02393. 2016.
[43]	Das D, George Lee CS . Sample-to-sample correspondence for unsupervised domain adaptation[J]. Eng Appl Artif Intell 73:80-91. 2018.
[44]	Das D, George Lee CS. Unsupervised domain adaptation using regularized hyper-graph matching[C]. In: 2018 25th IEEE international conference on image processing (ICIP).
[45]	Larochelle H, Hinton GE . Learning to combine foveal glimpses with a third-order Boltzmann machine[J]. In: Advances in neural information processing systems, pp 1243-1251, 2010.
[46]	Hochreiter S, Schmidhuber J . Long short-term memory[J]. Neural Comput 9(8):1735-1780,1997.
[47]	Kim JH, Lee SW, Kwak D et al. Multimodal residual learning for visual QA[J]. In: Advances in neural information pro-cessing systems, pp 361-369, 2016.
[48]	Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation[C]. In: Proceedings of the IEEE interna- tional conference on computer vision, pp 1520-1528,2015.
[49]	Srivastava RK, Greff K, Schmidhuber J . Training very deep networks[J]. In: Advances in neural information processing systems, pp 2377-2385,2015.
[50]	Mnih V, Heess N, Graves A et al. Recurrent models of visual attention[C]. In: NIPS. 2014.
[51]	Jaderberg M, Simonyan K, Zisserman A . Spatial transformer networks[J]. In: Advances in neural information processing systems, pp 2017-2025,2015.
[52]	Xiao T, Xu Y, Yang K et al. The application of two-level attention models in deep convolutional neural network for fine- grained image classification[C]. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 842-850,2015.
[53]	Zhang Y, Qiu Z, Yao T , et al. Fully convolutional adaptation networks for semantic segmentation [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6810-6818.
[54]	R. Yu, X. Chen, V. I. Morariu, L. S. Davis . The role of context selection in object detection[J]. arXiv preprint arXiv:1609.02948, 2016.
[55]	S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross, S. Chintala, P. Dolla r . A multipath network for object detection[J]. arXiv preprint arXiv:1604.02135, 2016.
[56]	X. Zeng, W. Ouyang, J. Yan, H. Li, T. Xiao, K. Wang, Y. Liu, Y. Zhou, B. Yang, Z. Wang , et al. Crafting gbd-net for object detection[J]. IEEE transactions on pattern analysis and machine intelligence, 40(9):2109-2123,2018.
[57]	Radford A, Metz L, Chintala S . Unsupervised representation learning with deep convolutional generative adversarial networks[J]. arXiv preprint arXiv:1511.06434, 2015.
[58]	Brock A, Donahue J, Simonyan K . Large scale gan training for high fidelity natural image synjournal[J]. arXiv preprint arXiv:1809.11096, 2018.
[59]	Li J, Liang X, Wei Y , et al. Perceptual generative adversarial networks for small object detection [C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1222-1230.
[60]	Wang X, Shrivastava A, Gupta A . A-fast-rcnn: Hard positive generation via adversary for object detection [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 2606-2615.
[61]	Law H, Deng J . Cornernet: Detecting objects as paired keypoints [C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 734-750.
[62]	Duan K, Bai S, Xie L , et al. Centernet: Keypoint triplets for object detection [C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 6569-6578.

方法	输入	训练数据	测试数据	mAP	FPS
YOLO	448	VOC2007 + 2012	VOC2007	63.4	45
YOLOV2	416	VOC2007 + 2012	VOC2007	76.8	67
Faster R-CNN		VOC2007 + 2012	VOC2007	73.2	5
R-FCN		VOC2007 + 2012	VOC2007	80.5	5.9
SSD	300	VOC2007 + 2012	VOC2007	77.7	61
DSSD	321	VOC2007 + 2012	VOC2007	78.6	9
ESSD	300	VOC2007 + 2012	VOC2007	79.2	52
SSD	512	VOC2007 + 2012	VOC2007	79.8	25
DSSD	513	VOC2007 + 2012	VOC2007	81.5	6
ESSD	512	VOC2007 + 2012	VOC2007	82.4	18

方法	基础网络	mAP
Faster R-CNN	VGG16	73.2
Faster R-CNN	Residual-101	76.4
YOLOv2	Darknet-19	78.6
DSSD	Residual-101	81.5
Context-Aware Faster R-CNN	VGG16	82.1
Context-Aware Faster R-CNN	Residual-101	84.8

方法	CIFAR-100	Caltech-256	CUB-200
TLAN	72.88	68.82	77.90
FCAN	95.80	76.40	82.04
RA-CNN	97.21	79.24	85.31
ATM	97.68	80.32	86.12