Implementation and Integration of OpenCL Operators in TensorFlow Framework

doi:10.11871/jfdc.issn.2096-742X.2022.02.001

Abstract

Abstract:

[Objective] TensorFlow, a mainstream machine learning framework, and CUDA heterogeneous programming environment are currently being used widely in academia and industry. TensorFlow operators implemented in CUDA are the key to accelerating computation. However, TensorFlow's lack of support for OpenCL, an open general-purpose heterogeneous programming standard, severely limits the versatility of TensorFlow and prevents the full computational power of OpenCL hardware devices. [Methods] To address this issue, this paper deeply explores the implementation of TensorFlow, implements the OpenCL operator based on an in-depth analysis of the TensorFlow code structure, and implements the integration of the OpenCL operator in the 2.2.0 version of the TensorFlow framework. [Results] Based on the above implementation, TensorFlow can run on hardware devices supporting OpenCL 1.2 with the help of the OpenCL operator. Also, the optimization method proposed in this paper significantly improves the computational efficiency of the OpenCL operator. [Conclusions] The experiments show that the method proposed in this paper can effectively solve the problem that TensorFlow cannot be applied to OpenCL hardware devices.

Key words: TensorFlow, OpenCL, Operator

GUO Qiang,CHENG Daguo,SUN Yufei,ZHOU Jianyu,ZHANG Yuzhi,PEI Jiaao,GAN Rundong,CHEN Rui. Implementation and Integration of OpenCL Operators in TensorFlow Framework[J]. Frontiers of Data and Computing, 2022, 4(2): 3-16.

Figures/Tables 15

Fig.1

Fig.2

Fig.3

Fig.4

Table 1

Fig.5

Table 2

Fig.6

Table 3

Fig.7

Fig.8

Fig.9

Fig.10

Fig.11

Fig.12

References 25

[1]	NVIDIA Corporation. NVIDIA CUDA编程指南[EB/OL]. [2021/11/04]. https://www.nvidia.cn/docs/IO/51635/NVIDIA_CUDA_Programming_Guide_1.1_chs.pdf.
[2]	Munshi A. The opencl specification[C]. 2009 IEEE Hot Chips 21 Symposium (HCS), IEEE, 2009: 1-314.
[3]	Abadi M, Barham P, Chen J, et al. {TensorFlow}: A Sy-stem for {Large-Scale} Machine Learning[C]. 12th US-ENIX symposium on operating systems design and implementation (OSDI 16), 2016: 265-283.
[4]	Abadi M, Agarwal A, Barham P, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems[J]. arXiv preprint arXiv:1603.04467, 2016.
[5]	Sanders J, Kandrot E. CUDA by example: an introduction to general-purpose GPU programming[M]. Addison-Wesley Professional, 2010:14-19.
[6]	NVIDIA Corporation. Cuda toolkit \| nvidia developer[EB/OL]. [2021/11/04]. https://developer.nvidia.cn/zh-cn/cuda-toolkit.
[7]	The Khronos® Group Inc. Opencl overview - the khro-nos group inc[EB/OL]. [2021/11/04]. https://www.kh-ronos.org/opencl/.
[8]	Perkins H. CUDA-on-CL: a compiler and runtime for running NVIDIA® CUDA™ C++ 11 applications on OpenCL™ 1.2 Devices[C]// Proceedings of the 5th Inter-national Workshop on OpenCL, 2017: 1-4.
[9]	hughperkins. tf-coriander - OpenCL 1.2 implementation for Tensorflow[EB/OL]. [2021/11/04]. https://github.com/hughperkins/tf-coriander.
[10]	The Khronos® Group Inc. SYCL Overview - The Khr-onos Group Inc[EB/OL]. [2021/11/04]. https://www.khronos.org/sycl/.
[11]	Goli M, Iwanski L, Richards A. Accelerated machine learning using TensorFlow and SYCL on OpenCL Dev-ices[C]// Proceedings of the 5th International Workshop on OpenCL, 2017: 1-4.
[12]	Goli M, Iwanski L, Lawson J, et al. OpenCL Acceleration for TensorFlow[J]. arXiv preprint arXiv:1605.02688, 2018: 1-3.
[13]	Codeplay Developer. Home - ComputeCpp CE - Pro-ducts[EB/OL]. [2021/11/04]. https://developer.codeplay.com/products/computecpp/ce/home.
[14]	The Khronos Group Inc. SPIR Overview[EB/OL]. [2021/11/04]. https://www.khronos.org/spir/.
[15]	NVIDIA Corporation. An Easy Introduction to CUDA C and C++[EB/OL]. [2021/11/04]. https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/.
[16]	Kondratyuk N, Nikolskiy V, Pavlov D, et al. GPU-acc-elerated molecular dynamics: State-of-art software perfor-mance and porting from Nvidia CUDA to AMD HIP[J]. The International Journal of High Performance Comput-ing Applications, 2021, 35(4): 312-324.
[17]	Keryell R, Reyes R, Howes L. Khronos SYCL for Open-CL: a tutorial[C]. Proceedings of the 3rd Inter-national Workshop on OpenCL, 2015: 1-1.
[18]	TensorFlow. Create an op \| tensorflow core[EB/OL]. [2021/11/04]. https://www.tensorflow.org/guide/create_op.
[19]	KnuEdge. Constructing a fake device in tensorflow[EB/OL]. [2021/11/04]. https://github.com/knuedge/ten-sorf-low/blob/36e0cdf04f294bfd51931d4f78e291590ed0d3ec/tensorflow/g3doc/hardware/adding_support/fake_device.md.
[20]	Martin York. C++ singleton design pattern[EB/OL]. [2021/11/04]. https://stackoverflow.com/questions/1008019/c-singleton-design-pattern.
[21]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
[22]	He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]. Proceedings of the IEEE conference on computer vision and pattern recognition, 2016: 770-778.
[23]	YunYang1994. TensorFlow2.0-Examples - Difficult alg-orithm, Simple code [EB/OL]. [2021/11/04]. https://github.com/YunYang1994/TensorFlow2.0-Examples.
[24]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv pre-print arXiv:1409.1556, 2014.
[25]	He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]// Proceedings of the IEEE con-ference on computer vision and pattern recognition, 2016: 770-778.

相关软硬件	环境
CPU	Intel(R) Xeon(R) Gold 5218 CPU @2.30GHz
RAM	187GB DDR4 2933 MT/s
GPU	NVIDIA Tesla V100S
NVIDIA CUDA Toolkit	CUDA-10.2
OpenCL	OpenCL 1.2
Host compiler	GCC 7.5

算子	核函数数量	功能
BiasAdd	2	将偏差项bias加到value上
BiasAddGrad	4	对“bias”张量进行“BiasAdd”的反向操作
BatchToSpace	1	用于T型的4维张量的BatchToSpace
Concat	2	将两个张量按照一定方式连接
DepthToSpace	3	将数据从深度重新排列为空间数据块
DynamicStitch	1	将数据张量的值交织成一个单一的张量
Resize_bilinear	3	计算双线性插值
SplitV	2	将一个张量沿一维分割成多个张量
SpaceToBatch	1	用于T型的4维张量的SpaceToBatch
Tile	1	通过对一个给定的张量进行平铺,构建一个张量

模型名称	学习率	批大小	训练轮数
VGG16	0.01	64	10
ResNet18	0.0001	32	20