Paper summary. OpenCL Caffe: Accelerating and enabling a cross platform machine learning framework
This 2016 paper presents an OpenCL branch/port of the deep learning framework Caffe. More specifically, this branch replaces the CUDA-based backend of Caffe to an open standard OpenCL backend. The software was first located at https://github.com/amd/OpenCL-caffe, then graduated to https://github.com/BVLC/caffe/tree/opencl.
Once we develop a DNN model, we ideally like to be able to deploy it for different applications across multiple platforms (servers, NVDIA GPUs, AMD GPUs, ARM GPUs, or even over smartphones and tablets) with minimum developing efforts. Unfortunately, most of the deep learning frameworks (including Caffe) are integrated with CUDA libraries for running on NVIDIA GPUs, and that limits portability across multiple platforms.
OpenCL helps for portability of heterogenous computing across platforms since it is supported by a variety of commercial chip manufacturers: Altera, AMD, Apple, ARM Holdings, Creative Technology, IBM, Imagination Technologies, Intel, Nvidia, Qualcomm, Samsung, Vivante, Xilinx, ZiiLABS and etc. In order to enable compatibility between different platforms, OpenCL program detects the specific devices and compiles at runtime.
OpenCL was originally developed by Apple, Inc and later submitted to Khronos Group. It is also supported by many operating systems including Android, FreeBSD, Linux, macOS, Windows.
The paper breaks the OpenCL porting process into two phases. Phase 1 achieves a layerwise porting of 3 layers, namely C++ machine learning interfaces, OpenCL wrappers, and GPU kernels. Layerwise porting means the layers are ported one by one and unit tested by using the originals of the other layers to guarantee correctness and convergence of the DNN algorithm.
After all layers are ported to OpenCL in Phase 1, the Phase 2 focuses on performance optimization. Profiling the OpenCL port in Phase 1 (via the AMD profiling tool, CodeXL, assisted with OpenCL event and printf) demonstrates some big bottlenecks. OpenCL's online compilation frequently calls clBuildProgram to create each GPU kernel: for 100 iterations of Cifar training, there were 63 clBuildProgram calls that took about 68% of the time. Another bottleneck was that the convolutional layers take up most of the computation time. BLAS's performance suffered from irregular tall and skinny matrix sizes from different layers.
To handle these, the paper proposes three key optimization techniques including kernel caching to avoid OpenCL online compilation overheads, a batched manner data layout scheme to boost data parallelism, and multiple command queues to boost task parallelism. The optimization techniques effectively map the DNN problem size into existing OpenCL math libraries, and improve hardware resources utilization and boost performance by 4.5x.
Compared to the highly optimized machine learning cuDNN library, OpenCL Caffe still has a performance gap of 2x as it lacks those optimizations. The authors argue given the current performance, the OpenCL caffe is still competitive in terms of performance per dollar, considering the market price difference of AMD R9 Fury (about 560 dollars) and the NVIDIA TitanX (about 1000 dollars).
However, the authors caution about minor problems in compatibility. "There are some differences in specific manufacturers' extension and keywords. For example, caffe uses a lot of template in GPU kernels to support different floating point precision. But it turns out the template keywords for different manufactures are different, which adds more difficulty for the same code to run on different platform without modifications."
OpenCL support of deep learning frameworks is still not great, but hopefully it is getting better everyday as this paper shows.
The slides presentation for the paper is also available here.
Once we develop a DNN model, we ideally like to be able to deploy it for different applications across multiple platforms (servers, NVDIA GPUs, AMD GPUs, ARM GPUs, or even over smartphones and tablets) with minimum developing efforts. Unfortunately, most of the deep learning frameworks (including Caffe) are integrated with CUDA libraries for running on NVIDIA GPUs, and that limits portability across multiple platforms.
OpenCL helps for portability of heterogenous computing across platforms since it is supported by a variety of commercial chip manufacturers: Altera, AMD, Apple, ARM Holdings, Creative Technology, IBM, Imagination Technologies, Intel, Nvidia, Qualcomm, Samsung, Vivante, Xilinx, ZiiLABS and etc. In order to enable compatibility between different platforms, OpenCL program detects the specific devices and compiles at runtime.
OpenCL was originally developed by Apple, Inc and later submitted to Khronos Group. It is also supported by many operating systems including Android, FreeBSD, Linux, macOS, Windows.
OpenCL Backend Porting and Optimization
Caffe framework is originally written in C++ and CUDA. The CUDA layer of Caffe handles optimization of hardware resource allocation and utilization, e.g. CPU-GPU task assignment, memory management, and data transfer. Since CUDA and OpenCL are different in hardware device abstraction, memory buffer management, synchronization, data transfers, the OpenCL backbend porting is not a straightforward process.The paper breaks the OpenCL porting process into two phases. Phase 1 achieves a layerwise porting of 3 layers, namely C++ machine learning interfaces, OpenCL wrappers, and GPU kernels. Layerwise porting means the layers are ported one by one and unit tested by using the originals of the other layers to guarantee correctness and convergence of the DNN algorithm.
After all layers are ported to OpenCL in Phase 1, the Phase 2 focuses on performance optimization. Profiling the OpenCL port in Phase 1 (via the AMD profiling tool, CodeXL, assisted with OpenCL event and printf) demonstrates some big bottlenecks. OpenCL's online compilation frequently calls clBuildProgram to create each GPU kernel: for 100 iterations of Cifar training, there were 63 clBuildProgram calls that took about 68% of the time. Another bottleneck was that the convolutional layers take up most of the computation time. BLAS's performance suffered from irregular tall and skinny matrix sizes from different layers.
To handle these, the paper proposes three key optimization techniques including kernel caching to avoid OpenCL online compilation overheads, a batched manner data layout scheme to boost data parallelism, and multiple command queues to boost task parallelism. The optimization techniques effectively map the DNN problem size into existing OpenCL math libraries, and improve hardware resources utilization and boost performance by 4.5x.
Evaluation
The evaluation uses the AlexNet DNN model for ImageNet. The evaluation compares the performance of Caffe with CUDA (including both cuBLAS and cuDNN v2) vs OpenCL (including both original clBLAS and batched manner optimization) on NVIDIA TitanX and AMD R9 Fury, with Alexnet model with mini-batch size 100. As shown in Figure 3, OpenCL Caffe with optimizations over clBLAS matches performance of cuBLAS Caffe.Compared to the highly optimized machine learning cuDNN library, OpenCL Caffe still has a performance gap of 2x as it lacks those optimizations. The authors argue given the current performance, the OpenCL caffe is still competitive in terms of performance per dollar, considering the market price difference of AMD R9 Fury (about 560 dollars) and the NVIDIA TitanX (about 1000 dollars).
Cross Platform Capability Analysis
A natural question that occurs is, would their OpenCL port of Caffe that is tested on AMD GPUs would automatically work with ARM MALI GPUs as well? That would be a good test of portability of OpenCL port of Caffe. This has not been answered in the paper.However, the authors caution about minor problems in compatibility. "There are some differences in specific manufacturers' extension and keywords. For example, caffe uses a lot of template in GPU kernels to support different floating point precision. But it turns out the template keywords for different manufactures are different, which adds more difficulty for the same code to run on different platform without modifications."
OpenCL support of deep learning frameworks is still not great, but hopefully it is getting better everyday as this paper shows.
The slides presentation for the paper is also available here.
Comments