|
52.4.0
|
The convolution 2D (in short, conv2D) is certainly one of the most compute intensive and performance critical operators in ML workloads. This operator can be implemented with different algorithms, which differ in terms of accuracy, kernel size support, and additional memory required. Unfortunately, it does not exist a single algorithm that can be used in all scenarios to achieve the best performance. Therefore, the Arm Compute Library integrates an heuristic within the conv2d operators to select the most efficient algorithm, depending on input and kernel shapes and desired level of accuracy. The heuristic depends on the target backend (either NEON™ for Arm® CPUs or OpenCL for Arm® GPUs) and the following subsections will provide the main details behind the selection of the algorithm.
⚠ Attention: The heuristics presented in the following subsections will only refer to the NHWC data layout, which is the optimal and recommended layout for the Arm Compute Library.
The conv2d heuristic for Arm® Cortex®-based CPUs is inside the get_convolution_method() method in the CpuConv2d function. The algorithms used in the get_convolution_method() function are the following:
⚠ Attention: Winograd only works with floating-point data types (F32, F16)
The heuristic first checks less frequent cases that we may have in ML workloads for edge devices. These cases are the following:
If we have a most frequent case, such as unit dilations, of larger IFM, we evaluate the following conditions instead:
If the preceding cases are not met, we will fall-back to the Im2Col+GeMM-based algorithm.
The conv2d heuristic for Arm® Mali™-based GPUs is inside the get_convolution_method() method in the ClConv2d function.
The algorithms used in the get_convolution_method() function are the following:
⚠ Attention: Winograd only works with floating-point data types (F32, F16)
The heuristic first checks less frequent cases that we may have in ML workloads for edge devices. These cases are the following:
In all the other cases, the GPU heuristic evaluates the suitability of Winograd and Direct-Conv2D/Indirect-Conv2D. In particular, Winograd is adopted when the convolution parameters (kernel size and strides) are supported by the algorithm and when the IFM is not small (for example, greater than 8). The conditions for using the Direct-Conv2D algorithms are several and we recommend you look at the heuristic directly. In general, the Direct-Conv2D operators is used in almost all cases where kernel size is not 1x1. The Indirect-GeMM algorithm is used in alternative to Direct-Conv2D only for Arm® Mali™-G77 GPU. If neither Winograd nor Direct-Conv2D can be used, we will fall-back to either GeMM (when the kernel size is 1x1) or the Im2Col+GeMM-based algorithm.