Convolution 2D heuristic: algorithm selection

The convolution 2D (in short, conv2D) is certainly one of the most compute intensive and performance critical operators in ML workloads. This operator can be implemented with different algorithms, which differ in terms of accuracy, kernel size support, and additional memory required. Unfortunately, it does not exist a single algorithm that can be used in all scenarios to achieve the best performance. Therefore, the Arm Compute Library integrates an heuristic within the conv2d operators to select the most efficient algorithm, depending on input and kernel shapes and desired level of accuracy. The heuristic depends on the target backend (either NEON™ for Arm® CPUs or OpenCL for Arm® GPUs) and the following subsections will provide the main details behind the selection of the algorithm.

⚠ Attention: The heuristics presented in the following subsections will only refer to the NHWC data layout, which is the optimal and recommended layout for the Arm Compute Library.

Convolution 2D heuristic: Arm® Cortex®-based CPUs

The conv2d heuristic for Arm® Cortex®-based CPUs is inside the get_convolution_method() method in the CpuConv2d function. The algorithms used in the get_convolution_method() function are the following:

Direct-Conv2D
Im2Col+GeMM-based
Indirect-GeMM (a.k.a. GEMMCONV2D)
GeMM
Winograd

⚠ Attention: Winograd only works with floating-point data types (F32, F16)

The heuristic first checks less frequent cases that we may have in ML workloads for edge devices. These cases are the following:

Non unit dilation: We call Im2Col+GeMM
Large input and kernel shapes: We call Direct-Conv2D because it is the only algorithm that does not extra additionally temporary memory
Small Input-Feature-Maps (IFM): In this scenario, we have found that the GeMM implementation is generally the most efficient algorithm compared to Winograd and Indirect-GeMM

If we have a most frequent case, such as unit dilations, of larger IFM, we evaluate the following conditions instead:

Unit kernel size (1x1): In this scenario, the conv2d operations corresponds to a matrix multiplication and we call GeMM.
Winograd. Winograd only works with unit strides and supports a limited number of kernel sizes, such as 3x3, 3x1, 1x3, 5x1, 1x5 and 5x5
Indirect-GeMM: It should be used in all cases expect when the kernel size is 1x1 or when the IFM is small

If the preceding cases are not met, we will fall-back to the Im2Col+GeMM-based algorithm.

Convolution 2D heuristic: Arm® Mali™-based GPUs

The conv2d heuristic for Arm® Mali™-based GPUs is inside the get_convolution_method() method in the ClConv2d function.

The algorithms used in the get_convolution_method() function are the following:

Direct-Conv2D
Im2Col+GeMM-based
Indirect-GeMM
GeMM
Winograd

⚠ Attention: Winograd only works with floating-point data types (F32, F16)

The heuristic first checks less frequent cases that we may have in ML workloads for edge devices. These cases are the following:

Non unit dilation: We call Im2Col+GeMM
Large input and kernel shapes: We call Direct-Conv2D because it is the only algorithm that does not extra additionally temporary memory

In all the other cases, the GPU heuristic evaluates the suitability of Winograd and Direct-Conv2D/Indirect-Conv2D. In particular, Winograd is adopted when the convolution parameters (kernel size and strides) are supported by the algorithm and when the IFM is not small (for example, greater than 8). The conditions for using the Direct-Conv2D algorithms are several and we recommend you look at the heuristic directly. In general, the Direct-Conv2D operators is used in almost all cases where kernel size is not 1x1. The Indirect-GeMM algorithm is used in alternative to Direct-Conv2D only for Arm® Mali™-G77 GPU. If neither Winograd nor Direct-Conv2D can be used, we will fall-back to either GeMM (when the kernel size is 1x1) or the Im2Col+GeMM-based algorithm.