OpenCL Function Qualifiers (函数限定符)

OpenCL 3.0 Reference Pages -> OpenCL Compiler -> Function Qualifiers

1. Function Qualifiers (函数限定符)

1.1 `__kernel` or `kernel`

The __kernel or kernel qualifier declares a function to be a kernel that can be executed by an application on an OpenCL device(s). The following rules apply to functions that are declared with this qualifier:
限定符 __kernel or kernel 可以将函数声明为内核，使其可由应用程序在 OpenCL 设备上执行。对于用此限定符声明的函数有如下规定：

It can be executed on the device only
仅可在设备上执行
It can be called by the host
可由主机调用
It is just a regular function call if a __kernel function is called by another kernel function.
当被另一个内核函数调用时，__kernel 函数仅仅是一个常规函数。

Kernel functions with variables declared inside the function with the __local or local qualifier can be called by the host using appropriate APIs such as clEnqueueNDRangeKernel.
如果内核函数中声明的变量带有限定符 __local or local，则主机可以使用恰当的 API 像 clEnqueueNDRangeKernel 来调用他。如果一个内核函数中有用 local 限定符声明的变量，则不能从另一个内核函数调用该函数。

The __kernel and kernel names are reserved for use as functions qualifiers and shall not be used otherwise.
保留关键字 __kernel and kernel 只能用作函数的限定符，不能挪作它用。

kernel void
parallel_add(global float *a, global float *b, global float *result) {...
}// The following example is an example of an illegal kernel declaration and will result in a compile-time error.
// The kernel function has a return type of int instead of void.
kernel int
parallel_add(global float *a, global float *b, global float *result) {...
}

kernel 函数适用以下规则：

返回类型必须是 void，否则会产生编译错误。
从宿主机执行内核的命令可以入队，从而可以在设备上执行这个函数。
如果从一个内核函数调用，则函数表现为一个常规的函数。

一个内核函数调用另一个内核函数的情况，后者包含用 local 限定符声明的变量。其具体行为由实现定义，这些代码不能保证跨实现的可移植性，因此要避免这种做法。

kernel void
my_func_a(global float *src, global float *dst) {local float l_var[32];...
}kernel void
my_func_b(global float *src, global float *dst) {my_func_a(src, dst);  // implementation-defined behavior
}

要保证可移植性，更好的办法是将 local 变量作为参数传入 kernel。

kernel void
my_func_a(global float *src, global float *dst, local float *l_var) {...
}kernel void
my_func_b(global float *src, global float *dst, local float *l_var) {my_func_a(src, dst, l_var);
}

1.2 Optional Attribute Qualifiers (可选属性限定符)

The __kernel qualifier can be used with the keyword attribute to declare additional information about the kernel function as described below.
限定符 __kernel 可以与关键字 __attribute__ 一起使用，从而为内核函数声明额外的信息，如下所述。

The optional __attribute__((vec_type_hint(<type>))) [27] is a hint to the compiler and is intended to be a representation of the computational width of the __kernel, and should serve as the basis for calculating processor bandwidth utilization when the compiler is looking to autovectorize the code. In the __attribute__((vec_type_hint(<type>))) qualifier <type> is one of the built-in vector types listed in Built-in Vector Data Types or the constituent scalar element types. If vec_type_hint(<type>) is not specified, the kernel is assumed to have the __attribute__((vec_type_hint(int))) qualifier.
可选特性 __attribute__((vec_type_hint(<type>))) 可以给编译器提示 __kernel 可计算的 width，编译器在对代码进行自动矢量化时，可以将其作为基准来计算处理器的可利用带宽。其中 <type> 是 built-in vector types，或者是构成其元素的标量类型。如果没有指定 vec_type_hint (<type>) 则假定内核具有限定符 __attribute__((vec_type_hint(int)))。<type> 的默认值是 int，指示本质上内核是标量。

[27] Implicit in autovectorization is the assumption that any libraries called from the __kernel must be recompilable at run time to handle cases where the compiler decides to merge or separate work-items. This probably means that such libraries can never be hard coded binaries or that hard coded binaries must be accompanied either by source or some retargetable intermediate representation. This may be a code security question for some.
自动矢量化 (autovectorization) 时会做隐式的假设：__kernel 中调用的所有库都必须可在运行时重新编译，只有这样，编译器才可以合并或分离 work-items。这可能意味着，这样的库不能是 hard coded binaries，如果是 hard coded binaries，则必须带有源码或可重定向的中间表示。在某些情况下，这可能导致代码安全问题。

For example, where the developer specified a width of float4, the compiler should assume that the computation usually uses up to 4 lanes of a float vector, and would decide to merge work-items or possibly even separate one work-item into many threads to better match the hardware capabilities. A conforming implementation is not required to autovectorize code, but shall support the hint. A compiler may autovectorize, even if no hint is provided. If an implementation merges N work-items into one thread, it is responsible for correctly handling cases where the number of global or local work-items in any dimension modulo N is not zero.
例如，如果开发人员指定的宽度是 float4，则编译器应当假定通常使用四路 float 矢量进行计算，并可以决定合并 work-items 或者甚至可能将一个 work-item 分成多个线程，以更好的匹配硬件的能力。对于合格的实现，不要求可以对代码自动矢量化，但应当支持这个暗示。即使没有这个暗示，编译器也可能自动矢量化。如果实现将 N 个 work-items 合并到一个线程中，则它要负责正确地处理这种情况：global or local 作业项的数目在任一维度上模 N 都不为零。

hint [hɪnt]：n. 提示，暗示，迹象，窍门 v. 暗示，透露，示意
constituent [kənˈstɪtjʊənt]：n. 成分，构成要素，选民 adj. 组成的，构成的
conform [kənˈfɔː(r)m]：v. 一致，使一致，使顺应，使遵照
lane [leɪn]：n. 车道，小巷，泳道，胡同

Examples:

// autovectorize assuming float4 as the basic computation width
__kernel __attribute__((vec_type_hint(float4)))
void foo(__global float4 *p) { ... }// autovectorize assuming double as the basic computation width
__kernel __attribute__((vec_type_hint(double)))
void foo(__global float4 *p) { ... }// autovectorize assuming int (default) as the basic computation width
__kernel
void foo(__global float4 *p) { ... }

If for example, a __kernel function is declared with __attribute__(( vec_type_hint (float4))) (meaning that most operations in the __kernel function are explicitly vectorized using float4) and the kernel is running using Intel® Advanced Vector Instructions (Intel® AVX) which implements a 8-float-wide vector unit, the autovectorizer might choose to merge two work-items to one thread, running a second work-item in the high half of the 256-bit AVX register.
如果声明 __kernel 函数时带有 __attribute__(( vec_type_hint (float4))) (__kernel 函数中大部分运算都已使用 float4 显式矢量化)，此内核在使用 Intel® Advanced Vector Instructions (Intel® AVX) 时，实现了 8 个 float 宽度的矢量单元，则自动矢量化时可能选择将两个 work-items 合并成一个线程，第二个 work-item 将在 256-bit AVX 寄存器的高 128 位中运行。

As another example, a Power4 machine has two scalar double precision floating-point units with an 6-cycle deep pipe. An autovectorizer for the Power4 machine might choose to interleave six kernels declared with the __attribute__(( vec_type_hint (double2))) qualifier into one hardware thread, to ensure that there is always 12-way parallelism available to saturate the FPUs. It might also choose to merge 4 or 8 work-items (or some other number) if it concludes that these are better choices, due to resource utilization concerns or some preference for divisibility by 2.
Power4 机器具有两个双精度浮点单元，均为六级流水线结构 (6-cycle deep pipe)。针对这种机器，自动矢量化时可能选择将六个声明时带有限定符 __attribute__(( vec_type_hint (double2))) 的内核间插到同一个硬件线程中，从而可以让 FPUs 可以一直 12 路全速并行。考虑到资源利用或者一些偏好，如 2 的指数，即仅将 4 个或 8 个 (或者其他数目的) work-items 进行合并，只要自动矢量化时认定这种选择更好就行。

divisibility [dɪˌvɪzɪ'bɪlɪtɪ]：n. 可除尽，可分割性，(晶体的) 解理性

The optional __attribute__((work_group_size_hint(X, Y, Z))) is a hint to the compiler and is intended to specify the work-group size that may be used i.e. value most likely to be specified by the local_work_size argument to clEnqueueNDRangeKernel. For example, the __attribute__((work_group_size_hint(1, 1, 1))) is a hint to the compiler that the kernel will most likely be executed with a work-group size of 1.
可选特性 __attribute__((work_group_size_hint(X, Y, Z))) 可以给编译器提示 work-group 的大小，即最有可能传给 clEnqueueNDRangeKernel 的参数 local_work_size 的值。例如 __attribute__((work_group_size_hint(1, 1, 1))) 就是暗示编译器内核最有可能在大小为 1 的 work-group 中执行。

The optional __attribute__((reqd_work_group_size(X, Y, Z))) is the work-group size that must be used as the local_work_size argument to clEnqueueNDRangeKernel. This allows the compiler to optimize the generated code appropriately for this kernel.
可选特性 __attribute__((reqd_work_group_size(X, Y, Z))) 则是 clEnqueueNDRangeKernel 的参数 local_work_size 必须使用的值。指定将要使用的 work-group 大小，即 clEnqueueNDRangeKernel 的 local_work_size 参数中指定的值。这样编译器就可以根据对 work-group 大小的了解完成特定的优化。

If Z is one, the work_dim argument to clEnqueueNDRangeKernel can be 2 or 3. If Y and Z are one, the work_dim argument to clEnqueueNDRangeKernel can be 1, 2 or 3.
如果 Z 是 1，则 clEnqueueNDRangeKernel 的参数 work_dim 可以是 2 或 3。如果 Y 和 Z 都是 1，则 clEnqueueNDRangeKernel 的引数 work_dim 可以是 1、2 或 3。

References

Khronos OpenCL Registry
https://www.khronos.org/registry/OpenCL/

OpenCL 3.0 Reference Pages
https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/

OpenCL C Language Specification
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_C.html