GPU自动调度卷积层
本文对GPU使用自动调度程序。
与依靠手动模板定义搜索空间的基于模板的autotvm不同,自动调度程序不需要任何模板。用户只需要编写计算声明,无需任何调度命令或模板。自动调度程序可以自动生成一个较大的搜索空间,在该空间中找到良好的调度。
本文以卷积层为例。
本文无法在Windows或最新版本的macOS上运行。要使其运行,需要将本文的内容包装在一个if name == “main”:块中。
import os

import numpy as np
import tvm
from tvm import te, auto_scheduler, topi
from tvm.topi.testing import conv2d_nchw_python
定义计算
首先,定义卷积层的计算。该函数应返回输入/输出张量的列表。通过这些张量,自动调度器可以获得整个计算图。
@auto_scheduler.register_workload
def conv2d_layer(N, H, W, CO, CI, KH, KW, stride, padding):
data = te.placeholder((N, CI, H, W), name=“data”)
kernel = te.placeholder((CO, CI, KH, KW), name=“kernel”)
bias = te.placeholder((1, CO, 1, 1), name=“bias”)
conv = topi.nn.conv2d_nchw(data, kernel, stride, padding, dilation=1, out_dtype=“float32”)
out = topi.nn.relu(conv + bias)
return [data, kernel, bias, out]
创建搜索任务
然后,为resnet中的最后一个卷积层创建搜索任务。
target = tvm.target.Target(“cuda”)

Use the last layer in ResNet-50

N, H, W, CO, CI, KH, KW, strides, padding = 1, 7, 7, 512, 512, 3, 3, (1, 1), (1, 1)
task = auto_scheduler.SearchTask(
func=conv2d_layer, args=(N, H, W, CO, CI, KH, KW, strides, padding), target=target
)

Inspect the computational graph

print(“Computational DAG:”)
print(task.compute_dag)
输出:
Computational DAG:
data = PLACEHOLDER [1, 512, 7, 7]
pad_temp(i0, i1, i2, i3) = tir.if_then_else(((((i2 >= 1) && (i2 < 8)) && (i3 >= 1)) && (i3 < 8)), data[i0, i1, (i2 - 1), (i3 - 1)], 0f)
kernel = PLACEHOLDER [512, 512, 3, 3]
compute(nn, ff, yy, xx) += (pad_temp[nn, rc, (yy + ry), (xx + rx)]*kernel[ff, rc, ry, rx])
bias = PLACEHOLDER [1, 512, 1, 1]
T_add(ax0, ax1, ax2, ax3) = (compute[ax0, ax1, ax2, ax3] + bias[ax0, ax1, 0, 0])
compute(i0, i1, i2, i3) = max(T_add[i0, i1, i2, i3], 0f)
接下来,为自动调度程序设置参数。这些参数主要指定在搜索过程中如何进行测量。
• measure_ctx启动不同的测量过程以提供隔离。保护主进程免受测量期间GPU崩溃的影响,避免其它运行时冲突。
• min_repeat_ms定义每次测量中一次“重复”的最小持续时间。这样可以预热GPU,对于获得准确的测量结果是必不可少的。通常,建议值> = 300毫秒。
• num_measure_trials是在搜索过程中可以使用的测量试验的数量。为了快速演示,在本文中仅进行了10次试用。在实践中,1000是使搜索收敛的一个好值。可以根据自己的时间预算进行更多试验。
• 此外,还用RecordToFile将测量记录转储到文件conv2d.json中。测量记录可用于最好地查询历史记录,恢复搜索以及以后进行更多分析。
• 有关更多参数auto_scheduler.TuningOptions, 请参见auto_scheduler.LocalRPCMeasureContext。
log_file = “conv2d.json”
measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)
tune_option = auto_scheduler.TuningOptions(
num_measure_trials=10, # change this to 1000 to achieve the best performance
runner=measure_ctx.runner,
measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
verbose=2,
)
输出:
Get devices for measurement successfully!
运行搜索
现在准备好所有输入。开始搜索,让自动调度程序发挥作用。经过一些测量试验之后,可以从日志文件中加载最佳调度并应用它。

Run auto-tuning (search)

task.tune(tune_option)

Apply the best schedule

sch, args = task.apply_best(log_file)

Kill the measurement process

del measure_ctx
输出:
可以降低调度以在自动调度后查看IR。自动调度程序可以正确执行优化,包括多层平铺,协作提取,展开和算子融合。
print(“Lowered TIR:”)
print(tvm.lower(sch, args, simple_mode=True))
输出:
Lowered TIR:
primfn(data_1: handle, kernel_1: handle, bias_1: handle, compute_1: handle) -> ()
attr = {“global_symbol”: “main”, “tir.noalias”: True}
buffers = {compute: Buffer(compute_2: Pointer(float32), float32, [1, 512, 7, 7], []),
kernel: Buffer(kernel_2: Pointer(float32), float32, [512, 512, 3, 3], []),
bias: Buffer(bias_2: Pointer(float32), float32, [1, 512, 1, 1], []),
data: Buffer(data_2: Pointer(float32), float32, [1, 512, 7, 7], [])}
buffer_map = {data_1: data, kernel_1: kernel, bias_1: bias, compute_1: compute} {
attr [IterVar(blockIdx.x: int32, (nullptr), “ThreadIndex”, “blockIdx.x”)] “thread_extent” = 16;
attr [compute_3: Pointer(float32)] “storage_scope” = “local”;
allocate(compute_3, float32, [14]);
attr [pad_temp.shared: Pointer(float32)] “storage_scope” = “shared”;
allocate(pad_temp.shared, float32, [1296]);
attr [kernel.shared: Pointer(float32)] “storage_scope” = “shared”;
allocate(kernel.shared, float32, [4608]);
attr [IterVar(threadIdx.x: int32, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112 {
compute_3[0] = 0f32
compute_3[7] = 0f32
compute_3[1] = 0f32
compute_3[8] = 0f32
compute_3[2] = 0f32
compute_3[9] = 0f32
compute_3[3] = 0f32
compute_3[10] = 0f32
compute_3[4] = 0f32
compute_3[11] = 0f32
compute_3[5] = 0f32
compute_3[12] = 0f32
compute_3[6] = 0f32
compute_3[13] = 0f32
for (rc.outer.outer: int32, 0, 32) {
attr [IterVar(threadIdx.x_1: int32, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112;
pad_temp.shared[threadIdx.x_1] = @tir.if_then_else(((((9 <= floormod(threadIdx.x_1, 81)) && (floormod(threadIdx.x_1, 81) < 72)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), (float32*)data_2[(((((rc.outer.outer784) + (floordiv(threadIdx.x_1, 81)49)) + (floordiv(floormod(threadIdx.x_1, 81), 9)7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112;
pad_temp.shared[(threadIdx.x_1 + 112)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 31), 81)) && (floormod((threadIdx.x_1 + 31), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), (float32
)data_2[(((((rc.outer.outer784) + (floordiv((threadIdx.x_1 + 112), 81)49)) + (floordiv(floormod((threadIdx.x_1 + 31), 81), 9)7)) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112;
pad_temp.shared[(threadIdx.x_1 + 224)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 62), 81)) && (floormod((threadIdx.x_1 + 62), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), (float32)data_2[(((((rc.outer.outer784) + (floordiv((threadIdx.x_1 + 224), 81)49)) + (floordiv(floormod((threadIdx.x_1 + 62), 81), 9)7)) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112;
pad_temp.shared[(threadIdx.x_1 + 336)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 12), 81)) && (floormod((threadIdx.x_1 + 12), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 3), 9))) && (floormod((threadIdx.x_1 + 3), 9) < 8)), (float32)data_2[(((((rc.outer.outer784) + (floordiv((threadIdx.x_1 + 336), 81)49)) + (floordiv(floormod((threadIdx.x_1 + 12), 81), 9)7)) + floormod((threadIdx.x_1 + 3), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112;
pad_temp.shared[(threadIdx.x_1 + 448)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 43), 81)) && (floormod((threadIdx.x_1 + 43), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 7), 9))) && (floormod((threadIdx.x_1 + 7), 9) < 8)), (float32)data_2[(((((rc.outer.outer784) + (floordiv((threadIdx.x_1 + 448), 81)49)) + (floordiv(floormod((threadIdx.x_1 + 43), 81), 9)7)) + floormod((threadIdx.x_1 + 7), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112;
pad_temp.shared[(threadIdx.x_1 + 560)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 74), 81)) && (floormod((threadIdx.x_1 + 74), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 2), 9))) && (floormod((threadIdx.x_1 + 2), 9) < 8)), (float32)data_2[(((((rc.outer.outer784) + (floordiv((threadIdx.x_1 + 560), 81)49)) + (floordiv(floormod((threadIdx.x_1 + 74), 81), 9)7)) + floormod((threadIdx.x_1 + 2), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112;
pad_temp.shared[(threadIdx.x_1 + 672)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 24), 81)) && (floormod((threadIdx.x_1 + 24), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 6), 9))) && (floormod((threadIdx.x_1 + 6), 9) < 8)), (float32)data_2[(((((rc.outer.outer784) + (floordiv((threadIdx.x_1 + 672), 81)49)) + (floordiv(floormod((threadIdx.x_1 + 24), 81), 9)7)) + floormod((threadIdx.x_1 + 6), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112;
pad_temp.shared[(threadIdx.x_1 + 784)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 55), 81)) && (floormod((threadIdx.x_1 + 55), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 1), 9))) && (floormod((threadIdx.x_1 + 1), 9) < 8)), (float32)data_2[(((((rc.outer.outer784) + (floordiv((threadIdx.x_1 + 784), 81)49)) + (floordiv(floormod((threadIdx.x_1 + 55), 81), 9)7)) + floormod((threadIdx.x_1 + 1), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112;
pad_temp.shared[(threadIdx.x_1 + 896)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 5), 81)) && (floormod((threadIdx.x_1 + 5), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 5), 9))) && (floormod((threadIdx.x_1 + 5), 9) < 8)), (float32)data_2[(((((rc.outer.outer784) + (floordiv((threadIdx.x_1 + 896), 81)49)) + (floordiv(floormod((threadIdx.x_1 + 5), 81), 9)7)) + floormod((threadIdx.x_1 + 5), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112;
pad_temp.shared[(threadIdx.x_1 + 1008)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 36), 81)) && (floormod((threadIdx.x_1 + 36), 81) < 72)) && (1 <= floormod(threadIdx.x_1, 9))) && (floormod(threadIdx.x_1, 9) < 8)), (float32)data_2[(((((rc.outer.outer784) + (floordiv((threadIdx.x_1 + 1008), 81)49)) + (floordiv(floormod((threadIdx.x_1 + 36), 81), 9)7)) + floormod(threadIdx.x_1, 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112;
pad_temp.shared[(threadIdx.x_1 + 1120)] = @tir.if_then_else(((((9 <= floormod((threadIdx.x_1 + 67), 81)) && (floormod((threadIdx.x_1 + 67), 81) < 72)) && (1 <= floormod((threadIdx.x_1 + 4), 9))) && (floormod((threadIdx.x_1 + 4), 9) < 8)), (float32)data_2[(((((rc.outer.outer784) + (floordiv((threadIdx.x_1 + 1120), 81)49)) + (floordiv(floormod((threadIdx.x_1 + 67), 81), 9)7)) + floormod((threadIdx.x_1 + 4), 9)) - 8)], 0f32, dtype=float32)
attr [IterVar(threadIdx.x_1, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112;
if @tir.likely((threadIdx.x_1 < 64), dtype=bool) {
pad_temp.shared[(threadIdx.x_1 + 1232)] = @tir.if_then_else((((floormod((threadIdx.x_1 + 17), 81) < 72) && (1 <= floormod((threadIdx.x_1 + 8), 9))) && (floormod((threadIdx.x_1 + 8), 9) < 8)), (float32)data_2[(((((rc.outer.outer784) + (floordiv((threadIdx.x_1 + 1232), 81)49)) + (floordiv(floormod((threadIdx.x_1 + 17), 81), 9)7)) + floormod((threadIdx.x_1 + 8), 9)) - 8)], 0f32, dtype=float32)
}
attr [IterVar(threadIdx.x_2: int32, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112 {
kernel.shared[(threadIdx.x_24)] = (float32)kernel_2[((((blockIdx.x147456) + (floordiv(threadIdx.x_2, 36)4608)) + (rc.outer.outer144)) + (floormod(threadIdx.x_2, 36)4))]
kernel.shared[((threadIdx.x_2
4) + 1)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 1), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 1), 144))]
kernel.shared[((threadIdx.x_2
4) + 2)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 2), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 2), 144))]
kernel.shared[((threadIdx.x_2
4) + 3)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 3), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 3), 144))]
}
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112 {
kernel.shared[((threadIdx.x_2
4) + 448)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 448), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 16), 144))]
kernel.shared[((threadIdx.x_2
4) + 449)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 449), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 17), 144))]
kernel.shared[((threadIdx.x_2
4) + 450)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 450), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 18), 144))]
kernel.shared[((threadIdx.x_2
4) + 451)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 451), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 19), 144))]
}
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112 {
kernel.shared[((threadIdx.x_2
4) + 896)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 896), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 32), 144))]
kernel.shared[((threadIdx.x_2
4) + 897)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 897), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 33), 144))]
kernel.shared[((threadIdx.x_2
4) + 898)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 898), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 34), 144))]
kernel.shared[((threadIdx.x_2
4) + 899)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 899), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 35), 144))]
}
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112 {
kernel.shared[((threadIdx.x_2
4) + 1344)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 1344), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 48), 144))]
kernel.shared[((threadIdx.x_2
4) + 1345)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 1345), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 49), 144))]
kernel.shared[((threadIdx.x_2
4) + 1346)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 1346), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 50), 144))]
kernel.shared[((threadIdx.x_2
4) + 1347)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 1347), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 51), 144))]
}
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112 {
kernel.shared[((threadIdx.x_2
4) + 1792)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 1792), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 64), 144))]
kernel.shared[((threadIdx.x_2
4) + 1793)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 1793), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 65), 144))]
kernel.shared[((threadIdx.x_2
4) + 1794)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 1794), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 66), 144))]
kernel.shared[((threadIdx.x_2
4) + 1795)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 1795), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 67), 144))]
}
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112 {
kernel.shared[((threadIdx.x_2
4) + 2240)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 2240), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 80), 144))]
kernel.shared[((threadIdx.x_2
4) + 2241)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 2241), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 81), 144))]
kernel.shared[((threadIdx.x_2
4) + 2242)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 2242), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 82), 144))]
kernel.shared[((threadIdx.x_2
4) + 2243)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 2243), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 83), 144))]
}
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112 {
kernel.shared[((threadIdx.x_2
4) + 2688)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 2688), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 96), 144))]
kernel.shared[((threadIdx.x_2
4) + 2689)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 2689), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 97), 144))]
kernel.shared[((threadIdx.x_2
4) + 2690)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 2690), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 98), 144))]
kernel.shared[((threadIdx.x_2
4) + 2691)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 2691), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 99), 144))]
}
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112 {
kernel.shared[((threadIdx.x_2
4) + 3136)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 3136), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 112), 144))]
kernel.shared[((threadIdx.x_2
4) + 3137)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 3137), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 113), 144))]
kernel.shared[((threadIdx.x_2
4) + 3138)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 3138), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 114), 144))]
kernel.shared[((threadIdx.x_2
4) + 3139)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 3139), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 115), 144))]
}
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112 {
kernel.shared[((threadIdx.x_2
4) + 3584)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 3584), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 128), 144))]
kernel.shared[((threadIdx.x_2
4) + 3585)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 3585), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 129), 144))]
kernel.shared[((threadIdx.x_2
4) + 3586)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 3586), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 130), 144))]
kernel.shared[((threadIdx.x_2
4) + 3587)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 3587), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 131), 144))]
}
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112 {
kernel.shared[((threadIdx.x_2
4) + 4032)] = (float32*)kernel_2[(((((blockIdx.x147456) + (floordiv((threadIdx.x_24), 144)4608)) + (rc.outer.outer144)) + (floormod(threadIdx.x_2, 36)4)) + 129024)]
kernel.shared[((threadIdx.x_2
4) + 4033)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 4033), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 1), 144))]
kernel.shared[((threadIdx.x_2
4) + 4034)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 4034), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 2), 144))]
kernel.shared[((threadIdx.x_2
4) + 4035)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 4035), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 3), 144))]
}
attr [IterVar(threadIdx.x_2, (nullptr), “ThreadIndex”, “threadIdx.x”)] “thread_extent” = 112 {
if @tir.likely((threadIdx.x_2 < 32), dtype=bool) {
kernel.shared[((threadIdx.x_2
4) + 4480)] = (float32*)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 4480), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 16), 144))]
}
if @tir.likely(((threadIdx.x_2
4) < 127), dtype=bool) {
if @tir.likely((threadIdx.x_2 < 32), dtype=bool) {
kernel.shared[((threadIdx.x_24) + 4481)] = (float32)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 4481), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 17), 144))]
}
}
if @tir.likely(((threadIdx.x_2
4) < 126), dtype=bool) {
if @tir.likely((threadIdx.x_2 < 32), dtype=bool) {
kernel.shared[((threadIdx.x_24) + 4482)] = (float32)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 4482), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 18), 144))]
}
}
if @tir.likely(((threadIdx.x_2
4) < 125), dtype=bool) {
if @tir.likely((threadIdx.x_2 < 32), dtype=bool) {
kernel.shared[((threadIdx.x_24) + 4483)] = (float32)kernel_2[((((blockIdx.x147456) + (floordiv(((threadIdx.x_24) + 4483), 144)4608)) + (rc.outer.outer144)) + floormod(((threadIdx.x_24) + 19), 144))]
}
}
}
for (rc.outer.inner: int32, 0, 4) {
compute_3[0] = ((float32
)compute_3[0] + ((float32*)pad_temp.shared[((rc.outer.inner324) + (floormod(threadIdx.x, 7)9))](float32)kernel.shared[((floordiv(threadIdx.x, 7)144) + (rc.outer.inner36))]))
compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[((rc.outer.inner324) + (floormod(threadIdx.x, 7)9))](float32)kernel.shared[(((floordiv(threadIdx.x, 7)144) + (rc.outer.inner36)) + 2304)]))
compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner324) + (floormod(threadIdx.x, 7)9)) + 1)](float32)kernel.shared[((floordiv(threadIdx.x, 7)144) + (rc.outer.inner36))]))
compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner324) + (floormod(threadIdx.x, 7)9)) + 1)](float32)kernel.shared[(((floordiv(threadIdx.x, 7)144) + (rc.outer.inner36)) + 2304)]))
compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner324) + (floormod(threadIdx.x, 7)9)) + 2)](float32)kernel.shared[((floordiv(threadIdx.x, 7)144) + (rc.outer.inner36))]))
compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner324) + (floormod(threadIdx.x, 7)9)) + 2)](float32)kernel.shared[(((floordiv(threadIdx.x, 7)144) + (rc.outer.inner36)) + 2304)]))
compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner324) + (floormod(threadIdx.x, 7)9)) + 3)](float32)kernel.shared[((floordiv(threadIdx.x, 7)144) + (rc.outer.inner36))]))
compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner324) + (floormod(threadIdx.x, 7)9)) + 3)](float32)kernel.shared[(((floordiv(threadIdx.x, 7)144) + (rc.outer.inner36)) + 2304)]))
compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner324) + (floormod(threadIdx.x, 7)9)) + 4)](float32)kernel.shared[((floordiv(threadIdx.x, 7)144) + (rc.outer.inner36))]))

    compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 22)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2312)]))compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 23)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 8)]))compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 23)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2312)]))compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 24)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 8)]))compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 24)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2312)]))compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 25)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 8)]))compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 25)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2312)]))compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 26)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 8)]))compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 26)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2312)]))compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 99)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 15)]))compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 99)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2319)]))compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 100)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 15)]))compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 100)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2319)]))compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 101)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 15)]))compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 101)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2319)]))compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 102)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 15)]))compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 102)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2319)]))compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 103)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 15)]))compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 103)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2319)]))compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 104)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 15)]))compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 104)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2319)]))compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 105)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 15)]))compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 105)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2319)]))compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 100)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 16)]))compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 100)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2320)]))compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 101)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 16)]))compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 101)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2320)]))compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 102)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 16)]))compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 102)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2320)]))compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 103)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 16)]))compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 103)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2320)]))compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 104)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 16)]))compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 104)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2320)]))compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 105)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 16)]))compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 105)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2320)]))compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 106)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 16)]))compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 106)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2320)]))compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 101)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 17)]))compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 101)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2321)]))compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 102)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 17)]))compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 102)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2321)]))compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 103)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 17)]))compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 103)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2321)]))compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 104)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 17)]))compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 104)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2321)]))compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 105)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 17)]))compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 105)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2321)]))compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 106)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 17)]))compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 106)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2321)]))compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 107)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 17)]))compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 107)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2321)]))compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 180)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 24)]))compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 180)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2328)]))compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 181)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 24)]))compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 181)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2328)]))compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 182)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 24)]))compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 182)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2328)]))compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 183)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 24)]))compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 183)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2328)]))compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 184)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 24)]))compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 184)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2328)]))compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 185)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 24)]))compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 185)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2328)]))compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 186)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 24)]))compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 186)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2328)]))compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 181)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 25)]))compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 181)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2329)]))compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 182)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 25)]))compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 182)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2329)]))compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 183)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 25)]))compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 183)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2329)]))compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 184)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 25)]))compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 184)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2329)]))compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 185)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 25)]))compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 185)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2329)]))compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 186)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 25)]))compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 186)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2329)]))compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 187)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 25)]))compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 187)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2329)]))compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 182)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 26)]))compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 182)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2330)]))compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 183)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 26)]))compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 183)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2330)]))compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 184)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 26)]))compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 184)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2330)]))compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 185)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 26)]))compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 185)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2330)]))compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 186)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 26)]))compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 186)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2330)]))compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 187)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 26)]))compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 187)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2330)]))compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 188)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 26)]))compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 188)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2330)]))compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 261)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 33)]))compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 261)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2337)]))compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 262)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 33)]))compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 262)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2337)]))compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 263)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 33)]))compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 263)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2337)]))compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 264)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 33)]))compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 264)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2337)]))compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 265)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 33)]))compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 265)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2337)]))compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 266)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 33)]))compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 266)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2337)]))compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 267)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 33)]))compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 267)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2337)]))compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 262)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 34)]))compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 262)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2338)]))compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 263)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 34)]))compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 263)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2338)]))compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 264)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 34)]))compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 264)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2338)]))compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 265)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 34)]))compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 265)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2338)]))compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 266)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 34)]))compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 266)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2338)]))compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 267)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 34)]))compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 267)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2338)]))compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 268)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 34)]))compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 268)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2338)]))compute_3[0] = ((float32*)compute_3[0] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 263)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 35)]))compute_3[7] = ((float32*)compute_3[7] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 263)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2339)]))compute_3[1] = ((float32*)compute_3[1] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 264)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 35)]))compute_3[8] = ((float32*)compute_3[8] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 264)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2339)]))compute_3[2] = ((float32*)compute_3[2] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 265)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 35)]))compute_3[9] = ((float32*)compute_3[9] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 265)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2339)]))compute_3[3] = ((float32*)compute_3[3] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 266)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 35)]))compute_3[10] = ((float32*)compute_3[10] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 266)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2339)]))compute_3[4] = ((float32*)compute_3[4] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 267)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 35)]))compute_3[11] = ((float32*)compute_3[11] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 267)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2339)]))compute_3[5] = ((float32*)compute_3[5] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 268)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 35)]))compute_3[12] = ((float32*)compute_3[12] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 268)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2339)]))compute_3[6] = ((float32*)compute_3[6] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 269)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 35)]))compute_3[13] = ((float32*)compute_3[13] + ((float32*)pad_temp.shared[(((rc.outer.inner*324) + (floormod(threadIdx.x, 7)*9)) + 269)]*(float32*)kernel.shared[(((floordiv(threadIdx.x, 7)*144) + (rc.outer.inner*36)) + 2339)]))}
}
for (i3.inner: int32, 0, 7) {compute_2[(((blockIdx.x*1568) + (threadIdx.x*7)) + i3.inner)] = max(((float32*)compute_3[i3.inner] + (float32*)bias_2[((blockIdx.x*32) + floordiv(threadIdx.x, 7))]), 0f32)compute_2[((((blockIdx.x*1568) + (threadIdx.x*7)) + i3.inner) + 784)] = max(((float32*)compute_3[(i3.inner + 7)] + (float32*)bias_2[(((blockIdx.x*32) + floordiv(threadIdx.x, 7)) + 16)]), 0f32)
}

}
}
检查正确性并评估性能
构建二进制文件并检查其正确性和性能。
func = tvm.build(sch, args, target)

Check correctness

data_np = np.random.uniform(size=(N, CI, H, W)).astype(np.float32)
weight_np = np.random.uniform(size=(CO, CI, KH, KW)).astype(np.float32)
bias_np = np.random.uniform(size=(1, CO, 1, 1)).astype(np.float32)
conv_np = conv2d_nchw_python(data_np, weight_np, strides, padding)
out_np = np.maximum(conv_np + bias_np, 0.0)

ctx = tvm.gpu()
data_tvm = tvm.nd.array(data_np, ctx=ctx)
weight_tvm = tvm.nd.array(weight_np, ctx=ctx)
bias_tvm = tvm.nd.array(bias_np, ctx=ctx)
out_tvm = tvm.nd.empty(out_np.shape, ctx=ctx)
func(data_tvm, weight_tvm, bias_tvm, out_tvm)

Check results

np.testing.assert_allclose(out_np, out_tvm.asnumpy(), rtol=1e-3)

Evaluate execution time

evaluator = func.time_evaluator(func.entry_name, ctx, min_repeat_ms=500)
print(
“Execution time of this operator: %.3f ms”
% (np.median(evaluator(data_tvm, weight_tvm, bias_tvm, out_tvm).results) * 1000)
)
输出:
Execution time of this operator: 0.184 ms
使用记录文件
搜索期间,所有测量记录都将转储到记录文件“ conv2d.json”中。测量记录可用于重新应用搜索结果,继续搜索以及执行其它分析。
这是一个示例,其中从文件加载最佳调度,打印等效的python调度API和CUDA源代码。它们可用于调试和学习自动调度程序的行为。
print(“Equivalent python schedule:”)
print(task.print_best(log_file, print_mode=“schedule”))

print(“CUDA source code:”)
print(task.print_best(log_file, print_mode=“cuda”))
输出:
Equivalent python schedule:
pad_temp_i0, pad_temp_i1, pad_temp_i2, pad_temp_i3 = tuple(pad_temp.op.axis) + tuple(pad_temp.op.reduce_axis)
compute_nn, compute_ff, compute_yy, compute_xx, compute_rc, compute_ry, compute_rx = tuple(compute.op.axis) + tuple(compute.op.reduce_axis)
T_add_ax0, T_add_ax1, T_add_ax2, T_add_ax3 = tuple(T_add.op.axis) + tuple(T_add.op.reduce_axis)
compute_i0, compute_i1, compute_i2, compute_i3 = tuple(compute.op.axis) + tuple(compute.op.reduce_axis)
s[T_add].compute_inline()
compute_nn_o_i, compute_nn_i = s[compute].split(compute_nn, factor=1)
compute_nn_o_o_i, compute_nn_o_i = s[compute].split(compute_nn_o_i, factor=1)
compute_nn_o_o_o_i, compute_nn_o_o_i = s[compute].split(compute_nn_o_o_i, factor=1)
compute_nn_o_o_o_o, compute_nn_o_o_o_i = s[compute].split(compute_nn_o_o_o_i, factor=1)
compute_ff_o_i, compute_ff_i = s[compute].split(compute_ff, factor=1)
compute_ff_o_o_i, compute_ff_o_i = s[compute].split(compute_ff_o_i, factor=1)
compute_ff_o_o_o_i, compute_ff_o_o_i = s[compute].split(compute_ff_o_o_i, factor=16)
compute_ff_o_o_o_o, compute_ff_o_o_o_i = s[compute].split(compute_ff_o_o_o_i, factor=2)
compute_yy_o_i, compute_yy_i = s[compute].split(compute_yy, factor=1)
compute_yy_o_o_i, compute_yy_o_i = s[compute].split(compute_yy_o_i, factor=1)
compute_yy_o_o_o_i, compute_yy_o_o_i = s[compute].split(compute_yy_o_o_i, factor=7)
compute_yy_o_o_o_o, compute_yy_o_o_o_i = s[compute].split(compute_yy_o_o_o_i, factor=1)
compute_xx_o_i, compute_xx_i = s[compute].split(compute_xx, factor=7)
compute_xx_o_o_i, compute_xx_o_i = s[compute].split(compute_xx_o_i, factor=1)
compute_xx_o_o_o_i, compute_xx_o_o_i = s[compute].split(compute_xx_o_o_i, factor=1)
compute_xx_o_o_o_o, compute_xx_o_o_o_i = s[compute].split(compute_xx_o_o_o_i, factor=1)
compute_rc_o_i, compute_rc_i = s[compute].split(compute_rc, factor=4)
compute_rc_o_o, compute_rc_o_i = s[compute].split(compute_rc_o_i, factor=4)
compute_ry_o_i, compute_ry_i = s[compute].split(compute_ry, factor=1)
compute_ry_o_o, compute_ry_o_i = s[compute].split(compute_ry_o_i, factor=3)
compute_rx_o_i, compute_rx_i = s[compute].split(compute_rx, factor=3)
compute_rx_o_o, compute_rx_o_i = s[compute].split(compute_rx_o_i, factor=1)
s[compute].reorder(compute_nn_o_o_o_o, compute_ff_o_o_o_o, compute_yy_o_o_o_o, compute_xx_o_o_o_o, compute_nn_o_o_o_i, compute_ff_o_o_o_i, compute_yy_o_o_o_i, compute_xx_o_o_o_i, compute_nn_o_o_i, compute_ff_o_o_i, compute_yy_o_o_i, compute_xx_o_o_i, compute_rc_o_o, compute_ry_o_o, compute_rx_o_o, compute_rc_o_i, compute_ry_o_i, compute_rx_o_i, compute_nn_o_i, compute_ff_o_i, compute_yy_o_i, compute_xx_o_i, compute_rc_i, compute_ry_i, compute_rx_i, compute_nn_i, compute_ff_i, compute_yy_i, compute_xx_i)
compute_i0_o_i, compute_i0_i = s[compute].split(compute_i0, factor=1)
compute_i0_o_o_i, compute_i0_o_i = s[compute].split(compute_i0_o_i, factor=1)
compute_i0_o_o_o, compute_i0_o_o_i = s[compute].split(compute_i0_o_o_i, factor=1)
compute_i1_o_i, compute_i1_i = s[compute].split(compute_i1, factor=1)
compute_i1_o_o_i, compute_i1_o_i = s[compute].split(compute_i1_o_i, factor=16)
compute_i1_o_o_o, compute_i1_o_o_i = s[compute].split(compute_i1_o_o_i, factor=2)
compute_i2_o_i, compute_i2_i = s[compute].split(compute_i2, factor=1)
compute_i2_o_o_i, compute_i2_o_i = s[compute].split(compute_i2_o_i, factor=7)
compute_i2_o_o_o, compute_i2_o_o_i = s[compute].split(compute_i2_o_o_i, factor=1)
compute_i3_o_i, compute_i3_i = s[compute].split(compute_i3, factor=7)
compute_i3_o_o_i, compute_i3_o_i = s[compute].split(compute_i3_o_i, factor=1)
compute_i3_o_o_o, compute_i3_o_o_i = s[compute].split(compute_i3_o_o_i, factor=1)
s[compute].reorder(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o, compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i, compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i, compute_i0_i, compute_i1_i, compute_i2_i, compute_i3_i)
s[compute].compute_at(s[compute], compute_i3_o_i)
kernel_shared = s.cache_read(kernel, “shared”, [compute])
kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3 = tuple(kernel_shared.op.axis)
s[kernel_shared].compute_at(s[compute], compute_rx_o_o)
pad_temp_shared = s.cache_read(pad_temp, “shared”, [compute])
pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3 = tuple(pad_temp_shared.op.axis)
s[pad_temp_shared].compute_at(s[compute], compute_rx_o_o)
s[pad_temp].compute_inline()
compute_i0_o_o_o_i1_o_o_o_fused_i2_o_o_o_fused_i3_o_o_o_fused = s[compute].fuse(compute_i0_o_o_o, compute_i1_o_o_o, compute_i2_o_o_o, compute_i3_o_o_o)
s[compute].bind(compute_i0_o_o_o_i1_o_o_o_fused_i2_o_o_o_fused_i3_o_o_o_fused, te.thread_axis(“blockIdx.x”))
compute_i0_o_o_i_i1_o_o_i_fused_i2_o_o_i_fused_i3_o_o_i_fused = s[compute].fuse(compute_i0_o_o_i, compute_i1_o_o_i, compute_i2_o_o_i, compute_i3_o_o_i)
s[compute].bind(compute_i0_o_o_i_i1_o_o_i_fused_i2_o_o_i_fused_i3_o_o_i_fused, te.thread_axis(“vthread”))
compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused = s[compute].fuse(compute_i0_o_i, compute_i1_o_i, compute_i2_o_i, compute_i3_o_i)
s[compute].bind(compute_i0_o_i_i1_o_i_fused_i2_o_i_fused_i3_o_i_fused, te.thread_axis(“threadIdx.x”))
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[kernel_shared].fuse(kernel_shared_ax0, kernel_shared_ax1, kernel_shared_ax2, kernel_shared_ax3)
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=4)
s[kernel_shared].vectorize(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[kernel_shared].split(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=112)
s[kernel_shared].bind(kernel_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis(“threadIdx.x”))
pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused = s[pad_temp_shared].fuse(pad_temp_shared_ax0, pad_temp_shared_ax1, pad_temp_shared_ax2, pad_temp_shared_ax3)
pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused, factor=1)
s[pad_temp_shared].vectorize(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_i)
pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_o, pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i = s[pad_temp_shared].split(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o, factor=112)
s[pad_temp_shared].bind(pad_temp_shared_ax0_ax1_fused_ax2_fused_ax3_fused_o_i, te.thread_axis(“threadIdx.x”))
s[compute].pragma(compute_nn_o_o_o_o, “auto_unroll_max_step”, 1024)
s[compute].pragma(compute_nn_o_o_o_o, “unroll_explicit”, True)

CUDA source code:

#ifdef _WIN32
using uint = unsigned int;
using uchar = unsigned char;
using ushort = unsigned short;
using int64_t = long long;
using uint64_t = unsigned long long;
#else
#define uint unsigned int
#define uchar unsigned char
#define ushort unsigned short
#define int64_t long
#define uint64_t ulong
#endif
extern “C” global void default_function_kernel0(float* restrict data, float* restrict kernel, float* restrict compute, float* restrict bias) {
float compute1[14];
shared float pad_temp_shared[1296];
shared float kernel_shared[4608];
compute1[(0)] = 0.000000e+00f;
compute1[(7)] = 0.000000e+00f;
compute1[(1)] = 0.000000e+00f;
compute1[(8)] = 0.000000e+00f;
compute1[(2)] = 0.000000e+00f;
compute1[(9)] = 0.000000e+00f;
compute1[(3)] = 0.000000e+00f;
compute1[(10)] = 0.000000e+00f;
compute1[(4)] = 0.000000e+00f;
compute1[(11)] = 0.000000e+00f;
compute1[(5)] = 0.000000e+00f;
compute1[(12)] = 0.000000e+00f;
compute1[(6)] = 0.000000e+00f;
compute1[(13)] = 0.000000e+00f;
for (int rc_outer_outer = 0; rc_outer_outer < 32; ++rc_outer_outer) {
__syncthreads();
pad_temp_shared[(((int)threadIdx.x))] = (((((9 <= (((int)threadIdx.x) % 81)) && ((((int)threadIdx.x) % 81) < 72)) && (1 <= (((int)threadIdx.x) % 9))) && ((((int)threadIdx.x) % 9) < 8)) ? data[((((((rc_outer_outer * 784) + ((((int)threadIdx.x) / 81) * 49)) + (((((int)threadIdx.x) % 81) / 9) * 7)) + (((int)threadIdx.x) % 9)) - 8))] : 0.000000e+00f);
pad_temp_shared[((((int)threadIdx.x) + 112))] = (((((9 <= ((((int)threadIdx.x) + 31) % 81)) && (((((int)threadIdx.x) + 31) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 4) % 9))) && (((((int)threadIdx.x) + 4) % 9) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 112) / 81) * 49)) + ((((((int)threadIdx.x) + 31) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 4) % 9)) - 8))] : 0.000000e+00f);
pad_temp_shared[((((int)threadIdx.x) + 224))] = (((((9 <= ((((int)threadIdx.x) + 62) % 81)) && (((((int)threadIdx.x) + 62) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 8) % 9))) && (((((int)threadIdx.x) + 8) % 9) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 224) / 81) * 49)) + ((((((int)threadIdx.x) + 62) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 8) % 9)) - 8))] : 0.000000e+00f);
pad_temp_shared[((((int)threadIdx.x) + 336))] = (((((9 <= ((((int)threadIdx.x) + 12) % 81)) && (((((int)threadIdx.x) + 12) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 3) % 9))) && (((((int)threadIdx.x) + 3) % 9) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 336) / 81) * 49)) + ((((((int)threadIdx.x) + 12) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 3) % 9)) - 8))] : 0.000000e+00f);
pad_temp_shared[((((int)threadIdx.x) + 448))] = (((((9 <= ((((int)threadIdx.x) + 43) % 81)) && (((((int)threadIdx.x) + 43) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 7) % 9))) && (((((int)threadIdx.x) + 7) % 9) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 448) / 81) * 49)) + ((((((int)threadIdx.x) + 43) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 7) % 9)) - 8))] : 0.000000e+00f);
pad_temp_shared[((((int)threadIdx.x) + 560))] = (((((9 <= ((((int)threadIdx.x) + 74) % 81)) && (((((int)threadIdx.x) + 74) % 81) < 72)) && (1 <= ((((int)threadIdx.x) + 2) % 9))) && (((((int)threadIdx.x) + 2) % 9) < 8)) ? data[((((((rc_outer_outer * 784) + (((((int)threadIdx.x) + 560) / 81) * 49)) + ((((((int)threadIdx.x) + 74) % 81) / 9) * 7)) + ((((int)threadIdx.x) + 2) % 9)) - 8))] : 0.000000e+00f);
compute1[(2)] = (compute1[(2)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 103))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 17))]));
compute1[(9)] = (compute1[(9)] + (pad_temp_shared[((((rc_outer_inner * 324) + ((((int)threadIdx.x) % 7) * 9)) + 103))] * kernel_shared[(((((((int)threadIdx.x) / 7) * 144) + (rc_outer_inner * 36)) + 2321))]));
compute1[(3)] = (compute1[(3)] + (pad_temp_shared[((((rc_outer_inner *

GPU自动调度卷积层相关推荐

  1. NVIDIA GPU自动调度神经网络

    NVIDIA GPU自动调度神经网络 对特定设备和工作负载进行自动调整对于获得最佳性能至关重要.这是有关如何使用自动调度器为NVIDIA GPU调整整个神经网络. 为了自动调整神经网络,将网络划分为小 ...

  2. pytorch 入门:GPU加速,卷积层,池化层

    GPU 加速 深度学习设计很多向量和多矩阵运算,比如BP , CNN 等深层模型都可以写成矩阵运算的格式,不用写成循环运算.但是CPU 上矩阵的运算会被展成循环的形式,CPU 是串行执行的.而GPU ...

  3. 自动调度GPU的卷积层

    自动调度GPU的卷积层 这是有关如何对GPU使用自动调度程序的文档. 与依靠手动模板定义搜索空间的基于模板的autotvm不同,自动调度程序不需要任何模板.用户只需要编写计算声明,而无需任何调度命令或 ...

  4. NVIDIA GPU的神经网络自动调度

    NVIDIA GPU的神经网络自动调度 针对特定设备和工作负载的自动调整对于获得最佳性能至关重要.这是一个关于如何使用自动调度器为NVIDIA GPU调整整个神经网络的资料. 为了自动调整一个神经网络 ...

  5. CPU的自动调度矩阵乘法

    CPU的自动调度矩阵乘法 这是一个有关如何对CPU使用自动调度程序的文档. 与依靠手动模板定义搜索空间的基于模板的autotvm不同,自动调度程序不需要任何模板.用户只需要编写计算声明,而无需任何调度 ...

  6. 智源青年科学家梁云:异构系统中张量计算的自动调度和优化框架

    与6位图灵奖得主和100多位专家 共同探讨人工智能的下一个十年 长按图片,内行盛会,首次免费注册 北京智源大会倒计时:9天  计算机体系结构领域国际顶级会议每次往往仅录用几十篇论文,录用率在20%左右 ...

  7. 在卷积层的运用_Conv 卷积层

    onv 卷积层 一.why CNN for image ? 1.Some Patterns are much smaller than the whole image. 一些模式比起整张图片来说更小, ...

  8. ARM CPU自动调度神经网络

    ARM CPU自动调度神经网络 对特定设备和工作负载进行自动调度,对于获得最佳性能至关重要.通过RPC使用自动调度器为ARM CPU调度整个神经网络. 为了自动调度神经网络,将网络划分为小的子图,进行 ...

  9. ARM CPU神经网络自动调度

    ARM CPU神经网络自动调度 对特定设备和工作负载进行自动调整对于获得最佳性能至关重要.这是一个有关如何通过RPC使用自动调度器为ARM CPU调整整个神经网络的教程. 为了自动调整神经网络,将网络 ...

最新文章

  1. 解决TCP网络传输“粘包”问题
  2. warning C4251 needs to have dll-interface解决办法
  3. 网络编程2_网络通讯协议, socket(tcp, udp)
  4. 帆软报表加载(开始)完毕执行自定义JavaScript
  5. 【JavaScript脚本】——T2事件操作
  6. 1026 程序运行时间 (15 分)(c语言)
  7. CF-1207 G.Indie Album(Trie上跑AC自动机)
  8. 读写锁的由奢入俭“易”
  9. android java 8_四个库,让你在 Android 中启用 Java 8 功能
  10. 【树链剖分】旅游(luogu 3976)
  11. 手把手教你搭建自己的个人博客(图文教程)
  12. tyvj1194 划分大理石
  13. 第1章 C/C++与开发环境介绍(《C和C++游戏趣味编程》配套教学视频)
  14. asp.net中FCKeditor的调用(31)
  15. modelsim教程
  16. 使用nssm管理tomcat服务操作步骤
  17. Redis入门完整教程:CacheCloud用户功能
  18. 无刷直流电机最强科普(收藏版)
  19. java web开发(一) Java Web开发框架对比
  20. ORA-04031: 无法分配 3840 字节的共享内存 (“shared pool“,“unknown object“,“sga heap(1,0)“,“kglsim object batch“)

热门文章

  1. 距离传感器控制灯泡代码_如何使用颜色传感器和超声波传感器检测障碍物和避障...
  2. detach detach_ pytorch
  3. BERT大火却不懂Transformer?读这一篇就够了 重点 命名实体识别
  4. python pycharm 包 安装问题
  5. 一文告诉你Adam、AdamW、Amsgrad区别和联系 重点
  6. Tensorflow函数——tf.variable_scope()
  7. LeetCode简单题之猜数字大小
  8. LeetCode简单题之拼写单词
  9. LeetCode简单题之二分查找
  10. 如何在CPU上优化GEMM矩阵乘法