为Halide安装opencl支持
为Halide安装opencl支持
- 1、背景
- 2、为Halide安装Opencl支持的目的
- 3、安装opencl支持的过程
- 4、如何在Halide中调用Opencl
- 4.1、调用流程
- 4.2、实际调用运用
- 4.3、实操过程
- 4.4、解析halide调用opencl的过程
1、背景
上次我们从源码编译了Halide,进行了代码示例学习;同时,我们也在该专题下进行过Intel-Cpu-Opencl的安装教程,同样有着相关示例学习和工程源码。
相关过往文章链接如下所示:
- Intel-Cpu-OpenCL-Runtime-SDK的使用和示例源码编译
- Halide源码编译
2、为Halide安装Opencl支持的目的
主要目的就是想要看下是否可以使用Halide进行opencl核函数的实现,以及想要看下halide调度对核函数的是如何产生影响的。同时,想看下其对后续的推理框架进行核函数生成的帮助,以及自创AutoKernel工具的可行性。
3、安装opencl支持的过程
- 首先安装Intel-Cpu-OpenCL-Runtime-SDK,具体方便见过往文章链接
- Halide源码编译开启opencl支持,具体见过往文章链接
# 1.下载Halide的源码
curl -o Halide.zip https://codeload.github.com/halide/Halide/zip/refs/heads/main
# 2.解压文件
unzip Halide.zip
# 3.编译
if [ ! -d "./build" ]; thenmkdir build
fi
cd build
# 关闭python_bindings开关
#在配置运行cmake的时候,打开target_opencl开关
cmake -DCMAKE_BUILD_TYPE=Release -DLLVM_DIR=$LLVM_ROOT/lib/cmake/llvm -DWITH_PYTHON_BINDINGS=OFF -DCMAKE_INSTALL_PREFIX=`pwd`/opencl-build -DTARGET_OPENCL=ON -S ..
cmake --build . -j32
# 4.此时安装在了build/opencl-build下面,记住此时的halide-install-path
make install -j32
4、如何在Halide中调用Opencl
4.1、调用流程
4.2、实际调用运用
#include <stdio.h>#include "Halide.h"// Include a clock to do performance testing.
#include "clock.h"// Include some support code for loading pngs.
#include "halide_image_io.h"using namespace Halide;
using namespace Halide::Tools;int main(){Target host_target = get_host_target();Target new_target = host_target.with_feature(Target::OpenCL);if (!host_supports_target_device(new_target) || !new_target.has_gpu_feature()) {return -1;}Var x, y, c, i, ii, xo, yo, xi, yi;Buffer<uint8_t> input = load_image("../../images/rgb.png");Func lut(i) = cast<uint8_t>(clamp(pow(i / 255.0f, 1.2f) * 255.0f, 0, 255));// Augment the input with a boundary condition.Func padded(x, y, c) = input(clamp(x, 0, input.width() - 1),clamp(y, 0, input.height() - 1), c);// Cast it to 16-bit to do the math.Func padded16(x, y, c) = cast<uint16_t>(padded(x, y, c));// Next we sharpen it with a five-tap filter.Func sharpen(x, y, c) = (padded16(x, y, c) * 2 -(padded16(x - 1, y, c) +padded16(x, y - 1, c) +padded16(x + 1, y, c) +padded16(x, y + 1, c)) /4);// Then apply the LUT.Func curved(x, y, c) = lut(sharpen(x, y, c));lut.compute_root();Var block, thread;lut.split(i, block, thread, 16);lut.gpu_blocks(block).gpu_threads(thread);curved.reorder(c, x, y).bound(c, 0, 3).unroll(c);// Compute curved in 2D 8x8 tiles using the GPU.curved.gpu_tile(x, y, xo, yo, xi, yi, 8, 8);padded.compute_at(curved, xo);padded.gpu_threads(x, y);// JIT-compile the pipeline for the GPU. CUDA, OpenCL, or// Metal are not enabled by default. We have to construct a// Target object, enable one of them, and then pass that// target object to compile_jit. Otherwise your CPU will very// slowly pretend it's a GPU, and use one thread per output// pixel.printf("Target: %s\n", target.to_string().c_str());curved.compile_jit(target);Buffer<uint8_t> output(input.width(), input.height(), input.channels());// Run the filter once to initialize any GPU runtime state.curved.realize(output);return 0;}
4.3、实操过程
#克隆工程
git clone https://github.com/pengzhikang/Halide-Learning.git
cd Halide-Learning/learn-halide/halide-opencl
# 把自己编译好的halide库放到3rdparty/halide中
cp -rf halide-install-path/* 3rdparty/halide
# 编译工程
chmod +x build.sh
./build.sh
展示的测速结果如下所示,这里可以看出来,虽然都是使用了cpu进行计算,但是还是intel-cpu-opencl速度更快些。
Running pipeline on CPU:
Running pipeline on GPU:
Target: x86-64-linux-avx-avx2-avx512-avx512_skylake-f16c-fma-opencl-sse41
Testing GPU correctness:
Testing performance on CPU:
4.6495 milliseconds
Testing performance on GPU:
3.7742 milliseconds
打开Target::OpenCL开关,我们可以看到使用opencl进行推理的时候,计算时调用的opencl api是如何的。
Entering Pipeline f4
Target: x86-64-linux-avx-avx2-avx512-avx512_skylake-debug-f16c-fma-jit-opencl-sse41-user_contextInput Buffer b0: buffer(0, 0x0, 0x7facf4120080, 1, uint8, {0, 768, 1}, {0, 1280, 768}, {0, 3, 983040})Input (void *) __user_context: 0x7ffc331b64e0Output Buffer f4: buffer(0, 0x0, 0x7facbb8a0080, 0, uint8, {0, 768, 1}, {0, 1280, 768}, {0, 3, 983040})
CL: halide_opencl_initialize_kernels (user_context: 0x7ffc331b64e0, state_ptr: 0x7fad16bff000, program: 0x7fad16c000c0, size: 7288load_libopencl (user_context: 0x7ffc331b64e0)Loaded OpenCL runtime library: libOpenCL.socreate_opencl_context (user_context: 0x7ffc331b64e0)
CL: platform 0 Intel(R) CPU Runtime for OpenCL(TM) ApplicationsGot platform 'Intel(R) CPU Runtime for OpenCL(TM) Applications', about to create context (t=13546)device name: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHzdevice vendor: Intel(R) Corporationdevice profile: FULL_PROFILEglobal mem size: 47926 MBmax mem alloc size: 11981 MBlocal mem size: 32768max compute units: 32max workgroup size: 8192max work item dimensions: 3max work item sizes: 8192x8192x8192x0clCreateContext -> 0x55bb00fc0128clCreateCommandQueue 0x55bb00f912e8
halide_cuda_initialize_kernels got compilation_cache mutex.clCreateProgramWithSource -> 0x55bb011493a8clBuildProgram 0x55bb011493a8 -D MAX_CONSTANT_BUFFER_SIZE=131072 -D MAX_CONSTANT_ARGS=480
Caching compiled kernel: 0x55bb011493a8 id 2 context 0x55bb00fc0128Time: 5.710537e+02 ms
CL: halide_opencl_device_malloc (user_context: 0x7ffc331b64e0, buf: 0x7ffc331b5b40)allocating buffer(0, 0x0, 0x0, 0, uint8, {0, 65536, 1})clCreateBuffer -> 65536 0x55bb0131c998 device_handle: 0x55bb00f3bf10Allocated device buffer 0x55bb00f3bf10 for buffer 0x7ffc331b5b40
CL: validate 0x55bb0131c998 offset: 0: asked for 65536, actual allocated 65536Time: 4.213800e-02 ms
CL: halide_opencl_run (user_context: 0x7ffc331b64e0, entry: _kernel_f0_s0_v12_v18___block_id_x, blocks: 4096x1x1, threads: 16x1x1, shmem: 0clCreateKernel _kernel_f0_s0_v12_v18___block_id_x -> Time: 1.361441e+00 msclSetKernelArg 0 8 [0x55bb00f3bf10 ...] 1
Mapped dev handle is: 0x55bb0131c998clSetKernelArg 1 0 [nullptr]clEnqueueNDRangeKernel 4096x1x1, 16x1x1 -> CL_SUCCESSReleasing kernel 0x55bb01011748clReleaseKernel finished0x55bb01011748Time: 1.849298e+00 ms
CL: halide_opencl_device_malloc (user_context: 0x7ffc331b64e0, buf: 0x55baffd54c68)allocating buffer(0, 0x0, 0x7facf4120080, 1, uint8, {0, 768, 1}, {0, 1280, 768}, {0, 3, 983040})clCreateBuffer -> 2949120 0x55baff28cf78 device_handle: 0x55bb00978520Allocated device buffer 0x55bb00978520 for buffer 0x55baffd54c68
CL: validate 0x55baff28cf78 offset: 0: asked for 2949120, actual allocated 2949120Time: 6.049300e-02 ms
CL: halide_opencl_buffer_copy (user_context: 0x7ffc331b64e0, src: 0x55baffd54c68, dst: 0x55baffd54c68)
CL: validate 0x55baff28cf78 offset: 0: asked for 0, actual allocated 2949120from host to device, 0x7facf4120080 + 0 -> 0x55bb00978520 + 0, 2949120 bytesTime: 2.930249e+00 ms
CL: halide_opencl_device_malloc (user_context: 0x7ffc331b64e0, buf: 0x55bb0102b678)allocating buffer(0, 0x0, 0x7facbb8a0080, 0, uint8, {0, 768, 1}, {0, 1280, 768}, {0, 3, 983040})clCreateBuffer -> 2949120 0x55baff54eca8 device_handle: 0x55bb008eeab0Allocated device buffer 0x55bb008eeab0 for buffer 0x55bb0102b678
CL: validate 0x55baff54eca8 offset: 0: asked for 2949120, actual allocated 2949120Time: 2.716900e-02 ms
CL: halide_opencl_run (user_context: 0x7ffc331b64e0, entry: _kernel_f4_s0_v10_v15___block_id_y, blocks: 96x160x1, threads: 10x10x1, shmem: 300clCreateKernel _kernel_f4_s0_v10_v15___block_id_y -> Time: 1.656029e+00 msclSetKernelArg 0 8 [0x55bb00978520 ...] 1
Mapped dev handle is: 0x55baff28cf78clSetKernelArg 1 8 [0x55bb00f3bf10 ...] 1
Mapped dev handle is: 0x55bb0131c998clSetKernelArg 2 8 [0x55bb008eeab0 ...] 1
Mapped dev handle is: 0x55baff54eca8clSetKernelArg 3 4 [0x50000000300 ...] 0clSetKernelArg 4 4 [0x500 ...] 0clSetKernelArg 5 4 [0x0 ...] 0clSetKernelArg 6 4 [0x30000000000 ...] 0clSetKernelArg 7 4 [0x300 ...] 0clSetKernelArg 8 4 [0xf000000000000 ...] 0clSetKernelArg 9 4 [0x1e0000000f0000 ...] 0clSetKernelArg 10 4 [0x10101001e0000 ...] 0clSetKernelArg 11 300 [nullptr]clEnqueueNDRangeKernel 96x160x1, 10x10x1 -> CL_SUCCESSReleasing kernel 0x55bb016fb848clReleaseKernel finished0x55bb016fb848Time: 5.408939e+00 ms
CL: halide_opencl_device_free (user_context: 0x7ffc331b64e0, buf: 0x7ffc331b5b40) cl_mem: 0x55bb0131c998
CL: validate 0x55bb0131c998 offset: 0: asked for 0, actual allocated 65536clReleaseMemObject 0x55bb0131c998Time: 8.309700e-02 ms
Exiting Pipeline f4
CL: halide_opencl_finalize_kernels (user_context: 0x7ffc331b64e0, state_ptr: 0x2
CL: halide_opencl_device_free (user_context: 0x0, buf: 0x55bb0102b678) cl_mem: 0x55baff54eca8
CL: validate 0x55baff54eca8 offset: 0: asked for 0, actual allocated 2949120clReleaseMemObject 0x55baff54eca8Time: 2.133770e-01 ms
CL: halide_opencl_device_free (user_context: 0x0, buf: 0x55baffd54c68) cl_mem: 0x55baff28cf78
CL: validate 0x55baff28cf78 offset: 0: asked for 0, actual allocated 2949120clReleaseMemObject 0x55baff28cf78Time: 1.516690e-01 ms
Target: x86-64-linux-avx-avx2-avx512-avx512_skylake-debug-f16c-fma-opencl-sse41
4.4、解析halide调用opencl的过程
halide对于一个计算图的实例化是如何用opencl实现的,具体看上面的打印信息,我们发现对于该计算图,halide使用了两个kernel去实现计算,其调用opencl的流程就是普通的opencl api调用流程:
为Halide安装opencl支持相关推荐
- ubuntu16.04中源码安装仅仅支持CPU的TensorFlow
直接用pip3安装的tensorflow在运行代码时,总是会提醒另一种更加高效率的编译模式,很烦人,再加上据说在CPU上计算速度会加倍,于是就尝试用tensorflow的源码进行安装,主要参考了T ...
- debian 安装 opencl
之前做个debian系统的安装博客.当时提到有显卡,安装了一个ati开源显卡驱动,然后成功进入桌面. 如今使用也熟悉些了,debian系统的软件不太一样,不过办公处理感觉比ubuntu的支持好些,毕竟 ...
- 用VS2005打开方案出现“此安装不支持该项目类型”
当在用VS2005打开已有项目时常会出现"此安装不支持该项目类型". 出现此原因是因为已有项目是在打了VS 2005 SP1补丁后编写的,所以在没有打补丁的.net中会出现此种情况 ...
- 为Linux上的Tomcat安装apr支持
在安装并配置好JDK和tomcat之后,就可以提供web的访问支持了,但是在启动的信息里面,会有相关的提示,说的意思就是要求你安装apr支持,而安装apr支持需要使用源代码一步一步来. 下面就请跟着我 ...
- 用VS2005开发Sharepoint工作流,出现此安装不支持该项目类型。的处理
用VS2005开发Sharepoint工作流,可能会出现"此安装不支持该项目类型."的错误,英文版VS2005是"the project type is not supp ...
- OpenVINO安装之安装openCL
参考: OpenVINO框架及相关工具套件安装https://docs.openvinotoolkit.org/cn/latest/_docs_install_guides_installing_op ...
- 11: Nginx安装lua支持
1.1 Nginx 使用lua脚本 注:需要LuaJIT-2.0.4.tar.gz,ngx_devel_kit,lua-nginx-module 1.Nginx安装lua支持 wget -c http ...
- python3 https_python3安装,支持openssl,支持采集https
python3安装,支持openssl,支持采集https 坑好多,特别是安装的时候,各种不匹配,服务器默认配置是python2,升级3后,采集的时候用到openssl,花了两天也没搞定各种错误,也许 ...
- jupyter 代码提示_Jupyter lab安装R支持时错误提示: exited with code 127解决办法
在Jupyter lab安装R支持时错误提示:jupyter-client has to be installed but "jupyter kernelspec --version&quo ...
最新文章
- 四、spring中高级装配(2)
- vim下php文件中自动缩排html代码
- IO流递归拷贝一个文件夹里面的所有文件,到另一个文件夹。如果重复不拷贝,可续拷
- 肖仰华 | 基于知识图谱的可解释人工智能:机遇与挑战
- 软件工程项目冲刺第二天
- 《上市公司信息披露电子化规范》简介
- oracle wip 拆解工单 操作_错过血亏!一文搞懂Oracle锁相关视图及相关操作
- 解决ojdbc14库引用问题
- Winserver普通域用户账户无权限启动停止系统服务(启动停止按钮是灰色)
- VS2010免注册调用大漠插件
- html5 小车动画_html5 echarts汽车仪表盘图表动画特效
- mysql脏页处理方法_mysql刷脏页的一次总结
- 使用urjtag了解WRV54G
- 不怕新歌有多嗨,就怕老歌带DJ,Python批量对DJ歌曲进行下载
- Win10电脑没有打开蓝牙的按钮怎么办?
- java的默认访问权限_java类的访问权限
- 1.5 深入理解常见类
- 读NP-C及哥德尔不完备定理感想
- 参数化建模类毕业论文文献有哪些?
- 效用最大化问题中的三个函数——需求函数、间接效用函数、支出函数
热门文章
- 1.jQuery入门
- jenkins ---持续集成/持续发布
- JavaScript 时间操作
- 解决webpack : 无法加载文件 C:\Users\XXX\AppData\Roaming\npm\webpack.ps1因为在此系统上禁止运行脚本
- DSP28335—FLASH烧写的方法
- 计算机应用基础教师授课视频,利用微课促进《计算机应用基础》教学的有效途径...
- Android智能指针——读书笔记
- dsa数字签名c语言编程,实验三DSA数字签名算法
- 【Linux命令】Linux复制时显示进度
- PCI Express学习篇---物理层电气特性(三)Transmitter Compliance Test