关于实现Halcon算法加速的基础知识（多核并行/GPU）

一、提高Halcon的运算速度，有以下几种方法：

1、Multithreading(多线程)

2、Automatic Parallelization(自动操作并行化)

3、Compute devices，利用GPU提速，如果显卡性能好，至少可以提高5~10倍的运算速度

二、多线程

1、官方自带的例程get_operator_info.hdev，可以查看支持多线程的算子；

* Determine the multithreading information
get_multithreading_operators (TypeExclusive, TypeMutual, TypeReentrant, TypeIndependent)
* 自定义函数展开之后,有get_operator_info算子
* get names of all operators of the library
get_operator_name ('', OperatorNames)
get_operator_info (OperatorNames[Index], 'parallelization', Information)

2、官方的手册

C:\Program Files\MVTec\HALCON-19.11-Progress\doc\pdf\manuals\programmers_guide.pdf

Chapter 2 Parallel Programming and HALCON

C:\Program Files\MVTec\HALCON-19.11-Progress\doc\pdf\reference\reference_hdevelop.pdf

Chapter 25 System --- 25.6 Multithreading

三、多核并行

看看官方的说明，关于HALCON-多核性能：

1、算子自动并行化(AOP)

Automatic Operator Parallelization (AOP)

多核和多处理器的计算机显著提升了计算机视觉系统的速度。八年多以来，HALCON提供了通过工业验证的算子并行化，能很好地支持这种速度的提升。当然，并不是全部的视觉操作都能受益于并行化这种方式。因此，HALCON的智能算法可以确定是否需要用并行化方式——会考虑到具体的算法，算法的输入值和硬件条件。

并行HALCON在多核计算机上会自动将数据，比如图像数据分配给多个线程，每一个线程对应一个内核。用户甚至不需要改动已有的HALCON程序来就能使用自动划分功能，从而立即获得显著的速度提升。

2、并行编程

HALCON支持并行编程，如多线程的程序。它不仅仅是线程安全的而且可多次调用。因此多个线程可在同一时刻同时调用HALCON算子。利用这种特性，用户可以将一个机器视觉应用软件分解成多个独立的部分，让它们在不同的处理器上并行运行。

在一个四核的计算机上运行算子，HALCON会自动将图像分为四部分，由四个线程并行处理。

在一个包含两个Quad-Core Intel Xeon E5345，2.33 GHz在内的计算机上使用median_image算子（13×13的模板）对1280×1024的图像进行滤波操作时，根据使用CPU核的数量的不同，加速因子分别为*1/1.96/2.90/3.79/4.51/5.48/6.34/6.93。注意：可以达到的最高加速因子与所用的HALCON算子和图像大小有关。

3、AOP默认是激活的

（1）Halcon一方面提供自动操作员并行化（AOP）和手动并行化的手段。另一方面，对应用程序部分进行编程。自动操作员并行化（AOP）将输入数据（例如图像）分割成多个部分并进行处理数据部分独立且并行。这也称为数据并行化。
默认情况下，AOP是激活的，即这种类型的并行化是自动完成的，在许多情况下您不会至少对于单个操作员而言，必须关心进一步的数据并行化。有关AOP的详细信息，请参见programmers_guide.pdf，第2.1节。

来自官方例程query_system_parameters.hdev
* Parallelization
get_system ('processor_num', ProcessorNum)
get_system ('thread_pool', ThreadPool)
get_system ('thread_num', ThreadNum)
*Automatic Operator Parallelization,默认值是true
get_system ('parallelize_operators', AOP)
*这个修饰符用于把函数定义为可重入函数,默认值是true;所谓可重入函数就是允许被递归调用的函数
get_system ('reentrant', Reentrant)
*故意关掉测试性能
*set_system('parallelize_operators','false')

（2）Halcon还提供了optimize_aop算子，用于优化aop，提高性能。

默认情况下（即不使用optimize_aop算子），Halcon使用AOP的最大可用线程数，最多使用处理器数量。但是，根据传递给运算符的数据大小和参数集，最大线程数上的并行化可能会过度且效率低下。 optimize_aop根据线程号优化AOP，并针对HALCON运算符的并行处理检查给定的硬件。这样，它将检查每个运算符，可以通过在tuple元组，channel通道或domain level域级别（不考虑the partial level部分级别）上的自动并行化来加快操作速度。每个检查的运算符都将执行几次（依次和并行），并带有一组不断变化的输入参数值/图像。后者有助于评估操作员的输入参数特征（例如，输入图像的大小）与其并行处理效率之间的依赖性。根据操作员参数的设置，这可能要花费几个小时。对于正确的优化，至关重要的是不要在计算机上同时运行任何其他计算密集型应用程序，因为这会严重影响硬件检查的时间测量，从而导致错误的结果。

详情参见官方例程optimize_aop.hdev

4、查看支持AOP的算子

自动并行化方法，为了实现运算符的自动并行化，HALCON利用数据并行性，即操作员的输入数据可以彼此独立地进行处理。数据并行性可以在四个位置找到。官方自带的例程get_operator_info.hdev，可以查看；

(1)tuple level (2)channel level (3)domain level (4)internal data level

* Determine the parallelization method of all parallelized operators
get_parallel_method_operators (SplitTuple, SplitChannel, SplitDomain, SplitPartial, None)
AutoParallel := [SplitTuple,SplitChannel,SplitDomain,SplitPartial]
AutoParallel := uniq(sort(AutoParallel))
* 自定义函数展开之后,有get_operator_info算子
* get names of all operators of the library
get_operator_name ('', OperatorNames)
get_operator_info (OperatorNames[Index], 'parallel_method', Information)

5、如果程序员不想使用AOP，而是自己实现并行化，那较为复杂，需要使用多线程技术，把图像进行拆分处理，最后再合并。因此需要更多专业知识，详情参见官方例程simulate_aop.hdev和官方说明书parallel_programming.pdf。

*set_system('parallelize_operators','false')

6、官方的手册

C:\Program Files\MVTec\HALCON-19.11-Progress\doc\pdf\manuals\programmers_guide.pdf

Chapter 2 Parallel Programming and HALCON

C:\Program Files\MVTec\HALCON-19.11-Progress\doc\pdf\reference\reference_hdevelop.pdf

Chapter 25 System --- 25.8 Parallelization
C:\Program Files\MVTec\HALCON-19.11-Progress\doc\pdf\manuals\parallel_programming.pdf

四、GPU

1、Halcon中使用GPU提速，效果明显。

Windows开始菜单--运行--输入dxdiag--显示，可以看到自己电脑的显卡型号。

官方自带的例程compute_devices.hdev，实现提速的优良效果，必须先关闭设备：dev_update_off()；

来自官方例程compute_devices.hdev
* This example shows how to use compute devices with HALCON.
*
dev_update_off ()
dev_close_window ()
dev_open_window_fit_size (0, 0, 640, 480, -1, -1, WindowHandle)
set_display_font (WindowHandle, 16, 'mono', 'true', 'false')
*
* Get list of all available compute devices.
query_available_compute_devices (DeviceIdentifier)
*
* End example if no device could be found.
if (|DeviceIdentifier| == 0)return ()
endif
*
* Display basic information on detected devices.
disp_message (WindowHandle, 'Found ' + |DeviceIdentifier| + ' Compute Device(s):', 'window', 12, 12, 'black', 'true')
for Index := 0 to |DeviceIdentifier| - 1 by 1get_compute_device_info (DeviceIdentifier[Index], 'name', DeviceName)get_compute_device_info (DeviceIdentifier[Index], 'vendor', DeviceVendor)Message[Index] := 'Device #' + Index + ': ' + DeviceVendor + ' ' + DeviceName
endfor
disp_message (WindowHandle, Message, 'window', 42, 12, 'white', 'false')
disp_continue_message (WindowHandle, 'black', 'true')
stop ()

2、操作GPU设备有关的算子：

query_available_compute_devices

get_compute_device_info

open_compute_device

init_compute_device

activate_compute_device

deactivate_compute_device

3、官方自带的例程get_operator_info.hdev，可以查看支持GPU加速（OpenCL）的算子；

* Determine all operators that support OpenCL
get_opencl_operators (OpenCLSupport)
* 自定义函数展开之后,有get_operator_info算子
get_operator_name ('', OperatorNames)
get_operator_info (OperatorNames[Index], 'compute_device', Information)

这里举例Halcon 19.11版本可以加速的算子有82个：

['abs_diff_image', 'abs_image', 'acos_image', 'add_image', 'affine_trans_image', 'affine_trans_image_size', 'area_center_gray', 'asin_image', 'atan2_image', 'atan_image', 'binocular_disparity_ms', 'binocular_distance_ms', 'binomial_filter', 'cfa_to_rgb', 'change_radial_distortion_image', 'convert_image_type', 'convol_image', 'cos_image', 'crop_domain', 'crop_part', 'crop_rectangle1', 'depth_from_focus', 'derivate_gauss', 'deviation_image', 'div_image', 'edges_image', 'edges_sub_pix', 'exp_image', 'find_ncc_model', 'find_ncc_models', 'gamma_image', 'gauss_filter', 'gauss_image', 'gray_closing_rect', 'gray_closing_shape', 'gray_dilation_rect', 'gray_dilation_shape', 'gray_erosion_rect', 'gray_erosion_shape', 'gray_histo', 'gray_opening_rect', 'gray_opening_shape', 'gray_projections', 'gray_range_rect', 'highpass_image', 'image_to_world_plane', 'invert_image', 'linear_trans_color', 'lines_gauss', 'log_image', 'lut_trans', 'map_image', 'max_image', 'mean_image', 'median_image', 'median_rect', 'min_image', 'mirror_image', 'mult_image', 'points_harris', 'polar_trans_image', 'polar_trans_image_ext', 'polar_trans_image_inv', 'pow_image', 'principal_comp', 'projective_trans_image', 'projective_trans_image_size', 'rgb1_to_gray', 'rgb3_to_gray', 'rotate_image', 'scale_image', 'sin_image', 'sobel_amp', 'sobel_dir', 'sqrt_image', 'sub_image', 'tan_image', 'texture_laws', 'trans_from_rgb', 'trans_to_rgb', 'zoom_image_factor', 'zoom_image_size']

4、官方手册

C:\Program Files\MVTec\HALCON-19.11-Progress\doc\pdf\reference\reference_hdevelop.pdf

Chapter 25 System --- 25.1 Compute Devices

五、举例测试

*参考官方例程optimize_aop.hdev;query_aop_info.hdev;simulate_aop.hdev;
*举例edges_sub_pix算子性能测试
dev_update_off ()//实现提速的优良效果，必须先关闭设备
dev_close_window ()
dev_open_window_fit_size (0, 0, 640, 480, -1, -1, WindowHandle)
set_display_font (WindowHandle, 16, 'mono', 'true', 'false')
get_system ('processor_num', NumCPUs)
get_system ('parallelize_operators', AOP)*读取图片
read_image(Image, 'D:/hellowprld/2/1-.jpg')
*彩色转灰度图
count_channels (Image, Channels)
if (Channels == 3 or Channels == 4)rgb1_to_gray (Image, ImageGray)
endifalpha:=5
low:=10
high:=20*测试1:去掉AOP,即没有加速并行处理
set_system ('parallelize_operators', 'false')
get_system ('parallelize_operators', AOP)
count_seconds(T0)
edges_sub_pix (ImageGray, Edges1, 'canny', alpha, low, high)
count_seconds(T1)
Time0:=(T1-T0)*1000
stop()*测试2:AOP自动加速并行处理
*Halcon的默认值是开启AOP的,即parallelize_operators值为true
set_system ('parallelize_operators', 'true')
count_seconds(T1)
edges_sub_pix (ImageGray, Edges1, 'canny', alpha, low, high)
count_seconds(T2)
Time1:=(T2-T1)*1000
stop()*测试3:GPU加速，支持GPU加速的算子Halcon19.11有82个
*GPU加速是先从CPU中将数据拷贝到GPU上处理，处理完成后再将数据从GPU拷贝到CPU上。从CPU到GPU再从GPU到CPU是要花费时间的。
*GPU加速一定会比正常的AOP运算速度快吗?不一定!结果取决于显卡的好坏.
query_available_compute_devices(DeviceIdentifiers)
DeviceHandle:=0
for i:=0 to |DeviceIdentifiers|-1 by 1get_compute_device_info(DeviceIdentifiers[i], 'name', Nmae)if (Nmae == 'GeForce GT 630')//根据GPU名称打开GPUopen_compute_device(DeviceIdentifiers[i], DeviceHandle)breakendif
endforif(DeviceHandle#0)set_compute_device_param (DeviceHandle, 'asynchronous_execution', 'false')init_compute_device(DeviceHandle, 'edges_sub_pix')activate_compute_device(DeviceHandle)
endif*获得显卡的信息
get_compute_device_param (DeviceHandle, 'buffer_cache_capacity', GenParamValue0)//默认值是显卡缓存的1/3
get_compute_device_param (DeviceHandle, 'buffer_cache_used', GenParamValue1)
get_compute_device_param (DeviceHandle, 'image_cache_capacity', GenParamValue2)
get_compute_device_param (DeviceHandle, 'image_cache_used', GenParamValue3)*GenParamValue0 := GenParamValue0 / 3
*set_compute_device_param (DeviceHandle, 'buffer_cache_capacity', GenParamValue0)
*get_compute_device_param (DeviceHandle, 'buffer_cache_capacity', GenParamValue4)count_seconds(T3)
*如果显卡缓存不够,会报错,error #4104 : Out of compute device memory
edges_sub_pix (ImageGray, Edges1, 'canny', alpha, low, high)
count_seconds(T4)
Time2:=(T4-T3)*1000if(DeviceHandle#0)deactivate_compute_device(DeviceHandle)
endif
stop()*测试4:AOP手动优化
set_system ('parallelize_operators', 'true')
get_system ('parallelize_operators', AOP)*4.1-优化线程数目方法'threshold'
optimize_aop ('edges_sub_pix', 'byte', 'no_file', ['file_mode','model','parameters'], ['nil','threshold','false'])count_seconds(T5)
edges_sub_pix (ImageGray, Edges1, 'canny', alpha, low, high)
count_seconds(T6)
Time3:=(T6-T5)*1000*4.2-优化线程数目方法'linear'
optimize_aop ('edges_sub_pix', 'byte', 'no_file', ['file_mode','model','parameters'], ['nil','linear','false'])count_seconds(T7)
edges_sub_pix (ImageGray, Edges1, 'canny', alpha, low, high)
count_seconds(T8)
Time4:=(T8-T7)*1000
stop()*4.3-优化线程数目方法'mlp'
optimize_aop ('edges_sub_pix', 'byte', 'no_file', ['file_mode','model','parameters'], ['nil','mlp','false'])count_seconds(T9)
edges_sub_pix (ImageGray, Edges1, 'canny', alpha, low, high)
count_seconds(T10)
Time5:=(T10-T9)*1000
stop()dev_clear_window()
Message := 'edges_sub_pix runtimes:'
Message[1] := 'CPU only Time0 without AOP='+Time0+'ms,'
Message[2] := 'CPU only Time1 with AOP='+Time1+'ms,'
Message[3] := 'GPU use Time2='+Time2+'ms,'
Message[4] := 'optimize Time3 threshold='+Time3+'ms'
Message[5] := 'optimize Time4 linear='+Time4+'ms'
Message[6] := 'optimize Time5 mlp='+Time5+'ms'
disp_message (WindowHandle, Message, 'window', 12, 12, 'red', 'false')
stop()

edges_sub_pix算子性能测试结果：

rotate_image算子性能测试结果：

得出的结论是：

1、GPU加速是先从CPU中将数据拷贝到GPU上处理，处理完成后再将数据从GPU拷贝到CPU上。从CPU到GPU再从GPU到CPU是要花费时间的。
2、GPU加速一定会比正常的AOP运算速度快吗?不一定!结果取决于显卡的好坏.

3、GPU加速，如果显卡缓存不够,会报错,error #4104 : Out of compute device memory

完整的*.hdev工程文件请下载：关于实现Halcon算法加速的基础知识（多核并行/GPU）_halcon加速-C++文档类资源-CSDN下载

---

欢迎访问姊妹篇《关于实现OpenCV算法加速的基础知识》

关于实现Halcon算法加速的基础知识（多核并行/GPU）相关推荐

【超全汇总】学习数据结构与算法，计算机基础知识，看这篇就够了【ZT帅地】2020-3-7
https://blog.csdn.net/m0_37907797/article/details/104029002 由于文章有点多,并且发的文章也不是一个系列一个系列发的,不过我的文章大部分都是围 ...
计算机基础知识数据结构与算法,（计算机基础知识）[数据结构与算法] 图
第六章图 6.1 图的定义和基本术语图: G=(V,E) Graph = (Vertex, Edge) V: 顶点(数据元素)的有穷非空集合 E: 边的有穷集合完全图: 任意两个点都有一条边相连 ...
广告算法相关概念及基础知识
各位老铁们,从推荐过来的大佬,本菜鸟也是这么过来的,下面逐步介绍,以推荐算法的口吻来引入到新的领域.哪里有坑就去哪里,现阶段先有饭吃再说,没饭吃都饿死了谈不上后面的了. 一般来说,人都是比较懒的,主动 ...
计算机中算法的概念,高中数学必修三: 算法的概念基础知识解析
知识点一:算法的含义数学中算法:通常指按照一定规则解决某一类问题的明确和有限的步骤. 现代算法:通常可以编成计算机程序,让计算机执行并解决的问题. 知识点二:算法的特征 (1) 有序性:算法是从初始 ...
OpenCV算法加速（1）OpenMP/PPL/TTB基础知识
一.提高OpenCV的运算速度,有以下几种方法: 1.利用x86转为x64提速,可以提高1倍的速度 2.多线程的openmp或Intel TBB提速,将cpu的利用率从20%多提高到100% 3.利用 ...
详解服务器异构计算FPGA基础知识
随着云计算,大数据和人工智能技术应用,单靠CPU已经无法满足各行各业的算力需求.海量数据分析.机器学习和边缘计算等场景需要计算架构多样化,需要不同的处理器架构和GPU,NPU和FPGA等异构计算技术协 ...
计算机二级基础知识教材,国家计算机二级考试公共基础知识教材
国家计算机二级测试公共基础知识教材公共基础知识总结之第一章数据结构和算法 (1) 公共基础知识总结之第二章程序设计基础 (4) 公共基础知识总结之第三章软件工程基础 (5) 公共基础知识总结之第四章 ...
计算机公共基础知识教材,国家计算机二级考试公共基础知识教材
国家计算机二级考试公共基础知识教材国家计算机二级考试公共基础知识教材国家计算机二级考试公共基础知识教材公共基础知识总结之第一章数据结构与算法 .......................... ...
计算机二级公共基础知识教材,《国家计算机二级考试【公共基础知识教材】》.pdf...
<国家计算机二级考试[公共基础知识教材]>.pdf 国家计算机二级考试公共基础知识教材国家计算机二级考试公共基础知识教材国家计算机二级考试公共基础知识教材国国家家计计算算机机二二级级 ...

关于实现Halcon算法加速的基础知识（多核并行/GPU）

关于实现Halcon算法加速的基础知识（多核并行/GPU）相关推荐

最新文章

热门文章