TVM优化c++部署实践

TVM优化c++部署实践
使用TVM导入神经网络模型：
模型支持pytorch , tensorflow , onnx, caffe 等。平时pytorch用的多，这里给一种pytorch的导入方式。
github代码仓：https://github.com/leoluopy/autotvm_tutorial

def relay_import_from_torch(model, direct_to_mod_param=False):
# 模型输入模型是 NCHW次序，tvm目前支持动态shape
input_shape = [1, 3, 544, 960]
input_data = torch.randn(input_shape)
# 使用随机数据，运行一次模型，记录张量运算
scripted_model = torch.jit.trace(model, input_data).eval()

input_name = "input0"
shape_list = [(input_name, input_shape)]
# 导入模型和权重
mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)
if direct_to_mod_param:return mod, params# target = tvm.target.Target("llvm", host="llvm")
# dev = tvm.cpu(0)
# 设定目标平台和设备号，可以是其它平台，如ARM GPU ,苹果手机GPU等
target = tvm.target.cuda()
dev = tvm.device(str(target), 0)
with tvm.transform.PassContext(opt_level=3):
# 编译模型至目标平台，保存在lib变量中，后面可以被导出。lib = relay.build(mod, target=target, params=params)
# 使用编译好的lib初始化 graph＿executor ，后面用于推理
tvm_model = graph_executor.GraphModule(lib["default"](dev))
return tvm_model, dev

初始化了推理需要的graph_executor。代码很简单，去github仓库扒下来扒，这里介绍另外一种，导出为so文件，然后加载so文件进行推理的方式。
使用TVM导出目标平台推理代码：
lib.export_library(“centerFace_relay.so”)
当然这里还没有进行schedule参数搜索，虽然相对于原始的pytorch接口也能有一定优化，但是还没有发挥最大功力。
TVM的python推理接口实践：
来，上代码。 so文件是刚才导出的推理库，也可以是后面搜索得到的推理库，等下后文介绍。

frame = cv2.imread("./ims/6.jpg")target = tvm.target.cuda()
dev = tvm.device(str(target), 0)lib = tvm.runtime.load_module("centerFace_relay.so")
tvm_centerPoseModel = runtime.GraphModule(lib["default"](dev))
input_tensor, img_h_new, img_w_new, scale_w, scale_h, raw_scale = centerFacePreprocess(frame)
tvm_centerPoseModel.set_input("input0", tvm.nd.array(input_tensor.astype("float32")))
for i in range(100):# 推理速率演示，推理多次后时间会稳定下来t0 = time.time()tvm_centerPoseModel.run()print("tvm inference cost: {}".format(time.time() - t0))
heatmap, scale, offset, lms = torch.tensor(tvm_centerPoseModel.get_output(0).asnumpy()), \torch.tensor(tvm_centerPoseModel.get_output(1).asnumpy()), \torch.tensor(tvm_centerPoseModel.get_output(2).asnumpy()), \torch.tensor(tvm_centerPoseModel.get_output(3).asnumpy()),dets, lms = centerFacePostProcess(heatmap, scale, offset, lms, img_h_new, img_w_new, scale_w, scale_h, raw_scale)
centerFaceWriteOut(dets, lms, frame)

现在打通了一个完整的流程，使用tvm导入模型 —> 编译并导出so库 —> 加载so库 —> 推理
上面的编译并导出so库，在windows平台导出dll 库。编译的过程使用tvm默认的schedule参数，也有一定的优化效果，测试下来，之前使用了一个centerface的pytorch模型推理50W像素的图片大约需要12ms [ 1080ti ］，默认编译后推理时间大约是 6ms 。
对比上面，除了使用默认的schedule参数进行推理，可以搜索更优的schedule参数。测试相同的情况，centerface推理时间3.5ms。有了大约一倍的提升。关键是性能没有损失！
对应的总体流程就变成了：
使用tvm导入模型 —> 搜索最优scheduel参数 — > 编译并导出so库 —> 加载so库 —> 推理
使用autoTVM搜索最优推理代码：
python 搜索代码．
def case_autotvm_relay_centerFace():
# InitCenterFacePy封装了pytorch的加载代码
model = InitCenterFacePy()
# tvm搜索完成后将结果保存在.log中
log_file = “centerFace.log”
dtype = “float32”
# 初始化优化器，及优化选项
tuning_option = {
“log_filename”: log_file,
“tuner”: “xgb”,
# “n_trial”: 1,
“n_trial”: 2000,
“early_stopping”: 600,
“measure_option”: autotvm.measure_option(
builder=autotvm.LocalBuilder(timeout=10),
runner=autotvm.LocalRunner(number=20, repeat=3, timeout=4, min_repeat_ms=150),
),
}
print(“Extract tasks centerFace…”)
mod, params, = relay_import_from_torch(model.module.cpu(), direct_to_mod_param=True)
input_shape = [1, 3, 544, 960]
target = tvm.target.cuda()
tasks = autotvm.task.extract_from_program(
mod[“main”], target=target, params=params, ops=(relay.op.get(“nn.conv2d”),)
)
# run tuning tasks
print(“Tuning…”)
tune_tasks(tasks, **tuning_option)
# compile kernels with history best records
# 模型搜索完成后，进行耗时统计。
profile_autvm_centerFace(mod, target, params, input_shape, dtype, log_file)
TVM验证推理时间：
tvm提供了耗时的统计，下面是代码。
def profile_autvm_centerFace(mod, target, params, input_shape, dtype, log_file):
with autotvm.apply_history_best(log_file):
print(“Compile…”)
with tvm.transform.PassContext(opt_level=3):
lib = relay.build_module.build(mod, target=target, params=params)
# load parameters
dev = tvm.device(str(target), 0)
module = runtime.GraphModule(lib"default")
data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
module.set_input(“input0”, data_tvm)
# evaluate
print(“Evaluate inference time cost…”)
ftimer = module.module.time_evaluator(“run”, dev, number=1, repeat=100)
prof_res = np.array(ftimer().results) * 1000 # convert to millisecond
print(
“Mean inference time (std dev): %.2f ms (%.2f ms)”
% (np.mean(prof_res), np.std(prof_res))
)
lib.export_library(“centerFace_relay.so”)
TVM的c++推理接口实践:
上面把python部分的东西都讲完了，得到了一个目标平台编译好的动态库。神经网络的部署不仅仅是推理，还有其它的代码，往往都是一些效率要求很高的场景，一般都使用c++作为目标平台的编码语言。so库得到后，如何推理呢，下面上代码：［主要两部分，完整代码见git 仓库，或者上知识星球获取］
初始化部分：
DLDevice dev{kDLGPU, 0};
// for windows , the suffix should be dll
mod_factory = tvm::runtime::Module::LoadFromFile(lib_path, “so”);
// 通过动态库获取模型实例 gmod
gmod = mod_factory.GetFunction(“default”)(dev);
// 获取函数指针: 设置推理输入
set_input = gmod.GetFunction(“set_input”);
get_output = gmod.GetFunction(“get_output”);
run = gmod.GetFunction(“run”);
// Use the C++ API
// 输入输出的内存空间 gpu设备上
x = tvm::runtime::NDArray::Empty({1, 3, 544, 960}, DLDataType{kDLFloat, 32, 1}, dev);
heatmap_gpu = tvm::runtime::NDArray::Empty({1, 1, 136, 240}, DLDataType{kDLFloat, 32, 1}, dev);
scale_gpu = tvm::runtime::NDArray::Empty({1, 2, 136, 240}, DLDataType{kDLFloat, 32, 1}, dev);
offset_gpu = tvm::runtime::NDArray::Empty({1, 2, 136, 240}, DLDataType{kDLFloat, 32, 1}, dev);
lms_gpu = tvm::runtime::NDArray::Empty({1, 10, 136, 240}, DLDataType{kDLFloat, 32, 1}, dev);
推理部分：
值得注意的是： cv::dnn::blobFromImage真是一个好用的函数，构造好 NCHW排列的输入内存块，opencv内置了openmp 加速，在树莓派，各种手机上这个函数也很好用。

        int h = frame.rows;int w = frame.cols;float img_h_new = int(ceil(h / 32) * 32);float img_w_new = int(ceil(w / 32) * 32);float scale_h = img_h_new / float(h);float scale_w = img_w_new / float(w);cv::Mat input_tensor = cv::dnn::blobFromImage(frame, 1.0, cv::Size(img_w_new, img_h_new),cv::Scalar(0, 0, 0),true,false, CV_32F);x.CopyFromBytes(input_tensor.data, 1 * 3 * 544 * 960 * sizeof(float));set_input("input0", x);timeval t0, t1;gettimeofday(&t0, NULL);run();gettimeofday(&t1, NULL);printf("inference cost: %f \n", t1.tv_sec - t0.tv_sec + (t1.tv_usec - t0.tv_usec) / 1000000.);get_output(0, heatmap_gpu);get_output(1, scale_gpu);get_output(2, offset_gpu);get_output(3, lms_gpu);tvm::runtime::NDArray heatmap_cpu = heatmap_gpu.CopyTo(DLDevice{kDLCPU, 0});tvm::runtime::NDArray scale_cpu = scale_gpu.CopyTo(DLDevice{kDLCPU, 0});tvm::runtime::NDArray offset_cpu = offset_gpu.CopyTo(DLDevice{kDLCPU, 0});tvm::runtime::NDArray lms_cpu = lms_gpu.CopyTo(DLDevice{kDLCPU, 0});

TVM部署树莓派卷积神经网络
介绍如果将神经网络使用TVM编译，导出动态链接库文件，最后部署在树莓派端(PC端)，并且运行起来。

环境搭建
需要安装LLVM，主要运行环境是CPU(树莓派的GPU暂时不用，内存有点小)，所以LLVM是必须的。
安装交叉编译器:
Cross Compiler
交叉编译器是什么，就是可以在PC平台上编译生成，可以直接在树莓派上运行的可执行文件。在TVM中，需要利用交叉编译器在PC端编译模型并且优化，然后生成适用于树莓派(arm构架)使用的动态链接库。
有这个动态链接库，就可以直接调用树莓派端的TVM运行时环境，调用这个动态链接库，从而执行神经网络的前向操作了。
那么怎么安装呢？需要安装叫做/usr/bin/arm-linux-gnueabihf-g++的交叉编译器，在Ubuntu系统中，直接sudo apt-get install g+±arm-linux-gnueabihf即可，注意名称不能错，需要的是hf(Hard-float)版本。
安装完后，执行/usr/bin/arm-linux-gnueabihf-g++ -v命令，就可以看到输出信息:
prototype@prototype-X299-UD4-Pro:~/$ /usr/bin/arm-linux-gnueabihf-g++ -v
Using built-in specs.
COLLECT_GCC=/usr/bin/arm-linux-gnueabihf-g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc-cross/arm-linux-gnueabihf/5/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: …/src/configure -v --with-pkgversion=‘Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.9’ --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libitm --disable-libquadmath --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-armhf-cross/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-armhf-cross --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-armhf-cross --with-arch-directory=arm --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libgcj --enable-objc-gc --enable-multiarch --enable-multilib --disable-sjlj-exceptions --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float=hard --with-mode=thumb --disable-werror --enable-multilib --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=arm-linux-gnueabihf --program-prefix=arm-linux-gnueabihf- --includedir=/usr/arm-linux-gnueabihf/include
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.9)
树莓派环境搭建
因为是在PC端利用TVM编译神经网络的，在树莓派端只需要编译TVM的运行时环境即可(TVM可以分为两个部分，一部分为编译时，另一个为运行时，两者可以拆开)。
这里附上官方的命令，注意树莓派端也需要安装llvm，树莓派端的llvm可以在llvm官方找到已经编译好的压缩包，解压后添加环境变量即可：

git clone --recursive https://github.com/dmlc/tvm
cd tvm
mkdir build
cp cmake/config.cmake build # 这里修改config.cmake使其支持llvm
cd build
cmake …
make runtime
在树莓派上编译TVM的运行时，不需要花很久的时间。
完成部署
在PC端利用TVM部署C++模型
如何利用TVM的C++端去部署，官方也有比较详细的文档，这里利用TVM和OpenCV读取一张图片，使用之前导出的动态链接库，运行神经网络对这张图片进行推断。
需要的头文件为：
#include
#include <dlpack/dlpack.h>
#include <opencv4/opencv2/opencv.hpp>
#include <tvm/runtime/module.h>
#include <tvm/runtime/registry.h>
#include <tvm/runtime/packed_func.h>
#include
其实这里只需要TVM的运行时，另外dlpack是存放张量的一个结构。其中OpenCV用于读取图片，而fstream用于读取json和参数信息：
tvm::runtime::Module mod_dylib =
tvm::runtime::Module::LoadFromFile("…/files/mobilenet.so");
std::ifstream json_in("…/files/mobilenet.json", std::ios::in);
std::string json_data((std::istreambuf_iterator(json_in)), std::istreambuf_iterator());
json_in.close();
// parameters in binary
std::ifstream params_in("…/files/mobilenet.params", std::ios::binary);
std::string params_data((std::istreambuf_iterator(params_in)), std::istreambuf_iterator());
params_in.close();
TVMByteArray params_arr;
params_arr.data = params_data.c_str();
params_arr.size = params_data.length();
在读取完信息后，要利用之前读取的信息，构建TVM中的运行图(Graph_runtime)：
int dtype_code = kDLFloat;
int dtype_bits = 32;
int dtype_lanes = 1;
int device_type = kDLCPU;
int device_id = 0;
tvm::runtime::Module mod = (*tvm::runtime::Registry::Get(“tvm.graph_runtime.create”))
(json_data, mod_dylib, device_type, device_id);
然后利用TVM中函数建立一个输入的张量类型，分配空间：
DLTensor *x;
int in_ndim = 4;
int64_t in_shape[4] = {1, 3, 128, 128};
TVMArrayAlloc(in_shape, in_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &x);
其中DLTensor是个灵活的结构，可以包容各种类型的张量，在创建了这个张量后，需要将OpenCV中读取的图像信息传入到这个张量结构中：
// 这里依然读取了papar.png这张图
image = cv::imread("/home/prototype/CLionProjects/tvm-cpp/data/paper.png");
cv::cvtColor(image, frame, cv::COLOR_BGR2RGB);
cv::resize(frame, input, cv::Size(128,128));
float data[128 * 128 * 3];
// 在这个函数中将OpenCV中的图像数据转化为CHW的形式
Mat_to_CHW(data, input);
需要注意，因为OpenCV中的图像数据的保存顺序是(128,128,3)，所以这里需要调整过来，其中Mat_to_CHW函数的具体内容是:
void Mat_to_CHW(float *data, cv::Mat &frame)
{
assert(data && !frame.empty());
unsigned int volChl = 128 * 128;

for(int c = 0; c < 3; ++c)
{for (unsigned j = 0; j < volChl; ++j)data[c*volChl + j] = static_cast<float>(float(frame.data[j * 3 + c]) / 255.0);
}

}
当然别忘了除以255.0因为在Pytorch中所有的权重信息的范围都是0-1。
在将OpenCV中的图像数据转化后，将转化后的图像数据拷贝到之前的张量类型中:
// x为之前的张量类型 data为之前开辟的浮点型空间
memcpy(x->data, &data, 3 * 128 * 128 * sizeof(float));
然后设置运行图的输入(x)和输出(y):
// get the function from the module(set input data)
tvm::runtime::PackedFunc set_input = mod.GetFunction(“set_input”);
set_input(“0”, x);
// get the function from the module(load patameters)
tvm::runtime::PackedFunc load_params = mod.GetFunction(“load_params”);
load_params(params_arr);
DLTensor* y;
int out_ndim = 2;
int64_t out_shape[2] = {1, 3,};
TVMArrayAlloc(out_shape, out_ndim, dtype_code, dtype_bits, dtype_lanes, device_type, device_id, &y);
// get the function from the module(run it)
tvm::runtime::PackedFunc run = mod.GetFunction(“run”);
// get the function from the module(get output data)
tvm::runtime::PackedFunc get_output = mod.GetFunction(“get_output”);
此刻就可以运行了：
run();
get_output(0, y);
// 将输出的信息打印出来
auto result = static_cast<float*>(y->data);
for (int i = 0; i < 3; i++)
cout<<result[i]<<endl;
最后的输出信息是
13.8204
-7.31387
-6.8253
可以看到，成功识别出了布这张图片，到底为止在C++端的部署就完毕了。
在树莓派上的部署
在树莓派上的部署其实也是很简单的，与上述步骤中不同的地方是需要设置target为树莓派专用:
target = tvm.target.arm_cpu(‘rasp3b’)
点进去其实可以发现rasp3b对应着-target=armv7l-linux-gnueabihf：
trans_table = {
“pixel2”: ["-model=snapdragon835", “-target=arm64-linux-android -mattr=+neon”],
“mate10”: ["-model=kirin970", “-target=arm64-linux-android -mattr=+neon”],
“mate10pro”: ["-model=kirin970", “-target=arm64-linux-android -mattr=+neon”],
“p20”: ["-model=kirin970", “-target=arm64-linux-android -mattr=+neon”],
“p20pro”: ["-model=kirin970", “-target=arm64-linux-android -mattr=+neon”],
“rasp3b”: ["-model=bcm2837", “-target=armv7l-linux-gnueabihf -mattr=+neon”],
“rk3399”: ["-model=rk3399", “-target=aarch64-linux-gnu -mattr=+neon”],
“pynq”: ["-model=pynq", “-target=armv7a-linux-eabi -mattr=+neon”],
“ultra96”: ["-model=ultra96", “-target=aarch64-linux-gnu -mattr=+neon”],
}
还有一点改动的是，在导出.so的时候需要加入cc="/usr/bin/arm-linux-gnueabihf-g++"，此时的/usr/bin/arm-linux-gnueabihf-g++为之前下载的交叉编译器。
path_lib = ‘…/tvm/deploy_lib.so’
lib.export_library(path_lib, cc="/usr/bin/arm-linux-gnueabihf-g++")
可以导出来树莓派需要的几个文件，将这几个文件移到树莓派中，随后利用上面说到的C++部署代码去部署就可以了。

参考链接：
https://blog.csdn.net/weixin_33514140/article/details/112775067
https://blog.csdn.net/m0_62789066/article/details/120855166t/m0_62789066/article/details/120855166
github代码仓：https://github.com/leoluopy/autotvm_tutorial

TVM优化c++部署实践相关推荐

TVM优化GPU机器翻译
TVM优化GPU机器翻译背景神经机器翻译(NMT)是一种自动化的端到端方法,具有克服传统基于短语的翻译系统中的弱点的潜力.最近,阿里巴巴集团正在为全球电子商务部署NMT服务. 将Transform ...
TVM 优化 ARM GPU 上的移动深度学习
TVM 优化 ARM GPU 上的移动深度学习随着深度学习的巨大成功,将深度神经网络部署到移动设备的需求正在迅速增长.与桌面平台上所做的类似,在移动设备中使用 GPU 既有利于推理速度,也有利于能源 ...
TensorFlow+TVM优化NMT神经机器翻译
TensorFlow+TVM优化NMT神经机器翻译背景神经机器翻译(NMT)是一种自动化的端到端方法,具有克服传统基于短语的翻译系统中的弱点的潜力.本文为全球电子商务部署NMT服务. 目前,将Tr ...
centos 安装mysql5.7_Zabbix 4.2.5 安装部署实践详解
[导读]云计算背景下,无论是大数据.物联网还是边缘计算,规模化后大量的设备需要保证正常运行,在人员一定的情况下,就需要提高运行维护效率.同时随着智能化被应用在人们生活的方方面面,关联性也越来越紧密,即 ...
网易云海外推流部署实践
谈到直播,实时性和流畅性一直是整个服务体系中的重中之重.本文是网易云通信视频技术开发工程师何荣光在LiveVideoStack Meet杭州站沙龙的分享,着重梳理网易云在海外推流方面的部署实践,帮助开 ...
关于性能优化的一些实践
关于性能优化的一些实践 2 背景在海量并发业务的场景下,比如电商抢购.微信红包这样的场景下,我们经常会遇到各种各样的性能问题,在应对这些问题的时候,应该有怎样的方法论去指导我们解决问题,基于这几年的 ...
YOLODet最新算法的目标检测开发套件，优化到部署
向AI转型的程序员都关注了这个号???????????? 人工智能大数据与深度学习公众号:datayx YOLODet-PyTorch是端到端基于pytorch框架复现yolo最新算法的目标检测开 ...
Zabbix 4.2.5 安装部署实践详解
一.安装 1.安装CentOS操作系统,并配置网络 2.安装Zabbix官方源 rpm -ivh http://repo.zabbix.com/zabbix/4.2/rhel/7/x86_64/zab ...
Docker 镜像优化与最佳实践
云栖TechDay41期,阿里云高级研发工程师御坂带来Docker镜像优化与最佳实践.从Docker镜像存储的原理开始,针对镜像的存储.网络传输,介绍如何在构建中对这些关键点进行优化.并介绍Docke ...

TVM优化c++部署实践

TVM优化c++部署实践相关推荐

最新文章

热门文章