【个人记录】torch转onnx对上TensorRT的grid_sample接口（4D/5D）进行加速

背景

双目里比较优秀的很多模型都有用到torch的grid_sample，但是其在TensorRT中没有接口。虽说4D的有替代方案，但是还是有性能损失。并且我最近测试了BGNET，感觉在实时方案中是最可靠的，在抖动中不会大量误匹配，但是其有双边滤波是升维的，变成了5D的grid_sample，让人头大。替代写法网上都没有，我自己照着实现了一下，结果是一致的，但是时间直接翻一倍，我人裂开，onnx部署时间和torch原来跑的时间一致就很难受。思来想去还是试试能不能怼个接口上去。

测试平台

Jetson Xavier NX

前情

经过一些复杂尝试，首先试了下torch2trt，感觉细节上容易对不上他的接口要求，要排查的太多了，放弃。
觉得在NX上编译源码TensorRT源码还是不太合适，在一顿搜索尝试之后确定了以下方案。

参考

他人github教程
英伟达官方说明
github开源的TensorRT接口
NVIDIA的trtexec文件夹

过程

首先将onnx内不支持的接口注册再导出，注册代码是复制的这里，注意要改算子名字的话就改我标的地方。

import torch
from my_model import my_model import typing
from torch.onnx import symbolic_helper_OPSET_VERSION = 11
_registered_ops: typing.AbstractSet[str] = set()def _reg(symbolic_fn: typing.Callable):name = "::%s" % symbolic_fn.__name__torch.onnx.register_custom_op_symbolic(name, symbolic_fn, _OPSET_VERSION)_registered_ops.add(name)def register():"""Register ONNX Runtime's built-in contrib ops.Should be run before torch.onnx.export()."""def grid_sampler(g, input, grid, mode, padding_mode, align_corners):# mode#   'bilinear'      : onnx::Constant[value={0}]#   'nearest'       : onnx::Constant[value={1}]#   'bicubic'       : onnx::Constant[value={2}]# padding_mode#   'zeros'         : onnx::Constant[value={0}]#   'border'        : onnx::Constant[value={1}]#   'reflection'    : onnx::Constant[value={2}]mode = symbolic_helper._maybe_get_const(mode, "i")padding_mode = symbolic_helper._maybe_get_const(padding_mode, "i")mode_str = ["bilinear", "nearest", "bicubic"][mode]padding_mode_str = ["zeros", "border", "reflection"][padding_mode]align_corners = int(symbolic_helper._maybe_get_const(align_corners, "b"))# From opset v13 onward, the output shape can be specified with# (N, C, H, W) (N, H_out, W_out, 2) => (N, C, H_out, W_out)# input_shape = input.type().sizes()# gird_shape = grid.type().sizes()# output_shape = input_shape[:2] + gird_shape[1:3]# g.op(...).setType(input.type().with_sizes(output_shape))return g.op(## op name, modify here. not sure whether "com.microsoft::" is required"com.microsoft::GridSamplePluginDynamic",  input,grid,mode_s=mode_str,padding_mode_s=padding_mode_str,align_corners_i=align_corners,)_reg(grid_sampler)@torch.no_grad()
def convert():register()# set cpudevice = "cuda"model = my_model (88, 'models.pth').to(device)model.eval()t1 = torch.rand(1, 1, 384, 640).to(device)t2 = torch.rand(1, 1, 384, 640).to(device)# Export the modeltorch.onnx.export(model,(t1, t2),'model.onnx',   # where to save the model (can be a file or file-like object)export_params=True,        # store the trained parameter weights inside the model fileopset_version=11,          # the ONNX version to export the model todo_constant_folding=True,  # whether to execute constant folding for optimizationinput_names = ['left', 'right'],   # the model's input namesoutput_names = ['output'])if __name__ == "__main__":convert()

因为我要使用的grid_sample是5D的，就算用最新的onnx导出也不支持，所以我还是用老版本的onnx导出。4D的我不知道直接用高版本导出会有啥问题哈。之后有机会试试CREStereo的。

随手用下onnx-simplifier，有可能可以消除一些if节点避免tensorrt报一些错

python3 -m onnxsim model.onnx model_sim.onnx

从这里下载开源的接口（mmcv也有，我感觉也可以找那边的用），我个人就用到grid_sample所以把其他都删了。删完之后这两个地方改一下。

按照他的markdown编译好之后就可以了。生成的库文件在build/lib里面，长这样
随后使用trtexec转换模型文件，链接上接口：

/usr/src/tensorrt/bin/trtexec --onnx=model.onnx --saveEngine=model.trt --fp16 \
--plugins=/home/ubuntu/Documents/amirstan_plugin/build/lib/libamirstan_plugin.so

这里我一开始的时候链接上但是接口还是对不上，改接口工程里面的名字好像没变换，想了想算了把onnx模型里面算子type改成GridSamplePluginDynamic了。

随后在自己之前工程上随手链上工程文件用c++跑一遍效果。
一开始尝试在CMakeLists里面直接链上，然后发现没有效果。但是trtexec是可以的呀，所以我瞅了眼trtexec的源码，他用的是dlopen函数，我也跟着用了。

#include <dlfcn.h>......string dll_path = "/home/ubuntu/Documents/amirstan_plugin/build/lib/libamirstan_plugin.so";void *handle = dlopen(dll_path.c_str(), RTLD_LAZY);if (NULL == handle){printf("dlopen error. msg:%s", dlerror());return -1;}......

同时在CMakeLists里面写上

target_link_libraries(main ${CMAKE_DL_LIBS} )

然后编译跑图，效果不太方便展示，看上去是和之前自己用别的方法实现的grid_sample是一样的。原先用别的方法实现的grid_sample跑模型大概12帧，现在17帧，提升明显。完事儿~

补充

python实现个人感觉可以参考这里，使用ctypes.CDLL(plugin_lib)来链接上库，其他一致。不确定哈，还没试过。

【个人记录】torch转onnx对上TensorRT的grid_sample接口（4D/5D）进行加速相关推荐

tensorrt，mmclas中的onnx转tensorrt
NVIDIA TensorRT | NVIDIA Developerhttps://developer.nvidia.cn/zh-cn/tensorrtTensorRT详细入门指北,如果你还不了解Te ...
1、pth转onnx模型、onnx转tensorrt模型、python中使用tensorrt进行加速推理（全网最全，不信你打我）
本文向所有亲们介绍在python当中配置tensorrt环境.使用tensorrt环境进行推理的教程,主要分为两大部分,第一部分环境配置,第二部分前向推理. 第一部分环境配置第一步:检查你的系统类 ...
pytorch模型转ONNX转TensorRT，模型转换和推理部署
一.pth模型转ONNX import os import sys import torch import numpy as npfrom feat.model import ResNet # 导入自 ...
用于ONNX的TensorRT后端
用于ONNX的TensorRT后端解析ONNX模型以使用TensorRT执行. 另请参阅TensorRT文档. 有关最近更改的列表,请参见changelog. 支持的TensorRT版本 Maste ...
onnx 测试_用于ONNX的TensorRT后端
用于ONNX的TensorRT后端解析ONNX模型以使用TensorRT执行. 另请参阅TensorRT文档. 有关最近更改的列表,请参见changelog. 支持的TensorRT版本 Maste ...
onnx转tensorrt 实战干货总结
目录查看tensorrt版本: input: dynamic input is missing dimensions in profile onnx转tensorrt方法安装tensorrt: o ...
torch转onnx模型
torch转onnx模型一.前言 onnx是开放神经网络交换格式,用于不同框架之间的迁移,推理方面比原生的torch快很多.本文以MobilenetV3做分类任务为例,实现模型转换. 二.使用步骤 ...
onnx 测试_pytorch onnx onnxruntime tensorrt踩坑各种问题
做了一个小测试,发现pytorch onnx tensorrt三个库的版本存在微妙的联系,在我之前的错误实验中,PyTorch==1.3.0/1.4.0:Onnx==1.6.0:tensorrt=7. ...
较为详细的记录总结TensorRT的python接口的使用，环境配置，模型转换和静态动态模型推理
先来一段摘抄自网上的TensorRT介绍: TensorRT是英伟达针对自家平台做的加速包,TensorRT主要做了这么两件事情,来提升模型的运行速度. TensorRT支持INT8和FP16的计算. ...

【个人记录】torch转onnx对上TensorRT的grid_sample接口（4D/5D）进行加速

背景

测试平台

前情

参考

过程

补充

【个人记录】torch转onnx对上TensorRT的grid_sample接口（4D/5D）进行加速相关推荐

最新文章

热门文章