jetpack4.5.1使用tensorrt加速模型

本文使用的是jetpack4.5.1，之所以没有使用最新的jetpack4.6，4.6.1是因为4.6以上版本中包含的tensorrt>=8，其中的函数接口都与tensorrt7有所不同，网上资料较少不方便实践，但是tensorrt8的加速效果要比7强很多

TensorRT
TensorRT is a high performance deep learning inference runtime for image classification, segmentation, and object detection neural networks. TensorRT is built on CUDA, NVIDIA’s parallel programming model, and enables you to optimize inference for all deep learning frameworks. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications.
JetPack 4.5.1 includes TensorRT 7.1.3

cuDNN
CUDA Deep Neural Network library provides high-performance primitives for deep learning frameworks. It provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.
JetPack 4.5.1 includes cuDNN 8.0

CUDA
CUDA Toolkit provides a comprehensive development environment for C and C++ developers building GPU-accelerated applications. The toolkit includes a compiler for NVIDIA GPUs, math libraries, and tools for debugging and optimizing the performance of your applications.
JetPack 4.5.1 includes CUDA 10.2

刷机成功后不要安装archiconda，或者设置不要默认使用conda的环境，否则tensorrt等无法使用

网络使用的是Bubbliiiing的yolox代码

网络模型生成onnx

import torch
from logs.two_asff.model import YoloBody        //换成你的网络模型，权重如下
model_path = 'logs/two_asff/99.09.pth'torch_model = YoloBody(num_classes=1, phi= 'shufflenet')  //加载模型device          = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch_model.load_state_dict(torch.load(model_path, map_location=device))# set the model to inference mode
torch_model.eval()
batch_size = 1
# Input to the model
x = torch.randn(batch_size, 3, 416, 416, requires_grad=True)
torch_out = torch_model(x)
# Export the model
torch.onnx.export(torch_model,                       # model being runx,                                 # model input (or a tuple for multiple inputs)"shufflenetv2-yolox.onnx",                       # where to save the model (can be a file or file-like object)export_params=True,                # store the trained parameter weights inside the model fileopset_version=11,                  # the ONNX version to export the model todo_constant_folding=True,          # whether to execute constant folding for optimizationinput_names = ['images'],           # the model's input namesoutput_names = ['output'],         # the model's output namesdynamic_axes={'input' : {0 : 'batch_size'},       # variable lenght axes'output' : {0 : 'batch_size'}})

要注意input_name和output_name，与后续的生成tensorrt模型的输入输出必须相同
！！！onnx生成的是INT64格式，或者里面包含INT64格式，
在测试tensorrt的时候会报警告声明tensorrt不支持，会压缩onnx INT64 到INT32
可以无视

测试onnx模型是否生成成功，测试的onnx的运行速度可能会小于pytorch的
（pip3 install onnx 的时候需注意，可能会顺带安装一下numpy或者什么包，版本不匹配有可能导致出现核心错误，无法使用，可以在PC上生成并测试onnx）

//测试onnx模型是否生成成功，一般来说都没有问题
import onnx
# Load the ONNX model
model = onnx.load("../two_asff.onnx")# Check that the IR is well formedprint(onnx.checker.check_model(model))print('-'*50)# Print a human readable representation of the graphprint('Model :\n\n{}'.format(onnx.helper.printable_graph(model.graph)))

跑一下onnx模型，测试一下速度

import time
import cv2
import numpy as np
import onnxruntime as rtdef image_process(image_path):mean = np.array([[[0.485, 0.456, 0.406]]])      # 训练的时候用来mean和stdstd = np.array([[[0.229, 0.224, 0.225]]])img = cv2.imread(image_path)img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)img = cv2.resize(img, (416, 416))                 # (96, 96, 3)image = img.astype(np.float32)/255.0image = (image - mean)/ stdimage = image.transpose((2, 0, 1))              # (3, 96, 96)image = image[np.newaxis,:,:,:]                 # (1, 3, 96, 96)image = np.array(image, dtype=np.float32)return imagedef onnx_runtime():imgdata = image_process('../img/1.jpg')p1 = time.time()sess = rt.InferenceSession('yolox-tiny.onnx')p2 = time.time()print("Inference time with the TensorRT engine: {}".format(p2 - p1))input_name = sess.get_inputs()[0].name  output_name = sess.get_outputs()[0].namepred_onnx = sess.run([output_name], {input_name: imgdata})# print("outputs:")# print(np.array(pred_onnx))onnx_runtime()

接下来是生成tensorrt模型
下面是 common.py 文件，创建复制就好，后面要用到

#
# Copyright 1993-2020 NVIDIA Corporation.  All rights reserved.
#
# NOTICE TO LICENSEE:
#
# This source code and/or documentation ("Licensed Deliverables") are
# subject to NVIDIA intellectual property rights under U.S. and
# international Copyright laws.
#
# These Licensed Deliverables contained herein is PROPRIETARY and
# CONFIDENTIAL to NVIDIA and is being provided under the terms and
# conditions of a form of NVIDIA software license agreement by and
# between NVIDIA and Licensee ("License Agreement") or electronically
# accepted by Licensee.  Notwithstanding any terms or conditions to
# the contrary in the License Agreement, reproduction or disclosure
# of the Licensed Deliverables to any third party without the express
# written consent of NVIDIA is prohibited.
#
# NOTWITHSTANDING ANY TERMS OR CONDITIONS TO THE CONTRARY IN THE
# LICENSE AGREEMENT, NVIDIA MAKES NO REPRESENTATION ABOUT THE
# SUITABILITY OF THESE LICENSED DELIVERABLES FOR ANY PURPOSE.  IT IS
# PROVIDED "AS IS" WITHOUT EXPRESS OR IMPLIED WARRANTY OF ANY KIND.
# NVIDIA DISCLAIMS ALL WARRANTIES WITH REGARD TO THESE LICENSED
# DELIVERABLES, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY,
# NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.
# NOTWITHSTANDING ANY TERMS OR CONDITIONS TO THE CONTRARY IN THE
# LICENSE AGREEMENT, IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY
# SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, OR ANY
# DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
# WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS
# ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE
# OF THESE LICENSED DELIVERABLES.
#
# U.S. Government End Users.  These Licensed Deliverables are a
# "commercial item" as that term is defined at 48 C.F.R. 2.101 (OCT
# 1995), consisting of "commercial computer software" and "commercial
# computer software documentation" as such terms are used in 48
# C.F.R. 12.212 (SEPT 1995) and is provided to the U.S. Government
# only as a commercial end item.  Consistent with 48 C.F.R.12.212 and
# 48 C.F.R. 227.7202-1 through 227.7202-4 (JUNE 1995), all
# U.S. Government End Users acquire the Licensed Deliverables with
# only those rights set forth herein.
#
# Any use of the Licensed Deliverables in individual and commercial
# software must include, in the user documentation and internal
# comments to the code, the above Disclaimer and U.S. Government End
# Users Notice.
#from itertools import chain
import argparse
import osimport pycuda.driver as cuda
import pycuda.autoinit
import numpy as npimport tensorrt as trttry:# Sometimes python2 does not understand FileNotFoundErrorFileNotFoundError
except NameError:FileNotFoundError = IOErrorEXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)def GiB(val):return val * 1 << 30def add_help(description):parser = argparse.ArgumentParser(description=description, formatter_class=argparse.ArgumentDefaultsHelpFormatter)args, _ = parser.parse_known_args()def find_sample_data(description="Runs a TensorRT Python sample", subfolder="", find_files=[]):'''Parses sample arguments.Args:description (str): Description of the sample.subfolder (str): The subfolder containing data relevant to this samplefind_files (str): A list of filenames to find. Each filename will be replaced with an absolute path.Returns:str: Path of data directory.'''# Standard command-line arguments for all samples.kDEFAULT_DATA_ROOT = os.path.join(os.sep, "usr", "src", "tensorrt", "data")parser = argparse.ArgumentParser(description=description, formatter_class=argparse.ArgumentDefaultsHelpFormatter)parser.add_argument("-d", "--datadir", help="Location of the TensorRT sample data directory, and any additional data directories.", action="append", default=[kDEFAULT_DATA_ROOT])args, _ = parser.parse_known_args()def get_data_path(data_dir):# If the subfolder exists, append it to the path, otherwise use the provided path as-is.data_path = os.path.join(data_dir, subfolder)if not os.path.exists(data_path):print("WARNING: " + data_path + " does not exist. Trying " + data_dir + " instead.")data_path = data_dir# Make sure data directory exists.if not (os.path.exists(data_path)):print("WARNING: {:} does not exist. Please provide the correct data path with the -d option.".format(data_path))return data_pathdata_paths = [get_data_path(data_dir) for data_dir in args.datadir]return data_paths, locate_files(data_paths, find_files)def locate_files(data_paths, filenames):"""Locates the specified files in the specified data directories.If a file exists in multiple data directories, the first directory is used.Args:data_paths (List[str]): The data directories.filename (List[str]): The names of the files to find.Returns:List[str]: The absolute paths of the files.Raises:FileNotFoundError if a file could not be located."""found_files = [None] * len(filenames)for data_path in data_paths:# Find all requested files.for index, (found, filename) in enumerate(zip(found_files, filenames)):if not found:file_path = os.path.abspath(os.path.join(data_path, filename))if os.path.exists(file_path):found_files[index] = file_path# Check that all files were foundfor f, filename in zip(found_files, filenames):if not f or not os.path.exists(f):raise FileNotFoundError("Could not find {:}. Searched in data paths: {:}".format(filename, data_paths))return found_files# Simple helper data class that's a little nicer to use than a 2-tuple.
class HostDeviceMem(object):def __init__(self, host_mem, device_mem):self.host = host_memself.device = device_memdef __str__(self):return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)def __repr__(self):return self.__str__()# Allocates all buffers required for an engine, i.e. host/device inputs/outputs.
def allocate_buffers(engine):inputs = []outputs = []bindings = []stream = cuda.Stream()for binding in engine:size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_sizedtype = trt.nptype(engine.get_binding_dtype(binding))# Allocate host and device buffershost_mem = cuda.pagelocked_empty(size, dtype)device_mem = cuda.mem_alloc(host_mem.nbytes)# Append the device buffer to device bindings.bindings.append(int(device_mem))# Append to the appropriate list.if engine.binding_is_input(binding):inputs.append(HostDeviceMem(host_mem, device_mem))else:outputs.append(HostDeviceMem(host_mem, device_mem))return inputs, outputs, bindings, stream# This function is generalized for multiple inputs/outputs.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):# Transfer input data to the GPU.[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]# Run inference.context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)# Transfer predictions back from the GPU.[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]# Synchronize the streamstream.synchronize()# Return only the host outputs.return [out.host for out in outputs]# This function is generalized for multiple inputs/outputs for full dimension networks.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
def do_inference_v2(context, bindings, inputs, outputs, stream):# Transfer input data to the GPU.[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]# Run inference.context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)# Transfer predictions back from the GPU.[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]# Synchronize the streamstream.synchronize()# Return only the host outputs.return [out.host for out in outputs]

log.py 文件，自己写的用来计时和显示的，也可以不用

import timedef print_div(text, custom='', end='\n\n'):print('-'*50, '\n')print(f"{text}{custom}", end=end)class timer():def __init__(self, text):print_div(text, '...', end=' ')self.t_start = time.time()def end(self):t_cost = time.time() - self.t_startprint('\033[35m', end='')print('Done ({:.5f}s)'.format(t_cost))print('\033[0m')class logger():def __init__(self, text):print_div(text)

生成trt模型
build_engine.py

import tensorrt as trt
from log import timer, loggerTRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)def build_engine(onnx_path, shape = [1,416,416,3]):with trt.Builder(TRT_LOGGER) as builder, builder.create_network(1) as network, trt.OnnxParser(network, TRT_LOGGER) as parser:builder.max_workspace_size = (256 << 20)
# 256MiBbuilder.fp16_mode = True
# fp32_mode -> Falsewith open(onnx_path, 'rb') as model:parser.parse(model.read())engine = builder.build_cuda_engine(network)return enginedef save_engine(engine, engine_path):buf = engine.serialize()with open(engine_path, 'wb') as f:f.write(buf)def load_engine(trt_runtime, engine_path):with open(engine_path, 'rb') as f:engine_data = f.read()engine = trt_runtime.deserialize_cuda_engine(engine_data)return engineif __name__ == "__main__":onnx_path = '../two_asff.onnx'trt_path = 'two_asff_113.trt'input_shape = [1, 416, 416, 3]build_trt = timer('Parser ONNX & Build TensorRT Engine')engine = build_engine(onnx_path, input_shape)build_trt.end()save_trt = timer('Save TensorRT Engine')save_engine(engine, trt_path)save_trt.end()

输出`在这里插入代码片

-------------------------------------------------- [TensorRT] WARNING: onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
Parser ONNX & Build TensorRT Engine... Done (358.54099s)-------------------------------------------------- Save TensorRT Engine... Done (0.35394s)

第二个生成trt的文件，使用其中一个就可以
onnx2trt.py

import numpy as np
import pycuda.driver as cudadriver
import tensorrt as trt
import torch
import os
import time
import commonfrom PIL import Image
import cv2
import torchvisiondef ONNX_build_engine(onnx_file_path, write_engine=True):# 通过加载onnx文件，构建engine# :param onnx_file_path: onnx文件路径# :return: engineG_LOGGER = trt.Logger(trt.Logger.WARNING)# 1、动态输入第一点必须要写的explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)batch_size = 1  # trt推理时最大支持的batchsizewith trt.Builder(G_LOGGER) as builder, builder.create_network(explicit_batch) as network, \trt.OnnxParser(network, G_LOGGER) as parser:builder.max_batch_size = batch_sizeconfig = builder.create_builder_config()config.max_workspace_size = common.GiB(2)  config.set_flag(trt.BuilderFlag.FP16)print('Loading ONNX file from path {}...'.format(onnx_file_path))with open(onnx_file_path, 'rb') as model:print('Beginning ONNX file parsing')parser.parse(model.read())print('Completed parsing of ONNX file')print('Building an engine from file {}; this may take a while...'.format(onnx_file_path))# 重点profile = builder.create_optimization_profile()  # 动态输入时候需要 分别为最小输入、常规输入、最大输入# 有几个输入就要写几个profile.set_shape 名字和转onnx的时候要对应# tensorrt6以后的版本是支持动态输入的，需要给每个动态输入绑定一个profile，用于指定最小值，常规值和最大值，如果超出这个范围会报异常。profile.set_shape("input", (1, 3, 208, 208), (1, 3, 416, 416), (1, 3, 640, 640))config.add_optimization_profile(profile)engine = builder.build_engine(network, config)print("Completed creating Engine")# 保存engine文件if write_engine:engine_file_path = 'two_asff_112.trt'with open(engine_file_path, "wb") as f:f.write(engine.serialize())return engineonnx_file_path = r'../two_asff.onnx'
write_engine = True
engine = ONNX_build_engine(onnx_file_path, write_engine)

推理trt
trt_inference.py
这个推理的时间会比较慢，是什么问题导致的我也不清楚，下面有一个比较快的

import tensorrt as trt
from PIL import Image
import torchvision.transforms as T
import numpy as npimport common
from engine import load_engine
from log import timer, loggerTRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)def load_engine(trt_runtime, engine_path):with open(engine_path, 'rb') as f:engine_data = f.read()engine = trt_runtime.deserialize_cuda_engine(engine_data)return enginedef load_data(path):trans = T.Compose([T.Resize(208), T.CenterCrop(208), T.ToTensor()])img = Image.open(path)img_tensor = trans(img).unsqueeze(0)return np.array(img_tensor)# load trt engineload_trt = timer("Load TRT Engine")
trt_path = 'two_asff_112.trt'
engine = load_engine(trt_runtime, trt_path)
load_trt.end()# allocate buffersinputs, outputs, bindings, stream = common.allocate_buffers(engine)
# load datainputs[0].host = load_data('../img/1.jpg')# infer_trt = timer("TRT Inference")
with engine.create_execution_context() as context:infer_trt = timer("TRT Inference")trt_outputs = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)    infer_trt.end()
preds = trt_outputs[0]
# infer_trt.end()# Get Labelsf = open('../model_data/apple_classes.txt')
t = [ i.replace('\n','') for i in f.readlines()]
logger(f"Result : {t[np.argmax(preds)]}")

直接从onnx生成trt并进行推理计算时间，但是没有输出保存，想保存的可以用上面的然后将生成trt之前的代码注释掉就可以了，这样只跑推理时间会很快
我本来想将自己的pytorch模型放进去进行比较的，但是花的时间太长，不如直接在python3的环境下运行
pAt_complate.py

import pycuda.autoinit
import numpy as np
import pycuda.driver as cuda
import tensorrt as trt
import torch
import os
import time
from PIL import Image
import cv2
import torchvision
# from logs.two_asff.model import YoloBodyfilename = '../img/1.jpg'
max_batch_size = 1
onnx_model_path = '../two_asff.onnx'
# model_path      = 'logs/two_asff/yolox/99.09.pth'TRT_LOGGER = trt.Logger()  # This logger is required to build an engine# net    =    YoloBody(1, shufflenet)
# device      = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# net.load_state_dict(torch.load(model_path, map_location=device))
# net    = net.eval()def get_img_np_nchw(filename):image = cv2.imread(filename)image_cv = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)image_cv = cv2.resize(image_cv, (224, 224))miu = np.array([0.485, 0.456, 0.406])std = np.array([0.229, 0.224, 0.225])img_np = np.array(image_cv, dtype=float) / 255.r = (img_np[:, :, 0] - miu[0]) / std[0]g = (img_np[:, :, 1] - miu[1]) / std[1]b = (img_np[:, :, 2] - miu[2]) / std[2]img_np_t = np.array([r, g, b])img_np_nchw = np.expand_dims(img_np_t, axis=0)return img_np_nchwclass HostDeviceMem(object):def __init__(self, host_mem, device_mem):"""Within this context, host_mom means the cpu memory and device means the GPU memory"""self.host = host_memself.device = device_memdef __str__(self):return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)def __repr__(self):return self.__str__()def allocate_buffers(engine):inputs = []outputs = []bindings = []stream = cuda.Stream()for binding in engine:size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_sizedtype = trt.nptype(engine.get_binding_dtype(binding))# Allocate host and device buffershost_mem = cuda.pagelocked_empty(size, dtype)device_mem = cuda.mem_alloc(host_mem.nbytes)# Append the device buffer to device bindings.bindings.append(int(device_mem))# Append to the appropriate list.if engine.binding_is_input(binding):inputs.append(HostDeviceMem(host_mem, device_mem))else:outputs.append(HostDeviceMem(host_mem, device_mem))return inputs, outputs, bindings, streamdef get_engine(max_batch_size=1, onnx_file_path="", engine_file_path="", \fp16_mode=False, int8_mode=False, save_engine=False,):"""Attempts to load a serialized engine if available, otherwise builds a new TensorRT engine and saves it."""def build_engine(max_batch_size, save_engine):"""Takes an ONNX file and creates a TensorRT engine to run inference with"""explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)with trt.Builder(TRT_LOGGER) as builder, \builder.create_network(explicit_batch) as network, \trt.OnnxParser(network, TRT_LOGGER) as parser:builder.max_workspace_size = 1 << 30  # Your workspace sizebuilder.max_batch_size = max_batch_size# pdb.set_trace()builder.fp16_mode = fp16_mode  # Default: Falsebuilder.int8_mode = int8_mode  # Default: Falseif int8_mode:# To be updatedraise NotImplementedError# Parse model fileif not os.path.exists(onnx_file_path):quit('ONNX file {} not found'.format(onnx_file_path))print('Loading ONNX file from path {}...'.format(onnx_file_path))with open(onnx_file_path, 'rb') as model:print('Beginning ONNX file parsing')parser.parse(model.read())print('Completed parsing of ONNX file')print('Building an engine from file {}; this may take a while...'.format(onnx_file_path))last_layer = network.get_layer(network.num_layers - 1)network.mark_output(last_layer.get_output(0))engine = builder.build_cuda_engine(network)print("Completed creating Engine")if save_engine:with open(engine_file_path, "wb") as f:f.write(engine.serialize())return engineif os.path.exists(engine_file_path):# If a serialized engine exists, load it instead of building a new one.print("Reading engine from file {}".format(engine_file_path))with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:return runtime.deserialize_cuda_engine(f.read())else:return build_engine(max_batch_size, save_engine)def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):# Transfer data from CPU to the GPU.[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]# Run inference.context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)# Transfer predictions back from the GPU.[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]# Synchronize the streamstream.synchronize()# Return only the host outputs.return [out.host for out in outputs]def postprocess_the_outputs(h_outputs, shape_of_output):h_outputs = h_outputs.reshape(*shape_of_output)return h_outputsimg_np_nchw = get_img_np_nchw(filename)
img_np_nchw = img_np_nchw.astype(dtype=np.float32)# These two modes are dependent on hardwares
fp16_mode = True
int8_mode = False
trt_engine_path = './model_fp16_{}_int8_{}.trt'.format(fp16_mode, int8_mode)
# Build an engine
engine = get_engine(max_batch_size, onnx_model_path, trt_engine_path, fp16_mode, int8_mode)
# Create the context for this engine
context = engine.create_execution_context()
# Allocate buffers for input and output
inputs, outputs, bindings, stream = allocate_buffers(engine)  # input, output: host # bindings# Do inference
shape_of_output = (max_batch_size, 4056)
# Load data to the buffer
inputs[0].host = img_np_nchw.reshape(-1)# inputs[1].host = ... for multiple input
t1 = time.time()
trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)  # numpy data
t2 = time.time()
print("Inference time with the TensorRT engine: {}".format(t2 - t1))
feat = postprocess_the_outputs(trt_outputs[0], shape_of_output)print('TensorRT ok')# model = torchvision.models.resnet50(pretrained=True).cuda()
# resnet_model = model.eval()# input_for_torch = torch.from_numpy(img_np_nchw).cuda()
# t3 = time.time()
# feat_2 = resnet_model(input_for_torch)
# t4 = time.time()
# feat_2 = feat_2.cpu().data.numpy()
# print('Pytorch ok!')# mse = np.mean((feat - feat_2) ** 2)
print("Inference time with the TensorRT engine: {}".format(t2 - t1))
# print("Inference time with the PyTorch model: {}".format(t4 - t3))
# print('MSE Error = {}'.format(mse))print('All completed!')

输出结果

Loading ONNX file from path ../two_asff.onnx...
Beginning ONNX file parsing
[TensorRT] WARNING: onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[TensorRT] WARNING: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
Completed parsing of ONNX file
Building an engine from file ../two_asff.onnx; this may take a while...
[TensorRT] ERROR: Tensor 790 is already set as an output
Completed creating Engine
Inference time with the TensorRT engine: 0.04271745681762695
TensorRT ok
All completed!

fp16精度比原来快了好多

jetpack4.5.1使用tensorrt加速模型相关推荐

yolov3_tiny.onnx转trt采用tensorrt加速模型推理
既然上一篇博客都把yolov3-tiny.weights转onnx做了,推理也测了.那么呢,就再直接转个trt模型吧.这样感觉博客的内容就更加连贯了吧,实用性貌似会更加强吧. (如果没看过yolov3 ...
TensorRT加速 ——NVIDIA终端AI芯片加速用，可以直接利用caffe或TensorFlow生成的模型来predict（inference）...
官网:https://developer.nvidia.com/tensorrt 作用:NVIDIA TensorRT™ is a high-performance deep learning inf ...
tensorflow打印模型图_[深度学习]TensorRT加速tensorflow实例
使用TensorRT加速tensorflow模型的推理应该是很有市场的一种应用了,但是使用Python的.易懂的例子并不多,官方的文档在这方面也是很不友好. 所以,本文旨在提供一个能把原理讲明白,代码 ...
探讨TensorRT加速AI模型的简易方案 — 以图像超分为例
AI模型近年来被广泛应用于图像.视频处理,并在超分.降噪.插帧等应用中展现了良好的效果.但由于图像AI模型的计算量大,即便部署在GPU上,有时仍达不到理想的运行速度.为此,NVIDIA推出了Tenso ...
NVIDIA Jetson平台上TensorRT加速YOLOV3,V4及V5系列参考例程分享
我的Jetson *Jenson Nano: Jetpack4.4 CUDA10.2 cuDNN8.0 OpenCV4.1 TensorRT7.1 *Jenson Xavier NX:Jetpack4 ...
NVIDIA jetson tensorrt加速yolov5摄像头检测
link 在使用摄像头直接检测目标时,检测的实时画面还是有点慢,下面是tensorrt加速过程记录. 一.设备 1.设备jetson agx xavier 2.jetpack4.6.1 3.tenso ...
jetson agx xavier：从亮机到yolov5下tensorrt加速
重要的下载资源链接放在前面: jetpack4.5资源主要内容记录在了自己的石墨文档里,自己习惯性地修改起来比较快,可能后续小修小改在那边更新.这里就做一个csdn的拷贝造福各位. https:// ...
手把手教你使用YOLOV5训练自己的数据集并用TensorRT加速
点击上方"3D视觉工坊",选择"星标" 干货第一时间送达前言本文主要介绍目标检测YOLOV5算法来训练自己的数据集,并且使用TensorRT来对训练好的模型 ...
使用TensorRT加速yolo3
一.TensorRT支持的模型: TensorRT 直接支持的model有ONNX.Caffe.TensorFlow,其他常见model建议先转化成ONNX.总结如下: 1 ONNX(.onnx) 2 ...

jetpack4.5.1使用tensorrt加速模型

jetpack4.5.1使用tensorrt加速模型相关推荐

最新文章

热门文章