



我们平常常用的TensorRT,还有各种工具包例如DeepDream、TLT(TRANSFER LEARNING TOOLKIT)、Triton-Server-Inference等都是英伟达提供给我们开箱即用的工具,也确实好用。唯一想吐槽的就是开源不彻底(轻喷轻喷~)。


  • https://www.nvidia.com/en-us/on-demand/
  • https://www.nvidia.com/en-us/research/ai-playground/
  • https://developer.nvidia.com/transfer-learning-toolkit




  • 自动驾驶、机器人
  • 大数据、网络、可视化
  • 数据科学
  • 深度学习
  • GPU编程
  • 图形图像以及设计
  • 高性能计算
  • 仿真、思维





TensorRT Quick Start Guide

We’ll walk you through the TensorRT Quick Start Guide. The newly-published TensorRT Quick Start Guide provides a quick introduction to new users starting out with TensorRT. It includes Jupyter notebooks and C++ examples of the most common TensorRT workflows and examples for using TensorRT with TensorFlow, PyTorch, and ONNX.






Accelerate Deep Learning Inference with TensorRT 8.0

TensorRT is an SDK for high-performance deep learning inference used in production to minimize latency and maximize throughput. The upcoming TensorRT 8.0 release provides features such as sparsity optimized for NVIDIA Ampere GPUs, quantization-aware training, and enhanced compiler to accelerate transformer-based networks. Deep learning compilers need to have a robust method to import, optimize, and deploy models. New users can learn about the common workflow, while experienced users can learn more about new TensorRT 8.0 features.



  • 支持QTA量化(也就是训练中量化),可以直接将其他框架中训练中量化的模型导入到TensorRT中使用
  • 对于安培(Ampere的)架构的显卡,支持稀疏化网络,可提升50%的吞吐量
  • 对于BERT等transformer构架的网络有了更好的优化


Introduction to TensorRT and Triton: A Walkthrough of Optimizing Your First Deep Learning Inference Model

NVIDIA TensorRT is a deep learning platform that optimizes neural network models and speeds up inference across GPU-accelerated platforms running in the data center and embedded devices. We’ll provide an overview of TensorRT, show how to optimize a PyTorch model, and demonstrate how to deploy this highly optimized model using NVIDIA Triton Inference Server. By the end of this workshop, developers will see the substantial benefits of integrating TensorRT and get started on optimizing their own deep learning models.


我有一个TensorRT!你有一个Triton!那么合起来呢?就是triton with TensorRT!两者结合起来可以称之为开源届最强服务器推理方案

Triton确实是好用的不行。Triton server的特性与其他服务器框架无异,而支持的底层backend有TensorRT、onnxruntime、libtorch、TensorFlow、Pytorch、Openvino等,支持http和grpc协议,也可以自定义协议(毕竟开源嘛),支持多卡,支持多实例,支持热加载。







Quantization Aware Training in PyTorch with TensorRT 8.0

Quantization is used to improve latency and resource requirements of Deep Neural Networks during inference. Quantization Aware Training (QAT) improves accuracy of quantized networks by emulating quantization errors in the forward and backward passes during training. TensorRT 8.0 brings improved support for QAT with PyTorch, in conjunction with NVIDIA’s open-source pytorch-quantization toolkit. This session gives an overview of the improvements in QAT with TensorRT 8.0, and walks through an end-to-end usage example.





Making the Most of Structured Sparsity in the NVIDIA Ampere Architecture

In this session, we’ll share details of Sparse Tensor Cores in the NVIDIA Ampere Architecture and the unique 2:4 sparse format they support. Learn how we’ve simplified maintaining accuracy when pruning all types of networks, including classification networks, language models, and GANs. Finally, find out how to accelerate your own workloads using Sparse Tensor Cores from start to finish with ASP and TensorRT 8.0 and cuSPARSELt.



英伟达部分显卡是支持稀疏化推理的,英伟达的A100 GPU显卡在运行bert的时候,稀疏化后的网络相比之前的dense网络要快50%。我们的显卡支持么?只要是Ampere architecture架构的显卡都是支持的(例如30XX显卡)。

  • Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt
  • How Sparsity Adds Umph to AI Inference


利用 NVIDIA 安培结构和 NVIDIA TensorRT 加速稀疏推理

Prototyping and Debugging Deep Learning Inference Models Using TensorRT’s ONNX-Graphsurgeon and Polygraphy Tools

Deep learning researchers and engineers usually have to spend a significant amount of time debugging accuracy and performance of their deep learning inference models before deploying them. TensorRT recently open-sourced some more tools to assist with the development and debugging of deep neural networks for inference. ONNX GraphSurgeon is a tool that allows you to easily generate new ONNX graphs, or modify existing ones. This can be useful in scenarios like using custom implementations for parts of the ONNX graph, in place of those provided by TensorRT. Polygraphy is a toolkit designed to assist in running and debugging deep learning models in various frameworks. It includes a Python API and several command-line tools built using this API. These tools allow displaying information about models, such as network structure; determining which layers of a TensorRT network need to be run in a higher precision for accuracy; and comparing inference results across frameworks, among other features.





  • 查看ONNX结构 polygraphy inspect model mymodel.onnx
  • 查看一个engine结构 polygraphy inspect model mytrt.trt --model-type engine
  • 通过onnx查看生成trt的网络结 polygraphy inspect model mymodel.onnx --display-as=trt --mode basic
  • 对于trt和onnx的结果
    polygraphy run mymodel.onnx --onnxrt --save-outputs onnx_res.json
    polygraphy run mytrt.trt --model-type engine --trt --load-outputs onnx_res.json --abs 1e-4
  • 修改onnx结构
    polygraphy surgeon sanitize modele2-nms.onnx
    –override-input-shapes input_name:[1,3,224,224]
    -o modele2-nms-static-shape.onnx


Achieve Best Inference Performance on NVIDIA GPUs by Combining TensorRT with TVM Compilation Using SageMaker Neo

Amazon SageMaker Neo allows customers to compile models from any framework for optimized inference on many compilation targets, including NVIDIA Jetson devices and T4 GPU instances. We’ll dive into the details of how Neo uses the open-source deep learning compiler TVM and NVIDIA TensorRT together to provide the best inference performance across popular deep learning model types.





New Features in TRTorch, a PyTorch/TorchScript Compiler Targeting NVIDIA GPUs Using TensorRT

We’ll cover new features of TRTorch, a compiler for PyTorch and TorchScript that optimizes deep learning models for inference on NVIDIA GPUs. Programs are internally optimized using TensorRT but maintain full compatibility with standard PyTorch or TorchScript code. This allows users to continue to feel like they’re writing PyTorch code in their inference applications while fully leveraging TensorRT. We’ll discuss new capabilities enabled in recent releases of TRTorch, including direct integration into PyTorch and post-training quantization.


  • Pytorch->torchscript->tensorrt





Low-Latency, High-Throughput Inferencing for Transformer-Based Models

Transformer-based models provide state-of-the-art accuracy for many NLP tasks. Recent models contain a large number of parameters, which makes meeting low latency requirements challenging for online inferencing. We’ll cover highly optimized inferencing solutions for transformer-based models to tackle online and offline inferencing scenarios. We’ll demonstrate that low latency and high throughput can be achieved with the combination of NVIDIA hardware and software. We’ll briefly go over BERT inferencing with FasterTransformer, TensorRT, and MXNet, and also present performance data from the latest NVIDIA GPUs.




  • https://github.com/NVIDIA/FasterTransformer
  • https://github.com/NVIDIA/TensorRT/tree/master/demo/BERT

Inference with Tensorflow 2 Integrated with TensorRT Session

Learn how to inference using Tensorflow 2 with TensorRT integrated and the performance this can offer. Tensorflow is a machine learning platform and TensorRT is an SDK for high-performance deep learning inference using NVIDIA GPUs. Tensorflow models are usually written in FP32 precision to work for both training and inference. Tensorflow-TensorRT integration automatically offloads portions of the Tensorflow graph to run with TensorRT using precisions FP16 or INT8 to improve inference throughput without sacrificing much accuracy. We’ll describe: how to use Tensorflow-TensorRT integration in Tensorflow 2; the dynamic shape feature we recently added to better handle Tensorflow graph with unknown shapes; the lazy calibration mode we recently added to improve the workflow for inferencing with INT8 precision; some details on how Tensorflow-TensorRT works; and the performance benefits of using Tensorflow-TensorRT for inference.


Designing and Optimizing Deep Neural Networks for High-Throughput and Low-Latency Production Deployment

When integrating DNNs into applications the project teams need to consider much more than just model accuracy. Factors such as throughput affect the size and the cost of the infrastructure required to host the application. Similarly, latency of model response is important for a wide range of time-sensitive application and a hard requirement when building safety-critical applications. We’ll discuss how to select efficient models that allow us to meet the throughput and latency requirements (including multitask DNNs) as well as key approaches for their further optimization, such as quantification-aware training, post-training quantification, pruning, distillation, and other forms of model compression. We’ll explain how those techniques interact with the GPU architecture. Finally, we’ll reprise key tools that can simplify the model optimization and deployment process, such as TensorRT or Triton Inference Server.

如何设计并且优化高吞吐低延迟的产品模型,涉及到了TensorRT以及Triton Inference Server










