起因

我在尝试一篇目标识别的论文的开源代码时，发现由于其NMS、LSTM等模块是使用c++编译的（为了弥补python速度的短板），对Pytorch要求不能使用超过0.3的版本。和FASTER RCNN开源代码很像，不过Faster-rcnn官方更新了其包含的c++文件，从而可以在pytorch大于1的版本上使用。我不想退回pytorch 0.3版本，于是尝试重写c++/cuda 扩展的模块

prerequisite

CUDA10.0
pytorch1.4+cu100
torchvision0.5.0+cu100
setuptools 52.0.0

方法

由于原始代码采用较老的方式（THC模块）实现扩展，这里尝试pytorch官方 tutorial的方法实现

https://pytorch.org/tutorials/advanced/cpp_extension.html#writing-a-c-extension

然而，这教程有一个很致命的问题，按照它的方法用不了，被坑了。所以…先看看作为参考吧。

概述

以NMS模块为例，文件树如下：

|project name
|---cuda
|   |---nms_kernel.cu
|   |---nms_kernel.h
|---nms_cuda.cpp
|---nms_cuda.h
|---build.py

其中nms_kernel.cu包括cuda核函数和调用它的函数A，nms_cuda.c又调用A函数，最后通过build.py编译扩展模块

##build.py

import os
import torch
from setuptools import setup, Extension
from torch.utils.cpp_extension import BuildExtension, CUDAExtension
from setuptools import setup, Extension
from torch.utils import cpp_extension
# Might have to export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}} ?
torch._C._GLIBCXX_USE_CXX11_ABI=0  //设置编译的参数
sources = []
headers = []
defines = []
inl = []
with_cuda = Falseif torch.cuda.is_available():print('Including CUDA code.')sources += ['src/cuda/nms_kernel.cu','src/nms_cuda.cpp']  //待编译的c与cu文件# sources += ['src/nms_cuda.cpp']headers += ['src/','src/cuda/'] //头文件目录defines += [('WITH_CUDA', None)]with_cuda = Trueinl +=['src/','src/cuda/']this_file = os.path.dirname(os.path.realpath(__file__))
print(this_file)
setup(name='ext.nms',  //拓展包名，不太重要ext_modules=[CUDAExtension('extnms',sources=sources,include_dirs= headers ,define_macros=defines,  extra_compile_flag=('-std=c++11'),//使用c11编译with_cuda=with_cuda)],cmdclass={'build_ext': BuildExtension})

考虑到有人是源码编译pytorch，那么就需要注意，编译pytorch时的-D GLIBCXX_USE_CXX11_ABI标志位，extention编译的标志位要和pytorch编译的一致，一般来说，pip安装不用考虑这个问题

cpp文件

该文件主要特点是：使用torch::Tensor代替原本的指针、数组类型（当然可以尝试指定数据类型，例如torch.IntTensor）
宏PYBIND11_MODULE中第一个参数即函数名，第二个是该函数指针地址，最后一个参数是模块名（用于import …）

#include pip
#include <math.h>
#include "cuda/nms_kernel.h"  //cuda核函数的头文件
#include <torch/extension.h>
#include <pybind11/pybind11.h>
namespace py = pybind11;    //用于c++到python的转换
extern THCState *state;
int ApplyNMSGPU(         //声明cuda函数，不要也可torch::Tensor keep_out,torch::Tensor boxes_dev,int boxes_num,torch::Tensor nms_overlap_thresh,int device_id);int nms_apply(         //需要转换陈python的函数，里面调用了cuda函数ApplyNMSGPUtorch::Tensor keep,torch::Tensor boxes_sorted,float nms_thresh
)
{torch::Tensor keep_data = keep;
torch::Tensor boxes_sorted_data = boxes_sorted;
int boxes_num = boxes_sorted.size(0);
int devId = boxes_sorted.get_device();return ApplyNMSGPU(keep_data, boxes_sorted_data, boxes_num, nms_thresh, devId);
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {m.def("nms_apply", &nms_apply, "nms");
}

cpp的头文件

#include <torch/extension.h>
int nms_apply(torch::Tensor keep,torch::Tensor boxes_sorted,float nms_thresh
)；

cuda文件

你只需要关注：
·数据类型：scalar_t、scalar_t*、size_t 他们都是模板数据类型，在后面确定具体是什么。
·宏AT_DISPATCH_ALL_TYPE，参数一：决定scalar_t是哪种具体数据类型，参数三：lambla函数，注意核函数的参数数据类型，对于torch:tensor的，需要.data<scalar_t>()，但是，如果是别的类型（例如指针）还用.data<scalar_t>()的话会报错：error: expression must have class type


//#include <ATen/ATen.h>
#include <vector>
#include <iostream>
#include <torch/extension.h>
#include <cuda.h>
#include <cuda_runtime.h>
#define CUDA_CHECK(condition) \/* Code block avoids redefinition of cudaError_t error */ \do { \cudaError_t error = condition; \if (error != cudaSuccess) { \std::cout << cudaGetErrorString(error) << std::endl; \} \} while (0)#define DIVUP(m,n) ((m) / (n) + ((m) % (n) > 0))
const long  threadsPerBlock = sizeof(unsigned long long) * 8;
/*
__device__ inline float devIoU(scalar_t*  a, scalar_t*  b) {float left = max(a[0], b[0]), right = min(a[2], b[2]);float top = max(a[1], b[1]), bottom = min(a[3], b[3]);float width = max(right - left + 1, 0.f), height = max(bottom - top + 1, 0.f);float interS = width * height;float Sa = (a[2] - a[0] + 1) * (a[3] - a[1] + 1);float Sb = (b[2] - b[0] + 1) * (b[3] - b[1] + 1);return interS / (Sa + Sb - interS);
}
*/
cxg script/
template <typename scalar_t>
__device__ __forceinline__ scalar_t devIoU( scalar_t*  a, scalar_t*  b) {float left = max(a[0], b[0]), right = min(a[2], b[2]);float top = max(a[1], b[1]), bottom = min(a[3], b[3]);float width = max(right - left + 1, 0.f), height = max(bottom - top + 1, 0.f);float interS = width * height;float Sa = (a[2] - a[0] + 1) * (a[3] - a[1] + 1);float Sb = (b[2] - b[0] + 1) * (b[3] - b[1] + 1);return interS / (Sa + Sb - interS);
}
/////cxg script//
template <typename scalar_t>
__global__ void nms_kernel(const size_t __restrict__ n_boxes,scalar_t__restrict__ nms_overlap_thresh,scalar_t* __restrict__ dev_boxes,unsigned long long* __restrict__ dev_mask) {const int row_start = blockIdx.y;const int col_start = blockIdx.x;// if (row_start > col_start) return;const int row_size =min(n_boxes - row_start * threadsPerBlock, threadsPerBlock);const int col_size =min(n_boxes - col_start * threadsPerBlock, threadsPerBlock);__shared__ scalar_t block_boxes[threadsPerBlock * 5];if (threadIdx.x < col_size) {block_boxes[threadIdx.x * 4 + 0] =dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 4 + 0];block_boxes[threadIdx.x * 4 + 1] =dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 4 + 1];block_boxes[threadIdx.x * 4 + 2] =dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 4 + 2];block_boxes[threadIdx.x * 4 + 3] =dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 4 + 3];}__syncthreads();if (threadIdx.x < row_size) {const int cur_box_idx = threadsPerBlock * row_start + threadIdx.x;scalar_t* cur_box = dev_boxes + cur_box_idx * 4;int i = 0;unsigned long long t = 0;int start = 0;if (row_start == col_start) {start = threadIdx.x + 1;}for (i = start; i < col_size; i++) {if (devIoU(cur_box, block_boxes + i * 4) > nms_overlap_thresh) {t |= 1ULL << i;}}const int col_blocks = DIVUP(n_boxes, threadsPerBlock);dev_mask[cur_box_idx * col_blocks + col_start] = t;}
}//
void _set_device(int device_id) {int current_device;CUDA_CHECK(cudaGetDevice(&current_device));if (current_device == device_id) {return;}// The call to cudaSetDevice must come before any calls to Get, which// may perform initialization using the GPU.CUDA_CHECK(cudaSetDevice(device_id));
}cxg script ///
int ApplyNMSGPU(torch::Tensor keep_out,torch::Tensor boxes_dev,int boxes_num,float nms_overlap_thresh,int device_id) {_set_device(device_id);unsigned long long* mask_dev = NULL;const auto col_blocks = DIVUP(boxes_num, threadsPerBlock);CUDA_CHECK(cudaMalloc(&mask_dev,boxes_num * col_blocks * sizeof(unsigned long long)));dim3 blocks(DIVUP(boxes_num, threadsPerBlock),DIVUP(boxes_num, threadsPerBlock));dim3 threads(threadsPerBlock);AT_DISPATCH_ALL_TYPES(boxes_dev.type(), "ApplyNMSGPU", ([&] {nms_kernel<scalar_t><<<blocks, threads>>>(boxes_num,nms_overlap_thresh,boxes_dev.data<scalar_t>(),mask_dev);}));std::vector<unsigned long long> mask_host(boxes_num * col_blocks);CUDA_CHECK(cudaMemcpy(&mask_host[0],mask_dev,sizeof(unsigned long long) * boxes_num * col_blocks,cudaMemcpyDeviceToHost));std::vector<unsigned long long> remv(col_blocks);memset(&remv[0], 0, sizeof(unsigned long long) * col_blocks);int num_to_keep = 0;for (int i = 0; i < boxes_num; i++) {int nblock = i / threadsPerBlock;int inblock = i % threadsPerBlock;if (!(remv[nblock] & (1ULL << inblock))) {keep_out[num_to_keep++] = i;auto p = &mask_host[0] + i * col_blocks;for (int j = nblock; j < col_blocks; j++) {remv[j] |= p[j];}}}CUDA_CHECK(cudaFree(mask_dev));return num_to_keep;
}/

cuda头文件

#include <torch/extension.h>
int ApplyNMSGPU(torch::Tensor keep_out,torch::Tensor boxes_dev,int boxes_num,float nms_overlap_thresh,int device_id);

编译：

命令：python build.py install
结果：你可以看到所有的nvcc gcc编译器的编译参数，确定是否正确

building 'extnms' extension
/usr/local/cuda/bin/nvcc -DWITH_CUDA -Isrc/ -Isrc/cuda/ -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c src/cuda/nms_kernel.cu -o build/temp.linux-x86_64-3.8/src/cuda/nms_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=extnms -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_61,code=sm_61 -std=c++11
/usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of signx86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DWITH_CUDA -Isrc/ -Isrc/cuda/ -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c src/nms_cuda.cpp -o build/temp.linux-x86_64-3.8/src/nms_cuda.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=extnms -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.8/src/cuda/nms_kernel.o build/temp.linux-x86_64-3.8/src/nms_cuda.o -L/usr/local/cuda/lib64 -lcudart -o build/lib.linux-x86_64-3.8/extnms.cpython-38-x86_64-linux-gnu.so
creating build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-3.8/extnms.cpython-38-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
creating stub loader for extnms.cpython-38-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/extnms.py to extnms.cpython-38.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying ext.nms.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying ext.nms.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying ext.nms.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying ext.nms.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
__pycache__.extnms.cpython-38: module references __file__
creating 'dist/ext.nms-0.0.0-py3.8-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing ext.nms-0.0.0-py3.8-linux-x86_64.egg
removing '/usr/local/lib/python3.8/dist-packages/ext.nms-0.0.0-py3.8-linux-x86_64.egg' (and everything under it)
creating /usr/local/lib/python3.8/dist-packages/ext.nms-0.0.0-py3.8-linux-x86_64.egg
Extracting ext.nms-0.0.0-py3.8-linux-x86_64.egg to /usr/local/lib/python3.8/dist-packages
ext.nms 0.0.0 is already the active version in easy-install.pthInstalled /usr/local/lib/python3.8/dist-packages/ext.nms-0.0.0-py3.8-linux-x86_64.egg
Processing dependencies for ext.nms==0.0.0
Finished processing dependencies for ext.nms==0.0.0

调用

import pytorch
import extnms
…

小结

我哼多天卡在类似lltm_cpp.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC1ENS_14NMSAPPLYGPUXXXX这种错误上，这种错误就是编译时，库使用了未知的函数，在这个项目里，体现在必须使用.h文件，然鹅官方教程就是没有用.h文件，导致这个错误

运行setup.py时报错：

ValueError: bad marshal data (unknown type code)

solusion:

pip3 install --upgrade --force-reinstall setuptools

pytorch编写cuda/c++ extention 方法相关推荐

PyTorch 编写自定义数据集，数据加载器和转换
本文为 pytorch 官方教程https://pytorch.org/tutorials/beginner/data_loading_tutorial.html代码的注释 w3cschool 的翻译 ...
Ubuntu18.04+RTX3060显卡配置pytorch、cuda、cudnn和miniconda
0. 前言之前已经安装成功了,也发了篇博客梳理了整套流程如下. ubuntu18.04安装pytorch.cuda.cudnn和miniconda_Toblerone_Wind的博客-CSDN博客_ ...
TensorFlow/PyTorch和cuda等版本对应关系
参考: pytorch.显卡.显卡驱动.cuda版本是如何对应的:https://www.jianshu.com/p/ac70300b598b NVIDIA系列显卡做深度学习,需要在主机安装显卡驱动 ...
cuda11.2安装pytorch——torch.cuda.is_available()=false
这两天正在用服务器跑程序,但是发现运行速度极慢,查看正在运行的进程,也没看到自己的进程,但是程序又确确实实在运行,这就奇了怪了,一通查找,发现程序竟然是在CPU上运行,也就是 torch.cuda.i ...
加载dict_PyTorch 7.保存和加载pytorch模型的两种方法
众所周知,python的对象都可以通过torch.save和torch.load函数进行保存和加载(不知道?那你现在知道了(*^_^*)),比如: x1 = {"d":" ...
php 单例类 mysql pdo_PHP实战：PHP基于单例模式编写PDO类的方法
<PHP实战:PHP基于单例模式编写PDO类的方法>要点: 本文介绍了PHP实战:PHP基于单例模式编写PDO类的方法,希望对您有用.如果有疑问,可以联系我们. 一.单例模式简介简单的说 ...
在猜年龄的基础上编写登录、注册方法，并且把猜年龄游戏分函数处理
''' 在猜年龄的基础上编写登录.注册方法,并且把猜年龄游戏分函数处理,如 2. 登录函数 3. 注册函数 4. 猜年龄三次函数 5. 选择三次奖品函数 ''' import random def u ...
Delphi环境中编写调用DLL的方法和技巧
Delphi环境中编写调用DLL的方法和技巧第一章为什么要使用动态链接库(DLL) top 提起DLL您一定不会陌生,在Windows中有着大量的以DLL为后缀的文件,它们是保证Windows正常 ...
VS开发中的代码编写小技巧——避免重复代码编写的几种方法
原文:VS开发中的代码编写小技巧--避免重复代码编写的几种方法上一篇文章中程序员的幸福生活--有你的日子,每天都是情人节,收到了大家的很多好评.鼓励和祝福,非常感动,真诚的谢谢大家.也希望每个朋友都 ...

pytorch编写cuda/c++ extention 方法

起因