起因

我在尝试一篇目标识别的论文的开源代码时,发现由于其NMS、LSTM等模块是使用c++编译的(为了弥补python速度的短板),对Pytorch要求不能使用超过0.3的版本。和FASTER RCNN开源代码很像,不过Faster-rcnn官方更新了其包含的c++文件,从而可以在pytorch大于1的版本上使用。我不想退回pytorch 0.3版本,于是尝试重写c++/cuda 扩展的模块

prerequisite

CUDA10.0
pytorch1.4+cu100
torchvision0.5.0+cu100
setuptools 52.0.0

方法

由于原始代码采用较老的方式(THC模块)实现扩展,这里尝试pytorch官方 tutorial的方法实现

https://pytorch.org/tutorials/advanced/cpp_extension.html#writing-a-c-extension

然而,这教程有一个很致命的问题,按照它的方法用不了,被坑了。所以…先看看作为参考吧。

概述

以NMS模块为例,文件树如下:

|project name
|---cuda
|   |---nms_kernel.cu
|   |---nms_kernel.h
|---nms_cuda.cpp
|---nms_cuda.h
|---build.py

其中nms_kernel.cu包括cuda核函数和调用它的函数A,nms_cuda.c又调用A函数,最后通过build.py编译扩展模块

##build.py

import os
import torch
from setuptools import setup, Extension
from torch.utils.cpp_extension import BuildExtension, CUDAExtension
from setuptools import setup, Extension
from torch.utils import cpp_extension
# Might have to export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}} ?
torch._C._GLIBCXX_USE_CXX11_ABI=0  //设置编译的参数
sources = []
headers = []
defines = []
inl = []
with_cuda = Falseif torch.cuda.is_available():print('Including CUDA code.')sources += ['src/cuda/nms_kernel.cu','src/nms_cuda.cpp']  //待编译的c与cu文件# sources += ['src/nms_cuda.cpp']headers += ['src/','src/cuda/'] //头文件目录defines += [('WITH_CUDA', None)]with_cuda = Trueinl +=['src/','src/cuda/']this_file = os.path.dirname(os.path.realpath(__file__))
print(this_file)
setup(name='ext.nms',  //拓展包名,不太重要ext_modules=[CUDAExtension('extnms',sources=sources,include_dirs= headers ,define_macros=defines,  extra_compile_flag=('-std=c++11'),//使用c11编译with_cuda=with_cuda)],cmdclass={'build_ext': BuildExtension})

考虑到有人是源码编译pytorch,那么就需要注意,编译pytorch时的-D GLIBCXX_USE_CXX11_ABI标志位,extention编译的标志位要和pytorch编译的一致,一般来说,pip安装不用考虑这个问题

cpp文件

该文件主要特点是:使用torch::Tensor代替原本的指针、数组类型(当然可以尝试指定数据类型,例如torch.IntTensor
PYBIND11_MODULE中第一个参数即函数名,第二个是该函数指针地址,最后一个参数是模块名(用于import …)

#include pip
#include <math.h>
#include "cuda/nms_kernel.h"  //cuda核函数的头文件
#include <torch/extension.h>
#include <pybind11/pybind11.h>
namespace py = pybind11;    //用于c++到python的转换
extern THCState *state;
int ApplyNMSGPU(         //声明cuda函数,不要也可torch::Tensor keep_out,torch::Tensor boxes_dev,int boxes_num,torch::Tensor nms_overlap_thresh,int device_id);int nms_apply(         //需要转换陈python的函数,里面调用了cuda函数ApplyNMSGPUtorch::Tensor keep,torch::Tensor boxes_sorted,float nms_thresh
)
{torch::Tensor keep_data = keep;
torch::Tensor boxes_sorted_data = boxes_sorted;
int boxes_num = boxes_sorted.size(0);
int devId = boxes_sorted.get_device();return ApplyNMSGPU(keep_data, boxes_sorted_data, boxes_num, nms_thresh, devId);
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {m.def("nms_apply", &nms_apply, "nms");
}

cpp的头文件

#include <torch/extension.h>
int nms_apply(torch::Tensor keep,torch::Tensor boxes_sorted,float nms_thresh
);

cuda文件

你只需要关注:
·数据类型:scalar_t、scalar_t*、size_t 他们都是模板数据类型,在后面确定具体是什么。
·宏AT_DISPATCH_ALL_TYPE,参数一:决定scalar_t是哪种具体数据类型,参数三:lambla函数,注意核函数的参数数据类型,对于torch:tensor的,需要.data<scalar_t>(),但是,如果是别的类型(例如指针)还用.data<scalar_t>()的话会报错:error: expression must have class type


//#include <ATen/ATen.h>
#include <vector>
#include <iostream>
#include <torch/extension.h>
#include <cuda.h>
#include <cuda_runtime.h>
#define CUDA_CHECK(condition) \/* Code block avoids redefinition of cudaError_t error */ \do { \cudaError_t error = condition; \if (error != cudaSuccess) { \std::cout << cudaGetErrorString(error) << std::endl; \} \} while (0)#define DIVUP(m,n) ((m) / (n) + ((m) % (n) > 0))
const long  threadsPerBlock = sizeof(unsigned long long) * 8;
/*
__device__ inline float devIoU(scalar_t*  a, scalar_t*  b) {float left = max(a[0], b[0]), right = min(a[2], b[2]);float top = max(a[1], b[1]), bottom = min(a[3], b[3]);float width = max(right - left + 1, 0.f), height = max(bottom - top + 1, 0.f);float interS = width * height;float Sa = (a[2] - a[0] + 1) * (a[3] - a[1] + 1);float Sb = (b[2] - b[0] + 1) * (b[3] - b[1] + 1);return interS / (Sa + Sb - interS);
}
*/
cxg script/
template <typename scalar_t>
__device__ __forceinline__ scalar_t devIoU( scalar_t*  a, scalar_t*  b) {float left = max(a[0], b[0]), right = min(a[2], b[2]);float top = max(a[1], b[1]), bottom = min(a[3], b[3]);float width = max(right - left + 1, 0.f), height = max(bottom - top + 1, 0.f);float interS = width * height;float Sa = (a[2] - a[0] + 1) * (a[3] - a[1] + 1);float Sb = (b[2] - b[0] + 1) * (b[3] - b[1] + 1);return interS / (Sa + Sb - interS);
}
/////cxg script//
template <typename scalar_t>
__global__ void nms_kernel(const size_t __restrict__ n_boxes,scalar_t__restrict__ nms_overlap_thresh,scalar_t* __restrict__ dev_boxes,unsigned long long* __restrict__ dev_mask) {const int row_start = blockIdx.y;const int col_start = blockIdx.x;// if (row_start > col_start) return;const int row_size =min(n_boxes - row_start * threadsPerBlock, threadsPerBlock);const int col_size =min(n_boxes - col_start * threadsPerBlock, threadsPerBlock);__shared__ scalar_t block_boxes[threadsPerBlock * 5];if (threadIdx.x < col_size) {block_boxes[threadIdx.x * 4 + 0] =dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 4 + 0];block_boxes[threadIdx.x * 4 + 1] =dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 4 + 1];block_boxes[threadIdx.x * 4 + 2] =dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 4 + 2];block_boxes[threadIdx.x * 4 + 3] =dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 4 + 3];}__syncthreads();if (threadIdx.x < row_size) {const int cur_box_idx = threadsPerBlock * row_start + threadIdx.x;scalar_t* cur_box = dev_boxes + cur_box_idx * 4;int i = 0;unsigned long long t = 0;int start = 0;if (row_start == col_start) {start = threadIdx.x + 1;}for (i = start; i < col_size; i++) {if (devIoU(cur_box, block_boxes + i * 4) > nms_overlap_thresh) {t |= 1ULL << i;}}const int col_blocks = DIVUP(n_boxes, threadsPerBlock);dev_mask[cur_box_idx * col_blocks + col_start] = t;}
}//
void _set_device(int device_id) {int current_device;CUDA_CHECK(cudaGetDevice(&current_device));if (current_device == device_id) {return;}// The call to cudaSetDevice must come before any calls to Get, which// may perform initialization using the GPU.CUDA_CHECK(cudaSetDevice(device_id));
}cxg script ///
int ApplyNMSGPU(torch::Tensor keep_out,torch::Tensor boxes_dev,int boxes_num,float nms_overlap_thresh,int device_id) {_set_device(device_id);unsigned long long* mask_dev = NULL;const auto col_blocks = DIVUP(boxes_num, threadsPerBlock);CUDA_CHECK(cudaMalloc(&mask_dev,boxes_num * col_blocks * sizeof(unsigned long long)));dim3 blocks(DIVUP(boxes_num, threadsPerBlock),DIVUP(boxes_num, threadsPerBlock));dim3 threads(threadsPerBlock);AT_DISPATCH_ALL_TYPES(boxes_dev.type(), "ApplyNMSGPU", ([&] {nms_kernel<scalar_t><<<blocks, threads>>>(boxes_num,nms_overlap_thresh,boxes_dev.data<scalar_t>(),mask_dev);}));std::vector<unsigned long long> mask_host(boxes_num * col_blocks);CUDA_CHECK(cudaMemcpy(&mask_host[0],mask_dev,sizeof(unsigned long long) * boxes_num * col_blocks,cudaMemcpyDeviceToHost));std::vector<unsigned long long> remv(col_blocks);memset(&remv[0], 0, sizeof(unsigned long long) * col_blocks);int num_to_keep = 0;for (int i = 0; i < boxes_num; i++) {int nblock = i / threadsPerBlock;int inblock = i % threadsPerBlock;if (!(remv[nblock] & (1ULL << inblock))) {keep_out[num_to_keep++] = i;auto p = &mask_host[0] + i * col_blocks;for (int j = nblock; j < col_blocks; j++) {remv[j] |= p[j];}}}CUDA_CHECK(cudaFree(mask_dev));return num_to_keep;
}/

cuda头文件

#include <torch/extension.h>
int ApplyNMSGPU(torch::Tensor keep_out,torch::Tensor boxes_dev,int boxes_num,float nms_overlap_thresh,int device_id);

编译:

命令:python build.py install
结果:你可以看到所有的nvcc gcc编译器的编译参数,确定是否正确

building 'extnms' extension
/usr/local/cuda/bin/nvcc -DWITH_CUDA -Isrc/ -Isrc/cuda/ -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c src/cuda/nms_kernel.cu -o build/temp.linux-x86_64-3.8/src/cuda/nms_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=extnms -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_61,code=sm_61 -std=c++11
/usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of signx86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DWITH_CUDA -Isrc/ -Isrc/cuda/ -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c src/nms_cuda.cpp -o build/temp.linux-x86_64-3.8/src/nms_cuda.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=extnms -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.8/src/cuda/nms_kernel.o build/temp.linux-x86_64-3.8/src/nms_cuda.o -L/usr/local/cuda/lib64 -lcudart -o build/lib.linux-x86_64-3.8/extnms.cpython-38-x86_64-linux-gnu.so
creating build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-3.8/extnms.cpython-38-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
creating stub loader for extnms.cpython-38-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/extnms.py to extnms.cpython-38.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying ext.nms.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying ext.nms.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying ext.nms.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying ext.nms.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
__pycache__.extnms.cpython-38: module references __file__
creating 'dist/ext.nms-0.0.0-py3.8-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing ext.nms-0.0.0-py3.8-linux-x86_64.egg
removing '/usr/local/lib/python3.8/dist-packages/ext.nms-0.0.0-py3.8-linux-x86_64.egg' (and everything under it)
creating /usr/local/lib/python3.8/dist-packages/ext.nms-0.0.0-py3.8-linux-x86_64.egg
Extracting ext.nms-0.0.0-py3.8-linux-x86_64.egg to /usr/local/lib/python3.8/dist-packages
ext.nms 0.0.0 is already the active version in easy-install.pthInstalled /usr/local/lib/python3.8/dist-packages/ext.nms-0.0.0-py3.8-linux-x86_64.egg
Processing dependencies for ext.nms==0.0.0
Finished processing dependencies for ext.nms==0.0.0

调用

import pytorch
import extnms

小结

我哼多天卡在类似lltm_cpp.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC1ENS_14NMSAPPLYGPUXXXX这种错误上,这种错误就是编译时,库使用了未知的函数,在这个项目里,体现在必须使用.h文件,然鹅官方教程就是没有用.h文件,导致这个错误

运行setup.py时报错:

ValueError: bad marshal data (unknown type code)

solusion:

pip3 install --upgrade --force-reinstall setuptools

pytorch编写cuda/c++ extention 方法相关推荐

  1. PyTorch 编写自定义数据集,数据加载器和转换

    本文为 pytorch 官方教程https://pytorch.org/tutorials/beginner/data_loading_tutorial.html代码的注释 w3cschool 的翻译 ...

  2. Ubuntu18.04+RTX3060显卡配置pytorch、cuda、cudnn和miniconda

    0. 前言 之前已经安装成功了,也发了篇博客梳理了整套流程如下. ubuntu18.04安装pytorch.cuda.cudnn和miniconda_Toblerone_Wind的博客-CSDN博客_ ...

  3. TensorFlow/PyTorch和cuda等版本对应关系

    参考: pytorch.显卡.显卡驱动.cuda版本是如何对应的:https://www.jianshu.com/p/ac70300b598b NVIDIA系列显卡做深度学习,需要 在主机安装显卡驱动 ...

  4. cuda11.2安装pytorch——torch.cuda.is_available()=false

    这两天正在用服务器跑程序,但是发现运行速度极慢,查看正在运行的进程,也没看到自己的进程,但是程序又确确实实在运行,这就奇了怪了,一通查找,发现程序竟然是在CPU上运行,也就是 torch.cuda.i ...

  5. 加载dict_PyTorch 7.保存和加载pytorch模型的两种方法

    众所周知,python的对象都可以通过torch.save和torch.load函数进行保存和加载(不知道?那你现在知道了(*^_^*)),比如: x1 = {"d":" ...

  6. php 单例类 mysql pdo_PHP实战:PHP基于单例模式编写PDO类的方法

    <PHP实战:PHP基于单例模式编写PDO类的方法>要点: 本文介绍了PHP实战:PHP基于单例模式编写PDO类的方法,希望对您有用.如果有疑问,可以联系我们. 一.单例模式简介 简单的说 ...

  7. 在猜年龄的基础上编写登录、注册方法,并且把猜年龄游戏分函数处理

    ''' 在猜年龄的基础上编写登录.注册方法,并且把猜年龄游戏分函数处理,如 2. 登录函数 3. 注册函数 4. 猜年龄三次函数 5. 选择三次奖品函数 ''' import random def u ...

  8. Delphi环境中编写调用DLL的方法和技巧

    Delphi环境中编写调用DLL的方法和技巧 第一章 为什么要使用动态链接库(DLL) top 提起DLL您一定不会陌生,在Windows中有着大量的以DLL为后缀的文件,它们是保证Windows正常 ...

  9. VS开发中的代码编写小技巧——避免重复代码编写的几种方法

    原文:VS开发中的代码编写小技巧--避免重复代码编写的几种方法 上一篇文章中程序员的幸福生活--有你的日子,每天都是情人节,收到了大家的很多好评.鼓励和祝福,非常感动,真诚的谢谢大家.也希望每个朋友都 ...

最新文章

  1. HALCON基于形变的模板匹配实现
  2. 如何提升蜘蛛的抓取频率?
  3. python装饰器 property_Python中@property装饰器的使用技巧性解析(代码示例)
  4. win10安装oracle 11g最新亲身经历操作记录
  5. Codeigniter 获取当前的控制器名称和方法名称
  6. java调用怎么调用方法区_Java中的方法调用有多昂贵
  7. 支付宝前端推出《Web前端开发入门手册》
  8. (43)System Verilog模块变量数据位宽扩展
  9. Python工作笔记003---正则中的re.I re.M_以及m.group和m.groups的解释
  10. 最常用的10种CSS BUG解决方法与技巧-浏览器兼容教程
  11. nginx fastcgi python_Nginx+FastCGI+Python
  12. excel格宽度转html像素,使列表框的列宽自动适应内容的宽度?(已有思路但需解决转换成像素问题)...
  13. WPF界面设计辅助工具--Blend
  14. Mac 给新人的入手指南
  15. react中请求网络图片加载不出来的问题 解决
  16. 傲慢与偏见之 - 因果倒置的锦上添花
  17. Html笔记——实现一组图片循环且首尾相连的滚动效果
  18. 3.2 0.96寸OLED显示屏的使用
  19. 【web前端】JavaScript总结
  20. fanuc机器人SCN_FANUC福尼斯以太网通讯.pdf

热门文章

  1. stm32闪存的理解
  2. 2018年中国大数据BI行业分析报告!
  3. c语言英文版孤独怎么说,孤独的伤感的英文句子
  4. 2016 song list
  5. js实现微信表情回显
  6. SQLServer从mdf和ldb还原数据库
  7. springboot集成微信和QQ第三方登录
  8. 重庆html5全景,用pano2vr 转swf 全景图
  9. 初中不读学计算机难吗,为什么初中生更加适合学习计算机编程?
  10. “低碳生活,绿建未来”主题活动——微信运动步数打卡比赛统计分析