pytorch编写cuda/c++ extention 方法
起因
我在尝试一篇目标识别的论文的开源代码时,发现由于其NMS、LSTM等模块是使用c++编译的(为了弥补python速度的短板),对Pytorch要求不能使用超过0.3的版本。和FASTER RCNN开源代码很像,不过Faster-rcnn官方更新了其包含的c++文件,从而可以在pytorch大于1的版本上使用。我不想退回pytorch 0.3版本,于是尝试重写c++/cuda 扩展的模块
prerequisite
CUDA10.0
pytorch1.4+cu100
torchvision0.5.0+cu100
setuptools 52.0.0
方法
由于原始代码采用较老的方式(THC模块)实现扩展,这里尝试pytorch官方 tutorial的方法实现
https://pytorch.org/tutorials/advanced/cpp_extension.html#writing-a-c-extension
然而,这教程有一个很致命的问题,按照它的方法用不了,被坑了。所以…先看看作为参考吧。
概述
以NMS模块为例,文件树如下:
|project name
|---cuda
| |---nms_kernel.cu
| |---nms_kernel.h
|---nms_cuda.cpp
|---nms_cuda.h
|---build.py
其中nms_kernel.cu包括cuda核函数和调用它的函数A,nms_cuda.c又调用A函数,最后通过build.py编译扩展模块
##build.py
import os
import torch
from setuptools import setup, Extension
from torch.utils.cpp_extension import BuildExtension, CUDAExtension
from setuptools import setup, Extension
from torch.utils import cpp_extension
# Might have to export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}} ?
torch._C._GLIBCXX_USE_CXX11_ABI=0 //设置编译的参数
sources = []
headers = []
defines = []
inl = []
with_cuda = Falseif torch.cuda.is_available():print('Including CUDA code.')sources += ['src/cuda/nms_kernel.cu','src/nms_cuda.cpp'] //待编译的c与cu文件# sources += ['src/nms_cuda.cpp']headers += ['src/','src/cuda/'] //头文件目录defines += [('WITH_CUDA', None)]with_cuda = Trueinl +=['src/','src/cuda/']this_file = os.path.dirname(os.path.realpath(__file__))
print(this_file)
setup(name='ext.nms', //拓展包名,不太重要ext_modules=[CUDAExtension('extnms',sources=sources,include_dirs= headers ,define_macros=defines, extra_compile_flag=('-std=c++11'),//使用c11编译with_cuda=with_cuda)],cmdclass={'build_ext': BuildExtension})
考虑到有人是源码编译pytorch,那么就需要注意,编译pytorch时的-D GLIBCXX_USE_CXX11_ABI
标志位,extention编译的标志位要和pytorch编译的一致,一般来说,pip安装不用考虑这个问题
cpp文件
该文件主要特点是:使用torch::Tensor代替原本的指针、数组类型(当然可以尝试指定数据类型,例如torch.IntTensor
)
宏PYBIND11_MODULE
中第一个参数即函数名,第二个是该函数指针地址,最后一个参数是模块名(用于import …)
#include pip
#include <math.h>
#include "cuda/nms_kernel.h" //cuda核函数的头文件
#include <torch/extension.h>
#include <pybind11/pybind11.h>
namespace py = pybind11; //用于c++到python的转换
extern THCState *state;
int ApplyNMSGPU( //声明cuda函数,不要也可torch::Tensor keep_out,torch::Tensor boxes_dev,int boxes_num,torch::Tensor nms_overlap_thresh,int device_id);int nms_apply( //需要转换陈python的函数,里面调用了cuda函数ApplyNMSGPUtorch::Tensor keep,torch::Tensor boxes_sorted,float nms_thresh
)
{torch::Tensor keep_data = keep;
torch::Tensor boxes_sorted_data = boxes_sorted;
int boxes_num = boxes_sorted.size(0);
int devId = boxes_sorted.get_device();return ApplyNMSGPU(keep_data, boxes_sorted_data, boxes_num, nms_thresh, devId);
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {m.def("nms_apply", &nms_apply, "nms");
}
cpp的头文件
#include <torch/extension.h>
int nms_apply(torch::Tensor keep,torch::Tensor boxes_sorted,float nms_thresh
);
cuda文件
你只需要关注:
·数据类型:scalar_t、scalar_t*、size_t
他们都是模板数据类型,在后面确定具体是什么。
·宏AT_DISPATCH_ALL_TYPE
,参数一:决定scalar_t是哪种具体数据类型,参数三:lambla函数,注意核函数的参数数据类型,对于torch:tensor的,需要.data<scalar_t>()
,但是,如果是别的类型(例如指针)还用.data<scalar_t>()
的话会报错:error: expression must have class type
//#include <ATen/ATen.h>
#include <vector>
#include <iostream>
#include <torch/extension.h>
#include <cuda.h>
#include <cuda_runtime.h>
#define CUDA_CHECK(condition) \/* Code block avoids redefinition of cudaError_t error */ \do { \cudaError_t error = condition; \if (error != cudaSuccess) { \std::cout << cudaGetErrorString(error) << std::endl; \} \} while (0)#define DIVUP(m,n) ((m) / (n) + ((m) % (n) > 0))
const long threadsPerBlock = sizeof(unsigned long long) * 8;
/*
__device__ inline float devIoU(scalar_t* a, scalar_t* b) {float left = max(a[0], b[0]), right = min(a[2], b[2]);float top = max(a[1], b[1]), bottom = min(a[3], b[3]);float width = max(right - left + 1, 0.f), height = max(bottom - top + 1, 0.f);float interS = width * height;float Sa = (a[2] - a[0] + 1) * (a[3] - a[1] + 1);float Sb = (b[2] - b[0] + 1) * (b[3] - b[1] + 1);return interS / (Sa + Sb - interS);
}
*/
cxg script/
template <typename scalar_t>
__device__ __forceinline__ scalar_t devIoU( scalar_t* a, scalar_t* b) {float left = max(a[0], b[0]), right = min(a[2], b[2]);float top = max(a[1], b[1]), bottom = min(a[3], b[3]);float width = max(right - left + 1, 0.f), height = max(bottom - top + 1, 0.f);float interS = width * height;float Sa = (a[2] - a[0] + 1) * (a[3] - a[1] + 1);float Sb = (b[2] - b[0] + 1) * (b[3] - b[1] + 1);return interS / (Sa + Sb - interS);
}
/////cxg script//
template <typename scalar_t>
__global__ void nms_kernel(const size_t __restrict__ n_boxes,scalar_t__restrict__ nms_overlap_thresh,scalar_t* __restrict__ dev_boxes,unsigned long long* __restrict__ dev_mask) {const int row_start = blockIdx.y;const int col_start = blockIdx.x;// if (row_start > col_start) return;const int row_size =min(n_boxes - row_start * threadsPerBlock, threadsPerBlock);const int col_size =min(n_boxes - col_start * threadsPerBlock, threadsPerBlock);__shared__ scalar_t block_boxes[threadsPerBlock * 5];if (threadIdx.x < col_size) {block_boxes[threadIdx.x * 4 + 0] =dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 4 + 0];block_boxes[threadIdx.x * 4 + 1] =dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 4 + 1];block_boxes[threadIdx.x * 4 + 2] =dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 4 + 2];block_boxes[threadIdx.x * 4 + 3] =dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 4 + 3];}__syncthreads();if (threadIdx.x < row_size) {const int cur_box_idx = threadsPerBlock * row_start + threadIdx.x;scalar_t* cur_box = dev_boxes + cur_box_idx * 4;int i = 0;unsigned long long t = 0;int start = 0;if (row_start == col_start) {start = threadIdx.x + 1;}for (i = start; i < col_size; i++) {if (devIoU(cur_box, block_boxes + i * 4) > nms_overlap_thresh) {t |= 1ULL << i;}}const int col_blocks = DIVUP(n_boxes, threadsPerBlock);dev_mask[cur_box_idx * col_blocks + col_start] = t;}
}//
void _set_device(int device_id) {int current_device;CUDA_CHECK(cudaGetDevice(¤t_device));if (current_device == device_id) {return;}// The call to cudaSetDevice must come before any calls to Get, which// may perform initialization using the GPU.CUDA_CHECK(cudaSetDevice(device_id));
}cxg script ///
int ApplyNMSGPU(torch::Tensor keep_out,torch::Tensor boxes_dev,int boxes_num,float nms_overlap_thresh,int device_id) {_set_device(device_id);unsigned long long* mask_dev = NULL;const auto col_blocks = DIVUP(boxes_num, threadsPerBlock);CUDA_CHECK(cudaMalloc(&mask_dev,boxes_num * col_blocks * sizeof(unsigned long long)));dim3 blocks(DIVUP(boxes_num, threadsPerBlock),DIVUP(boxes_num, threadsPerBlock));dim3 threads(threadsPerBlock);AT_DISPATCH_ALL_TYPES(boxes_dev.type(), "ApplyNMSGPU", ([&] {nms_kernel<scalar_t><<<blocks, threads>>>(boxes_num,nms_overlap_thresh,boxes_dev.data<scalar_t>(),mask_dev);}));std::vector<unsigned long long> mask_host(boxes_num * col_blocks);CUDA_CHECK(cudaMemcpy(&mask_host[0],mask_dev,sizeof(unsigned long long) * boxes_num * col_blocks,cudaMemcpyDeviceToHost));std::vector<unsigned long long> remv(col_blocks);memset(&remv[0], 0, sizeof(unsigned long long) * col_blocks);int num_to_keep = 0;for (int i = 0; i < boxes_num; i++) {int nblock = i / threadsPerBlock;int inblock = i % threadsPerBlock;if (!(remv[nblock] & (1ULL << inblock))) {keep_out[num_to_keep++] = i;auto p = &mask_host[0] + i * col_blocks;for (int j = nblock; j < col_blocks; j++) {remv[j] |= p[j];}}}CUDA_CHECK(cudaFree(mask_dev));return num_to_keep;
}/
cuda头文件
#include <torch/extension.h>
int ApplyNMSGPU(torch::Tensor keep_out,torch::Tensor boxes_dev,int boxes_num,float nms_overlap_thresh,int device_id);
编译:
命令:python build.py install
结果:你可以看到所有的nvcc gcc编译器的编译参数,确定是否正确
building 'extnms' extension
/usr/local/cuda/bin/nvcc -DWITH_CUDA -Isrc/ -Isrc/cuda/ -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c src/cuda/nms_kernel.cu -o build/temp.linux-x86_64-3.8/src/cuda/nms_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=extnms -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_61,code=sm_61 -std=c++11
/usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of signx86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DWITH_CUDA -Isrc/ -Isrc/cuda/ -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c src/nms_cuda.cpp -o build/temp.linux-x86_64-3.8/src/nms_cuda.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=extnms -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.8/src/cuda/nms_kernel.o build/temp.linux-x86_64-3.8/src/nms_cuda.o -L/usr/local/cuda/lib64 -lcudart -o build/lib.linux-x86_64-3.8/extnms.cpython-38-x86_64-linux-gnu.so
creating build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-3.8/extnms.cpython-38-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
creating stub loader for extnms.cpython-38-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/extnms.py to extnms.cpython-38.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying ext.nms.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying ext.nms.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying ext.nms.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying ext.nms.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
__pycache__.extnms.cpython-38: module references __file__
creating 'dist/ext.nms-0.0.0-py3.8-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing ext.nms-0.0.0-py3.8-linux-x86_64.egg
removing '/usr/local/lib/python3.8/dist-packages/ext.nms-0.0.0-py3.8-linux-x86_64.egg' (and everything under it)
creating /usr/local/lib/python3.8/dist-packages/ext.nms-0.0.0-py3.8-linux-x86_64.egg
Extracting ext.nms-0.0.0-py3.8-linux-x86_64.egg to /usr/local/lib/python3.8/dist-packages
ext.nms 0.0.0 is already the active version in easy-install.pthInstalled /usr/local/lib/python3.8/dist-packages/ext.nms-0.0.0-py3.8-linux-x86_64.egg
Processing dependencies for ext.nms==0.0.0
Finished processing dependencies for ext.nms==0.0.0
调用
import pytorch
import extnms
…
小结
我哼多天卡在类似lltm_cpp.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC1ENS_14NMSAPPLYGPUXXXX
这种错误上,这种错误就是编译时,库使用了未知的函数,在这个项目里,体现在必须使用.h文件,然鹅官方教程就是没有用.h文件,导致这个错误
运行setup.py时报错:
ValueError: bad marshal data (unknown type code)
solusion:
pip3 install --upgrade --force-reinstall setuptools
pytorch编写cuda/c++ extention 方法相关推荐
- PyTorch 编写自定义数据集,数据加载器和转换
本文为 pytorch 官方教程https://pytorch.org/tutorials/beginner/data_loading_tutorial.html代码的注释 w3cschool 的翻译 ...
- Ubuntu18.04+RTX3060显卡配置pytorch、cuda、cudnn和miniconda
0. 前言 之前已经安装成功了,也发了篇博客梳理了整套流程如下. ubuntu18.04安装pytorch.cuda.cudnn和miniconda_Toblerone_Wind的博客-CSDN博客_ ...
- TensorFlow/PyTorch和cuda等版本对应关系
参考: pytorch.显卡.显卡驱动.cuda版本是如何对应的:https://www.jianshu.com/p/ac70300b598b NVIDIA系列显卡做深度学习,需要 在主机安装显卡驱动 ...
- cuda11.2安装pytorch——torch.cuda.is_available()=false
这两天正在用服务器跑程序,但是发现运行速度极慢,查看正在运行的进程,也没看到自己的进程,但是程序又确确实实在运行,这就奇了怪了,一通查找,发现程序竟然是在CPU上运行,也就是 torch.cuda.i ...
- 加载dict_PyTorch 7.保存和加载pytorch模型的两种方法
众所周知,python的对象都可以通过torch.save和torch.load函数进行保存和加载(不知道?那你现在知道了(*^_^*)),比如: x1 = {"d":" ...
- php 单例类 mysql pdo_PHP实战:PHP基于单例模式编写PDO类的方法
<PHP实战:PHP基于单例模式编写PDO类的方法>要点: 本文介绍了PHP实战:PHP基于单例模式编写PDO类的方法,希望对您有用.如果有疑问,可以联系我们. 一.单例模式简介 简单的说 ...
- 在猜年龄的基础上编写登录、注册方法,并且把猜年龄游戏分函数处理
''' 在猜年龄的基础上编写登录.注册方法,并且把猜年龄游戏分函数处理,如 2. 登录函数 3. 注册函数 4. 猜年龄三次函数 5. 选择三次奖品函数 ''' import random def u ...
- Delphi环境中编写调用DLL的方法和技巧
Delphi环境中编写调用DLL的方法和技巧 第一章 为什么要使用动态链接库(DLL) top 提起DLL您一定不会陌生,在Windows中有着大量的以DLL为后缀的文件,它们是保证Windows正常 ...
- VS开发中的代码编写小技巧——避免重复代码编写的几种方法
原文:VS开发中的代码编写小技巧--避免重复代码编写的几种方法 上一篇文章中程序员的幸福生活--有你的日子,每天都是情人节,收到了大家的很多好评.鼓励和祝福,非常感动,真诚的谢谢大家.也希望每个朋友都 ...
最新文章
- HALCON基于形变的模板匹配实现
- 如何提升蜘蛛的抓取频率?
- python装饰器 property_Python中@property装饰器的使用技巧性解析(代码示例)
- win10安装oracle 11g最新亲身经历操作记录
- Codeigniter 获取当前的控制器名称和方法名称
- java调用怎么调用方法区_Java中的方法调用有多昂贵
- 支付宝前端推出《Web前端开发入门手册》
- (43)System Verilog模块变量数据位宽扩展
- Python工作笔记003---正则中的re.I re.M_以及m.group和m.groups的解释
- 最常用的10种CSS BUG解决方法与技巧-浏览器兼容教程
- nginx fastcgi python_Nginx+FastCGI+Python
- excel格宽度转html像素,使列表框的列宽自动适应内容的宽度?(已有思路但需解决转换成像素问题)...
- WPF界面设计辅助工具--Blend
- Mac 给新人的入手指南
- react中请求网络图片加载不出来的问题 解决
- 傲慢与偏见之 - 因果倒置的锦上添花
- Html笔记——实现一组图片循环且首尾相连的滚动效果
- 3.2 0.96寸OLED显示屏的使用
- 【web前端】JavaScript总结
- fanuc机器人SCN_FANUC福尼斯以太网通讯.pdf