Jetson tx2 上源码安装 pytorch1.0.0（真. 血泪史）

本篇以在python3.5安装过程为例。在安装之前说明以下：

重点一：平台及cuda cudnn的安装问题

Jetson TX2平台版本：Jetpack 3.3, cuda 9.0.252, cudnn7.1.5, TensorRT4.0.2, python2.7/python3.5系统内核：tegra-ubuntu 4.4.38-tegra aarch64Linux系统版本：Ubuntu16.04，cmake 3.15.6 （TX2刷机完原始的cmake是3.5.1版本，由于后面自己捣鼓的时候说最好安装3.9.0以上版本cmake，所以我就直接升级到新版本了）

在源码安装pytorch的时候会使用到cuda及cudnn，首先检查自己Jetson TX2上的cuda cudnn 是不是从jetpack安装的，如果不是那么就需要注意了！！！Jetson TX2的CPU是基于ARM的，所以安装的cuda及cudnn都必须是ARM版本的（即aarch64），Jetson TX2上cuda及cudnn的安装可以参考这篇：Jetson TX2 安装 cuda9.0 及 cudnn7 超详细（真实亲测）

重点二：pytorch源码下载问题

1、pytorch不同版本对应着不同cuda版本

在pytorch的github上直接下载的是最新版的pytorch，本文写于2020.1.14，现在使用

git clone  http://github.com/pytorch/pytorch

下载得到的是 pytorch 1.4.0a0，想要安装这个版本的pytorch需要的平台需要安装cuda9.2及以上，对于我现在的平台是不匹配的，pytorch与cuda的对应可以在pytorch的官网上找到：

pytorch版本与cuda的对应可以参考pytorch官网：
pytorch历史版本：https://pytorch.org/get-started/previous-versions/
pytorch最新版本：https://pytorch.org/get-started/locally/

如果你的平台跟我的一样，那么我推荐 pytorch1.0.0版本。怎么才能下载到自己想要版本的pytorch呢？建议大家好看这个链接：如何下载自己想要版本的pytorch

2、pytorch github源码中的third_party文件夹是链接没有文件

在pytorch github上，文件夹里面虽然显示是有内容的，但是其实是相关子项目链接，直接下载pytorch源码是不能将第三方库一起下载下来的。所以，在下载pytorch的时候需要注意。推荐使用下面的命令下载：

git clone --recursive --branch v1.0.0 http://github.com/pytorch/pytorch一定要加上 --recursive 用于循环克隆git子项目

重点三：We should turn-off NCCL support since it is only available on the desktop GPU.

见 https://devtalk.nvidia.com/default/topic/1042821/jetson-tx2/pytorch-install-broken/
在编译中出现下面的错误，就是因为没有关闭 NCCL，具体的关闭方法下面会讲到

nvlink error : entry function '_Z28ncclAllReduceLLKernel_sum_i88ncclColl' with max regcount of 80 calls function '_Z25ncclReduceScatter_max_u64P14CollectiveArgs' with regcount of 96
nvlink error : entry function '_Z29ncclAllReduceLLKernel_sum_i328ncclColl' with max regcount of 80 calls function '_Z25ncclReduceScatter_max_u64P14CollectiveArgs' with regcount of 96
nvlink error : entry function '_Z29ncclAllReduceLLKernel_sum_f168ncclColl' with max regcount of 80 calls function '_Z25ncclReduceScatter_max_u64P14CollectiveArgs' with regcount of 96
nvlink error : entry function '_Z29ncclAllReduceLLKernel_sum_u328ncclColl' with max regcount of 80 calls function '_Z25ncclReduceScatter_max_u64P14CollectiveArgs' with regcount of 96
nvlink error : entry function '_Z29ncclAllReduceLLKernel_sum_f328ncclColl' with max regcount of 80 calls function '_Z25ncclReduceScatter_max_u64P14CollectiveArgs' with regcount of 96
nvlink error : entry function '_Z29ncclAllReduceLLKernel_sum_u648ncclColl' with max regcount of 80 calls function '_Z25ncclReduceScatter_max_u64P14CollectiveArgs' with regcount of 96
nvlink error : entry function '_Z28ncclAllReduceLLKernel_sum_u88ncclColl' with max regcount of 80 calls function '_Z25ncclReduceScatter_max_u64P14CollectiveArgs' with regcount of 96
Makefile:83: recipe for target '/home/nvidia/jetson-reinforcement/build/pytorch/third_party/build/nccl/obj/collectives/device/devlink.o' failed
make[5]: *** [/home/nvidia/jetson-reinforcement/build/pytorch/third_party/build/nccl/obj/collectives/device/devlink.o] Error 255
Makefile:45: recipe for target 'devicelib' failed
make[4]: *** [devicelib] Error 2
Makefile:24: recipe for target 'src.build' failed
make[3]: *** [src.build] Error 2
CMakeFiles/nccl.dir/build.make:60: recipe for target 'lib/libnccl.so' failed
make[2]: *** [lib/libnccl.so] Error 2
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/nccl.dir/all' failed
make[1]: *** [CMakeFiles/nccl.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

所以我下载了pytorch1.0.0，下面开始安装

一、确定 python 及 pip 命令的版本

在jetson tx2刷机之后自带有python2.7和python3.5两个版本的python，所以在使用命令的时候需要注意系统中默认的python和pip是哪个版本的，比如在我的平台上：

$ pip --version
显示：
pip 19.3.1 from /home/nvidia/.local/lib/python2.7/site-packages/pip (python 2.7)

说明默认的 pip 绑定的是python2.7版本，想要把pytorch安装在python3.5上的话，需要使用python3 和 pip3 命令，当然也可以修改系统默认python 和 pip绑定的版本（自行查找方法）。

二、安装依赖及必要组件

sudo apt install libopenblas-dev libatlas-dev liblapack-dev
sudo apt install liblapacke-dev checkinstall # For OpenCV
sudo apt-get install python3-pippip3 install --upgrade pip3==9.0.1
sudo apt-get install python3-devsudo pip3 install numpy scipy
sudo pip3 install pyyaml
sudo pip3 install scikit-build
sudo apt-get -y install cmake
sudo apt install libffi-dev
sudo pip3 install cffi

安装完之后，我们添加cudnn的lib和include路径

sudo gedit ~/.bashrc
export CUDNN_LIB_DIR=/usr/lib/aarch64-linux-gnu
export CUDNN_INCLUDE_DIR=/usr/include
source ~/.bashrc

三、下载 pytorch 1.0.0 源码及修改

根据本文刚开始所说，以及这个链接：如何下载自己想要版本的pytorch 下载好pytorch1.0.0的源码，然后开始关闭 NCCL

注意，在编译之前我们必须先关闭程序中的NCCL

#sudo gedit /pytorch/CMakeList.txt
#   > CmakeLists.txt : Change NCCL to 'Off' on line 98#sudo gedit /pytorch/setup.py
#   > setup.py: Add USE_NCCL = False below line 200#sudo gedit /pytorch/tools/setup_helpers/nccl.py
#   > nccl.py : Change USE_SYSTEM_NCCL to 'False' on line 13
#               Change NCCL to 'False' on line 78#sudo gedit /pytorch/torch/csrc/cuda/nccl.h
#   > nccl.h : Comment self-include on line 8
#              Comment entire code from line 21 to 28#sudo gedit torch/csrc/distributed/c10d/ddp.cpp
#   > ddp.cpp : Comment nccl.h include on line 6
#               Comment torch::cuda::nccl::reduce on line 163

修改完成后开始编译过程。然后执行下面的命令：

cd pytorchgit submodule update --init --recursive # 如果这个命令报错，那先执行   git init  即可sudo pip3 install -U setuptools
sudo pip3 install -r requirements.txt

四、编译

首先，先开启TX2的最大功率模式，这样可以使我们的编译速度稍微快一些：

sudo nvpmodel -m 0         # 切换工作模式到最大
sudo  ~/jetson_clocks.sh   # 强制开启风扇最大转速

sudo pip3 install scikit-build --user
sudo ldconfigexport USE_NCCL=0
export USE_DISTRIBUTED=1
export USE_OPENCV=ON
export USE_CUDNN=1
export USE_CUDA=1
export ONNX_ML=1

然后开始编译：

sudo python3 setup.py bdist_wheel   # 这一步其实是在编译生成 wheel 文件，存在 /pytorch/disk 下

漫长的编译完成后，再执行下面命令：

sudo DEBUG=1 python3 setup.py build develop
# 如果在执行这一句的时候显示 tx2内存不足，那么就可以 现将 /pytorch/disk 下的 wheel 文件拷贝出来，
# 再执行  sudo python3 setup.py clean  清除编译的内容，然后 cd 到wheel拷贝出来目录下，执行下面的命令安装：
# sudo pip3 install torch-1.0.0a0-cp35-cp35m-linux_aarch64.whl

同样漫长的编译完成后，再执行后续的安装命令：

sudo apt clean
sudo apt-get install libjpeg-dev zlib1g-devcd ~
git clone https://github.com/python-pillow/Pillow.git
cd Pillow/
sudo python3 setup.py install
sudo apt-get install python3-sklearn
sudo pip3 install pandas Cython scikit-image sudo pip3 --no-cache-dir install torchvision

安装过程中 error 集锦及解决方案

error 1：

./caffe2/operators/quantized/int8_utils.h:4:38: fatal error: gemmlowp/public/gemmlowp.h: No such file or directory compilation terminated.
详细信息如下：

In file included from ../caffe2/operators/quantized/int8_concat_op.h:7:0,from ../caffe2/operators/quantized/int8_concat_op.cc:1:
../caffe2/operators/quantized/int8_utils.h:4:38: fatal error: gemmlowp/public/gemmlowp.h: No such file or directory
compilation terminated.
[1669/2643] Building CXX object caffe2/CMakeFile...ators/rnn/recurrent_network_blob_fetcher_op.cc.o
ninja: build stopped: subcommand failed.
Failed to run 'bash ../tools/build_pytorch_libs.sh --use-cuda --use-nnpack --use-qnnpack caffe2'

解决方案
报错信息中可见，是由于 caffe2/operators/quantized/int8_utils.h 找不到 gemmlowp/public/gemmlowp.h这个文件，但是这个文件真实存在，所以解决办法就是，在 int8_utils.h 中的引用方式

#include <gemmlowp/public/gemmlowp.h>
变成
#include "third_party/gemmlowp/public/gemmlowp.h"

同时，将 caffe2/operators/quantized/int8_simd.h 文件中的头文件也进行修改：

#include "gemmlowp/fixedpoint/fixedpoint.h"
#include "gemmlowp/public/gemmlowp.h"
变成
#include "third_party/gemmlowp/fixedpoint/fixedpoint.h"
#include "third_party/gemmlowp/public/gemmlowp.h"

error 2：

libcudnn.so.7: error adding symbols: File in wrong format
详细信息如下：

[ 58%] Linking CXX shared library ../lib/libcaffe2_gpu.so
/usr/local/cuda/lib64/
collect2: error: ld returned 1 exit status
caffe2/CMakeFiles/caffe2_gpu.dir/build.make:185448: recipe for target 'lib/libcaffe2_gpu.so' failed
make[2]: *** [lib/libcaffe2_gpu.so] Error 1
CMakeFiles/Makefile2:4400: recipe for target 'caffe2/CMakeFiles/caffe2_gpu.dir/all' failed
make[1]: *** [caffe2/CMakeFiles/caffe2_gpu.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 58%] Built target python_copy_files
Makefile:140: recipe for target 'all' failed
make: *** [all] Error 2
Failed to run 'bash ../tools/build_pytorch_libs.sh --use-cuda --use-nnpack --use-qnnpack caffe2'

解决方案
这是因为当前平台上安装的cudnn不是ARM版本的原因导致的。Jetson TX2是基于ARM架构的，与PC端不同，PC端的cudnn是基于X86_64架构的。因此，解决方案就是安装ARM版本的cudnn，安装方法可以参考这篇文章：

Jetson TX2 安装 cuda9.0 及 cudnn7 超详细（真实亲测）

error 3：

/third_party/onnx/onnx/onnx_pb.h:52:26: fatal error: onnx/onnx.pb.h: No such file or directory
compilation terminated.
详细信息如下：

caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o -MF caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o.d -o caffe2/CMakeFiles/caffe2_pybind11_state.dir/python/pybind_state.cc.o -c ../caffe2/python/pybind_state.cc
In file included from ../caffe2/onnx/helper.h:4:0,from ../caffe2/onnx/backend.h:5,from ../caffe2/python/pybind_state.cc:19:../third_party/onnx/onnx/onnx_pb.h:52:26: fatal error: onnx/onnx.pb.h: No such file or directory
compilation terminated.
[1735/2643] Building CXX object caffe2/CMakeFiles/caffe2.dir/share/contrib/depthwise/depthwise3x3_conv_op.cc.o
ninja: build stopped: subcommand failed.
Failed to run 'bash ../tools/build_pytorch_libs.sh --use-cuda --use-nnpack --use-qnnpack caffe2'

解决方案
参考：https://github.com/onnx/onnx/issues/1947

third_party/onnx/onnx/onnx_pb.h 中的代码如下：

#ifdef ONNX_ML
#include "onnx/onnx-ml.pb.h"
#else
#include "onnx/onnx.pb.h"
#endif

但是， onnx-ml.pb.h 和 onnx.pb.h两个文件不在third_party/onnx，他们是在编译的过程中生成的，在pytorch的路径下搜索只能发现 onnx-ml.pb.h 这个文件，因此我们只需要声明一下 ONNX_ML 即可：

在当前终端下输入
export ONNX_ML=1

再次编译即可

error 4：

RuntimeError: cuda runtime error (7) : too many resources requested for launch at /home/nvidia/pytorch/aten/src/THCUNN/generic/SpatialUpSamplingBilinear.cu:66
详细信息如下：

 self.nyu = h5py.File(self.data_path)
THCudaCheck FAIL file=/home/nvidia/pytorch/aten/src/THCUNN/generic/SpatialUpSamplingBilinear.cu line=66 error=7 : too many resources requested for launch
Traceback (most recent call last):File "test.py", line 96, in <module>output = model(input_var)File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__result = self.forward(*input, **kwargs)File "/media/nvidia/xiudan/zd_structure_paraformImgNet/fcrn.py", line 249, in forwardad1 = self._upsample_add(p1, c2)File "/media/nvidia/xiudan/zd_structure_paraformImgNet/fcrn.py", line 216, in _upsample_addreturn torch.nn.functional.interpolate(x, size=(H,W), mode='bilinear',align_corners=True) + yFile "/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py", line 2447, in interpolatereturn torch._C._nn.upsample_bilinear2d(input, _output_size(2), align_corners)
RuntimeError: cuda runtime error (7) : too many resources requested for launch at /home/nvidia/pytorch/aten/src/THCUNN/generic/SpatialUpSamplingBilinear.cu:66

解决方案
参照：https://github.com/pytorch/pytorch/issues/8103#issucomment-424343705
在 “aten/src/THCUNN/generic/SpatialUpSamplingBilinear.cu”: 文件中做如下修改：

Around line 62:
comment out THCState_getCurrentDeviceProperties(state)->maxThraedsPerBlock;
Set
const int num_threads = 512;

Around line 97
comment out THCState_getCurrentDeviceProperties(state)->maxThraedsPerBlock;
Set
const int num_threads = 512;

I followed this guide for installation: https://gist.github.com/dusty-nv/ef2b372301c00c0a9d3203e42fd83426 using the install mode command “sudo python setup.py install”