服务器系统装显卡驱动,GPU服务器安装NVIDIA显卡驱动
for i in xsgpu81 xsgpu82 xsgpu83 xsgpu84 xsgpu85; do qssh root@$i 'cat /etc/issue;uname -r';done
Ubuntu 16.04.2 LTS \n \l
4.4.0-62-genericmodprobe
2、下载nvidia driver驱动并安装
可能需要 service lighted stop, 如果机器不干净(之前装过gpu相关的东西)的话
wget http://us.download.nvidia.com/XFree86/Linux-x86_64/375.26/NVIDIA-Linux-x86_64-375.26.run
root@xsgpu81:~# sudo sh NVIDIA-Linux-x86_64-375.26.run
Accept
OK
OK
OK
3、安装cuda
wget http://ogo0b6qe6.bkt.clouddn.com/cuda_8.0.61_375.26_linux.run
chmod +x cuda_8.0.61_375.26_linux.run
sudo sh cuda_8.0.61_375.26_linux.run --silent
echo "PATH=/usr/local/cuda-8.0/bin:$PATH" >> /root/.bashrc
echo "LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH" >> /root/.bashrc
source /root/.bashrc
4、拷贝测试文件
qscp NVIDIA_CUDA-8.0_Samples/0_Simple/vectorAdd/vectorAdd root@xsgpu81:/root/
root@xsgpu81:~# ./vectorAdd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
人肉部署含GPU设备的mesos-agent节点
按照标准流程在GPU机器上部署mesos-agent及其它基础服务(boots-docker, consul, logbeat)
人肉流程:
停含有GPU机器上的mesos-agent服务 supervisorctl stop mesos-agent
清理mesos-agent work_dir
rm -rf cat /home/qboxserver/mesos-agent/current/conf/mesos-agent/work_dir
进入到mesos-agent配置文件目录 /home/qboxserver/mesos-agent/current/conf/mesos-agent更新配置
获取机器上的GPU设备数和型号nvidia-smi -L, 列出的GPU设备数即为设备总数
将设备型号写入到attributes文件 echo "NETWORK:BRIDGE;GPU_MODEL:$MODEL” > attributes
增加isolation配置 echo "cgroups/devices,gpu/nvidia“ > isolation
标识可用的gpu设备编号 echo “0, 1, …, 设备总数 - 1” > nvidia_gpu_devices
resources中增加gpu资源{"name":"gpus","type":"SCALAR","scalar":{"value”:设备总数}}
进入/home/qboxserver/mesos-agent/current/libexec/mesos替换executor
保留原始的executor mv mesos-docker-executor mesos-docker-executor.cpp
下载gpu executor
wget http://ogo0b6qe6.bkt.clouddn.com/mesos-docker-executor-2017-11-18
mv mesos-docker-executor-2017-11-18 mesos-docker-executor; chown qboxserver.qboxserver mesos-docker-executor
cp mesos-docker-executor.go mesos-docker-executor
安装nvidia-docker-plugin
cd /home/qboxserver && mkdir nvidia-docker
cd /home/qboxserver/nvidia-docker
wget http://ogo0b6qe6.bkt.clouddn.com/nvidia-docker-plugin.2016-11-22-20-45-30.tar.gz
tar zxf nvidia-docker-plugin.2016-11-22-20-45-30.tar.gz
ln -s 2016-11-22-20-45-30 current
./current/bin/start.sh
curl -s http://localhost:3476/v1.0/gpu/info 查看gpu设备信息
启动mesos-agent服务
升级GPU 驱动(尝试使用apt-get安装驱动)
apt-get purge nvidia*
add-apt-repository ppa:graphics-drivers
apt-get update
apt-get install nvidia-
reboot
安装配套的cadvisor
cd /home/qboxserver/boots-cadvisor/current/bin && \
mv cadvisor cadvisor.bak && \
wget http://ogo0b6qe6.bkt.clouddn.com/cadvisor && \
chmod +x cadvisor && \
chown qboxserver:qboxserver cadvisor && \
./start.sh
xs区域新上线GPU计算节点7台
版本升级步骤:
有些服务会占用gpu, 升级之前这些服务要停掉:
service lightdm stop (有些机器开了这个,有些没有)
dockerd nvidia-docker-plugin boots-cadvisor stop
卸载原来的内核模块
modprobe -r nvidia nvidia_drm nvidia_uvm
有时候卸载不成功 lsof |grep nvidia 看那个进程还在用,杀掉该进程,重试。
lsmod |grep nvidia 没有的时候说明老的驱动被卸载干净,可以开始安装。
wget http://us.download.nvidia.com/tesla/396.44/NVIDIA-Linux-x86_64-396.44.run
sh NVIDIA-Linux-x86_64-396.44.run --slient
执行完毕后:
nvidia-smi 查看是否安装成功
重启机器
升级实例:
1、查看原来的版本
root@xsgpu9:~# nvidia-smi
NVIDIA-SMI 375.26
2、查看正在使用的模块
root@xsgpu9:~# lsmod | grep -i nvidia
nvidia_drm 53248 0
nvidia_modeset 790528 1 nvidia_drm
nvidia 11943936 1 nvidia_modeset
drm_kms_helper 143360 2 ast,nvidia_drm
drm 360448 5 ast,ttm,drm_kms_helper,nvidia_drm
3、卸载相关的模块
modprobe -r nvidia_drm nvidia_modeset nvidia
4、下载新的版本
root@xsgpu9:~# wget http://us.download.nvidia.com/tesla/396.44/NVIDIA-Linux-x86_64-396.44.run
5、安装新版本
sh NVIDIA-Linux-x86_64-396.44.run --silent
6、查看新版本
nvidia-smi
| NVIDIA-SMI 396.44 Driver Version: 396.44 |
xs311 apt -get安装了nvidia的驱动,删除命令,
apt-get --purge remove nvidia-*
dora.内部计算 --> dora.内部计算GPU 问题记录:
root@jjh1569:/var/log# cat /home/qboxserver/mesos-agent/current/conf/mesos-agent/attributes
NETWORK:HOST
修改为:
NETWORK:HOST;GPU_MODEL:QSV
之后重启dockerd和mesos-agent服务
发现启动mesos-agent服务失败
刚才那个mesos-agent问题,是配置不一致,导致的启动失败(mesos-agent会保持重连机制,配置不同会失败)
删除work目录,/disk1/mesos
root@jjh1569:/var/log# cd /home/qboxserver/mesos-agent/current/conf/mesos-agent/
root@jjh1569:/home/qboxserver/mesos-agent/current/conf/mesos-agent# cat work_dir
/disk1/mesos
然后执行:
rm -rf /disk1/mesos
root@jjh1569:/var/log# less syslog
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: I1024 18:55:11.627574 9662 slave.cpp:519] Agent resources: cpus(*):7; mem(*):12288; disk(*):445440; ports(*):[10000-20000]
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: I1024 18:55:11.627622 9662 slave.cpp:527] Agent attributes: [ NETWORK=HOST, GPU_MODEL=QSV ]
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: I1024 18:55:11.627645 9662 slave.cpp:532] Agent hostname: 10.20.78.29
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: I1024 18:55:11.630751 9660 state.cpp:57] Recovering state from '/disk1/mesos/meta'
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: Failed to perform recovery: Incompatible agent info detected.
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: Old agent info:
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: attributes {
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: name: "NETWORK"
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: type: TEXT
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: text {
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: value: "HOST"
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: }
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: }
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: New agent info:
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: attributes {
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: name: "NETWORK"
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: type: TEXT
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: text {
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: value: "HOST"
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: }
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: }
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: attributes {
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: name: "GPU_MODEL"
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: type: TEXT
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: text {
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: value: "QSV" #多出的一部分
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: }
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: }
然后修改attributes和resources(QSV是自定义的gpu类型,gpus是GPU个数,需要对应修改)
再重启dockerd和mesos-agent服务(如果启动失败,删除workdir: /disk1/mesos目录再重启mesos-agent)
#!/bin/bash
if grep -q QSV /home/qboxserver/mesos-agent/current/conf/mesos-agent/attributes
then echo QSV is exit
else
sed -i "s/NETWORK:HOST/NETWORK:HOST;GPU_MODEL:QSV/g" /home/qboxserver/mesos-agent/current/conf/mesos-agent/attributes
fi
/home/qboxserver/mesos-agent/current/conf/mesos-agent/resources
cat << EOF >> /home/qboxserver/mesos-agent/current/conf/mesos-agent/resources
[
{
"name": "cpus",
"type": "SCALAR",
"scalar": {
"value": 7
}
},
{
"name": "mem",
"type": "SCALAR",
"scalar": {
"value": 14336
}
},
{
"name" : "disk",
"type" : "SCALAR",
"scalar" : { "value" : 20480 }
},
{
"name": "ports",
"type": "RANGES",
"ranges": {
"range": [
{
"begin": 10000,
"end": 20000
}
]
}
},
{
"name": "gpus",
"type": "SCALAR",
"scalar": {
"value": 1
}
},
{
"name": "gpuset",
"type": "SET",
"set": {
"item": ["0"]
}
}
]
EOF
gpu插件相关脚本:
root@xs313:~# cat /tmp/gpu.sh
#!/bin/bash
#usage: 部署 dora gpu 机器 gpu 相关配置的脚本
supervisorctl stop mesos-agent
supervisorctl stop boots-cadvisor
supervisorctl stop dockerd
#安装自定义 cadviser
cd /home/qboxserver/boots-cadvisor/current/bin
mv cadvisor cadvisor.bak
wget http://ogo0b6qe6.bkt.clouddn.com/cadvisor
chmod +x cadvisor
chown qboxserver:qboxserver cadvisor
#安装自定义的 mesos-docker-executor
cd /home/qboxserver/mesos-agent/current/libexec/mesos
wget http://ogo0b6qe6.bkt.clouddn.com/mesos-docker-executor-2018-09-10-15-05-00
mv mesos-docker-executor mesos-docker-executor.bak
mv mesos-docker-executor-2018-09-10-15-05-00 mesos-docker-executor
chown qboxserver:qboxserver mesos-docker-executor
chmod +x mesos-docker-executor
#meos-agent 参数
#Part #1** 修改 attributes
MODEL=$(nvidia-smi -L | cut -d" " -f4 | xargs | cut -d" " -f1)
sed -i "s/NETWORK:HOST/NETWORK:HOST;GPU_MODEL:${MODEL}/g" /home/qboxserver/mesos-agent/current/conf/mesos-agent/attributes
nvidia-smi -L
#Part #2** 添加 isolation
echo "cgroups/devices,gpu/nvidia" > /home/qboxserver/mesos-agent/current/conf/mesos-agent/isolation
#Part #3** 添加 nvidia_gpu_devices
echo "0,1,2,3,4,5,6,7" > /home/qboxserver/mesos-agent/current/conf/mesos-agent/nvidia_gpu_devices
#Part #4** 添加 resources
for i in `seq 2`; do sed -i '$d' /home/qboxserver/mesos-agent/current/conf/mesos-agent/resources ; done
cat << EOF >> /home/qboxserver/mesos-agent/current/conf/mesos-agent/resources
},
{
"name": "gpus",
"type": "SCALAR",
"scalar": {
"value": 8
}
},
{
"name": "gpuset",
"type": "SET",
"set": {
"item": ["0", "1", "2", "3", "4", "5", "6", "7"]
}
}
]
EOF
#安装 nvidia-docker-plugin
cd /home/qboxserver && mkdir nvidia-docker && cd nvidia-docker
wget http://ogo0b6qe6.bkt.clouddn.com/nvidia-docker-plugin.2016-11-22-20-45-30.tar.gz
tar zxf nvidia-docker-plugin.2016-11-22-20-45-30.tar.gz
ln -s 2016-11-22-20-45-30 current
./current/bin/start.sh
#最后上线
rm -rf $(cat /home/qboxserver/mesos-agent/current/conf/mesos-agent/work_dir)
supervisorctl start dockerd
supervisorctl start mesos-agent
supervisorctl start boots-cadvisor
查看nvidia显卡驱动
目前dora使用的gpu有k80和p4两种类型,查看方法:
nvidia-smi -L
root@xs991:~# nvidia-smi -L
GPU 0: Tesla P4 (UUID: GPU-50850be7-c49e-4693-e20e-a677d2adeb82)
GPU 1: Tesla P4 (UUID: GPU-22e9fbe2-9170-4548-c301-579b786858b6)
GPU 2: Tesla P4 (UUID: GPU-c8132e0e-c8a4-defc-fea3-01b5c930667e)
GPU 3: Tesla P4 (UUID: GPU-762546f1-0b48-c963-954e-fa74b4f7e76f)
GPU 4: Tesla P4 (UUID: GPU-2fdb3d5e-dd66-1f6d-a814-5265df4fa1f4)
GPU 5: Tesla P4 (UUID: GPU-a4011f72-78c2-ab13-c6b8-3e58e9093773)
GPU 6: Tesla P4 (UUID: GPU-84d2bbd4-c3e0-d7ed-6628-5528878de6ea)
GPU 7: Tesla P4 (UUID: GPU-fa3933c0-3cb3-4e8c-a84a-75342a15cc24)
root@xs313:~# nvidia-smi -L
GPU 0: Tesla K80 (UUID: GPU-a457c419-bcfd-538b-d993-e443d28dcd24)
GPU 1: Tesla K80 (UUID: GPU-07f9795d-3917-b804-a6c5-621e27c239f8)
GPU 2: Tesla K80 (UUID: GPU-78197899-b007-1e74-29a8-3f27958e7d28)
GPU 3: Tesla K80 (UUID: GPU-d594f478-261b-e139-b87f-cf1d7b076f42)
GPU 4: Tesla K80 (UUID: GPU-8df7cf81-e51a-3a88-a4b8-6075d18a9365)
GPU 5: Tesla K80 (UUID: GPU-c9931f33-32c0-da73-aa8f-6109989b129c)
GPU 6: Tesla K80 (UUID: GPU-0830ceaa-f860-b717-67ac-e4e7fec25a26)
GPU 7: Tesla K80 (UUID: GPU-9b509b1c-a186-cf05-8aa3-4ba73aed1eb1)
显卡有nvidia和Intel集成两种类型
root@xsgpu81:~# lspci | grep -i nvidia
04:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
05:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
08:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
09:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
84:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
85:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
88:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
89:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
qboxserver@jjh1569:~$ lspci | grep -i vga
00:13.0 Non-VGA unclassified device: Intel Corporation Sunrise Point-H Integrated Sensor Hub (rev 31)
07:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30)
服务器系统装显卡驱动,GPU服务器安装NVIDIA显卡驱动相关推荐
- 为什么装服务器系统,服务器系统装什么原因
服务器系统装什么原因 内容精选 换一换 密钥对鉴权方式的Windows云服务器,使用私钥文件获取登录密码失败.出现获取密码失败一般原因是Cloudbase-init注入密码失败. Cloudbase- ...
- Ubuntu 16.04系统中利用CUDA安装更新NVIDIA显卡驱动程序的方法
严正声明:本文系作者davidhopper原创,未经许可,不得转载. 在ThinkPad系列笔记本电脑中使用过Ubuntu 16.04系统的同学都知道,NVIDIA显卡驱动程序特别难安装.以我使用的T ...
- 【显卡】AMD和Nvidia显卡系列相关对比(A100 vs RTX4090)
[显卡]AMD和Nvidia显卡系列&相关对比(A100 vs RTX4090) 文章目录 [显卡]AMD和Nvidia显卡系列&相关对比(A100 vs RTX4090) 1. 介绍 ...
- 服务器系统装显卡驱动,windows2019服务器系统安装显卡驱动(A卡篇)
原创:张荣国 今天捣腾一台旧电脑安装windows2019服务器系统,测网站等.安装windows2019倒是没什么难度.本来想着服务器系统,也不用理它驱动了,毕竟基本驱动它会自己装上.但后来接显示器 ...
- 装服务器系统时无法找到介质,服务器安装介质未找到
服务器安装介质未找到 内容精选 换一换 启动目的端时失败,错误码:SMS.3103,提示迁移失败原因"对目的服务器重新建立引导失败".启动目的端后,系统会重新安装目的端Grub,在 ...
- Ubuntu18.04双显卡笔记本+ROS 安装nvidia显卡驱动、CUDA10.2、CUDNN8.3.0、Eigen3.3.7
目录 一.nvidia显卡驱动安装 1.查看显卡型号 2.检查自己电脑的gpu是否CUDA-capable 3.安装 gcc : 4.删除旧的NVIDIA驱动: 5.查看显卡驱动 6.安装双显卡切换指 ...
- com口驱动_Ubuntu 安装Nvidia显卡驱动指南
该文档适用于: Ubuntu 14/16/18 三个版本. Nvidia显卡驱动适用于: RTX2080TI/RTX2080/RTX2070/GTX1080TI/GTX1080/GTX1070以及更低 ...
- Linux下JavaCv使用GPU加速(Nvidia显卡)
1.环境配置,安装显卡驱动,cuda,cudnn linux上安装NVIDIA显卡驱动以及深度学习需要的cudn.cudnn.pytorch_宜城有少年的博客-CSDN博客_linux安装 ...
- centos打显卡驱动命令_CentOS NVIDIA显卡驱动安装
如果系统CPU不是i3.i5.i7的话,用以下方法可正常安装驱动.否则进入系统后Nivida显卡不能正常启用,系统默认启用的是intel集成显卡.目前CentOS等linux系统对这种双显卡模式不能很 ...
最新文章
- MVC通过ViewBag动态生成Html输出到View
- [网络安全自学篇] 三十一.文件上传之Upload-labs靶场及CTF题目01-10(四)
- sklearn自学指南(part11)--Elastic-Net及多任务Elastic-Net
- linux删除git账号密码忘记了,linux清除git账号密码
- CString 操作指南
- Oracle SQL精妙SQL语句讲解(二)
- Java Character.UnicodeBlock of()方法与示例
- 中国电信:张志勇辞任公司执行副总裁
- activity绑定service
- Altium Designer(五):布板技巧
- java中不使用局部变量a的值_【转发】Java匿名类中使用的局部变量为何要加final...
- labview 判断整数_labview教程——如何判断字符串包含的是数字
- SQL注入漏洞-SQL注入中information_schema的作用
- html5测试网速插件,js 检测客户端网速
- html caption属性,HTML caption align 属性 | 菜鸟教程
- 陶哲轩实分析 5.2 节习题试解
- hellow word
- 中央农村工作会议释放重要信号,AI 技术助力农业的十种路径,未来可期
- hadoop基础----hadoop实战(七)-----hadoop管理工具---使用Cloudera Manager安装Hadoop---Cloudera Manager和CDH5.8离线安装
- Pyhton 裁剪视频尺寸 脚本