gpu-manager安装

概述
准备工作
部署gpu-manager
部署gpu-admission
查看结果
参考

概述

gpu-manager是腾讯的一个开源vGPU应用，具体原理就不介绍了，详见GPUManager虚拟化方案。

本文主要参照腾讯开源vgpu方案gpu-manager安装教程进行安装，并就安装时出现的问题，对其中的部分配置进行了更改，如果根据上述文章安装失败，可以参考本文来进行安装。

准备工作

gpu-manager不提供nvidia容器运行时，需要提前在所有有GPU的节点上安装nvidia驱动。如果集群中之前安装了gpu-operator之类的应用，需要先卸载，否则会因为kubelet占用Xserver进程导致安装过程出现error。具体过程不赘述了，参考如下文章：
超全超详细的安装nvidia显卡驱动教程
Ubuntu安装nvidia驱动
解决centos下安装显卡驱动出现的unable to find the kernel source tree等关于内核版本问题
如何关闭X Server，以避免在更新nVidia驱动程序时出错？

安装完之后重启（没有试过不重启是否可以）并运行如下命令，以初始化/dev下的硬件：

nvidia-smi
nvidia-modprobe -u -c=0

运行后/dev下应该有如下等内容被创建：

[root@xxxxxx dev]# ls /dev|grep nvid
nvidia0
nvidia-caps
nvidiactl
nvidia-uvm
nvidia-uvm-tools

否则容器初始化时会报一个/dev/xxx找不到的错误
（参考：https://blog.csdn.net/JosephThatwho/article/details/107869332）

部署gpu-manager

本文集群中docker的驱动是systemd，而gpu-manager默认为cgroupfs，因此需要修改配置，而更换驱动的配置在gpu-manager较高版本才支持。
并且如果集群版本较高，低版本的gpu-manager会不兼容（本文k8s版本为v1.22.10）。
创建gpu-manager.yaml配置如下：

apiVersion: v1
kind: ServiceAccount
metadata:name: gpu-managernamespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:name: gpu-manager-role
subjects:
- kind: ServiceAccountname: gpu-managernamespace: kube-system
roleRef:kind: ClusterRolename: cluster-adminapiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: DaemonSet
metadata:name: gpu-manager-daemonsetnamespace: kube-system
spec:updateStrategy:type: RollingUpdateselector:matchLabels:name: gpu-manager-dstemplate:metadata:# This annotation is deprecated. Kept here for backward compatibility# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/annotations:scheduler.alpha.kubernetes.io/critical-pod: ""labels:name: gpu-manager-dsspec:serviceAccount: gpu-managertolerations:# This toleration is deprecated. Kept here for backward compatibility# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/- key: CriticalAddonsOnlyoperator: Exists- key: tencent.com/vcuda-coreoperator: Existseffect: NoSchedule# Mark this pod as a critical add-on; when enabled, the critical add-on# scheduler reserves resources for critical add-on pods so that they can# be rescheduled after a failure.# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/priorityClassName: "system-node-critical"# only run node has gpu devicenodeSelector:nvidia-device-enable: enablehostPID: truecontainers:- image: tkestack/gpu-manager:v1.1.5name: gpu-managersecurityContext:privileged: trueports:- containerPort: 5678volumeMounts:- name: device-pluginmountPath: /var/lib/kubelet/device-plugins- name: vdrivermountPath: /etc/gpu-manager/vdriver- name: vmdatamountPath: /etc/gpu-manager/vm- name: logmountPath: /var/log/gpu-manager- name: checkpointmountPath: /etc/gpu-manager/checkpoint- name: run-dirmountPath: /var/run- name: cgroupmountPath: /sys/fs/cgroupreadOnly: true- name: usr-directorymountPath: /usr/local/hostreadOnly: true- name: kube-rootmountPath: /root/.kubereadOnly: trueenv:- name: LOG_LEVELvalue: "4"- name: EXTRA_FLAGSvalue: "--cgroup-driver=systemd"- name: NODE_NAMEvalueFrom:fieldRef:fieldPath: spec.nodeNamevolumes:- name: device-pluginhostPath:type: Directorypath: /var/lib/kubelet/device-plugins- name: vmdatahostPath:type: DirectoryOrCreatepath: /etc/gpu-manager/vm- name: vdriverhostPath:type: DirectoryOrCreatepath: /etc/gpu-manager/vdriver- name: loghostPath:type: DirectoryOrCreatepath: /etc/gpu-manager/log- name: checkpointhostPath:type: DirectoryOrCreatepath: /etc/gpu-manager/checkpoint# We have to mount the whole /var/run directory into container, because of bind mount docker.sock# inode change after host docker is restarted- name: run-dirhostPath:type: Directorypath: /var/run- name: cgrouphostPath:type: Directorypath: /sys/fs/cgroup# We have to mount /usr directory instead of specified library path, because of non-existing# problem for different distro- name: usr-directoryhostPath:type: Directorypath: /usr- name: kube-roothostPath:type: Directorypath: /root/.kube

主要修改了如下：
更换了高版本镜像

去掉–incluster-mode=true，因为高版本没有该选项
其次如果不指定或者将–logtostderr为true，那么日志就会显示在容器的log（命令行）中，按需指定
最后指定–cgroup-driver为systemd（如果你的驱动是cgroupfs则无需指定）

它会创建daemonset，并在对应搭上了一个标签的node上运行。
所以需要给所有需要调度gpu节点打上标签，如下：

kubectl label node <你的GPU节点> nvidia-device-enable=enable
kubectl label node <你的GPU节点> nvidia-device-enable=enable
...
kubectl apply -f gpu-manager.yaml

如果一切正确的话，守护进程应该在给打了label的节点上正常运行：

部署gpu-admission

gpu-admission的部署按照上述教程（https://www.jianshu.com/p/7d795bc226c7）的来没有问题，不过我做了一些小小的改变
创建gpu-admission.yaml如下：

apiVersion: v1
kind: ServiceAccount
metadata:name: gpu-admissionnamespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:name: gpu-admission-as-kube-scheduler
subjects:
- kind: ServiceAccountname: gpu-admissionnamespace: kube-system
roleRef:kind: ClusterRolename: system:kube-schedulerapiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:name: gpu-admission-as-volume-scheduler
subjects:
- kind: ServiceAccountname: gpu-admissionnamespace: kube-system
roleRef:kind: ClusterRolename: system:volume-schedulerapiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:name: gpu-admission-as-daemon-set-controller
subjects:
- kind: ServiceAccountname: gpu-admissionnamespace: kube-system
roleRef:kind: ClusterRolename: system:controller:daemon-set-controllerapiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: Deployment
metadata:labels:component: schedulertier: control-planeapp: gpu-admissionname: gpu-admissionnamespace: kube-system
spec:selector:matchLabels:component: schedulertier: control-planereplicas: 1template:metadata:labels:component: schedulertier: control-planeversion: secondspec:serviceAccountName: gpu-admissioncontainers:- image: thomassong/gpu-admission:47d56ae9name: gpu-admissionenv:- name: LOG_LEVELvalue: "4"ports:- containerPort: 3456dnsPolicy: ClusterFirstWithHostNethostNetwork: truepriority: 2000000000priorityClassName: system-cluster-critical
---
apiVersion: v1
kind: Service
metadata:name: gpu-admissionnamespace: kube-system
spec:ports:- port: 3456protocol: TCPtargetPort: 3456selector:app: gpu-admissiontype: ClusterIP

我为该deploy配置了一个service，之后就配置时就不用通过pod IP访问了（参考了https://cloud.tencent.com/developer/article/1685122）：
为deploy再打一个标签

创建service

kubectl create -f gpu-admission.yaml

创建/etc/kubernetes/scheduler-policy-config.json，如下：

{"kind": "Policy","apiVersion": "v1","predicates": [{"name": "PodFitsHostPorts"},{"name": "PodFitsResources"},{"name": "NoDiskConflict"},{"name": "MatchNodeSelector"},{"name": "HostName"}],"priorities": [{"name": "BalancedResourceAllocation","weight": 1},{"name": "ServiceSpreadingPriority","weight": 1}],"extenders": [{"urlPrefix": "http://gpu-admission.kube-system:3456/scheduler","apiVersion": "v1beta1","filterVerb": "predicates","enableHttps": false,"nodeCacheCapable": false}],"hardPodAffinitySymmetricWeight": 10,"alwaysCheckAllPredicates": false
}

之后的过程与上述教程（https://www.jianshu.com/p/7d795bc226c7）完全一致。
创建/etc/kubernetes/scheduler-extender.yaml

apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
clientConnection:kubeconfig: "/etc/kubernetes/scheduler.conf"
algorithmSource:policy:file:path: "/etc/kubernetes/scheduler-policy-config.json"

修改/etc/kubernetes/manifests/kube-scheduler.yaml，修改完后kube-scheduler会自动重启，如下：

apiVersion: v1
kind: Pod
metadata:creationTimestamp: nulllabels:component: kube-schedulertier: control-planename: kube-schedulernamespace: kube-system
spec:containers:- command:- kube-scheduler- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf- --bind-address=0.0.0.0- --feature-gates=TTLAfterFinished=true,ExpandCSIVolumes=true,CSIStorageCapacity=true,RotateKubeletServerCertificate=true- --kubeconfig=/etc/kubernetes/scheduler.conf- --leader-elect=true- --port=0- --config=/etc/kubernetes/scheduler-extender.yamlimage: registry.cn-beijing.aliyuncs.com/kubesphereio/kube-scheduler:v1.22.10imagePullPolicy: IfNotPresentlivenessProbe:failureThreshold: 8httpGet:path: /healthzport: 10259scheme: HTTPSinitialDelaySeconds: 10periodSeconds: 10timeoutSeconds: 15name: kube-schedulerresources:requests:cpu: 100mstartupProbe:failureThreshold: 24httpGet:path: /healthzport: 10259scheme: HTTPSinitialDelaySeconds: 10periodSeconds: 10timeoutSeconds: 15volumeMounts:- mountPath: /etc/kubernetes/scheduler.confname: kubeconfigreadOnly: true- mountPath: /etc/localtimename: localtimereadOnly: true- mountPath: /etc/kubernetes/scheduler-extender.yamlname: extenderreadOnly: true- mountPath: /etc/kubernetes/scheduler-policy-config.jsonname: extender-policyreadOnly: truehostNetwork: truepriorityClassName: system-node-criticalsecurityContext:seccompProfile:type: RuntimeDefaultvolumes:- hostPath:path: /etc/kubernetes/scheduler.conftype: FileOrCreatename: kubeconfig- hostPath:path: /etc/localtimetype: Filename: localtime- hostPath:path: /etc/kubernetes/scheduler-extender.yamltype: FileOrCreatename: extender- hostPath:path: /etc/kubernetes/scheduler-policy-config.jsontype: FileOrCreatename: extender-policy
status: {}

该作者修改了3处地方，如下：
启动命令

挂载配置

卷配置

如果正常，修改完之后，调度器会自动重新创建：

如果没有创建，可以手动apply，然后就可以看到错误原因了。

查看结果

至此，集群中应该有如下几类Pod正常运行：

可以查看节点是否存在vGPU资源：

kubectl describe node <你的GPU节点>

可以自己部署个pod测试，如果成功的话，比如pytorch，应该会有如下输出：

（下图为当前分配了多少资源，与上图无关）

另外，本文安装完后容器内无法使用nvidia-smi，不过感觉不影响使用，如果需要该功能，可以参考https://github.com/tkestack/gpu-manager/issues/89

参考

腾讯开源vgpu方案gpu-manager安装教程
GPUManager虚拟化方案
超全超详细的安装nvidia显卡驱动教程
解决centos下安装显卡驱动出现的unable to find the kernel source tree等关于内核版本问题
如何关闭X Server，以避免在更新nVidia驱动程序时出错？
https://github.com/tkestack/gpu-manager/issues/138
https://github.com/tkestack/gpu-manager/issues/151
https://github.com/tkestack/gpu-manager/issues/89

k8s中GPU虚拟化工具gpu-manager的安装相关推荐

linux 中多进程下载工具,[转载]Linux 下安装多线程下载工具 proz
学习Linux,就不停的要到网上下载各种各样的源码包或rpm包,只在图形界面下操作,如果网络不是很流畅(比如我的adsl)通常很慢,而且不稳定,有时候会出现下载包不完整的情况,所以建议找个下载工具装上 ...
揭秘GPU虚拟化，算力隔离，和最新技术突破qGPU
原文:https://www.toutiao.com/i6969464502689595935/ 〇.本文写作背景大约 2 年前,在腾讯内网,笔者和很多同事讨论了 GPU 虚拟化的现状和问题.从那以 ...
GPU虚拟化，算力隔离，和qGPU
作者:jikesong,腾讯 CSIG 腾讯云异构计算研发副总监〇.本文写作背景大约 2 年前,在腾讯内网,笔者和很多同事讨论了 GPU 虚拟化的现状和问题.从那以后,出现了一些新的研究方向,并且 ...
【技术系列】浅谈GPU虚拟化技术（第一章）
摘要: GPU深度好文系列,阿里云技术专家分享第一章 GPU虚拟化发展史 GPU的虚拟化发展历程事实上与公有云市场和云计算应用场景的普及息息相关.如果在10年前谈起云计算,大部分人的反应是" ...
第一章 GPU虚拟化发展史
第一章 GPU虚拟化发展史 GPU的虚拟化发展历程事实上与公有云市场和云计算应用场景的普及息息相关.如果在10年前谈起云计算,大部分人的反应是"不知所云".但是随着云计算场景的普 ...
VMware中GPU虚拟化的三种模式(1)–vSGA
VMware中GPU虚拟化的三种模式(1)–vSGA 或者说,三种虚拟化图形加速类型虚拟共享图形加速 (vSGA , virtual Shared Graphics Acceleration) 虚拟 ...
K8S中使用显卡GPU(N卡) —— 筑梦之路
前些年做AI项目的时候经常用到显卡,大多数时候都是传统部署,对于资源的利用率并不高,而显卡也不便宜,K8S集群内调用显卡可以更加细致地进行显卡计算资源的分配,提高资源利用率. 之前记录和显卡相关的一些 ...
GPU虚拟化云桌面应用中实际带宽
云桌面应用设计中往往将带宽作为一个瓶颈来考虑.本人有幸参与了一个以GPU虚拟化为核心的云桌面实施项目,以兹分享. 项目中是采用GPU虚拟化的云桌面来运行以某企业的业务大数据可视化页面.页面是基于阿里大 ...
GPU虚拟化现状及新技术方案XPU
AI行业现状随着我国"新基建"的启动,AI(Artificial Intelligence,人工智能)和5G.大数据中心.工业互联网等一起构成了新基建的7大核心突破领域.AI将渗 ...

k8s中GPU虚拟化工具gpu-manager的安装