

  • 测试环境
  • 开始
  • 第2步报错
  • 第9步报错
  • 第11步报错
  • 第7步报错
  • 安装成功后


  1. AIStationV3.0:
  2. GeForce RTX 3090:给你们看看我的大宝贝
  3. NF5280M5:





  1. 报错信息

  2. 报错原因

    TASK [driver : nvidia gpu driver | install driver]


  3. 解决方案

    1. kernel-devel&kernel-headers要与实际内核版本匹配,都是3.10.0-1127
    2. GPU驱动要和实际型号匹配
    [root@node1 aistation]# cat /proc/version
    Linux version 3.10.0-1127.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Tue Mar 31 23:36:51 UTC 2020
    [root@node1 aistation]# rpm -qa | grep kernel
    kernel-devel-3.10.0-1127.el7.x86_64[root@node1 ~]# cd /home/packages/gpu_driver
    [root@node1 gpu_driver]# ll
    total 402564
    -rw-r--r--. 1 root root  86253524 Jul 14  2020 datacenter-gpu-manager-1.7.2-1.x86_64.rpm
    -rw-r--r--. 1 root root   1052036 Nov  3 13:57 nvidia-fabricmanager-450-450.80.02-1.x86_64.rpm
    -rw-r--r--. 1 root root    373768 Nov  3 13:57 nvidia-fabricmanager-devel-450-450.80.02-1.x86_64.rpm
    -rwxr-xr-x. 1 root root 183481072 Jan 13 15:09 NVIDIA-Linux-x86_64-450.80.02.run

    上面可以看到,内核版正确,而英伟达官网GeForce RTX 3090的驱动版本是NVIDIA-Linux-x86_64-455.45.01.run,所以问题就出在这了,但是部署文档上又没有说能支持3090,不管了!上传3090的驱动,然后再把3090的驱动文件名改成系统默认的那个:

    [root@node1 gpu_driver]# ll
    total 402564
    -rw-r--r--. 1 root root  86253524 Jul 14  2020 datacenter-gpu-manager-1.7.2-1.x86_64.rpm
    -rw-r--r--. 1 root root   1052036 Nov  3 13:57 nvidia-fabricmanager-450-450.80.02-1.x86_64.rpm
    -rw-r--r--. 1 root root    373768 Nov  3 13:57 nvidia-fabricmanager-devel-450-450.80.02-1.x86_64.rpm
    -rwxr-xr-x. 1 root root 141055124 Nov  3 13:57 NVIDIA-Linux-x86_64-450.80.02.run
    -rwxr-xr-x. 1 root root 183481072 Jan 13 15:09 NVIDIA-Linux-x86_64-455.45.01.run[root@node1 gpu_driver]# mv NVIDIA-Linux-x86_64-450.80.02.run NVIDIA-Linux-x86_64-450.80.02.run.bak
    [root@node1 gpu_driver]# mv NVIDIA-Linux-x86_64-455.45.01.run NVIDIA-Linux-x86_64-450.80.02.run
    [root@node1 gpu_driver]# ll
    total 402564
    -rw-r--r--. 1 root root  86253524 Jul 14  2020 datacenter-gpu-manager-1.7.2-1.x86_64.rpm
    -rw-r--r--. 1 root root   1052036 Nov  3 13:57 nvidia-fabricmanager-450-450.80.02-1.x86_64.rpm
    -rw-r--r--. 1 root root    373768 Nov  3 13:57 nvidia-fabricmanager-devel-450-450.80.02-1.x86_64.rpm
    -rwxr-xr-x. 1 root root 183481072 Jan 13 15:09 NVIDIA-Linux-x86_64-450.80.02.run
    -rwxr-xr-x. 1 root root 141055124 Nov  3 13:57 NVIDIA-Linux-x86_64-450.80.02.run.bak


    [root@node1 gpu_driver]# nvidia-smi
    Wed Jan 13 21:31:16 2021
    | NVIDIA-SMI 455.45.01    Driver Version: 455.45.01    CUDA Version: 11.1     |
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |   0  GeForce RTX 3090    Off  | 00000000:AF:00.0 Off |                  N/A |
    | 30%   16C    P8     7W / 350W |      0MiB / 24268MiB |      0%      Default |
    |                               |                      |                  N/A |
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |  No running processes found                                                 |


  1. 报错信息

    TASK [image : load kolla images] **************************************************************************************************************************************
    fatal: [node1]: FAILED! => {"ansible_job_id": "664877348652.39512", "changed": true, "cmd": "cd /opt/aistation/kolla_images && bash -x loadimages.sh eb5a7c0df3494817845d1fcd21133afa kolla_images_list inspur-kollaimages.tar.gz", "delta": "0:00:32.406170", "end": "2021-01-13 15:35:38.628654", "finished": 1, "msg": "non-zero return code", "rc": 1, "start": "2021-01-13 15:35:06.222484", "stderr": "+ '[' 4 -ne 4 ']'\n+ registryaddress=\n+ registry_admin_password=eb5a7c0df3494817845d1fcd21133afa\n+ images_list_file=kolla_images_list\n+ images_file=inspur-kollaimages.tar.gz\n+ echo 'start pushing image to docker registry'\n+ docker load\n++ cat kolla_images_list\n+ images=com.inspur/centos-source-mariadb:aistation.0.0.200\n+ echo eb5a7c0df3494817845d1fcd21133afa\n+ docker login -u admin --password-stdin\nError response from daemon: Get dial tcp connect: connection refused\n+ for image in '${images[@]}'\n+ newImageName=\n+ echo\n+ docker tag com.inspur/centos-source-mariadb:aistation.0.0.200\n+ docker push\nGet dial tcp connect: connection refused", "stderr_lines": ["+ '[' 4 -ne 4 ']'", "+ registryaddress=", "+ registry_admin_password=eb5a7c0df3494817845d1fcd21133afa", "+ images_list_file=kolla_images_list", "+ images_file=inspur-kollaimages.tar.gz", "+ echo 'start pushing image to docker registry'", "+ docker load", "++ cat kolla_images_list", "+ images=com.inspur/centos-source-mariadb:aistation.0.0.200", "+ echo eb5a7c0df3494817845d1fcd21133afa", "+ docker login -u admin --password-stdin", "Error response from daemon: Get dial tcp connect: connection refused", "+ for image in '${images[@]}'", "+ newImageName=", "+ echo", "+ docker tag com.inspur/centos-source-mariadb:aistation.0.0.200", "+ docker push", "Get dial tcp connect: connection refused"], "stdout": "start pushing image to docker registry\nLoaded image: com.inspur/centos-source-mariadb:aistation.0.0.200\n192.168.0.170:5000/com.inspur/centos-source-mariadb:aistation.0.0.200\nThe push refers to repository []", "stdout_lines": ["start pushing image to docker registry", "Loaded image: com.inspur/centos-source-mariadb:aistation.0.0.200", "", "The push refers to repository []"]}NO MORE HOSTS LEFT ****************************************************************************************************************************************************to retry, use: --limit @/home/deploy-script/common/kolla_mariadb/cluster.retryPLAY RECAP ************************************************************************************************************************************************************
    node1                      : ok=7    changed=5    unreachable=0    failed=1
  2. 解决方法


  1. 报错信息

    PLAY RECAP ************************************************************************************************************************************************************
    node1                      : ok=49   changed=37   unreachable=0    failed=0+ sleep 5
    + kubectl get pod -n aistation
    Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods)
    [root@AIStationV3test aistation2.0]# kubectl get pod -A -o wide
    Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods)[root@AIStationV3test aistation2.0]# bash health-check.sh
    Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods)
    Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods)



  1. 报错信息

    [root@AIStationV3test aistation2.0]# systemctl status kubelet
    ● kubelet.service - Kubernetes Kubelet ServerLoaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)Active: active (running) since Wed 2021-01-13 15:14:04 CST; 1h 12min agoDocs: https://github.com/GoogleCloudPlatform/kubernetes
    Main PID: 77402 (kubelet)Tasks: 0Memory: 21.9MCGroup: /system.slice/kubelet.service‣ 77402 /usr/local/bin/kubelet --logtostderr=true --v=2 --address= --node-ip= --hostname-override=node1 --allow-privileged=true...Jan 13 16:26:06 node1 kubelet[77402]: W0113 16:26:06.298288   77402 container.go:523] Failed to update stats for container "/system.slice/docker-8b92c965f0879f6b5c4...
    Jan 13 16:26:06 node1 kubelet[77402]: E0113 16:26:06.707414   77402 fsHandler.go:118] failed to collect filesystem stats - rootDiskErr: could not stat "/v...f7ff6191d9
    Jan 13 16:26:07 node1 kubelet[77402]: I0113 16:26:07.264780   77402 kubelet.go:1932] SyncLoop (PLEG): "metrics-server-7c5c656d5d-dprj8_kube-system(35b2338c-556f-11e...
    Jan 13 16:26:07 node1 kubelet[77402]: E0113 16:26:07.266211   77402 pod_workers.go:190] Error syncing pod 35b2338c-556f-11eb-9e0d-b4055d088f2a ("metrics-server-7c5c...
    Jan 13 16:26:07 node1 kubelet[77402]: E0113 16:26:07.797865   77402 pod_workers.go:190] Error syncing pod 9d33ac32-5575-11eb-9e0d-b4055d088f2a ("alert-engine-5b6dff...
    Jan 13 16:26:08 node1 kubelet[77402]: E0113 16:26:08.797593   77402 pod_workers.go:190] Error syncing pod 1eede667-5576-11eb-9e0d-b4055d088f2a ("aistation-api-gatew...
    Jan 13 16:26:08 node1 kubelet[77402]: W0113 16:26:08.998309   77402 status_manager.go:485] Failed to get status for pod "ibase-service-74684b8c9f-sd9s8_ai...c9f-sd9s8)
    Jan 13 16:26:09 node1 kubelet[77402]: W0113 16:26:09.481292   77402 container.go:523] Failed to update stats for container "/system.slice/docker-e5af4a80ae2eb868188...
    Jan 13 16:26:09 node1 kubelet[77402]: W0113 16:26:09.818149   77402 container.go:523] Failed to update stats for container "/system.slice/docker-84ea2dd832b0f05adfc...
    Jan 13 16:26:09 node1 kubelet[77402]: E0113 16:26:09.985407   77402 kubelet_node_status.go:385] Error updating node status, will retry: error getting node...des node1)
    Hint: Some lines were ellipsized, use -l to show in full.


    [root@AIStationV3test aistation2.0]# systemctl restart kubelet
    [root@AIStationV3test aistation2.0]# systemctl status kubelet
    ● kubelet.service - Kubernetes Kubelet ServerLoaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)Active: active (running) since Wed 2021-01-13 16:26:25 CST; 1s agoDocs: https://github.com/GoogleCloudPlatform/kubernetesProcess: 103130 ExecStartPre=/bin/mkdir -p /var/lib/kubelet/volume-plugins (code=exited, status=0/SUCCESS)Main PID: 103133 (kubelet)Tasks: 45Memory: 40.5MCGroup: /system.slice/kubelet.service└─103133 /usr/local/bin/kubelet --logtostderr=true --v=2 --address= --node-ip= --hostname-override=node1 --allow-privileged=tru...Jan 13 16:26:26 node1 kubelet[103133]: I0113 16:26:26.340248  103133 remote_image.go:50] parsed scheme: ""
    Jan 13 16:26:26 node1 kubelet[103133]: I0113 16:26:26.340263  103133 remote_image.go:50] scheme "" not registered, fallback to default scheme
    Jan 13 16:26:26 node1 kubelet[103133]: I0113 16:26:26.340389  103133 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{/var/run/dockersh...0  <nil>}]
    Jan 13 16:26:26 node1 kubelet[103133]: I0113 16:26:26.340440  103133 clientconn.go:796] ClientConn switching balancer to "pick_first"
    Jan 13 16:26:26 node1 kubelet[103133]: I0113 16:26:26.340447  103133 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{/var/run/dockersh...0  <nil>}]
    Jan 13 16:26:26 node1 kubelet[103133]: I0113 16:26:26.340482  103133 clientconn.go:796] ClientConn switching balancer to "pick_first"
    Jan 13 16:26:26 node1 kubelet[103133]: I0113 16:26:26.340571  103133 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc000d5d...CONNECTING
    Jan 13 16:26:26 node1 kubelet[103133]: I0113 16:26:26.340573  103133 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00029a...CONNECTING
    Jan 13 16:26:26 node1 kubelet[103133]: I0113 16:26:26.341816  103133 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc000d5d950, READY
    Jan 13 16:26:26 node1 kubelet[103133]: I0113 16:26:26.344265  103133 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00029acd0, READY
    Hint: Some lines were ellipsized, use -l to show in full.[root@AIStationV3test aistation2.0]# kubectl get nodes --show-labelsError from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)[root@AIStationV3test install_config]# systemctl status kubelet -l
    ● kubelet.service - Kubernetes Kubelet ServerLoaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)Active: active (running) since Wed 2021-01-13 16:42:52 CST; 4min 2s agoDocs: https://github.com/GoogleCloudPlatform/kubernetesProcess: 57554 ExecStartPre=/bin/mkdir -p /var/lib/kubelet/volume-plugins (code=exited, status=0/SUCCESS)Main PID: 57556 (kubelet)Tasks: 0Memory: 63.5MCGroup: /system.slice/kubelet.service‣ 57556 /usr/local/bin/kubelet --logtostderr=true --v=2 --address= --node-ip= --hostname-override=node1 --allow-privileged=true --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --authentication-token-webhook --enforce-node-allocatable= --client-ca-file=/etc/kubernetes/ssl/ca.crt --rotate-certificates --pod-manifest-path=/etc/kubernetes/manifests --pod-infra-container-image= --node-status-update-frequency=10s --cgroup-driver=systemd --cgroups-per-qos=False --max-pods=110 --anonymous-auth=false --read-only-port=0 --fail-swap-on=True --runtime-cgroups=/systemd/system.slice --kubelet-cgroups=/systemd/system.slice --cluster-dns= --cluster-domain=cluster.local --resolv-conf=/etc/resolv.conf --node-labels= --eviction-hard= --image-gc-high-threshold=100 --image-gc-low-threshold=99 --kube-reserved cpu=100m --system-reserved cpu=100m --registry-burst=110 --registry-qps=110 --network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin --volume-plugin-dir=/var/lib/kubelet/volume-pluginsJan 13 16:46:54 node1 kubelet[57556]: E0113 16:46:54.255588   57556 eviction_manager.go:247] eviction manager: failed to get summary stats: failed to get node info: node "node1" not found
    Jan 13 16:46:54 node1 kubelet[57556]: E0113 16:46:54.268144   57556 kubelet.go:2246] node "node1" not found
    Jan 13 16:46:54 node1 kubelet[57556]: E0113 16:46:54.368355   57556 kubelet.go:2246] node "node1" not found
    Jan 13 16:46:54 node1 kubelet[57556]: E0113 16:46:54.468645   57556 kubelet.go:2246] node "node1" not found
    Jan 13 16:46:54 node1 kubelet[57556]: E0113 16:46:54.568872   57556 kubelet.go:2246] node "node1" not found
    Jan 13 16:46:54 node1 kubelet[57556]: E0113 16:46:54.669081   57556 kubelet.go:2246] node "node1" not found
    Jan 13 16:46:54 node1 kubelet[57556]: E0113 16:46:54.769283   57556 kubelet.go:2246] node "node1" not found
    Jan 13 16:46:54 node1 kubelet[57556]: E0113 16:46:54.869529   57556 kubelet.go:2246] node "node1" not found
    Jan 13 16:46:54 node1 kubelet[57556]: E0113 16:46:54.969820   57556 kubelet.go:2246] node "node1" not found
    Jan 13 16:46:55 node1 kubelet[57556]: E0113 16:46:55.069981   57556 kubelet.go:2246] node "node1" not found


    1. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods)
    2. kubelet.go:2246] node "node1" not found


    2. 解决方案




AIStationV3.0 + GeForce RTX 3090 + 5280M5安装测试及故障处理相关推荐

  1. Ubuntu 18.04 安装 GeForce RTX 3090

    Ubuntu 18.04 安装 GeForce RTX 3090 1,查看显卡型号 2,驱动安装 3,禁用nouveau 4,卸载显卡驱动重新安装 5,卸载nvidia cuda驱动 安装驱动 ubu ...

  2. GeForce RTX 3090深度学习测评

    GeForce RTX 3090深度学习测评 环境踩坑 八卡GeForce RTX 3090+Pytorch1.7+cuda11.1+对应cudnn pytorch 1.7以下版本无法对显卡写入数据 ...

  3. 【PyTorch】切记:GeForce RTX 3090 显卡仅支持 CUDA 11 以上的版本!

    问题描述 前不久给新来的 2台 8 张 GeForce RTX 3090 服务器配置了深度学习环境(配置教程参考这篇文章),最近在使用的时候却遇到了各种问题. 问题 1:GeForce RTX 309 ...

  4. Ubuntu 16.04 - GeForce RTX 2080 Ti 安装 GPU 显卡驱动 (Display Driver)

    Ubuntu 16.04 - GeForce RTX 2080 Ti 安装 GPU 显卡驱动 (Display Driver) NVIDIA 引领人工智能计算 - NVIDIA https://www ...

  5. conda安装cuda_记一次在 RTX 3090 上安装 APEX

    0. 背景 最近炼丹开始用一块 RTX 3090 (24 G),因为代码里用 ALBERT-base-v2 处理了很多东西导致显存爆炸,于是开始谋求可以节约显存的办法. 网上的一些方法例如及时 del ...

  6. Ubuntu20.04上3090显卡安装Nvidia驱动和CUDA11.1及cuDNN8.0.4

    前言 GeForce RTX 3090是最新显卡,CUDA支持也只支持到最新版本11.0及以上.本文记录了GeForce RTX 3090安装驱动和CUDA11.1.cuDNN等过程,CUDA11.1 ...

  7. Ubuntu 18.04 安装 GeForce RTX 2080 Ti

    Ubuntu 18.04 安装 GeForce RTX 2080 Ti 1,安装 RTX 2080 Ti 2,安装驱动 3,禁用 nouveau 1,安装 RTX 2080 Ti # cat /etc ...

  8. 因买不到 RTX 3090,他花 19 万搭了一个专业级机器学习工作站

    点击上方"视学算法",选择加"星标"或"置顶" 重磅干货,第一时间送达 作者 | Emil Wallner 编译 | 青暮.陈大鑫 转自 | ...

  9. 因买不到RTX 3090,小哥自己搭建了一个专业级机器学习工作站

    点击上方"AI遇见机器学习",选择"星标"公众号 重磅干货,第一时间送达 来自|知乎   作者|Emil Wallner 来源 AI科技评论 编辑丨极市平台 极 ...


  1. 利用高带宽无线代替电缆应用于脑机接口信号传输
  2. 关于MySQL的SLEEP(N)函数
  3. 首发福利!全球第一开源ERP Odoo系统架构部署指南 电子书分享
  4. linux查看vtk版本,vtk在linux下的安装(12月8日更新)
  5. [zt]OJ常见的Judge Status
  6. Shanda EZ Mini
  7. 下载各种百度文库以及豆丁网文章的简便方法
  8. Idea 远程调试服务器
  9. 计算机excel宏代码怎么写,教你如何使用Excel VBA VBA新手必看
  10. 免费的思维导图软件都有哪些?
  11. 服务器虚拟连接按键精灵,服务器运行按键精灵
  12. C# 高并发场景下 共享内存 Actor并发模型到底哪个快?
  13. 飞凌OK6410A 多媒体视频编解码 player-qt4 QT视频播放器
  14. html怎么引用桌面图片,html怎么引入图片?
  15. android 换肤 原理,Android动态换肤实现原理解析
  16. C++中重载和重写的区别
  17. 怎么取消苹果订阅自动续费?教你一招,2分钟搞定!
  18. BFS、DFS复杂度分析(时间、空间)
  19. 行通信比并行通信的速度更高
  20. 视频压缩软件,视频压缩软件哪个最好用?


  1. 门诊处方的录入(一)
  2. AxMath-设置字体-伪装成Word公式
  3. vue学习(方法和路径)
  4. 用JavaScript实现的一些计算公式
  5. 结构体的字符串输入要求和运行时错误
  6. 复制字符串的三种方法
  7. 年轻就该多尝试,教你20小时Get一项新技能
  8. vivo x21安装Charles证书步骤
  9. 赌王儿子偷偷结婚,赌王自己都不知道!网友:那孙女何超莲几时结婚
  10. 2019蓝桥杯小猫吃鱼解题