一次kubenetes的rook-ceph创建pv失败的故障排查
kubenetes基于rook-ceph创建pv失败的一次故障排除
1、本次问题出现,新创建statefulset的pod无法正常创建pv
Events:Type Reason Age From Message---- ------ ---- ---- -------Warning FailedScheduling <unknown> default-scheduler running "VolumeBinding" filter plugin for pod "mysql-0": pod has unbound immediate PersistentVolumeClaimsWarning FailedScheduling <unknown> default-scheduler running "VolumeBinding" filter plugin for pod "mysql-0": pod has unbound immediate PersistentVolumeClaims
出现pvc没法正常挂载pv,PV未正常挂载的情况;
排查问题过程:
1、首先检查rook-ceph状态
[root@master1 images]# kubectl exec -it -n rook-ceph rook-ceph-tools-6b4889fdfd-86dp5 /bin/bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
[root@rook-ceph-tools-6b4889fdfd-86dp5 /]# ceph -scluster:id: bb5107d5-d3f7-45df-9146-1148efa378b5health: HEALTH_OKservices:mon: 3 daemons, quorum b,c,d (age 67m)mgr: a(active, since 7m)mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replayosd: 10 osds: 10 up (since 7h), 10 in (since 8d)task status:scrub status:mds.myfs-a: idlemds.myfs-b: idledata:pools: 4 pools, 97 pgsobjects: 1.20k objects, 3.2 GiBusage: 19 GiB used, 1.9 TiB / 2.0 TiB availpgs: 97 active+cleanio:client: 852 B/s rd, 1 op/s rd, 0 op/s wr
检查发现ceph状态正常;
2、检查kube-system集群
[root@master1 images]# kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-578894d4cd-8wlg4 1/1 Running 0 8d
calico-node-5rnjk 1/1 Running 0 8d
calico-node-7rvj2 1/1 Running 0 8d
calico-node-p7hpq 1/1 Running 0 8d
calico-node-vgrlg 1/1 Running 0 8d
calico-node-zd2mn 1/1 Running 0 8d
coredns-66bff467f8-fj7td 1/1 Running 0 5d3h
coredns-66bff467f8-rmnzk 1/1 Running 0 8d
dashboard-metrics-scraper-6b4884c9d5-8gtnl 1/1 Running 0 8d
etcd-master1 1/1 Running 0 20m
etcd-master2 1/1 Running 0 20m
etcd-master3 1/1 Running 0 20m
kube-apiserver-master1 1/1 Running 0 8d
kube-apiserver-master2 1/1 Running 0 8d
kube-apiserver-master3 1/1 Running 0 8d
kube-controller-manager-master1 1/1 Running 63 8d
kube-controller-manager-master2 1/1 Running 64 8d
kube-controller-manager-master3 1/1 Running 64 8d
kube-proxy-6n7lz 1/1 Running 0 8d
kube-proxy-7nstv 1/1 Running 0 8d
kube-proxy-kxzhp 1/1 Running 0 8d
kube-proxy-tw9j4 1/1 Running 0 8d
kube-proxy-w4s47 1/1 Running 0 8d
kube-scheduler-master1 1/1 Running 63 32m
kube-scheduler-master2 1/1 Running 64 22m
kube-scheduler-master3 1/1 Running 74 22m
kubernetes-dashboard-6f77f7cfdb-kb6fx 1/1 Running 4 8d
metrics-server-584b5f4754-z58xl 1/1 Running 0 8d
traefik-5875c779f4-4z62m 1/1 Running 0 4d21h
检查发现,kube-scheduler-master和kube-controller-manager-master 出现多次重启;
E0406 05:13:21.278261 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: request timed out
E0406 05:17:05.409616 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: leader changed
E0406 05:35:03.180503 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: request timed out
E0406 06:07:26.579433 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: request timed out
E0406 07:14:02.476881 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: leader changed
E0406 07:48:13.975004 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: request timed out
E0406 08:27:33.280699 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: leader changed
E0406 09:01:27.363775 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: request timed out
E0406 09:40:58.889225 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: leader changed
E0406 10:13:18.380376 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: request timed out
发现etcd出现频繁选主的情况;
于是检查etcd的状态,发现etcd的状态正常;初步判断etcd出现异常可能由于网络波动造成etcd重新选主;
[root@master1 images]# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --write-out=table --endpoints=172.10.25.184:2379,172.10.25.185:2379,172.10.25.186:2379 endpoint status
+--------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 172.10.25.184:2379 | 3b85f750baf8d841 | 3.4.3 | 59 MB | false | false | 1886 | 3128844 | 3128844 | |
| 172.10.25.185:2379 | 5f95ee4c3d9d164 | 3.4.3 | 59 MB | false | false | 1886 | 3128845 | 3128845 | |
| 172.10.25.186:2379 | be2885dc23c5f563 | 3.4.3 | 59 MB | true | false | 1886 | 3128846 | 3128846 | |
+--------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
之后再检查kubelet的状态,
[root@master1 mysql]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node AgentLoaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)Drop-In: /usr/lib/systemd/system/kubelet.service.d└─10-kubeadm.confActive: active (running) since Tue 2021-04-06 17:57:51 CST; 2min 28s agoDocs: https://kubernetes.io/docs/Main PID: 13723 (kubelet)Tasks: 30Memory: 88.4MCGroup: /system.slice/kubelet.service└─13723 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=systemd --network-plugin=cni --pod-infr...Apr 06 18:00:12 master1 kubelet[13723]: I0406 18:00:12.930233 13723 operation_generator.go:181] scheme "" not registered, fallback to default scheme
Apr 06 18:00:12 master1 kubelet[13723]: I0406 18:00:12.930252 13723 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins_registry/rook-ceph.rbd.csi.ceph.com-reg.sock <nil> 0 <nil>}] <nil> <nil>}
Apr 06 18:00:12 master1 kubelet[13723]: I0406 18:00:12.930263 13723 clientconn.go:933] ClientConn switching balancer to "pick_first"
Apr 06 18:00:12 master1 kubelet[13723]: W0406 18:00:12.930363 13723 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/rook-ceph.rbd.csi.ceph.com-reg.sock <nil> 0 <nil>}. Err...
Apr 06 18:00:13 master1 kubelet[13723]: W0406 18:00:13.930508 13723 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/rook-ceph.cephfs.csi.ceph.com-reg.sock <nil> 0 <nil>}. ...
Apr 06 18:00:13 master1 kubelet[13723]: W0406 18:00:13.930641 13723 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/rook-ceph.rbd.csi.ceph.com-reg.sock <nil> 0 <nil>}. Err...
Apr 06 18:00:15 master1 kubelet[13723]: W0406 18:00:15.319378 13723 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/rook-ceph.cephfs.csi.ceph.com-reg.sock <nil> 0 <nil>}. ...
Apr 06 18:00:15 master1 kubelet[13723]: W0406 18:00:15.473849 13723 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/rook-ceph.rbd.csi.ceph.com-reg.sock <nil> 0 <nil>}. Err...
Apr 06 18:00:17 master1 kubelet[13723]: W0406 18:00:17.583410 13723 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/rook-ceph.cephfs.csi.ceph.com-reg.sock <nil> 0 <nil>}. ...
Apr 06 18:00:18 master1 kubelet[13723]: W0406 18:00:18.305165 13723 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/rook-ceph.rbd.csi.ceph.com-reg.sock <nil> 0 <nil>}. Err...
Hint: Some lines were ellipsized, use -l to show in full.
此时kubelet调度出现异常,无法正常调度到CSI( rook-ceph.rbd.csi.ceph.com),本次问题出现原因浮现
于是检查rook-ceph的pod情况,发现rook-ceph的相关csi的pod缺失,导致没发正常调度
[root@master1 mysql]# kubectl get po -n rook-ceph
NAME READY STATUS RESTARTS AGE
rook-ceph-crashcollector-master1-9ff9c4b7f-92zqn 1/1 Running 0 8d
rook-ceph-crashcollector-master2-6fd8fd857d-m4ngp 1/1 Running 0 8d
rook-ceph-crashcollector-master3-78869fc5b5-9rsrs 1/1 Running 0 5d
rook-ceph-crashcollector-node1-765b49998c-25wjd 1/1 Running 0 4d20h
rook-ceph-crashcollector-node2-5c9bf65fcd-rpv26 1/1 Running 0 7h31m
rook-ceph-mds-myfs-a-765f596697-cp7zs 1/1 Running 43 4d20h
rook-ceph-mds-myfs-b-5488556c97-kdr9m 1/1 Running 0 8d
rook-ceph-mgr-a-77b889cb6d-rqktg 1/1 Running 0 8d
rook-ceph-mon-b-5d747c4957-mgv2t 1/1 Running 0 8d
rook-ceph-mon-c-55c86765c7-7clf6 1/1 Running 0 5d
rook-ceph-mon-d-85d9bcd45b-lkjvs 1/1 Running 0 7d20h
rook-ceph-operator-6f9fc8c7dd-bk68g 1/1 Running 0 3d16h
rook-ceph-osd-0-65d88658cb-gkthq 1/1 Running 0 5d
rook-ceph-osd-1-7dc95f7cd7-46ppp 1/1 Running 0 8d
rook-ceph-osd-2-5894c6b9c8-88q98 1/1 Running 0 4d20h
rook-ceph-osd-3-8565f8b8cc-l4llh 1/1 Running 0 8d
rook-ceph-osd-4-6cf8449f54-qh2lq 1/1 Running 0 5d
rook-ceph-osd-5-c7b84d7b7-qqv5p 1/1 Running 0 5d
rook-ceph-osd-6-5485fdcfc5-hwdl2 1/1 Running 0 8d
rook-ceph-osd-7-b54c78b68-m88kh 1/1 Running 0 4d20h
rook-ceph-osd-8-74b4575bd8-cx2fb 1/1 Running 0 8d
rook-ceph-osd-9-6b8d6bb87f-7tgcj 1/1 Running 0 5d
rook-ceph-osd-prepare-master1-cbgzb 0/1 Completed 0 3d16h
rook-ceph-osd-prepare-master2-lmglb 0/1 Completed 0 3d16h
rook-ceph-osd-prepare-master3-j9fx5 0/1 Completed 0 3d16h
rook-ceph-osd-prepare-node1-cdmcc 0/1 Completed 0 3d16h
rook-ceph-tools-6b4889fdfd-86dp5 1/1 Running 0 4d20h
rook-discover-5grcs 1/1 Running 0 3d16h
rook-discover-7ltj8 1/1 Running 0 3d16h
rook-discover-bnrrw 1/1 Running 0 3d16h
rook-discover-m8lbb 1/1 Running 0 7h31m
rook-discover-zkdb5 1/1 Running 0 3d16h
由于rook-ceph的所有的pod的启动调度都和rook-ceph-operator的调度有关,于是重启了 rook-ceph-operator这个pod;
[root@master1 mysql]# kubectl delete po -n rook-ceph rook-ceph-operator-6f9fc8c7dd-bk68g
pod "rook-ceph-operator-6f9fc8c7dd-bk68g" deleted
重启后检查rook-ceph:
[root@master1 mysql]# kubectl get po -n rook-ceph
NAME READY STATUS RESTARTS AGE
csi-cephfsplugin-6hntb 0/3 ContainerCreating 0 1s
csi-cephfsplugin-6zdm8 0/3 ContainerCreating 0 1s
csi-cephfsplugin-84dhz 0/3 Pending 0 1s
csi-cephfsplugin-twkbn 0/3 ContainerCreating 0 1s
csi-cephfsplugin-xg4mg 0/3 ContainerCreating 0 1s
csi-rbdplugin-48zk8 0/3 ContainerCreating 0 1s
csi-rbdplugin-4tn8s 0/3 ContainerCreating 0 1s
csi-rbdplugin-6vrwq 0/3 ContainerCreating 0 1s
csi-rbdplugin-provisioner-b4d4bc45d-s2sfx 0/6 ContainerCreating 0 1s
csi-rbdplugin-provisioner-b4d4bc45d-shz27 0/6 ContainerCreating 0 1s
csi-rbdplugin-s4jlv 0/3 ContainerCreating 0 1s
csi-rbdplugin-sdvjt 0/3 ContainerCreating 0 2s
rook-ceph-crashcollector-master1-9ff9c4b7f-92zqn 1/1 Running 0 8d
rook-ceph-crashcollector-master2-6fd8fd857d-m4ngp 1/1 Running 0 8d
rook-ceph-crashcollector-master3-78869fc5b5-9rsrs 1/1 Running 0 5d
rook-ceph-crashcollector-node1-765b49998c-25wjd 1/1 Running 0 4d20h
rook-ceph-crashcollector-node2-5c9bf65fcd-rpv26 1/1 Running 0 7h32m
rook-ceph-detect-version-wkcpw 0/1 Terminating 0 6s
rook-ceph-mds-myfs-a-765f596697-cp7zs 1/1 Running 43 4d20h
rook-ceph-mds-myfs-b-5488556c97-kdr9m 1/1 Running 0 8d
rook-ceph-mgr-a-77b889cb6d-rqktg 1/1 Running 0 8d
rook-ceph-mon-b-5d747c4957-mgv2t 1/1 Running 0 8d
rook-ceph-mon-c-55c86765c7-7clf6 1/1 Running 0 5d
rook-ceph-mon-d-85d9bcd45b-lkjvs 1/1 Running 0 7d20h
rook-ceph-operator-6f9fc8c7dd-ktkw5 1/1 Running 0 12s
rook-ceph-osd-0-65d88658cb-gkthq 1/1 Running 0 5d
rook-ceph-osd-1-7dc95f7cd7-46ppp 1/1 Running 0 8d
rook-ceph-osd-2-5894c6b9c8-88q98 1/1 Running 0 4d20h
rook-ceph-osd-3-8565f8b8cc-l4llh 1/1 Running 0 8d
rook-ceph-osd-4-6cf8449f54-qh2lq 1/1 Running 0 5d
rook-ceph-osd-5-c7b84d7b7-qqv5p 1/1 Running 0 5d
rook-ceph-osd-6-5485fdcfc5-hwdl2 1/1 Running 0 8d
rook-ceph-osd-7-b54c78b68-m88kh 1/1 Running 0 4d20h
rook-ceph-osd-8-74b4575bd8-cx2fb 1/1 Running 0 8d
rook-ceph-osd-9-6b8d6bb87f-7tgcj 1/1 Running 0 5d
rook-ceph-osd-prepare-master1-cbgzb 0/1 Completed 0 3d16h
rook-ceph-osd-prepare-master2-lmglb 0/1 Completed 0 3d16h
rook-ceph-osd-prepare-master3-j9fx5 0/1 Completed 0 3d16h
rook-ceph-osd-prepare-node1-cdmcc 0/1 Completed 0 3d16h
rook-ceph-tools-6b4889fdfd-86dp5 1/1 Running 0 4d20h
rook-discover-5grcs 1/1 Running 0 3d16h
rook-discover-7ltj8 1/1 Running 0 3d16h
rook-discover-bnrrw 1/1 Running 0 3d16h
rook-discover-m8lbb 1/1 Running 0 7h32m
rook-discover-zkdb5 1/1 Running 0 3d16h
检查发现CSI等插件重新创建后,本地等待pod全部正常运行;
之后再次检查本地的初始创建pod的情况:
[root@master1 mysql]# kubectl get po -n ztw
NAME READY STATUS RESTARTS AGE
mysql-0 2/2 Running 0 54m
mysql-1 2/2 Running 0 27m
mysql-2 2/2 Running 0 27m
创建pod正常启动
一次kubenetes的rook-ceph创建pv失败的故障排查相关推荐
- Hive在SQL标准权限模式下创建UDF失败的问题排查
环境: CDH 5.16 Hive 1.1.0 已开启Kerberos Hive 授权使用SQL StandardsBased Authorization模式(以下简称SSBA模式) 症状表现: 在编 ...
- 无法创建文件系统以及无法创建PV时怎么办?
我们平常对磁盘分区格式化的时候有时无法格式化,报告的信息为: 1 "/dev/sdb3 is apparently in use by the system; will not make a ...
- linux 无法创建文件,无法创建文件系统以及无法创建PV时怎么办?
我们平常对磁盘分区格式化的时候有时无法格式化,报告的信息为:"/dev/sdb3 is apparently in use by the system; will not make a fi ...
- K8S通过rook部署rook ceph集群、配置dashboard访问并创建pvc
Rook概述 Ceph简介 Ceph是一种高度可扩展的分布式存储解决方案,提供对象.文件和块存储.在每个存储节点上,将找到Ceph存储对象的文件系统和Ceph OSD(对象存储守护程序)进程.在Cep ...
- kubernetes存储:local,openEBS,rook ceph
文章目录 Local 存储(PV) 概念 hostPath Local PV storageClassName指定延迟绑定动作 pv的删除流程 OpenEBS存储 控制平面 OpenEBS PV Pr ...
- kubernetes部署 rook ceph
环境: centos7.6, kubernetes 1.15.3, rook 1.3.4 部署 rook ceph 1.部署 rook ceph 官网下载 rook.解压后, cd rook-1.3. ...
- k8s + rook + Ceph 记录
k8s 部署 ceph git clone git@github.com:rook/rook.git --single-branch --branch v1.6.11 failed to reconc ...
- K8S部署rook+ceph云原生存储
1. 概念 1.1. Ceph 1.2. Rook 1.3. 架构 2. 部署rook+ceph 2.1. 准备事项 2.1.1. 建议配置 2.1.2. 本文环境 2.1.3. 注意事项 2.1.4 ...
- Rook Ceph Snapshot的清理
目录 问题 背景知识 Ceph rbd快照 VolumeSnapshotContent和rbd快照 删除一个rbd snapshot 删除多个rbd snapshot 参考 问题 平时用k8s进行开发 ...
最新文章
- Linux(64位)下OpenBabel 2.4.1、python2.7和Ipython实战(二)
- C语言之free函数及野指针
- join left 大数据_Java并发编程笔记-JDK内置并行执行框架Fork/Join
- python系统-Python(第八课,简单开发系统)
- ArcGIS Server9.3+ArcGIS Desktop9.3破解安装(for microsoft .net)
- 【Java】用键盘输入若干数字,以非数字字符结束,计算这些数的和和平均值
- Pytorch中model.eval()的作用分析
- 论文 | 港中文自动驾驶点云上采样方法
- elasticsearch最大节点数_Elasticsearch究竟要设置多少分片数?
- Ubuntu18 安装yum
- 9. HTML DOM getElementsByName() 方法
- mysql 键 索引_五、MySQL索引和键
- 赤兔oracle恢复软件 收费,赤兔Oracle数据库恢复软件 v11.6
- WebService接口的生成和调用(WebService接口)
- OPPO消息推送服务器,OPPO开放平台消息推送申请教程
- python获取本机IP
- Redis持久化 - 邱乘屹的个人技术博客
- 电脑里的所有播放器只能播放声音没有画面
- 日常技术积累-ARM中RO/RW/ZI
- word_state
热门文章
- GPT-4正式发布!如何访问 怎么免费使用GPT-4?
- Python AST node转为string(source code)
- php tp6 错误接管分析,ThinkPHP5 异常接管
- 《天天数学》连载55:二月二十四日
- 视频分辨率、帧率和码率三者之间关系详解
- 查看电脑可支持最大内存容量的方法
- 遗传算法(GA)求解TSP问题
- java 账本 创建数据库_想用你所学的JAVA与数据库写一个属于自己的账本吗?一起来看看呗!看如何用java项目操作数据库...
- “%,/,//”的用法
- Docker基础讲解狂神笔记(1/2)