文章目录

  • 故障描述
  • 排查思路
    • 1.尝试重启Pod
    • 2.查看pod events事件
    • 3.查看kubelet日志
    • 4.检查pvc与pv资源对象
    • 5.检查磁盘挂载
  • 解决方案

故障描述

内部环境收到Pod异常告警

[Alerting] Pod 状态告警
集群中存在 Pod 处于异常状态超过  1 分钟1. ti-inf/etcd-1 (Pending): 1.000
详请链接, http://xx.xx.xx.xx/grafana/d/default/alert-dashboard?tab=alert&viewPanel=19&orgId=1

查看k8s集群中异常Pod,发现为数据组件pod

排查思路

1.尝试重启Pod

~]# kubectl delete pod etcd-1 -nti-inf
发现还是处于异常状态。

2.查看pod events事件

~]# kubectl describe pod redis-server-2 -nti-inf
Events:Type     Reason       Age                     From     Message----     ------       ----                    ----     -------Normal   Scheduled    28m                     volcano  Successfully assigned ti-inf/redis-server-2 to x.x.x.xWarning  FailedMount  3m17s (x3599 over 28m)  kubelet  MountVolume.SetUp failed for volume "pvc-9d1c0e76-6d56-439d-8070-741d8846d569" : rpc error: code = Internal desc = stat /csi-data-dir/ti-database/pv: input/output error
从events事件中可以看到,kubelet程序在MountVolume这一步骤Failed,暴露出来的信息为“pvc input/output error”

3.查看kubelet日志

[root@VM-2-29-centos prometheus-db]# grep -i error /var/log/messages| tail -n 5
Jun 28 20:14:13 VM-2-29-centos kubelet: E0628 20:14:13.819828  793997 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-668750fa-cc0a-4105-96f3-7fa184db4ada podName: nodeName:}" failed. No retries permitted until 2022-06-28 20:14:14.319804053 +0800 CST m=+11760883.388055363 (durationBeforeRetry 500ms). Error: "MountVolume.SetUp failed for volume \"pvc-668750fa-cc0a-4105-96f3-7fa184db4ada\" (UniqueName: \"kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-668750fa-cc0a-4105-96f3-7fa184db4ada\") pod \"etcd-1\" (UID: \"1c99773c-3845-4141-ac30-1c3d26f1f30a\") : rpc error: code = Internal desc = stat /csi-data-dir/ti-database/pv: input/output error"
Jun 28 20:14:13 VM-2-29-centos kubelet: E0628 20:14:13.901519  793997 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-668750fa-cc0a-4105-96f3-7fa184db4ada podName:4c5d9bdf-498a-4456-9c6c-e6f7b456e693 nodeName:}" failed. No retries permitted until 2022-06-28 20:14:14.401482582 +0800 CST m=+11760883.469733942 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"data\" (UniqueName: \"kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-668750fa-cc0a-4105-96f3-7fa184db4ada\") pod \"4c5d9bdf-498a-4456-9c6c-e6f7b456e693\" (UID: \"4c5d9bdf-498a-4456-9c6c-e6f7b456e693\") : kubernetes.io/csi: mounter.TearDownAt failed: rpc error: code = Internal desc = stat /var/lib/kubelet/pods/4c5d9bdf-498a-4456-9c6c-e6f7b456e693/volumes/kubernetes.io~csi/pvc-668750fa-cc0a-4105-96f3-7fa184db4ada/mount: input/output error"
Jun 28 20:14:14 VM-2-29-centos kubelet: E0628 20:14:14.018249  793997 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-9d1c0e76-6d56-439d-8070-741d8846d569 podName: nodeName:}" failed. No retries permitted until 2022-06-28 20:14:14.518217097 +0800 CST m=+11760883.586468437 (durationBeforeRetry 500ms). Error: "MountVolume.SetUp failed for volume \"pvc-9d1c0e76-6d56-439d-8070-741d8846d569\" (UniqueName: \"kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-9d1c0e76-6d56-439d-8070-741d8846d569\") pod \"redis-server-2\" (UID: \"5550e257-2245-4401-bd9a-cf275ff94675\") : rpc error: code = Internal desc = stat /csi-data-dir/ti-database/pv: input/output error"
Jun 28 20:14:14 VM-2-29-centos kubelet: E0628 20:14:14.102735  793997 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-9d1c0e76-6d56-439d-8070-741d8846d569 podName:daea4ba4-b97c-46c6-866b-aa7cc29af0a8 nodeName:}" failed. No retries permitted until 2022-06-28 20:14:14.602692068 +0800 CST m=+11760883.670943428 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"data\" (UniqueName: \"kubernetes.io/csi/loopdevice.csi.infra.ti.io^pvc-9d1c0e76-6d56-439d-8070-741d8846d569\") pod \"daea4ba4-b97c-46c6-866b-aa7cc29af0a8\" (UID: \"daea4ba4-b97c-46c6-866b-aa7cc29af0a8\") : kubernetes.io/csi: mounter.TearDownAt failed: rpc error: code = Internal desc = stat /var/lib/kubelet/pods/daea4ba4-b97c-46c6-866b-aa7cc29af0a8/volumes/kubernetes.io~csi/pvc-9d1c0e76-6d56-439d-8070-741d8846d569/mount: input/output error"经过日志分析可以看到是磁盘出现了部分阻塞,出现以上大量报错信息。

4.检查pvc与pv资源对象

[root@VM-2-29-centos ~]# kubectl get pvc -nti-inf |grep redis
data-redis-server-0                  Bound    pvc-59fde781-e03e-4b26-b07c-7de93f608395   10Gi       RWO            csi-localpv-tidb   136d
data-redis-server-1                  Bound    pvc-6bf28ec2-40e1-4b52-8d54-b4ab0aa9f67a   10Gi       RWO            csi-localpv-tidb   136d
data-redis-server-2                  Bound    pvc-9d1c0e76-6d56-439d-8070-741d8846d569   10Gi       RWO            csi-localpv-tidb   136d
[root@VM-2-29-centos ~]#
[root@VM-2-29-centos ~]# kubectl get pv |grep redis
pvc-59fde781-e03e-4b26-b07c-7de93f608395   10Gi       RWO            Delete           Bound    ti-inf/data-redis-server-0                                                    csi-localpv-tidb            136d
pvc-6bf28ec2-40e1-4b52-8d54-b4ab0aa9f67a   10Gi       RWO            Delete           Bound    ti-inf/data-redis-server-1                                                    csi-localpv-tidb            136d
pvc-9d1c0e76-6d56-439d-8070-741d8846d569   10Gi       RWO            Delete           Bound    ti-inf/data-redis-server-2                                                    csi-localpv-tidb            136dpvc与pv资源均正常。

5.检查磁盘挂载

dmesg(display message) [or display driver],即看内核相关信息

[二 6月 28 20:22:47 2022] buffer_io_error: 6 callbacks suppressed
[二 6月 28 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971392, async page read
[二 6月 28 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971393, async page read
[二 6月 28 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971394, async page read
[二 6月 28 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971395, async page read
[二 6月 28 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971396, async page read
[二 6月 28 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971397, async page read
[二 6月 28 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971398, async page read
[二 6月 28 20:22:47 2022] Buffer I/O error on dev loop0, logical block 20971399, async page read
[二 6月 28 20:22:47 2022] Buffer I/O error on dev loop4, logical block 20971392, async page read
[二 6月 28 20:22:47 2022] Buffer I/O error on dev loop4, logical block 20971393, async page read

因pvc对应磁盘为/dev/vdc,而且系统做了lvm逻辑卷,显然是逻辑卷故障了

通过系统终端查询此目录,已经无法正常访问
~]# ls /data/ti-database
ls: 无法访问/data/ti-database: 输入/输出错误说明:缓冲区 I/O 错误,逻辑块20971393 异步页面读取失败

解决方案

因平台数据组件(etcd/redis/es)均为3个副本,可容忍单点故障,并且此逻辑卷在起初规划设计时只给数据组件使用,所以对其他服务没有影响,只需要重新制作lvm逻辑卷即可。

详细操作流程:
1、mysql/etcd/es 数据备份
2、卸载逻辑卷挂载
3、使用lvremove删除逻辑卷LV
4、使用vgremove删除卷组VG
5、使用pvremove删除物理卷设备
在上述操作执行完毕之后,再执行 lvdisplay、vgdisplay、pvdisplay 命令来查看 LVM 的信息时就不会再看到信息了
6、删除此节点pv与pvc
7、重新制作lvm逻辑卷并进行挂载
8、创建pv、pvc资源对象,与Pod进行关联绑定
9、验证Pod状态
10、检查redis与etcd组件集群健康状态,及数据一致性校验

参考资料:
https://github.com/longhorn/longhorn/issues/1210
https://developer.aliyun.com/article/521158

Error: “MountVolume.SetUp failed for volume pvc 故障处理相关推荐

  1. pod一直处于ContainerCreating,查看报错信息为挂载错误MountVolume.SetUp failed for volume

    背景,在搭建redis集群时,使用的是nfs挂载卷,中途我好像把挂载盘的文件移走了,当我再次启动pod时就出现挂载错误. [root@master redis-cluster-sts]# kubect ...

  2. 解决argo workflow报错:MountVolume.SetUp failed for volume “docker-sock“ : hostPath type check failed

    提交workflow时报错: MountVolume.SetUp failed for volume "docker-sock" : hostPath type check fai ...

  3. MountVolume.MountDevice failed for volume “pvc“ ...问题解决

    一.问题描述 Warning FailedMount 44s (x2 over 108s) kubelet MountVolume.MountDevice failed for volume &quo ...

  4. MountVolume.NewMounter initialization failed for volume “pvc-61dedc85-ea5a-4ac7-aaf3-e072e2e46e18“

    报错 本地测试环境k8s重启后,stateful set报错了 # 报错信息 MountVolume.NewMounter initialization failed for volume " ...

  5. repo sync error.GitError: manifests rev-list : fatal: revision walk setup failed

    更新代码是repo sync 出错:error.GitError: manifests rev-list ('^HEAD', u'a78728c68089372c3ce03a76f10143d7a5d ...

  6. pip install nmslib 失败 (error: command ‘x86_64-linux-gnu-gcc‘ failed with exit status 1)

    1. 问题现象 使用 pip 安装 nmslib 命令时出现如下错误: sudo pip install nmslib ....ERROR: Complete output from command ...

  7. python mysql gcc_MySQL-python “error: command 'gcc' failed with exit status 1”错误

    安装MySQL-python-1.2.3c1出现"error: command 'gcc' failed with exit status 1"错误 具体报错信息如下: _mysq ...

  8. 安装MySQL-python报错 error: command 'gcc' failed with exit status 1解决方法

    错误如: _mysql.c:2331: error: '_mysql_ConnectionObject' has no member named 'open' _mysql.c:2338: error ...

  9. pycuda installation error: command 'gcc' failed with exit status 1

    原文:python采坑之路 Setup script exited with error: command 'gcc' failed with exit status 1 伴随出现"cuda ...

最新文章

  1. 社交网络图挖掘3--重叠社区的发现及Simrank
  2. dubbo集群和负载均衡
  3. Silverlight4实现三维企业网站
  4. javascript高级程序设计---js事件思维导图
  5. topcoder SRM712 Div1 LR
  6. 9切换中文mac_超详细的Mac重装系统教程!让重装系统变得简单起来!
  7. 20170721L08-02-02老男孩Linux运维实战培训初级第八节课课前【上机实战】考试讲解...
  8. java语言复制数组的四种方法
  9. linux redis 监控工具,Redis服务器监控工具redis-live
  10. 计算机图形学----投影矩阵
  11. [验证码实现] Captcha 验证码类,一个很个性的验证码类 (转载)
  12. Java的一些基础小知识之JVM与GC (转)
  13. (通用版)salesforce中soql及sosl的伪‘Like’模糊检索
  14. PHP从入门到精通pdf
  15. TF-IDF算法(原理+python代码实现)
  16. 群晖之邮件服务器搭建
  17. 共享打印机连接报错问题汇总
  18. C语言学习中遇到的问题和解决方法
  19. 【Linux】【操作】Linux操作集锦系列之三——进程管理系列之(一) 进程信息查看
  20. 日常运维-端口查询篇

热门文章

  1. arm gnu linux系统,GNU ARM汇编
  2. 高并发高可用复杂系统中的缓存架构(十六) 实现缓存与数据库双写一致性保障方案
  3. IT风投的一个典型案例--阿里巴巴
  4. oki/5330c.html,oki5330scXP驱动怎么安装;打打印机驱动安装
  5. span 文字自动换行(实测可用)
  6. 系统安装、qt配置、安装pcie驱动
  7. 登录页面的SQL注入漏洞
  8. QWidget: Must construct a QApplication before a QWidget 请按任意键继续. . .
  9. SAP GUI 740 windows 免费下载
  10. AI4DB:人工智能之慢SQL根因分析