环境准备:

192.168.244.11  k8s-company01-master01
192.168.244.12  k8s-company01-master02
192.168.244.13  k8s-company01-master03
192.168.244.15  k8s-company01-lb
192.168.244.14  k8s-company01-worker001

三台 master 宕掉两台或三台

在宕掉两台或三台 master 后集群已宕掉,worker 节点中的 pod 可以正常运行,这里考虑机器可以正常修复,并能正常启动,对于机器无法修复的恢复过程参考:k8s集群恢复测试

这里模拟测试:

  • 停掉 192.168.244.12,192.168.244.13 两台 master 机器
  • 让 192.168.244.11 上的 etcd 正常工作
  • 待 192.168.244.12,192.168.244.13 启动后,恢复整个集群

停掉 12 和 13 机器,使集群无法工作

在关闭之前集群是正常的:

[root@k8s-company01-master01 ~]# kubectl get nodes
NAME                      STATUS   ROLES    AGE     VERSION
k8s-company01-master01    Ready    master   11m     v1.14.1
k8s-company01-master02    Ready    master   9m23s   v1.14.1
k8s-company01-master03    Ready    master   7m10s   v1.14.1
k8s-company01-worker001   Ready    <none>   13s     v1.14.1
[root@k8s-company01-master01 ~]#  kubectl -n kube-system get pod
NAME                                             READY   STATUS    RESTARTS   AGE
calico-kube-controllers-749f7c8df8-dqqkb         1/1     Running   1          5m6s
calico-kube-controllers-749f7c8df8-mdrnz         1/1     Running   1          5m6s
calico-kube-controllers-749f7c8df8-w89sk         1/1     Running   0          5m6s
calico-node-6r9jj                                1/1     Running   0          22s
calico-node-cnlqs                                1/1     Running   0          5m6s
calico-node-fb5dh                                1/1     Running   0          5m6s
calico-node-pmxrh                                1/1     Running   0          5m6s
calico-typha-646cdc958c-gd6xj                    1/1     Running   0          5m6s
coredns-56c9dc7946-hw4s8                         1/1     Running   1          11m
coredns-56c9dc7946-nr5zp                         1/1     Running   1          11m
etcd-k8s-company01-master01                      1/1     Running   0          10m
etcd-k8s-company01-master02                      1/1     Running   0          9m31s
etcd-k8s-company01-master03                      1/1     Running   0          7m18s
kube-apiserver-k8s-company01-master01            1/1     Running   0          10m
kube-apiserver-k8s-company01-master02            1/1     Running   0          9m31s
kube-apiserver-k8s-company01-master03            1/1     Running   0          6m12s
kube-controller-manager-k8s-company01-master01   1/1     Running   1          10m
kube-controller-manager-k8s-company01-master02   1/1     Running   0          9m31s
kube-controller-manager-k8s-company01-master03   1/1     Running   0          6m24s
kube-proxy-gnkxl                                 1/1     Running   0          7m19s
kube-proxy-jd82z                                 1/1     Running   0          11m
kube-proxy-rsswz                                 1/1     Running   0          9m32s
kube-proxy-tcx5s                                 1/1     Running   0          22s
kube-scheduler-k8s-company01-master01            1/1     Running   1          10m
kube-scheduler-k8s-company01-master02            1/1     Running   0          9m31s
kube-scheduler-k8s-company01-master03            1/1     Running   0          6m14s

关闭之后:

[root@k8s-company01-master01 ~]# kubectl get nodes
Unable to connect to the server: unexpected EOF
[root@k8s-company01-master01 ~]#  kubectl -n kube-system get pod
Unable to connect to the server: http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=NO_ERROR, debug=""
[root@k8s-company01-master01 ~]# ETCDCTL_API=2 etcdctl  --endpoints https://192.168.244.11:2379,https://192.168.244.12:2379,https://192.168.244.13:2379  --cert-file=/etc/kubernetes/pki/etcd/server.crt --key-file=/etc/kubernetes/pki/etcd/server.key --ca-file=/etc/kubernetes/pki/etcd/ca.crt cluster-health
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://192.168.244.11:2379 exceeded header timeout
; error #1: dial tcp 192.168.244.13:2379: connect: no route to host
; error #2: client: endpoint https://192.168.244.12:2379 exceeded header timeouterror #0: client: endpoint https://192.168.244.11:2379 exceeded header timeout
error #1: dial tcp 192.168.244.13:2379: connect: no route to host
error #2: client: endpoint https://192.168.244.12:2379 exceeded header timeout

集群无法正常工作,etcd 也无法使用。

使 11 节点上的 etcd 以单节点集群的形式启动

集群中 etcd 读取的配置是 /etc/kubernetes/manifests/etcd.yaml 中的。

在其中添加两条命令,使其变成单节点集群的形式:

    - etcd- --advertise-client-urls=https://192.168.244.11:2379- --cert-file=/etc/kubernetes/pki/etcd/server.crt- --client-cert-auth=true- --data-dir=/var/lib/etcd- --initial-advertise-peer-urls=https://192.168.244.11:2380- --initial-cluster=k8s-company01-master01=https://192.168.244.11:2380- --initial-cluster-state=new      ##新加 1- --force-new-cluster              ##新加 2- --key-file=/etc/kubernetes/pki/etcd/server.key- --listen-client-urls=https://127.0.0.1:2379,https://192.168.244.11:2379- --listen-peer-urls=https://192.168.244.11:2380- --name=k8s-company01-master01- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt- --peer-client-cert-auth=true- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt- --snapshot-count=10000- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

注意上面两条新加的命令。

修改配置后,etcd 启动的参数就会变成以单节点集群形式,此时由于 etcd 的恢复,集群可以正常工作了。

[root@k8s-company01-master01 ~]#  kubectl get node
NAME                      STATUS     ROLES    AGE   VERSION
k8s-company01-master01    Ready      master   23m   v1.14.1
k8s-company01-master02    NotReady   master   21m   v1.14.1
k8s-company01-master03    NotReady   master   19m   v1.14.1
k8s-company01-worker001   Ready      <none>   12m   v1.14.1
[root@k8s-company01-master01 ~]#  kubectl -n kube-system get pod
NAME                                             READY   STATUS             RESTARTS   AGE
calico-kube-controllers-749f7c8df8-dqqkb         1/1     Running            3          20m
calico-kube-controllers-749f7c8df8-mdrnz         1/1     Running            1          20m
calico-kube-controllers-749f7c8df8-w89sk         1/1     Running            2          20m
calico-node-6r9jj                                1/1     Running            0          15m
calico-node-cnlqs                                1/1     Running            1          20m
calico-node-fb5dh                                1/1     Running            0          20m
calico-node-pmxrh                                1/1     Running            1          20m
calico-typha-646cdc958c-gd6xj                    1/1     Running            0          20m
coredns-56c9dc7946-hw4s8                         1/1     Running            7          26m
coredns-56c9dc7946-nr5zp                         1/1     Running            3          26m
etcd-k8s-company01-master01                      1/1     Running            0          3m36s
etcd-k8s-company01-master02                      0/1     CrashLoopBackOff   4          24m
etcd-k8s-company01-master03                      0/1     CrashLoopBackOff   4          22m
kube-apiserver-k8s-company01-master01            1/1     Running            7          25m
kube-apiserver-k8s-company01-master02            0/1     CrashLoopBackOff   3          24m
kube-apiserver-k8s-company01-master03            0/1     CrashLoopBackOff   3          21m
kube-controller-manager-k8s-company01-master01   1/1     Running            1          25m
kube-controller-manager-k8s-company01-master02   1/1     Running            1          24m
kube-controller-manager-k8s-company01-master03   1/1     Running            1          21m
kube-proxy-gnkxl                                 1/1     Running            1          22m
kube-proxy-jd82z                                 1/1     Running            0          26m
kube-proxy-rsswz                                 1/1     Running            1          24m
kube-proxy-tcx5s                                 1/1     Running            0          15m
kube-scheduler-k8s-company01-master01            1/1     Running            1          25m
kube-scheduler-k8s-company01-master02            1/1     Running            1          24m
kube-scheduler-k8s-company01-master03            1/1     Running            1          21m
[root@k8s-company01-master01 ~]# ETCDCTL_API=2 etcdctl  --endpoints https://192.168.244.11:2379,https://192.168.244.12:2379,https://192.168.244.13:2379  --cert-file=/etc/kubernetes/pki/etcd/server.crt --key-file=/etc/kubernetes/pki/etcd/server.key --ca-file=/etc/kubernetes/pki/etcd/ca.crt cluster-health
member eff3fafa1597fbf0 is healthy: got healthy result from https://192.168.244.11:2379
cluster is healthy

etcd 集群此时只有一个节点。而且有的 pod 是无法正常运行的,etcd 的另外两个节点都处于 CrashLoopBackOff 状态。

恢复 etcd 集群,使整个 k8s 集群恢复正常

启动 12 和 13 服务器。

在 11 服务器上添加 12 和 13 的 etcd 节点,使其组成一个集群,在添加之前,要先清掉 12 和 13 上 etcd 的数据:

cd /var/lib/etcd
rm -rf member/

添加 12 节点 (添加节点的操作,均在 11 上执行 ):

[root@k8s-company01-master01 ~]# ETCDCTL_API=3 etcdctl member add etcd-k8s-company01-master02 --peer-urls="https://192.168.244.12:2380" --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt
Member 9ada83de146cad81 added to cluster 7b96e402e17890a5ETCD_NAME="etcd-k8s-company01-master02"
ETCD_INITIAL_CLUSTER="etcd-k8s-company01-master02=https://192.168.244.12:2380,k8s-company01-master01=https://192.168.244.11:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.244.12:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

输出上面的内容,表示添加成功,此时还需要重启 12 的 kubelet 服务。

systemctl restart kubelet.service
[root@k8s-company01-master01 ~]# ETCDCTL_API=2 etcdctl  --endpoints https://192.168.244.11:2379,https://192.168.244.12:2379,https://192.168.244.13:2379  --cert-file=/etc/kubernetes/pki/etcd/server.crt --key-file=/etc/kubernetes/pki/etcd/server.key --ca-file=/etc/kubernetes/pki/etcd/ca.crt cluster-health
member 9ada83de146cad81 is healthy: got healthy result from https://192.168.244.12:2379
member eff3fafa1597fbf0 is healthy: got healthy result from https://192.168.244.11:2379
cluster is healthy

集群已变成两个节点的。

继续添加 13 节点 (添加后重启 13 的 kubelet 服务):

[root@k8s-company01-master01 ~]# ETCDCTL_API=3 etcdctl member add etcd-k8s-company01-master03 --peer-urls="https://192.168.244.13:2380" --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt
Member efa2b7e4c407fb7a added to cluster 7b96e402e17890a5ETCD_NAME="etcd-k8s-company01-master03"
ETCD_INITIAL_CLUSTER="k8s-company01-master02=https://192.168.244.12:2380,etcd-k8s-company01-master03=https://192.168.244.13:2380,k8s-company01-master01=https://192.168.244.11:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.244.13:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
[root@k8s-company01-master01 ~]# ETCDCTL_API=2 etcdctl  --endpoints https://192.168.244.11:2379,https://192.168.244.12:2379,https://192.168.244.13:2379  --cert-file=/etc/kubernetes/pki/etcd/server.crt --key-file=/etc/kubernetes/pki/etcd/server.key --ca-file=/etc/kubernetes/pki/etcd/ca.crt cluster-health
member 9ada83de146cad81 is healthy: got healthy result from https://192.168.244.12:2379
member efa2b7e4c407fb7a is healthy: got healthy result from https://192.168.244.13:2379
member eff3fafa1597fbf0 is healthy: got healthy result from https://192.168.244.11:2379
cluster is healthy

etcd 集群已恢复正常。

集群恢复正常:

[root@k8s-company01-master01 ~]#  kubectl get node
NAME                      STATUS   ROLES    AGE   VERSION
k8s-company01-master01    Ready    master   33m   v1.14.1
k8s-company01-master02    Ready    master   31m   v1.14.1
k8s-company01-master03    Ready    master   29m   v1.14.1
k8s-company01-worker001   Ready    <none>   22m   v1.14.1
[root@k8s-company01-master01 ~]#  kubectl -n kube-system get pod
NAME                                             READY   STATUS    RESTARTS   AGE
calico-kube-controllers-749f7c8df8-dqqkb         1/1     Running   3          27m
calico-kube-controllers-749f7c8df8-mdrnz         1/1     Running   1          27m
calico-kube-controllers-749f7c8df8-w89sk         1/1     Running   2          27m
calico-node-6r9jj                                1/1     Running   0          22m
calico-node-cnlqs                                1/1     Running   1          27m
calico-node-fb5dh                                1/1     Running   0          27m
calico-node-pmxrh                                1/1     Running   1          27m
calico-typha-646cdc958c-gd6xj                    1/1     Running   0          27m
coredns-56c9dc7946-hw4s8                         1/1     Running   7          33m
coredns-56c9dc7946-nr5zp                         1/1     Running   3          33m
etcd-k8s-company01-master01                      1/1     Running   0          10m
etcd-k8s-company01-master02                      1/1     Running   7          31m
etcd-k8s-company01-master03                      1/1     Running   7          29m
kube-apiserver-k8s-company01-master01            1/1     Running   7          32m
kube-apiserver-k8s-company01-master02            1/1     Running   6          31m
kube-apiserver-k8s-company01-master03            1/1     Running   7          28m
kube-controller-manager-k8s-company01-master01   1/1     Running   2          32m
kube-controller-manager-k8s-company01-master02   1/1     Running   1          31m
kube-controller-manager-k8s-company01-master03   1/1     Running   1          28m
kube-proxy-gnkxl                                 1/1     Running   1          29m
kube-proxy-jd82z                                 1/1     Running   0          33m
kube-proxy-rsswz                                 1/1     Running   1          31m
kube-proxy-tcx5s                                 1/1     Running   0          22m
kube-scheduler-k8s-company01-master01            1/1     Running   2          32m
kube-scheduler-k8s-company01-master02            1/1     Running   1          31m
kube-scheduler-k8s-company01-master03            1/1     Running   1          28m

如有出现节点添加成功,但 etcd 集群仍为宕掉的状态,可以稍等,11 节点会自动将添加失败的节点踢出,然后重复添加即可。

k8s集群灾难恢复-原机器能起来相关推荐

  1. k8s集群二进制部署 1.17.3

    K8s简介 Kubernetes(简称k8s)是Google在2014年6月开源的一个容器集群管理系统,使用Go语言开发,用于管理云平台中多个主机上的容器化的应用,Kubernetes的目标是让部署容 ...

  2. 机器从零到 K8S 集群 Worker 节点的安装过程

    最近基于 Hyper-V 虚拟机搭了一个单节点的 K8S,过程没有记录下来 本次实践从零开始搭建一个 K8S Slave 节点 机器从零到 K8S 集群 Slave 节点的安装过程 实践环境 安装 L ...

  3. K8s 集群节点在线率达到 99.9% 以上,扩容效率提升 50%,我们做了这 3 个深度改造

    作者 | 张振(守辰)阿里云云原生应用平台高级技术专家 导读:2019 年阿里巴巴核心系统 100% 以云原生方式上云,完美地支撑了 双11 大促.这次上云的姿势很不一般,不仅是拥抱了 Kuberne ...

  4. 总结 Underlay 和 Overlay 网络,在k8s集群实现underlay网络,网络组件flannel vxlan/ calico IPIP模式的网络通信流程,基于二进制实现高可用的K8S集群

    1.总结Underlay和Overlay网络的的区别及优缺点 Overlay网络:  Overlay 叫叠加网络也叫覆盖网络,指的是在物理网络的 基础之上叠加实现新的虚拟网络,即可使网络的中的容器可 ...

  5. Addon SuperEdge 让原生 K8s 集群可管理边缘应用和节点

    作者 梁豪,腾讯TEG工程师,云原生开源爱好者,SuperEdge 开发者,现负责TKEX-TEG容器平台运维相关工作. 王冬,腾讯云TKE后台研发工程师,专注容器云原生领域,SuperEdge 核心 ...

  6. K8s 集群节点在线率达到 99.9% 以上,扩容效率提升 50%,我们做了这 3 个深度改造...

    2019 年阿里巴巴核心系统 100% 以云原生方式上云,完美地支撑了 双11 大促.这次上云的姿势很不一般,不仅是拥抱了 Kubernetes,而且还以拥抱 Kubernetes 为契机进行了一系列 ...

  7. 使用kubeadm快速部署一个K8s集群

    kubeadm是官方社区推出的一个用于快速部署kubernetes集群的工具. 这个工具能通过两条指令完成一个kubernetes集群的部署: # 创建一个 Master 节点 $ kubeadm i ...

  8. Centos7 安装部署Kubernetes(k8s)集群过程

    1.系统环境 服务器版本 docker软件版本 CPU架构 CentOS Linux release 7.9 Docker version 20.10.12 x86_64 2.前言 如下图描述了软件部 ...

  9. centos7中kubeadm方式搭建k8s集群(crio+calico)(k8s v1.21.0)

    文章目录 centos7中kubeadm方式搭建k8s集群(crio+calico)(k8s v1.21.0) 环境说明 注意事项及说明 1.版本兼容问题 2.镜像问题 安装步骤 安装要求 准备环境 ...

  10. k8s集群部署方式(kubeadm方式安装k8s)

    说明:部分操作请先看报错说明,在进行操作!! 环境准备(1. centos7.7操作系统配置) #--------------------------------------------------- ...

最新文章

  1. python画皇冠_用Python画小女孩放风筝的示例
  2. 智能家居 (7) ——网络服务器线程控制
  3. 班级日常分享,一天一瞬间
  4. java 6 基础教程_Java小白入门教程(6)——循环语句
  5. nginx根据参数转发到不同服务器_Nginx服务器之负载均衡策略
  6. php 验证码一直不对,ThinkPHP验证码老是出错怎么办
  7. c include 多层目录_python+C、C++混合编程的应用
  8. vue 实现横向时间轴
  9. 夺命雷公狗---javascript NO:19 Navigator浏览器对象
  10. (引)XPath 示例
  11. 实数系的连续性的含义
  12. Android测试之Robotium自动化测试框架
  13. PDF解密怎么弄?分享这3个解密软件
  14. gae代码_GAE中的Java EE
  15. 英文写作佳句300例
  16. 如何判断一个指定的位置点坐标(GPS上的经纬度)是否落在一个多边形区域内?
  17. 计算机连接不上蓝牙鼠标,蓝牙鼠标连接不上的解决方案
  18. HBuilder 使用教程
  19. 字节跳动2021批笔试题解
  20. 接口测试 - 构造伪数据/测试数据(Faker)

热门文章

  1. 《东周列国志》第九十六回 蔺相如两屈秦王 马服君单解韩围
  2. 是真的吗?蚂蚁的LDC架构,到底是干嘛的,真的那么牛吗
  3. 百度脑图(kityminder)优化
  4. React中文文档 9. 表单
  5. 等离子显示器测试软件,等离子显示器驱动芯片内置ERC功能的测试方法
  6. 北京积分落户公示名单公布:华为笑傲榜单,来看落户最多的是哪些企业!
  7. 2022电子邮箱大全,国内企业邮箱注册大全有哪些?
  8. cygwin安装apt-cyg命令
  9. omitting directory * 问题解决
  10. NKOJ 5140 大吉大利 晚上吃鸡