Nomad集群自身高可用测试

1.搭建nomad集群

本测试使用三台ubuntu18.04虚拟机，IP地址分别为：

虚拟机1：192.168.60.10
虚拟机2：192.168.60.11
虚拟机3：192.168.60.12

每个虚拟机上都运行一个Nomad server 和一个Nomad client，每一个server对应一个client，而三个server中会有一个leader。

搭建方法参考Nomad Consul搭建集群，去除consul的部分即可。

三台虚拟机分别创建/etc/nomad.d/nomad.hcl:

datacenter = "dc1"
data_dir = "/home/xxx/nomad/data"  #自己修改路径plugin "raw_exec" {config {enabled = true}
}server {enabled = truebootstrap_expect = 2server_join {retry_join = ["192.168.60.10:4648","192.168.60.11:4648"，"192.168.60.12:4648"]}
}client {enabled = trueservers = ["192.168.60.10:4647"] #虚拟机1为此值，虚拟机2为192.168.60.11:4647，虚拟机3为192.168.60.12:4647
}

bootstrap_expect = 2是集群允许server的最小数量，实际server数量可以大于等于它。
当有效server数量低于bootstrap_expect，则集群会一直等待恢复，正在运行的任务不变，也不会重调度。
但要注意，bootstrap_expect的数量代表了实际工作的server数量，只有它们对应的client才会被分配任务。
为了测试高可用，需要关掉一个server，所以集群server总数为3的情况下bootstrap_expect要设为2（不能大于2），否则可能无法继续运行。
nomad_test.hcl文件中的注释可能需要去掉。

三个虚拟机都执行sudo nomad agent -config /etc/nomad.d

修改hcl配置文件时，注意先把data文件删掉或者修改data路径

查看集群情况，ubuntu1中的server是leader：

ubuntu1@ubuntu1:~$ nomad server members
Name             Address        Port  Status  Leader  Protocol  Build  Datacenter  Region
ubuntu1.global   192.168.60.10  4648  alive   true    2         1.1.3  dc1         global
ubuntu2.global   192.168.60.11  4648  alive   false   2         1.1.3  dc1         global
ubuntu3.global   192.168.60.12  4648  alive   false   2         1.1.3  dc1         globalubuntu1@ubuntu1:~$ nomad node status
ID        DC    Name     Class   Drain  Eligibility  Status
f340253b  dc1   ubuntu1  <none>  false  eligible     ready
d7dc7cb2  dc1   ubuntu2  <none>  false  eligible     ready
a9470c7f  dc1   ubuntu3  <none>  false  eligible     ready

2.测试driver=docker

以docker的http-echo镜像为例，创建httpecho.nomad:

job "docs" {datacenters = ["dc1"]group "example" {count = 1network {port "http" {static = "5678"}}restart {attempts = 1interval = "30m"delay    = "2s"mode     = "fail"}reschedule {attempts       = 15interval       = "1h"delay          = "5s"delay_function = "constant"unlimited      = false}task "server" {driver = "docker"config {image = "hashicorp/http-echo"ports = ["http"]args = ["-listen",":5678","-text","hello world",]}}}
}

在ubuntu1中执行nomad job run httpecho.nomad，结果如下：

ubuntu1@ubuntu1:~$ nomad job run httpecho.nomad
==> 2021-08-27T13:51:14+08:00: Monitoring evaluation "ac5b793f"2021-08-27T13:51:14+08:00: Evaluation triggered by job "docs"
==> 2021-08-27T13:51:15+08:00: Monitoring evaluation "ac5b793f"2021-08-27T13:51:15+08:00: Evaluation within deployment: "540742e4"2021-08-27T13:51:15+08:00: Allocation "c1c145f8" created: node "a9470c7f", group "example"2021-08-27T13:51:15+08:00: Evaluation status changed: "pending" -> "complete"
==> 2021-08-27T13:51:15+08:00: Evaluation "ac5b793f" finished with status "complete"
==> 2021-08-27T13:51:15+08:00: Monitoring deployment "540742e4"✓ Deployment "540742e4" successful2021-08-27T13:51:42+08:00ID          = 540742e4Job ID      = docsJob Version = 4Status      = successfulDescription = Deployment completed successfullyDeployedTask Group  Desired  Placed  Healthy  Unhealthy  Progress Deadlineexample     1        1       1        02021-08-27T14:01:40+08:00

用sudo docker ps发现job运行在ubuntu1的client上：

ubuntu1@ubuntu1:~$ sudo docker ps
CONTAINER ID   IMAGE                 COMMAND                  CREATED          STATUS          PORTS                                                      NAMES
0ea066a3b0bc   hashicorp/http-echo   "/http-echo -listen …"   53 seconds ago   Up 52 seconds   192.168.60.10:5678->5678/tcp, 192.168.60.10:5678->5678/udp   server-c1c145f8-f3f9-efcf-14b2-13651aa2c6e2

可以在ubuntu2/3执行http 192.168.60.10:5678（没有http可以sudo apt-get install httpie）：

ubuntu2@ubuntu2:~$ http 192.168.60.10:5678
HTTP/1.1 200 OK
Content-Length: 12
Content-Type: text/plain; charset=utf-8
Date: Fri, 27 Aug 2021 05:54:36 GMT
X-App-Name: http-echo
X-App-Version: 0.2.3hello world

现在我们将ubuntu1中运行着sudo nomad agent -config /etc/nomad.d的终端结束掉，发现ubuntu1的状态变成了left，ubuntu3变成了leader。

ubuntu2@ubuntu2:~$ nomad server members
Name             Address        Port  Status  Leader  Protocol  Build  Datacenter  Region
ubuntu1.global   192.168.60.10  4648  left    false   2         1.1.3  dc1         global
ubuntu2.global   192.168.60.11  4648  alive   false   2         1.1.3  dc1         global
ubuntu3.global   192.168.60.12  4648  alive   true    2         1.1.3  dc1         global

查看ubuntu1的docker执行状态，发现httpecho还存在，功能也可用，说明关闭server不会对client执行的docker造成影响。

ubuntu1@ubuntu1:~$ sudo docker ps
CONTAINER ID   IMAGE                 COMMAND                  CREATED          STATUS          PORTS                                                      NAMES
0ea066a3b0bc   hashicorp/http-echo   "/http-echo -listen …"   13 minutes ago   Up 13 minutes   192.168.60.10:5678->5678/tcp, 192.168.60.10:5678->5678/udp   server-c1c145f8-f3f9-efcf-14b2-13651aa2c6e2

但是我们看其他两个虚拟机的docker 执行情况，发现ubuntu2中也起了一个httpecho，ubuntu3没有变化：

ubuntu2@ubuntu2:~$ sudo docker ps
CONTAINER ID   IMAGE                 COMMAND                  CREATED         STATUS         PORTS                                                      NAMES
e7c228d09421   hashicorp/http-echo   "/http-echo -listen …"   2 minutes ago   Up 2 minutes   192.168.60.11:5678->5678/tcp, 192.168.60.11:5678->5678/udp   server-f87ef72a-3ee0-b5c8-6f1e-0469e66d77b1

说明当运行着job的client所对应的server被关闭时，其上的job会在其他的clinet上启动。并且是否为leader和是否被选择执行job没有必然联系。

这时，我们再次在ubuntu1上开启server，sudo nomad agent -config /etc/nomad.d

发现原本仍然运行的ubuntu1中的httpecho被终止，而ubuntu2、3中的不变。

ubuntu1@ubuntu1:~$ sudo docker ps
CONTAINER ID   IMAGE                 COMMAND                  CREATED         STATUS         PORTS                                                      NAMES

ubuntu2@ubuntu2:~$ sudo docker ps
CONTAINER ID   IMAGE                 COMMAND                  CREATED         STATUS         PORTS                                                      NAMES
e7c228d09421   hashicorp/http-echo   "/http-echo -listen …"   13 minutes ago   Up 13 minutes   192.168.60.11:5678->5678/tcp, 192.168.60.11:5678->5678/udp   server-f87ef72a-3ee0-b5c8-6f1e-0469e66d77b1

ubuntu3@ubuntu3:~$ sudo docker ps
CONTAINER ID   IMAGE                 COMMAND                  CREATED         STATUS         PORTS                                                      NAMES

说明被关闭的server被恢复之后，job的执行会保持现状，并不会恢复到关闭之前的状态，多余的job会被关闭。

3.测试driver=raw_exec

编写简单程序main.c：

#include <stdio.h>
int main()
{while(1){sleep(10);}return 0;
}

gcc main.c -o main.out编译。

sudo cp main.out /usr/拷贝到/usr下面，三个虚拟机都要完成以上操作。

创建exec.nomad:

job "exec" {datacenters = ["dc1"]type = "batch"group "execg" {count = 1restart {attempts = 1interval = "30m"delay    = "0s"mode     = "fail"}reschedule {attempts       = 15interval       = "1h"delay          = "5s"delay_function = "constant"unlimited      = false}task "exect" {driver = "raw_exec"config {command = "/usr/main.out"}}}
}

在ubuntu1中执行nomad job run exec.nomad，结果如下：

ubuntu1@ubuntu1:~$ nomad job run exec.nomad
==> 2021-08-30T16:08:45+08:00: Monitoring evaluation "562ed36f"2021-08-30T16:08:45+08:00: Evaluation triggered by job "exec"
==> 2021-08-30T16:08:46+08:00: Monitoring evaluation "562ed36f"2021-08-30T16:08:46+08:00: Allocation "b798e296" created: node "d7dc7cb2", group "execg"2021-08-30T16:08:46+08:00: Evaluation status changed: "pending" -> "complete"
==> 2021-08-30T16:08:46+08:00: Evaluation "562ed36f" finished with status "complete"

用ps -ef发现job运行在ubuntu2的client上：

ubuntu2@ubuntu2:~$ ps -ef
...
root      41559      2  0 16:07 ?        00:00:00 [kworker/u256:1-]
root      41599  40421  0 16:08 pts/0    00:00:00 /usr/bin/nomad logmon
root      41608  40421  0 16:08 ?        00:00:00 /usr/bin/nomad executor {"LogF
root      41616  41608  0 16:08 ?        00:00:00 /usr/main.out
ubuntu2   41633  40566  0 16:09 pts/3    00:00:00 ps -ef

现在我们将ubuntu2中运行着sudo nomad agent -config /etc/nomad.d的终端结束掉，发现ubuntu2的状态变成了left。

ubuntu1@ubuntu1:~$ nomad server members
Name             Address        Port  Status  Leader  Protocol  Build  Datacenter  Region
ubuntu1.global   192.168.60.10  4648  alive   false   2         1.1.3  dc1         global
ubuntu2.global   192.168.60.11  4648  left    false   2         1.1.3  dc1         global
ubuntu3.global   192.168.60.12  4648  alive   true    2         1.1.3  dc1         global

查看ubuntu2的进程执行状态，发现main.out进程还存在，说明关闭server不会对client执行的exec造成影响。

但是我们看其他两个虚拟机的进程执行情况，发现ubuntu1中也起了一个main.out进程，ubuntu3没有变化：

ubuntu1@ubuntu1:~$ ps -ef
root      85269  75952  0 16:12 pts/1    00:00:00 /usr/bin/nomad logmon
root      85278  75952  0 16:12 ?        00:00:00 /usr/bin/nomad executor {"LogF
root      85286  85278  0 16:12 ?        00:00:00 /usr/main.out
root      85460      2  0 16:12 ?        00:00:00 [kworker/u256:1-]
ubuntu1   85576  76826  0 16:13 pts/3    00:00:00 ps -ef

说明当运行着任务的client所对应的server被关闭时，其上的任务会在其他的clinet上启动。

这时，我们再次在ubuntu2上开启server，sudo nomad agent -config /etc/nomad.d

发现原本仍然运行的ubuntu2中的main.out进程被终止，而ubuntu1、3中的不变。

说明被关闭的server被恢复之后，任务的执行会保持现状，并不会恢复到关闭之前的状态，多余的任务会被关闭。