背景

Slurm集群调度系统支持创建集群联合(Federation),并在集群之间以对等方式调度作业。提交到联合集群的作业将收到唯一的作业ID,该ID在联合集群中的所有群集中都是唯一的。作业提交到本地集群(集群在slurm.conf中定义),然后跨联盟中的群集进行复制。然后,每个集群根据自己的调度策略独立地尝试调度作业。集群与“原始”集群(作业提交到的集群)协调以调度作业。

利用 Federation 作业调度,可以实现本地-云端混合 HPC 调度,提升本地已有 Slurm 集群的资源弹性和扩展,Slurm 本地与云端集群组成 Federation 集群联合,用户可以像往常一样在本地 Slurm 中提交作业,作业会同时复制到云端Slurm 集群中,每个集群都会试图调度作业运行,为作业分配资源。如果成功,它将通知原始集群(作业提交集群)它启动了作业,原始集群会通知其它集群终止和删除这个作业并置于吊销状态。

基本流程

1. 客户登录本地集群(on-prem)

2. 客户提交作业到本地集群

3. Slurm集群会将作业拷贝到 亚马逊云科技云上 Slurm 集群(亚马逊云科技)

4. 如果本地集群可以执行作业,则通知云上集群(亚马逊云科技)取消作业

5. 如果本地集群无调度执行作业,而云上集群可以调度执行,则云上集群(亚马逊云科技)开始调度执行作业,并通知本地集群(on-prem)取消(revoke)作业

6. 可以使用 sinfo –federation, squeue –federation, sacct –federation命令查看所有的作业执行情况。

验证配置

1. 使用 Parallecluster在一个亚马逊云科技区域建立一个 Slurm 集群,最大节点和最小节点相同以模拟本地集群(on-prem);

2. 启用本地集群 slurmdbd 进程和 accounting 服务,Slurm 多集群依赖 accounting 服务;

3. 使用 Parallecluster 在另外一个区域建立一个 Slurm 集群,作为云上 cloudburst 集群(亚马逊云科技);

4. 使用 VPC-peering,连接两个集群模拟混合云,必须配置 DNS 机器名解析;

5. 配置多集群;

6. 配置 Federation;

7. 提交测试作业验证 Federation 集群调度;

具体配置流程

1. 安装Parallecluster ,使用虚拟环境,以便安装两套 Parallecluster

安装 升级 pip和 virtualenv

1$ python3 -m pip install --upgrade pip
2$ python3 -m pip install --user --upgrade virtualenv

创建虚拟环境

1$ python3 -m virtualenv ~/.pcluster

激活虚拟环境

1$ source ~/.pcluster/bin/activate

安装 Parallecluster 到虚拟环境中

1(.pcluster) a483e778a9b5:~ xinxx$ python3 -m pip install --upgrade aws-parallelcluster

验证Parallecluster 安装

1(.pcluster) a483e778a9b5:~ xinxx$ pcluster version
22.10.0

2. 使用 Parallecluster 建立模拟本地 Slurm 集群(on-prem),可以配置initial_queue_size=max_queue_size,以模拟本地固定集群情况。在安装的时候,指定 pcluster 的配置文件,以区别本地(on-perm)和云端(亚马逊云科技)集群。

配置 on-perm 集群

1(.pcluster) a483e778a9b5:~ xinxx$ pcluster configure -c ~/.parallelcluster/pcluster-config-on-perm
2INFO: Configuration file /Users/xinxx/.parallelcluster/pcluster-config-on-perm will be written.
3Press CTRL-C to interrupt the procedure.
4
 1Allowed values for AWS Region ID:21. cn-north-132. cn-northwest-14AWS Region ID [cn-northwest-1]: 5Allowed values for EC2 Key Pair Name:61. xinxx-key-nx7EC2 Key Pair Name [xinxx-key-nx]: 8Allowed values for Scheduler:91. sge
102. torque
113. slurm
124. awsbatch
13Scheduler [slurm]:
14Allowed values for Operating System:
151. alinux
162. alinux2
173. centos7
184. centos8
195. ubuntu1604
206. ubuntu1804
21Operating System [alinux2]:
22Minimum cluster size (instances) [0]: 2
23Maximum cluster size (instances) [10]: 2
24Master instance type [t2.micro]:
25Compute instance type [t2.micro]:
26Automate VPC creation? (y/n) [n]: y
27Allowed values for Network Configuration:
281. Master in a public subnet and compute fleet in a private subnet
292. Master and compute fleet in the same public subnet
30Network Configuration [Master in a public subnet and compute fleet in a private subnet]: 2
31Beginning VPC creation. Please do not leave the terminal until the creation is finalized
32Creating CloudFormation stack...
33Do not leave the terminal until the process has finished
34Stack Name: parallelclusternetworking-pub-20201211120145
35Status: parallelclusternetworking-pub-20201211120145 - CREATE_COMPLETE
36The stack has been created
37Configuration file written to /Users/xinxx/.parallelcluster/pcluster-config-on-perm
38You can edit your configuration file or simply run 'pcluster create -c /Users/xinxx/.parallelcluster/pcluster-config-on-perm cluster-name' to create your cluster

建立 on-perm 集群

1(.pcluster) a483e778a9b5:~ xinxx$ pcluster create on-perm -c /Users/xinxx/.parallelcluster/pcluster-config-on-perm
2Beginning cluster creation for cluster: on-perm
3Creating stack named: parallelcluster-on-perm
4Status: parallelcluster-on-perm - CREATE_COMPLETE
5MasterPublicIP: 52.82.115.178
6ClusterUser: ec2-user
7MasterPrivateIP: 10.0.3.94

3. 修改本地集群

更新集群名称为:on-perm

  • 编辑 vi /opt/slurm/etc/slurm.conf,修改 ClusterName 参数为 on-perm

1ClusterName=on-perm
  • 停止 slurm 集群

1[root@ip-10-0-3-94 ]# systemctl stop slurmctld
  • 删除/var/spool/slurm.state/下的所有文件

1[root@ip-10-0-3-94 ]# rm -rf /var/spool/slurm.state/*
  • 重启 Slurm

1[root@ip-10-0-3-94 ]# systemctl start slurmctld
  • 检查 Slurm 集群运行情况

 1[root@ip-10-0-3-94 ]# lsid2Slurm 20.02.4, Feb 1 20203Copyright SchedMD LLC, 2010-2017.45My cluster name is on-perm6My master name is ip-10-0-3-9478[root@ip-10-0-3-94 ]# sinfo9PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
10compute* up infinite 2 idle~ compute-st-t2micro-[1-2]

在on-perm 集群的管理节点安装 SlurmDBD,用于 Accounting 信息记录,多集群下这个服务必须安装

  • 安装 MariaDB,本例为 Amazon Linux2作为管理节点,使用 root 身份执行下列命令

1[root@ip-10-0-3-94 ~]# yum install -y mariadb mariadb-server
  • 启动 MariaDB

1[root@ip-10-0-3-94 ]# systemctl start mariadb
2[root@ip-10-0-3-94 ]# systemctl enable mariadb
  • 设置 MariaDB root 密码

1[root@ip-10-0-3-94 ]# mysqladmin -u root password <yourpassword>
  • 登录

1[root@ip-10-0-3-94 ]# mysql -u root -p
2Enter password:
3Welcome to the MariaDB monitor. Commands end with ; or \g.
  • 创建Slurm Accounting 需要的 database

 1MariaDB [(none)]> create user 'slurm'@'localhost' identified by '<password>';2Query OK, 0 rows affected (0.00 sec)34MariaDB [(none)]> grant all on slurm_acct_db.* TO 'slurm'@'localhost';5Query OK, 0 rows affected (0.00 sec)67MariaDB [(none)]> grant all on slurm_acct_db.* TO 'slurm'@'system0';8Query OK, 0 rows affected (0.00 sec)9
10MariaDB [(none)]> create database slurm_acct_db;
11Query OK, 1 row affected (0.00 sec)
12
13MariaDB [(none)]> quit
14Bye

在配置文件/opt/slurm/etc/slurm.conf中增加Accounting参数,JobCompHost用于保存作业完成信息,目前只支持直接访问 MySQL,所以,JobCompHost=

,JobCompPass=MySQL用户「slurm」的密码。

AccountingStorageHost=slurmdbd进程运行的机器名。注意:缺省的配置中在 LOGGING 段中已有 JobCompType=jobcomp/none,需要注释掉

 1# LOGGING2SlurmctldDebug=info3SlurmctldLogFile=/var/log/slurmctld.log4SlurmdDebug=info5SlurmdLogFile=/var/log/slurmd.log6#JobCompType=jobcomp/none78...9
10# JobComp
11JobCompType=jobcomp/mysql
12JobCompHost=localhost
13JobCompPort=3306
14JobCompPass=<your_mariadb_slurm_password>
15JobCompUser=slurm
16JobCompLoc=slurm_acct_db
17#JobCompLoc=
18#
19# ACCOUNTING
20JobAcctGatherType=jobacct_gather/linux
21JobAcctGatherFrequency=30
22#
23AccountingStorageType=accounting_storage/slurmdbd
24AccountingStorageHost=ip-10-0-3-94
25AccountingStoragePort=6819
26#AccountingStorageLoc=
27#AccountingStoragePass=
28#AccountingStorageUser=
29#
30DebugFlags=NO_CONF_HASH

建立配置文件/opt/slurm/etc/slurmdbd.conf

 1# LOGGING2SlurmctldDebug=info3SlurmctldLogFile=/var/log/slurmctld.log4SlurmdDebug=info5SlurmdLogFile=/var/log/slurmd.log6#JobCompType=jobcomp/none78...9
10# slurmDBD info
11DbdHost=localhost
12DbdPort=6819
13SlurmUser=slurm
14#MessageTimeout=60
15DebugLevel=6
16#DefaultQOS=normal
17LogFile=/var/log/slurmdbd.log
18PidFile=/var/run/slurmdbd.pid
19PluginDir=/opt/slurm/lib/slurm
20#PrivateData=accounts,users,usage,jobs
21#TrackWCKey=yes
22# Database info
23StorageType=accounting_storage/mysql
24StorageHost=localhost
25StoragePort=3306
26StoragePass=Letmein123
27StorageUser=slurm
28StorageLoc=slurm_acct_db

重启 slurmctld,和启动 slurmdbd

 1# LOGGING2SlurmctldDebug=info3SlurmctldLogFile=/var/log/slurmctld.log4SlurmdDebug=info5SlurmdLogFile=/var/log/slurmd.log6#JobCompType=jobcomp/none78...9[root@ip-10-0-3-94 etc]# systemctl stop slurmctld
10[root@ip-10-0-3-94 etc]# systemctl start slurmctld
11[root@ip-10-0-3-94 etc]# /opt/slurm/sbin/slurmdbd

检查 accounting 状态

 1Internal DBD rollup last ran Sun Dec 13 03:25:33 2020 (1607829933)2 Last cycle: 443 Max cycle: 444 Total time: 445 Total cycles: 16 Mean cycle: 4478Remote Procedure Call statistics by message type9 SLURM_PERSIST_INIT ( 6500) count:9 ave_time:380 total_time:3423
10 DBD_FINI ( 1401) count:9 ave_time:172 total_time:1552
11 DBD_CLUSTER_TRES ( 1407) count:1 ave_time:640 total_time:640
12 DBD_GET_JOBS_COND ( 1444) count:1 ave_time:526 total_time:526
13 DBD_GET_ACCOUNTS ( 1409) count:1 ave_time:488 total_time:488
14 DBD_GET_CLUSTERS ( 1412) count:1 ave_time:479 total_time:479
15
16Remote Procedure Call statistics by user
17 root ( 0) count:20 ave_time:302 total_time:6058
18 slurm ( 990) count:2 ave_time:525 total_time:1050

4. 建立云端集群

使用新的配置文件配置 pcluster

  • 手动建立一个 VPC,因为组成多集群,CIDR不能重叠,使用10.100.0.0/16,注意要启用 DNS 主机名和解析

  • 运行 pcluster configure -c ~/.parallecluster/pcluster-config-aws,使用上一部创建的 VPC

 1(.pcluster) a483e778a9b5:~ xinxx$ pcluster configure -c .parallelcluster/pcluster-config-aws2INFO: Configuration file .parallelcluster/pcluster-config-aws will be written.3Press CTRL-C to interrupt the procedure.456Allowed values for AWS Region ID:71. cn-north-182. cn-northwest-19AWS Region ID [cn-northwest-1]: 1
10Allowed values for EC2 Key Pair Name:
111. xin-key-bj
12EC2 Key Pair Name [xin-key-bj]:
13Allowed values for Scheduler:
141. sge
152. torque
163. slurm
174. awsbatch
18Scheduler [slurm]:
19Allowed values for Operating System:
201. alinux
212. alinux2
223. centos7
234. centos8
245. ubuntu1604
256. ubuntu1804
26Operating System [alinux2]:
27Minimum cluster size (instances) [0]:
28Maximum cluster size (instances) [10]:
29Master instance type [t2.micro]:
30Compute instance type [t2.micro]:
31Automate VPC creation? (y/n) [n]: n
32Allowed values for VPC ID:
33 # id name number_of_subnets
34--- --------------------- ----------------------------------- -------------------
35 1 vpc-022aa918fe6dbe46f ParalleCluster-cloud 0
36 2 vpc-6e....a 2
37
38VPC ID [vpc-022aa918fe6dbe46f]: 1
39There are no qualified subnets. Starting automatic creation of subnets...
40Allowed values for Network Configuration:
411. Master in a public subnet and compute fleet in a private subnet
422. Master and compute fleet in the same public subnet
43Network Configuration [Master in a public subnet and compute fleet in a private subnet]:
44Creating CloudFormation stack...
45Do not leave the terminal until the process has finished
46Stack Name: parallelclusternetworking-pubpriv-20201213061955
47Status: parallelclusternetworking-pubpriv-20201213061955 - CREATE_COMPLETE
48The stack has been created
49Configuration file written to .parallelcluster/pcluster-config-aws
50You can edit your configuration file or simply run 'pcluster create -c .parallelcluster/pcluster-config-aws cluster-name' to create your cluster
  • 建立集群

1(.pcluster) a483e778a9b5:~ xinxx$ pcluster create -c .parallelcluster/pcluster-config-aws aws
2Beginning cluster creation for cluster: aws
3Creating stack named: parallelcluster-aws
4Status: parallelcluster-aws - CREATE_COMPLETE
5ClusterUser: ec2-user
6MasterPrivateIP: 10.100.0.216
  • 检查集群

1[ec2-user@ip-10-100-0-216 ~]$ lsid
2Slurm 20.02.4, Feb 1 2020
3Copyright SchedMD LLC, 2010-2017.
4
5My cluster name is parallelcluster
6My master name is ip-10-100-0-216
7[ec2-user@ip-10-100-0-216 ~]$ sinfo
8PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
9compute* up infinite 10 idle~ compute-dy-t2micro-[1-10]

登录管理节点,修改 ClusterName=aws

  • 登录管理节点,编辑/opt/slurm/etc/slurm.conf

 1[root@ip-10-100-0-216 etc]# cat slurm.conf 2#3# Example slurm.conf file. Please run configurator.html4# (in doc/html) to build a configuration file customized5# for your environment.6#7#8# slurm.conf file generated by configurator.html.9#
10# See the slurm.conf man page for more information.
11#
12# CLUSTER SETTINGS
13ClusterName=aws
14...
  • 删除/var/spool/slurm.state/*,重启集群

 1[root@ip-10-100-0-216 etc]# systemctl stop slurmctld2[root@ip-10-100-0-216 etc]# rm -rf /var/spool/slurm.state/*3[root@ip-10-100-0-216 etc]# ls /var/spool/slurm.state/4[root@ip-10-100-0-216 etc]# systemctl start slurmctld5[root@ip-10-100-0-216 etc]# lsid6Slurm 20.02.4, Feb 1 20207Copyright SchedMD LLC, 2010-2017.89My cluster name is aws
10My master name is ip-10-100-0-216

5. 配置 VPC peer,打开 双方 VPC 的DNS 解析,修改路由,安全组,建立双方连接

创建 VPC Peer

打开双方 DNS 解析

修改双方路由表,确保通过 DNS机器名 可以访问

修改本地集群(on-perm)和云端集群(亚马逊云科技)的安全组的入站规则,允许互相访问

  • 本地(on-perm)管理节点

  • 云端(亚马逊云科技)集群管理节点

  • 修改 VPC 路由表,确认双方可以互相通信

6. 配置云端集群 (亚马逊云科技)的 Accounting 配置

编辑/opt/slurm/etc/slurm.conf,增加下列内容

 1# 2# ACCOUNTING3JobAcctGatherType=jobacct_gather/linux4JobAcctGatherFrequency=305#6AccountingStorageType=accounting_storage/slurmdbd7AccountingStorageHost=ip-10-0-3-948AccountingStoragePort=68199#AccountingStorageLoc=
10#AccountingStoragePass=
11#AccountingStorageUser=

重启集群

1[root@ip-10-100-0-216 etc]# systemctl restart slurmctld

7. 登录本地集群(on-perm)管理节点,注册集群

注册集群,如果报告云端集群(亚马逊云科技)已经注册,请忽略

1[root@ip-10-0-3-94 log]# sacctmgr --immediate add cluster Name=on-perm
2 Adding Cluster(s)
3 Name = on-perm
4[root@ip-10-0-3-94 log]# sacctmgr --immediate add cluster Name=aws

检查多集群状态

1[root@ip-10-0-3-94 etc]# sacctmgr show cluster format=cluster,controlhost,controlport
2 Cluster ControlHost ControlPort
3---------- --------------- ------------
4 aws 10.100.0.216 6820
5 on-perm 10.0.3.94 6820

测试多集群作业提交,切换到普通用户 ec2-user

  • 建立测试程序,赋予可执行权限

 1[ec2-user@ip-10-0-3-94 ~]$ vi host_batch2[ec2-user@ip-10-0-3-94 ~]$ chmod a+x host_batch3[ec2-user@ip-10-0-3-94 ~]$ cat host_batch 4#!/bin/bash5#6#SBATCH --job-name=hostname_sleep_sample7#SBATCH --output=out_%j.txt8#9#SBATCH --nodes=1
10
11srun hostname
12sleep 60

指定集群,提交作业

  • 提交到本地集群(on-perm)

 1[ec2-user@ip-10-0-3-94 ~]$ sbatch host_batch 2Submitted batch job 23[ec2-user@ip-10-0-3-94 ~]$ squeue 4 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 5 2 compute hostname ec2-user R 0:05 1 compute-st-t2micro-1 6[ec2-user@ip-10-0-3-94 ~]$ sacct7 JobID JobName Partition Account AllocCPUS State ExitCode 8------------ ---------- ---------- ---------- ---------- ---------- -------- 92 hostname_+ compute 1 RUNNING 0:0
102.batch batch 1 RUNNING 0:0
112.0 hostname 1 COMPLETED 0:0
  • 提交到云端集群(亚马逊云科技),使用-M指定集群。注意 squeue 和 sacct 也需要指定-M 参数才可以看到作业在云端集群(亚马逊云科技)的执行情况

 1[ec2-user@ip-10-0-3-94 ~]$ sbatch -M aws host_batch 2Submitted batch job 2 on cluster aws3[ec2-user@ip-10-0-3-94 ~]$ squeue 4 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 5[ec2-user@ip-10-0-3-94 ~]$ squeue -M aws6CLUSTER: aws7 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 8 2 compute hostname ec2-user CF 0:10 1 compute-dy-t2micro-1 9[ec2-user@ip-10-0-3-94 ~]$ sacct -M aws
10 JobID JobName Partition Account AllocCPUS State ExitCode
11------------ ---------- ---------- ---------- ---------- ---------- --------
122 hostname_+ compute 1 RUNNING 0:0

8. 建立集群 Federation 联合

使用 root,执行下面命令

 1[root@ip-10-0-3-94 ~]# sacctmgr add federation cloudburst clusters=on-perm,aws2 Adding Federation(s)3 cloudburst4 Settings5 Cluster = on-perm6 Cluster = aws7Would you like to commit changes? (You have 30 seconds to decide)8(N/y): y9[root@ip-10-0-3-94 ~]# sacctmgr show federation
10Federation Cluster ID Features FedState
11---------- ---------- -- -------------------- ------------
12cloudburst aws 2 ACTIVE
13cloudburst on-perm 1 ACTIVE

9. 提交测试作业,检查在 Federation 的执行情况

提交大量作业

1[ec2-user@ip-10-0-3-94 ~]$ sbatch host_batch
2Submitted batch job 67108870
3
4...
5Submitted batch job 67108871
6Submitted batch job 67108872

检查作业情况,可以看到已经有作业在本地集群开始执行

 1[ec2-user@ip-10-0-3-94 ~]$ squeue --federation2 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3 67108872 compute hostname ec2-user CF 0:31 1 compute-dy-t2micro-1 4 67108873 compute hostname ec2-user CF 0:23 1 compute-dy-t2micro-3 5 67108874 compute hostname ec2-user CF 0:23 1 compute-dy-t2micro-4 6 67108875 compute hostname ec2-user CF 0:23 1 compute-dy-t2micro-5 7 67108876 compute hostname ec2-user CF 0:17 1 compute-dy-t2micro-6 8 67108877 compute hostname ec2-user CF 0:17 1 compute-dy-t2micro-7 9 67108878 compute hostname ec2-user CF 0:16 1 compute-dy-t2micro-8
10 67108869 compute hostname ec2-user R 0:05 1 compute-dy-t2micro-2
11 67108870 compute hostname ec2-user R 0:32 1 compute-st-t2micro-1
12 67108871 compute hostname ec2-user R 0:32 1 compute-st-t2micro-2

检查集群情况

1[ec2-user@ip-10-0-3-94 ~]$ sinfo --federation
2PARTITION CLUSTER AVAIL TIMELIMIT NODES STATE NODELIST
3compute* aws up infinite 7 alloc# compute-dy-t2micro-[1,3-8]
4compute* aws up infinite 2 idle~ compute-dy-t2micro-[9-10]
5compute* on-perm up infinite 2 alloc compute-st-t2micro-[1-2]
6compute* aws up infinite 1 alloc compute-dy-t2micro-2

检查云端 集群的 EC2,发现已经有 EC2开始启动,并作为计算节点加入集群

检查最终运行情况,当本地资源不足的情况,作业会在云端集群分配计算节点并执行

 1[ec2-user@ip-10-0-3-94 ~]$ sacct --federation -o JobID,JobName,State,Cluster,NodeList2 JobID JobName State Cluster NodeList 3------------ ---------- ---------- ---------- --------------- 42 hostname_+ COMPLETED on-perm compute-st-t2m+ 52.batch batch COMPLETED on-perm compute-st-t2m+ 62.0 hostname COMPLETED on-perm compute-st-t2m+ 767108867 hostname_+ COMPLETED on-perm compute-st-t2m+ 867108867.ba+ batch COMPLETED on-perm compute-st-t2m+ 967108867.0 hostname COMPLETED on-perm compute-st-t2m+
1067108868 hostname_+ COMPLETED on-perm compute-st-t2m+
1167108868.ba+ batch COMPLETED on-perm compute-st-t2m+
1267108868.0 hostname COMPLETED on-perm compute-st-t2m+
1367108869 hostname_+ COMPLETED aws compute-dy-t2m+
1467108869.ba+ batch COMPLETED aws compute-dy-t2m+
1567108869.0 hostname COMPLETED aws compute-dy-t2m+
1667108870 hostname_+ COMPLETED on-perm compute-st-t2m+
1767108870.ba+ batch COMPLETED on-perm compute-st-t2m+
1867108870.0 hostname COMPLETED on-perm compute-st-t2m+
1967108871 hostname_+ COMPLETED on-perm compute-st-t2m+
2067108871.ba+ batch COMPLETED on-perm compute-st-t2m+
2167108871.0 hostname COMPLETED on-perm compute-st-t2m+
2267108872 hostname_+ COMPLETED aws compute-dy-t2m+
2367108872.ba+ batch COMPLETED aws compute-dy-t2m+
2467108872.0 hostname COMPLETED aws compute-dy-t2m+
2567108873 hostname_+ COMPLETED aws compute-dy-t2m+
2667108873.ba+ batch COMPLETED aws compute-dy-t2m+
2767108873.0 hostname COMPLETED aws compute-dy-t2m+
2867108874 hostname_+ COMPLETED aws compute-dy-t2m+
2967108874.ba+ batch COMPLETED aws compute-dy-t2m+
3067108874.0 hostname COMPLETED aws compute-dy-t2m+
3167108875 hostname_+ COMPLETED aws compute-dy-t2m+
3267108875.ba+ batch COMPLETED aws compute-dy-t2m+
3367108875.0 hostname COMPLETED aws compute-dy-t2m+
3467108876 hostname_+ COMPLETED aws compute-dy-t2m+
3567108876.ba+ batch COMPLETED aws compute-dy-t2m+
3667108876.0 hostname COMPLETED aws compute-dy-t2m+
3767108877 hostname_+ COMPLETED aws compute-dy-t2m+
3867108877.ba+ batch COMPLETED aws compute-dy-t2m+
3967108877.0 hostname COMPLETED aws compute-dy-t2m+
4067108878 hostname_+ COMPLETED aws compute-dy-t2m+
4167108878.ba+ batch COMPLETED aws compute-dy-t2m+
4267108878.0 hostname COMPLETED aws compute-dy-t2m+ 

参考资料

SchedMD Homepage:

https://www.schedmd.com/

Slurm on GCP ReadMe:

https://github.com/SchedMD/slurm/blob/master/contribs/gcp/README.md

Slurm Quickstart Guide:

https://slurm.schedmd.com/quickstart.html

Slurm MAN Pages:

https://slurm.schedmd.com/man_index.html

Slurm Command Summary (PDF):

https://slurm.schedmd.com/pdfs/summary.pdf

Slurm Accounting Guide:

https://slurm.schedmd.com/accounting.html

本篇作者

李沐

亚马逊云科技首席科学家

信欣

亚马逊云科技资深解决方案架构师

目前负责基于亚马逊云科技云计算方案架构的咨询和设计。在加入亚马逊云科技之前曾就职于IBM,有超过十年的 HPC 产品研发和架构设计经验。

听说,点完下面4个按钮

就不会碰到bug了!

使用 Slurm Federation 调度实现HPC混合云 Cloudburst相关推荐

  1. 混合云,让你看的清清楚楚明明白白真真切切

    点击标题下「中国云报」可快速关注   也许你已经胸有成竹,也许你还没有做好万全的准备,混合云正迎面而来.兼具私有云和公有云的优势,更贴近企业用户从传统架构转向云架构的需求,混合云将成为未来企业主流的架 ...

  2. 安全可信 | 首批+先进!天翼全栈混合云一举斩获三项可信云评估

    2022年底,由中国信息通信研究院主办的"2022混合云技术发展论坛"在北京召开,论坛上发布了多项团体和行业标准,受到了产.学.研各方的关注. 天翼云率先顺利通过<混合云超融 ...

  3. 韵达混合云深度解析:Docker助力大规模云上调度实践

    在2016杭州云栖大会第二日,韵达快运集团高级总监张磊在智慧物流专场分享了<大数据在物流行业应用突破--大规模云上调度实践>.他主要从韵达上云过程.云上资源调度实践.未来发展三个方面进行了 ...

  4. 云上故事 | “电”亮数字生活,阿里云混合云助力南方电网智能调度

    要评选全球最耗电的城市,非赌城拉斯维加斯莫属. 提起这个名字,首先让人想到的就是纸醉金迷.夜夜笙歌的娱乐生活.这里有从内到外散发着法国风情的PARIS酒店,随音乐而动的百乐宫喷泉,宏伟逼真的金字塔和狮 ...

  5. 联科集团携手阿里云发布科研混合云平台 共建科研教育新生态

    1月17日,联科集团基于阿里云发布科研混合云平台,开启科研教育智能化未来图景. 自2018杭州云栖大会阿里云与联科集团签署合作协议后,阿里云一直致力于在全球范围内推动云超算的落地.阿里云科教行业总经理 ...

  6. python联科_联科集团携手阿里云发布科研混合云平台 共建科研教育新生态

    1月17日,联科集团基于阿里云发布科研混合云平台,开启科研教育智能化未来图景. 自2018杭州云栖大会阿里云与联科集团签署合作协议后,阿里云一直致力于在全球范围内推动云超算的落地.阿里云科教行业总经理 ...

  7. vSphere 7简介:混合云的功能和技术

    vSphere 7简介:混合云的功能和技术 2020年3月10日,VMware 发布了vSphere 7,我很高兴终于能够描述为什么它是真正适用于混合云的技术! 正在上传-重新上传取消 带Kubern ...

  8. 转:vSphere 7简介:混合云的功能和技术

    vSphere 7简介:混合云的功能和技术 2020年3月10日,VMware 发布了vSphere 7,我很高兴终于能够描述为什么它是真正适用于混合云的技术! 带Kubernetes的vSphere ...

  9. Go 开源说第十一期:KubeSphere-面向云原生应用的容器混合云

    点击蓝字 关注我们 本文由"GO开源说"第十一期 <KubeSphere-面向云原生应用的容器混合云>直播内容修改整理而成,视频内容较长,本文内容有所删减和重构. Ku ...

最新文章

  1. 分库分表之 Sharding-JDBC 中间件,看这篇真的够了!
  2. c#中计算三角形面积公式_看着有点迷的三角形面积计算
  3. 网页打开共享目录_你会做Excel文件目录吗?真的太太太太太简单了!
  4. C语言中二维数组移动一行,二维数组对每一行进行排序。。
  5. java set中取数据_Java中取数据库用的ResultSet问题
  6. c语言如何扩大字体,C语言图形汉字及放大显示程序
  7. struts2 log4j_Struts2和Log4j集成示例项目
  8. 计算机时钟电路检查,数字电子时钟电路设计实训报告
  9. SSM中拦截器和过滤器
  10. 机器学习算法------6.4 模型评估(误差平方和、肘方法、轮廓系数法、CH系数)
  11. 音创ktv点歌系统服务器,音创ktv点歌系统家庭版
  12. rabbitmq的web管理界面无法使用guest用户登录
  13. 王建农老师昆笛 + 简谱
  14. 本地创建git仓库并提交到码云
  15. 配置Maven从私服下载构件
  16. 数据库的入门简单了解
  17. 驱动层SSDT 隐藏进程
  18. vs调试nuget包_高冷?孩子气?醋包?那不得是分对象啊
  19. 安装markdownpad2过程中遇到this view has crashed 问题
  20. 两分钟了解HTTP请求报文和响应报文

热门文章

  1. 《React-Native系列》44、基于多个TextInput的键盘遮挡处理方案优化
  2. Shell脚本完全说明
  3. iOS调试神器--FLEX
  4. 常用操作系统扫描工具介绍
  5. 图像分类比赛中,你可以用如下方案举一反三
  6. 【统计DataFrame中每列非空值的个数】
  7. ‘ network communites’(网络社区)(二)(louvain算法实现)
  8. iOS9.2.1 App从AppStore上下载闪退问题
  9. 计算机如何和键盘通讯,《汉语通信方案》电脑缩拼输入法及键盘
  10. 梯度下降算法的工作原理