在OpenShift平台上验证NVIDIA DGX系统的分布式多节点自动驾驶AI训练

在OpenShift平台上验证NVIDIA
DGX系统的分布式多节点自动驾驶AI训练

自动驾驶汽车的深度神经网络（DNN）开发是一项艰巨的工作。本文验证了DGX多节点，多GPU，分布式训练在DXC机器人驱动环境中运行。

还使用了一个机器人学习平台来驱动深度学习（11.3）的工作负载。目前，OpenShift 3.11已部署在许多大型GPU加速的自动驾驶（AD）开发和测试环境中。这里显示的方法同样适用于新的OpenShift版本，并且可以转移到其他基于OpenShift的集群中。

DXC Robotic Drive是一个自动驾驶的数据驱动开发平台，可大大降低风险，加快ADAS/AD功能的开发、测试和验证，以支持2级以上5级自主功能。它是目前已知的最大的EB级开发解决方案，利用业界成熟的本地和云基础设施、方法、工具和加速器实现高度自动化的广告开发过程。

互操作性测试环境是运行OpenShift 3.11和4.3的机器人驱动创新实验室。

DL workloads at scale

数据并行（dataparallelishm）是最常用的扩展DL工作负载的设计模式。关于如何加速视觉和递归神经网络，有许多参考文献和实践。

DL模型被多次实例化，并且数据在这些实例中并行传输。实例彼此交换渐变，以协同工作，而不是独立工作。

这是来自高性能计算（HPC）领域的消息传递接口（MPI）框架的经典计算模式。因此，在众所周知的MPI的帮助下对这些工作负载进行编排是很简单的。MPI还可以轻松扩展到多个节点之外。

支持多GPU的DL框架，如PyTorch和TensorFlow，在任何项目开始时都非常适合使用，以确保工作负载可以使用单个GPU工作站直到大型GPU集群。

这些框架还支持使用MPI本机进行数据并行训练，并且可以使用MPI工具（如mpirun或mpiexec）触发工作负载。数据并行模式的多种实现都遵循这种模式，比如Horovod。

RedHat OpenShift Container Platform（OCP）是基于Kubernetes的Docker或CRI-O运行时容器构建的平台即服务。OpenShift专注于安全性，并且确实包括了对上游Kubernetes的缺陷、安全性和性能问题的修复。作为Kubernetes，OpenShift允许在RedHat的支持下大规模地部署和管理集群。

Kubernetes和OpenShift可以轻松地处理MPI工作负载。一个集成以Kubeflow MPI操作符的形式存在，它在后台协调资源并提高工作负载。

图1显示了使用两个DGX-1系统的DL工作负载。在这种情况下，有16个单独的进程。在Horovod中，它们被赋予一个名为rank的唯一ID来区分它们：rank 0到rank
15。所有单独的进程在输入数据的不同部分并行工作，并交换它们的梯度以协同工作。

Figure 1. DL workload using two DGX-1 systems.

为了在各个塔之间进行有效的通信，使用NVIDIA集体通信库（NCCL）。NCCL实现了针对NVIDIA
GPU的性能优化的多GPU和多节点集合通信原语。

训练数据的低延迟POSIX存储是通过机器人驱动的持久Volumes来实现的。大规模地处理存储是至关重要的，但不是这里的重点。

Installation steps

下面是如何在运行OpenShift v3.11的DXC机器人驱动环境中安装DGX系统。
测试系统概述

OpenShift v3.11至少需要一个临时引导计算机、三个主节点和至少两个计算节点。

因为DL可能是一个数据密集型工作负载，所以集群需要一个合适的网络解决方案。机器人驱动创新实验室提供了HPE FlexFabric 5945 32QSFP28交换机，这些交换机与DGX系统的Mellanox适配器结合使用。

DGX-1的所有集群互连适配器都以以太网模式使用，并通过使用LACP分组模式捆绑在一起。

下表总结了群集的硬件和软件配置。

Table 1. Overview of HW/SW configuration.

Preparing the DGX-1 systems

要在DGX系统上安装RHEL 7.7，请使用NVIDIA提供的安装说明。这些步骤还包括安装特定于DGX的软件。

将DGX系统连接到OpenShift群集

按照OpenShift 3.11的RedHat文档中的说明安装集群。

集群启动并运行后，使用RedHat提供的官方版本来扩展集群并包括两个DGX系统。这些执行手册添加必要的库和配置节点，并将它们添加到集群本身。

图2显示了OCP仪表板的开始显示，允许与OpenShift集群交互。这些交互包括监测资源、创建pod和检索日志。

Figure 2. Start screen of an OpenShift cluster.

Enabling GPUs in the cluster

OpenShift（和Kubernetes）都支持标准资源，比如CPU、内存和监测可用的磁盘空间。使用设备插件或操作符处理其它资源。在此设置中，将使用用于OpenShift 3.11的NVIDIA GPU设备插件。

在Kubernetes和OpenShift（v1.13+，v4.1+）的更新版本中，引入了operator框架。使用这个操作框架，NVIDIA GPU操作符允许自动部署以前必须手动部署的组件。这些组件包括NVIDIA驱动程序、用于gpu的Kubernetes设备插件、NVIDIA容器运行时、自动节点标记和基于DCGM的监视。

有关更多信息，请参阅设备插件和NVIDIA GPU运营商。

Figure 3. Multi-GPU, single-node workload example.

In the following examples, we show you how to trigger an MPI-based, DL workload using the Horovod framework. For this test, we used an MPI-enabled Docker container with the required frameworks, such as NVIDIA GPU CLOUD (NGC) containers. NGC is a GPU-optimized software hub, simplifying AI and HPC workflows.

Start the workload, then wrap the
command in an OpenShift YAML for orchestration.

To run a scalable ResNet-50 training
with randomly generated data natively, run the following command:

docker run --gpus all -it
horovod/horovod:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6

The previous
example makes use of all GPUs available on the system with the --gpus all flag. To
allocate only a smaller number of GPUs, change that to --gpus 1. You can also
specify individual GPUs by using their device IDs.

To start the workload in a container in
interactive mode, a typical command looks like the following code example. In
this example, eight GPUs are used:

horovodrun -np 8 -H
localhost:8 python pytorch_synthetic_benchmark.py

A corresponding YAML file is used to
start this workload through OpenShift, which can be deployed straight from
OpenShift login node, OpenShift master node, or the GUI.

oc create -f horovod_example_8gpus.yaml

kubectl create -f
horovod_example_8gpus.yaml

This is the content of the YAML file
used:

apiVersion: v1

kind: Pod

metadata:

namespace: managed-machine-learning

spec:

serviceAccount: tensorflow-sa

restartPolicy: OnFailure

containers:

name: horovod-test

image:
horovod/horovod:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6

command: [
“horovodrun”, “-np”, “8”, “python” ,
“pytorch_synthetic_benchmark.py” ]

env:
- name:
  NVIDIA_VISIBLE_DEVICES
  
  value: all
- name: NVIDIA_DRIVER_CAPABILITIES
  
  value:
  “compute,utility”
- name:
  NVIDIA_REQUIRE_CUDA
  
  value:
  “cuda>=9.0”
resources:

limits:

nvidia.com/gpu: “8”

  requests:

nvidia.com/gpu: “8”

securityContext:privileged:

true

To influence the scalability of the
workload set the following parameters accordingly.

The command
section specifies the command to be run, which is the same as used in the
Docker example. The number of processes spawned can be controlled via the -np 8 flag.

command: [ “horovodrun”,
“-np”, “8”, “python” ,
“pytorch_synthetic_benchmark.py” ]

Allocate the
proper number of GPUs in the resources section. The number should correspond to
the number of processes set in the horovodrun command.

resources:limits:

nvidia.com/gpu:“8”

  requests:

nvidia.com/gpu:“8”

A pod is a group of one or more containers. Figure 4 shows the
creation of pods using the web interface of OpenShift. As mentioned earlier,
you can also use the CLI by either using oc or kubectl to create a new pod.

Figure 4. Created pods inside the OpenShift dashboard.

To retrieve outputs or logs of pods
running in an OpenShift cluster, there are several possible ways. One way is to
access the log through the CLI by using either kubectl logs, or oc logs:

kubectl logs -n

oc logs -n

Figure 5 shows the other possibility,
using the OpenShift GUI.

Figure 5. Access to the pod output using the dashboard.

Multi-GPU, single-node workload

有几种方法可以利用多个系统的功能来获得单个DL训练的好处。多个系统的计算资源可以聚合起来以加速训练。这对于自主车辆DNN开发中的DNN训练等工作负载尤其有利，因为在大型数据集上运行此类工作负载时，实验的周转时间通常是一个关键因素。
图1显示了两个DGX系统在单个DL作业中协同工作。DL工作负载在DGX-1系统上总共使用16个GPU，每个GPU为8个。每个作业根据数据的一个分区处理其计算。
所有的工作人员通过NVIDIA NVLINK与他们的同事同步他们的技术，这是一种由NVIDIA开发的通信协议，允许在cpu和gpu或gpu之间以及网络之间传输数据和控制代码。此同步步骤显示为虚线。

重用以前的相同代码库和容器。与以前的编排方法的唯一区别是使用已知的MPI运算符。这简化了跨多个DGX系统的工作负载分配和部署。

MPI Operator

有几个选项可以在OpenShift或Kubernetes集群上运行多GPU、多节点工作负载。其中一个常见的框架是MPI操作符，由Kubeflow项目提供。MPI操作符处理DL工作负载的编排，如前所示。

安装MPI操作符时，会在集群中引入一种新的作业类型：MPIJob。下面的作业规范的代码示例显示了通过OpenShift或Kubernetes启动这种MPIJob工作负载的对应YAML文件。它可以直接从OpenShift登录节点、主节点或GUI部署。

与上一节中找到的YAML文件不同，该示例文件使用NVIDIA TensorFlow Docker映像，并在两个DGX-1系统的16个gpu上运行合成ResNet-50基准测试。使用通过NGC提供的TensorFlow Docker映像可以受益于NVIDIAs不断改进的性能。创建的三个pod（一个启动器pod和两个worker pod）之间的通信由MPI操作员负责。

与Horovod分布式模型的YAML定义一样，使用mpirun-np16的线程数再次对应于需求的gpu数量。在本例中，它是两个worker pod，每个都连接了8个gpu。通过更改worker pod的数量和生成的线程的数量，可以轻松地扩展这个示例。

16 GPU培训作业的作业规范如下所示：

apiVersion: kubeflow.org/v1alpha2kind: MPIJobmetadata: name: 16gpu-tensorflow-benchmark-imagenetspec: slotsPerWorker: 8 cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 template: spec: containers: - image: nvcr.io/nvidia/tensorflow:19.10-py3 name: tensorflow-benchmarks env: - name: NVIDIA_DRIVER_CAPABILITIES value: compute,utility - name: NVIDIA_REQUIRE_CUDA value: cuda>=9.0 command: - mpirun - -np - “16” - -bind-to - none - -map-by - slot - -x - NCCL_DEBUG=INFO - -x - LD_LIBRARY_PATH - -x - PATH - -mca - pml - ob1 - -mca - btl - ^openib - python - nvidia-examples/resnet50v1.5/main.py - --mode=training_benchmark - --batch_size=128 - --num_iter=90 - --iter_unit=epoch - --results_dir=/efs Worker: replicas: 2 template: spec: containers: - image: nvcr.io/nvidia/tensorflow:19.10-py3 name: tensorflow-benchmarks env: - name: NVIDIA_DRIVER_CAPABILITIES value: compute,utility - name: NVIDIA_REQUIRE_CUDA value: cuda>=9.0 resources: limits: nvidia.com/gpu: 8

On Kubernetes, the YAML
file looks like the one used in OpenShift. One important difference is that
environment variables in OpenShift v3.11 must be set inside the spec of the
pods.

         env:              - name: NVIDIA_DRIVER_CAPABILITIES                value: compute,utility              - name: NVIDIA_REQUIRE_CUDA                value: cuda>=9.0

Aggregating cutoff resources

In large computing environments, there
is often a cutoff or some unused resources. The MPI Operator is of great use in
such a case, as it can aggregate those cutoffs and avoid waste. This is a
property that only the MPI Operator can deliver.

The following code example shows an
exemplary two-GPU job that can aggregate resources from multiple systems. The
job requests two GPUs for the workload. You can customize this example to fit
almost every situation, for example three GPUs on three different nodes or four
GPUs on two different nodes.

apiVersion: kubeflow.org/v1alpha2

kind: MPIJob

metadata:

spec:

slotsPerWorker: 1

cleanPodPolicy: Running

mpiReplicaSpecs:

  Launcher:replicas: 1template:spec:containers:- image: mpioperator/tensorflow-benchmarks:latest

env:

name: NVIDIA_DRIVER_CAPABILITIES

value: compute,utility
name: NVIDIA_REQUIRE_CUDA

value: cuda>=9.0

command:

mpirun
-np
“2”
-bind-to
none
-map-by
slot
-x
NCCL_DEBUG=INFO
-x
LD_LIBRARY_PATH
-x
PATH
-mca
pml
ob1
-mca
btl
^openib
-mca
plm_base_verbose
“100”
-mca
btl_base_verbose
“30”
python
scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
–model=resnet101
–batch_size=64
–variable_update=horovod

resources:

limits:

 nvidia.com/gpu: 1Worker:replicas: 2template:spec:containers:- image:

mpioperator/tensorflow-benchmarks:latest

name: tensorflow-benchmarksenv:

name: NVIDIA_DRIVER_CAPABILITIES

value: compute,utility
name: NVIDIA_REQUIRE_CUDA

value:
cuda>=9.0

resources:

limits:

  nvidia.com/gpu: 1

Running this YAML file results in a log
file like the following. It can either be collected using the GUI or CLI.

oc logs -n

kubectl logs -n

tensorflow-benchmarks-gpu-v1a2-worker-1:26:156
[0] NCCL INFO Ring 01 : 1 -> 0 [send] via NET/Socket/0

tensorflow-benchmarks-gpu-v1a2-worker-0:26:156
[0] NCCL INFO Ring 01 : 0 -> 1 [send] via NET/Socket/0

tensorflow-benchmarks-gpu-v1a2-worker-0:26:156
[0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled

tensorflow-benchmarks-gpu-v1a2-worker-1:26:156
[0] NCCL INFO comm 0x7f887032e120 rank 1 nranks 2 cudaDev 0 nvmlDev 7 - Init
COMPLETE

tensorflow-benchmarks-gpu-v1a2-worker-0:26:156
[0] NCCL INFO comm 0x7f003434e940 rank 0 nranks 2 cudaDev 0 nvmlDev 3 - Init
COMPLETE

tensorflow-benchmarks-gpu-v1a2-worker-0:26:156
[0] NCCL INFO Launch mode Parallel

Done warm up

Step
Img/sec total_loss

Done warm up

Step
Img/sec total_loss

1 images/sec: 147.6 +/-
0.0 (jitter = 0.0) 8.308

1 images/sec: 147.8 +/-
0.0 (jitter = 0.0) 8.378

10 images/sec: 159.1 +/-
2.3 (jitter = 4.7) 8.526

Querying for the running pods inside the
cluster does show that cutoff resources from two nodes are being used.

$ oc get pods -o wide

NAME
READY STATUS
RESTARTS AGE
IP
NODE

tensorflow-benchmarks-gpu-v1a2-launcher-dqgsk
1/1 Running 0 3m45s
10.233.XXX.XXX dgx01.dev.XXX

tensorflow-benchmarks-gpu-v1a2-worker-0
1/1 Running
0 3m45s 10.233.XXX.XXX
dgx02.dev.XXX

tensorflow-benchmarks-gpu-v1a2-worker-1
1/1 Running
0 3m45s 10.233.XXX.XXX
dgx01.dev.XXX

Docker
network configuration

NCCL is discovering the topology with
its peers. The following output was taken from one of the running pods. It
shows that, besides the standard local adapter, there is only one additional
connection configured.

eth0 Link
encap:Ethernet HWaddr 0a:58:0a:81:07:b2

inet addr:10.129.7.178 Bcast:10.129.7.255 Mask:255.255.254.0

inet6 addr: fe80::419:d8ff:fe19:1bbd/64 Scope:Link

UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1

RX packets:21275 errors:0 dropped:0 overruns:0 frame:0

TX packets:33993 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:0

RX bytes:11305584 (11.3 MB) TX bytes:4496191 (4.4 MB)

lo
Link encap:Local Loopback

inet addr:127.0.0.1 Mask:255.0.0.0

inet6 addr: ::1/128 Scope:Host

UP LOOPBACK RUNNING MTU:65536 Metric:1

RX packets:28205726 errors:0 dropped:0 overruns:0 frame:0

TX packets:28205726 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:12809324160 (12.8 GB) TX bytes:12809324160 (12.8 GB)

Having one in a
Docker NIC (here, eth0) is a starting point. A more
sophisticated setup may use multiple Docker network adapters.

Get started with multi-GPU, multi-node training running on OpenShift

Adopt industry state-of-the-art DL
workloads and deploy them at scale today. In this post, we showed you that data
parallel training making use of the MPI paradigm is highly flexible for different
environments.

With this style, DL engineers of a
scalable DGX system cluster get the following benefits, regardless of their
orchestration software:

·
Scale beyond the limitations of a single
node and enable DL in a larger cluster.

·
Prevent turning resource cutoffs in
waste by aggregating leftover resources effectively.

OpenShift上的Robotic Drive容器化计算平台大规模地协调DL工作负载，包括使用NVIDIA DGX系统的多GPU、多节点作业。首先，基于视觉的ML模型能够为最先进的复杂驾驶行为任务（如运动规划）提供良好的技术基础。这些任务需要大量的探索和实验，这些都由机器人驱动创新实验来解决。

在OpenShift平台上验证NVIDIA DGX系统的分布式多节点自动驾驶AI训练相关推荐

在远程FPGA虚拟实验平台上验证七段译码器
在远程FPGA虚拟实验平台上验证七段译码器 VirtualBoard模块代码 SevenSegDecode模块代码在远程FPGA实验平台验证七段译码器第一步:申请实验板第二步:加载 FPGA 电 ...
打车平台Lyft获Magna 2亿美元投资，携手打造自动驾驶汽车
Root 编译整理量子位出品 | 公众号 QbitAI 目前,汽车共享出行的里程数,占所有汽车行驶总里程数的比例还不到1%. "如果要增加到80%,"打车平台Lyft CEO ...
自动驾驶行车记录仪训练集_无服务器安全性：将其置于自动驾驶仪上
自动驾驶行车记录仪训练集 Ack :本文是从个人经验以及从无服务器安全性的其他多个来源学到的东西的混合. 我无法在这里列出或确认所有这些信息: 但是,应该特别感谢The Register , Hack ...
SAP云平台上两个ABAP系统实例之间的互连
场景:SAP云平台上的两个ABAP系统实例,一个作为数据的提供者-provision system:另一个作为数据的消费者 - client system,后者从前者读取数据,并显示实现步骤概述: ...
NVIDIA向交通运输行业开源其自动驾驶汽车深度神经网络
NVIDIA今日宣布,在NVIDIA GPU Cloud (NGC)容器注册上,向交通运输行业开源NVIDIA DRIVE™自动驾驶汽车开发深度神经网络. NVIDIA DRIVE已成为自动驾驶汽车开 ...
自动驾驶系统入门（八）- 自动驾驶仿真技术
1.什么是自动驾驶汽车 1.1 基本概念定义 1)自动驾驶汽车是通过搭载先进的车载传感器.控制器和数据处理器.执行机构等装置,借助车联网.5G和V2X等现代移动通信与网络技术实现交通参与物与彼此间的互 ...
【转】自动驾驶系统入门（八）- 自动驾驶仿真技术
1.什么是自动驾驶汽车 1.1 基本概念定义 1)自动驾驶汽车是通过搭载先进的车载传感器.控制器和数据处理器.执行机构等装置,借助车联网.5G和V2X等现代移动通信与网络技术实现交通参与物与彼此间的互 ...
综述（九）线控系统的分类，及自动驾驶中常见的线控系统所起到的作用
线控底盘主要有五大系统,分别为线控转向.线控制动.线控换挡.线控油门.线控悬挂.而转向和制动则是面向自动驾驶执行端方向最核心的产品,其中又以制动技术难度更高. 线控油门:当前线控油门或电子油门技术已经 ...
NVIDIA DGX低至7.5折限时抢购，全球首款深度学习超级计算机组合
深度学习正迅速改变计算机科学领域的发展,深度学习性能需求也在日益增长.NVIDIA DGX™系统是全球首款专为深度学习定制的超级计算机组合,包括DGX Station™.DGX-1™.DGX-2和GP ...

在OpenShift平台上验证NVIDIA DGX系统的分布式多节点自动驾驶AI训练

在OpenShift平台上验证NVIDIA DGX系统的分布式多节点自动驾驶AI训练相关推荐

最新文章

热门文章