点击上方“小白学视觉”,选择加"星标"或“置顶

重磅干货,第一时间送达

总览

tf.distribute.Strategy API提供了一种抽象,用于在多个处理单元之间分布您的训练。目的是允许用户以最小的更改使用现有模型和培训代码来进行分布式培训。

本教程使用tf.distribute.MirroredStrategy,它在一台机器上的多个GPU上进行同步训练的图内复制。本质上,它将所有模型变量复制到每个处理器。然后,它使用all-reduce组合所有处理器的梯度,并将组合后的值应用于模型的所有副本。

MirroredStategy是TensorFlow核心中可用的几种分发策略之一。您可以在分发策略指南中了解更多策略。

Keras API

本示例使用tf.keras API构建模型和训练循环。有关自定义训练循环,请参阅带有训练循环的tf.distribute.Strategy教程。

Keras API

This example uses the tf.keras API to build the model and training loop. For custom training loops, see the tf.distribute.Strategy with training loops tutorial.

Import dependencies

from __future__ import absolute_import, division, print_function, unicode_literals# Import TensorFlow and TensorFlow Datasetstry:!pip install -q tf-nightly
exceptException:passimport tensorflow_datasets as tfds
import tensorflow as tf
tfds.disable_progress_bar()import os
print(tf.__version__)
2.1.0-dev20191004

Download the dataset

Download the MNIST dataset and load it from TensorFlow Datasets. This returns a dataset in tf.data format.

Setting with_info to True includes the metadata for the entire dataset, which is being saved here to info. Among other things, this metadata object includes the number of train and test examples.

datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)mnist_train, mnist_test = datasets['train'], datasets['test']
Downloading and preparing dataset mnist (11.06 MiB) to /home/kbuilder/tensorflow_datasets/mnist/1.0.0.../usr/lib/python3/dist-packages/urllib3/connectionpool.py:860: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warningsInsecureRequestWarning)
/usr/lib/python3/dist-packages/urllib3/connectionpool.py:860: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warningsInsecureRequestWarning)
/usr/lib/python3/dist-packages/urllib3/connectionpool.py:860: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warningsInsecureRequestWarning)
/usr/lib/python3/dist-packages/urllib3/connectionpool.py:860: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warningsInsecureRequestWarning)WARNING:tensorflow:From /home/kbuilder/.local/lib/python3.6/site-packages/tensorflow_datasets/core/file_format_adapter.py:209: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and:
`tf.data.TFRecordDataset(path)`WARNING:tensorflow:From /home/kbuilder/.local/lib/python3.6/site-packages/tensorflow_datasets/core/file_format_adapter.py:209: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and:
`tf.data.TFRecordDataset(path)`Dataset mnist downloaded and prepared to /home/kbuilder/tensorflow_datasets/mnist/1.0.0. Subsequent calls will reuse this data.

Define distribution strategy

Create a MirroredStrategy object. This will handle distribution, and provides a context manager (tf.distribute.MirroredStrategy.scope) to build your model inside.

strategy = tf.distribute.MirroredStrategy()
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
Number of devices: 1

Setup input pipeline

When training a model with multiple GPUs, you can use the extra computing power effectively by increasing the batch size. In general, use the largest batch size that fits the GPU memory, and tune the learning rate accordingly.

# You can also do info.splits.total_num_examples to get the total# number of examples in the dataset.num_train_examples = info.splits['train'].num_examples
num_test_examples = info.splits['test'].num_examplesBUFFER_SIZE = 10000BATCH_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

Pixel values, which are 0-255, have to be normalized to the 0-1 range. Define this scale in a function.

def scale(image, label):image = tf.cast(image, tf.float32)image /= 255return image, label

Apply this function to the training and test data, shuffle the training data, and batch it for training. Notice we are also keeping an in-memory cache of the training data to improve performance.

train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)

Create the model

Create and compile the Keras model in the context of strategy.scope.

with strategy.scope():model = tf.keras.Sequential([tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),tf.keras.layers.MaxPooling2D(),tf.keras.layers.Flatten(),tf.keras.layers.Dense(64, activation='relu'),tf.keras.layers.Dense(10, activation='softmax')])model.compile(loss='sparse_categorical_crossentropy',optimizer=tf.keras.optimizers.Adam(),metrics=['accuracy'])
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

Define the callbacks

The callbacks used here are:

  • TensorBoard: This callback writes a log for TensorBoard which allows you to visualize the graphs.

  • Model Checkpoint: This callback saves the model after every epoch.

  • Learning Rate Scheduler: Using this callback, you can schedule the learning rate to change after every epoch/batch.

For illustrative purposes, add a print callback to display the learning rate in the notebook.

# Define the checkpoint directory to store the checkpointscheckpoint_dir = './training_checkpoints'# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
# Function for decaying the learning rate.# You can define any decay function you need.def decay(epoch):if epoch < 3:return1e-3elif epoch >= 3and epoch < 7:return1e-4else:return1e-5
# Callback for printing the LR at the end of each epoch.classPrintLR(tf.keras.callbacks.Callback):def on_epoch_end(self, epoch, logs=None):print('\nLearning rate for epoch {} is {}'.format(epoch + 1,
callbacks = [tf.keras.callbacks.TensorBoard(log_dir='./logs'),tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,save_weights_only=True),tf.keras.callbacks.LearningRateScheduler(decay),PrintLR()]

Train and evaluate

Now, train the model in the usual way, calling fit on the model and passing in the dataset created at the beginning of the tutorial. This step is the same whether you are distributing the training or not.

下载1:OpenCV-Contrib扩展模块中文版教程

在「小白学视觉」公众号后台回复:扩展模块中文教程即可下载全网第一份OpenCV扩展模块教程中文版,涵盖扩展模块安装、SFM算法、立体视觉、目标跟踪、生物视觉、超分辨率处理等二十多章内容。

下载2:Python视觉实战项目52讲

在「小白学视觉」公众号后台回复:Python视觉实战项目即可下载包括图像分割、口罩检测、车道线检测、车辆计数、添加眼线、车牌识别、字符识别、情绪检测、文本内容提取、面部识别等31个视觉实战项目,助力快速学校计算机视觉。

下载3:OpenCV实战项目20讲

在「小白学视觉」公众号后台回复:OpenCV实战项目20讲即可下载含有20个基于OpenCV实现20个实战项目,实现OpenCV学习进阶。

交流群

欢迎加入公众号读者群一起和同行交流,目前有SLAM、三维视觉、传感器、自动驾驶、计算摄影、检测、分割、识别、医学影像、GAN、算法竞赛等微信群(以后会逐渐细分),请扫描下面微信号加群,备注:”昵称+学校/公司+研究方向“,例如:”张三 + 上海交大 + 视觉SLAM“。请按照格式备注,否则不予通过。添加成功后会根据研究方向邀请进入相关微信群。请勿在群内发送广告,否则会请出群,谢谢理解~

官方 | Keras分布式训练教程相关推荐

  1. 官方 | TensorFlow 2.0分布式训练教程

    点击上方"小白学视觉",选择加"星标"或"置顶" 重磅干货,第一时间送达本文转自|计算机视觉联盟 总览 tf.distribute.Stra ...

  2. keras 分布式_TensorFlow 2.0正式版官宣!深度集成Keras

      新智元报道   来源:medium.GitHub 编辑:小芹.大明 [新智元导读]TensorFlow 2.0正式版终于发布了!深度集成Keras,更简单.更易用,GPU训练性能提升.这是一个革命 ...

  3. PyTorch单机多卡分布式训练教程及代码示例

    导师不是很懂PyTorch的分布式训练流程,我就做了个PyTorch单机多卡的分布式训练介绍,但是他觉得我做的没这篇好PyTorch分布式训练简明教程 - 知乎.这篇讲的确实很好,不过我感觉我做的也还 ...

  4. 【AI简报20210820期】Apollo“汽车机器人” 具备L5级自动驾驶能力、树莓派DIY智能无臭猫便盆...

    导读:这期的干货有点大,诸位看官且忍受一下~ AI 简报 Github 地址: https://github.com/Lebhoryi/AI-News-weekly 本文一共 3600 字,通篇阅读结 ...

  5. 【深度学习】— 各框架分布式训练简介+测评

    1.各框架分布式简介 1.Pytorch 从官方文档上我们可以看到,pytorch的分布式训练,主要是torch.distributed包所提供,主要包含以下组件: Distributed Data- ...

  6. 请列举你了解的分布式锁_面试官想要你回答的分布式锁实现原理

    写在前面 在了解分布式锁具体实现方案之前,我们应该先思考一下使用分布式锁必须要考虑的一些问题.​ 互斥性:在任意时刻,只能有一个进程持有锁. 防死锁:即使有一个进程在持有锁的期间崩溃而未能主动释放锁, ...

  7. p2p网络测试工具_自媒体 IPFS官方升级DHT方案,提升网络整体性能

    IPFS官方升级DHT方案,提升网络整体性能 4 月底,官方发布了迄今为止最大的 go-ipfs 更新:IPFS 0.5.0.此升级为 IPFS 带来了主要的性能和可靠性改进,尤其是在内容发现和路由方 ...

  8. Keras中几个重要函数用法

    官方keras例子:http://keras-cn.readthedocs.io/en/latest/getting_started/sequential_model/ 模块需导入包: [python ...

  9. 聊聊分布式锁——Redis和Redisson的方式

    聊聊分布式锁--Redis和Redisson的方式 一.什么是分布式锁 分布式~~锁,要这么念,首先得是『分布式』,然后才是『锁』 分布式:这里的分布式指的是分布式系统,涉及到好多技术和理论,包括CA ...

最新文章

  1. 比尔盖茨NEJM发文:新冠肺炎是百年一遇的流行病!全世界应该如何应对?
  2. MYSQL中常用的强制性操作(例如强制索引)
  3. 高级SmartGWT教程,第1部分
  4. 好记性不如烂笔头,记录几个常用的Linux操作
  5. “元宇宙” 是什么东西?
  6. ospf路由协议源码学习
  7. CASE WHEN语句中加IN应该如何使用
  8. linux 命令行删除分区,如何在 Linux 中删除分区
  9. opencart 添加新模型
  10. 防止HALCON刷新图像窗口控件闪烁
  11. git 设置和取消代理
  12. 工厂模式类图梳理笔记
  13. IDEA主题设置更换
  14. JavaScript 设计模式学习第七篇- 单例模式
  15. Red Hat Enterprise Linux9 + Zabbix 6.2.3 + Grafana 9.2.0
  16. 这应该是把春秋招讲的最清楚的文章了,不接受反驳
  17. rabbitMQ无法访问web管理页面
  18. python百钱百鸡问题_shell的循环与百鸡百钱问题
  19. linux id / chown 命令 nobody uid gid 是什么
  20. Example of data scratching

热门文章

  1. 面试必备|带你彻底搞懂Python生成器
  2. 如何保证世界杯直播不卡顿?腾讯云要用AI解决这个问题
  3. 你可能没那么了解 JWT
  4. Bullsh*t,System. currentTimeMillis大胆用起来,我说的!
  5. 零散的MySQL基础总是记不住?看这一篇就够了!
  6. 一种关注于重要样本的目标检测方法!
  7. 数据挖掘技术在出行体验上的应用!
  8. 透过现象看本质,图解支持向量机
  9. 无需多个模型也能实现知识整合?港中文MMLab提出「烘焙」算法,全面提升ImageNet性能...
  10. Nature大调查显示 :全球1/4博士生想换导师