• practical-advice-for-building-deep-neural-networks/

In our machine learning lab, we’ve accumulated tens of thousands of training hours across numerous high-powered machines. The computers weren’t the only ones to learn a lot in the process, though: we ourselves have made a lot of mistakes and fixed a lot of bugs.

Here we present some practical tips for training deep neural networks based on our experiences (rooted mainly in TensorFlow). Some of the suggestions may seem obvious to you, but they weren’t to one of us at some point. Other suggestions may not apply or might even be bad advice for your particular task: use discretion!

We acknowledge these are all well-known methods. We, too, stand on the shoulders of giants here! Our objective with this article is simply to summarize them at a high level for use in practice.

General Tips

  • Use the ADAM optimizer. It works really well. Prefer it to more traditional optimizers such as vanilla gradient descent. TensorFlow note: If saving and restoring weights, remember to set up the Saver after setting up the AdamOptimizer, because ADAM has state (namely per-weight learning rates) that need to be restored as well.
  • ReLU is the best nonlinearity (activation function). Kind of like how Sublime is the best text editor. But really, ReLUs are fast, simple, and, amazingly, they work, without diminishing gradients along the way. While sigmoid is a common textbook activation function, it does not propagate gradients well through DNNs.
  • Do NOT use an activation function at your output layer. This should be obvious, but it is an easy mistake to make if you build each layer with a shared function: be sure to turn off the activation function at the output.
  • DO add a bias in every layer. This is ML 101: a bias essentially translates a plane into a best-fitting position. In y=mx+b, b is the bias, allowing the line to move up or down into the “best fit” position.
  • Use variance-scaled initialization. In Tensorflow, this looks like tf.contrib.layers.variance_scaling_initializer(). In our experience, this generalizes/scales better than regular Gaussian, truncated normal, and Xavier. Roughly speaking, the variance scaling initializer adjusts the variance the initial random weights based on the number of inputs or outputs at each layer (default in TensorFlow is number of inputs), thus helping signals to propagate deeper into the network without extra “hacks” like clipping or batch normalization. Xavier is similar, except that the variance is nearly the same in all layers; but networks with layers that vary greatly in their shapes (common with convolutional networks) may not cope as well with the same variance in each layer.
  • Whiten (normalize) your input data. For training, subtract the mean of the data set, then divide by its standard deviation. The less your weights have to be stretched and pulled in every which direction, the faster and more easily your network will learn. Keeping the input data mean-centered with constant variance will help with this. You’ll have to perform the same normalization to each test input as well, so make sure your training set resembles real data.
  • Scale input data in a way that reasonably preserves its dynamic range. This is related to normalization but should happen before normalizing. For example, data x with an actual real-world range of [0, 140000000] can often be tamed with tanh(x) or tanh(x/C) where C is some constant that stretches the curve to fit more of the input range within the dynamic, sloping part of the tanh function. Especially in cases where your input data may be unbounded on one or both ends, the neural net will learn much better between (0,1).
  • Don’t bother decaying the learning rate (usually). Learning rate decay was more common with SGD, but ADAM takes care of this naturally. If you absolutely want to squeeze out every ounce of performance: decay the learning rate for a short time at the end of training; you’ll probably see a sudden, very small drop in error, then it will flatten out again.
  • If your convolution layer has 64 or 128 filters, that’s probably plenty. Especially for a deep network. Like, really, 128 is A LOT. If you already have a high number of filters, adding more probably won’t improve things.
  • Pooling is for transform invariance. Pooling essentially lets the network learn “the general idea” of “that part” of an image. Max pooling, for example, can help a convolutional network become robust against translation, rotation, and scaling of features in the image.

Debugging a Neural Network

If your network isn’t learning (meaning: the loss/accuracy is not converging during training, or you’re not getting results you expect), try these tips:

  • Overfit! The first thing to do if your network isn’t learning is to overfit a training point. Accuracy should be essentially 100% or 99.99%, or an error as close to 0. If your neural network can’t overfit a single data point, something is seriously wrong with the architecture, but it may be subtle. If you can overfit one data point but training on a larger set still does not converge, try the following suggestions.
  • Lower your learning rate. Your network will learn slower, but it may find its way into a minimum that it couldn’t get into before because its step size was too big. (Intuitively, think of stepping over a ditch on the side of the road, when you actually want to get into the lowest part of the ditch, where your error is the lowest.)
  • Raise your learning rate. This will speed up training which helps tighten the feedback loop, meaning you’ll have an inkling sooner whether your network is working. While the network should converge sooner, its results probably won’t be great, and the “convergence” might actually jump around a lot. (With ADAM, we found ~0.001 to be pretty good in many experiences.)
  • Decrease (mini-)batch size. Reducing a batch size to 1 can give you more granular feedback related to the weight updates, which you should report with TensorBoard (or some other debugging/visualization tool).
  • Remove batch normalization. Along with decreasing batch size to 1, doing this can expose diminishing or exploding gradients. For weeks we had a network that wasn’t converging, and only when we removed batch normalization did we realize that the outputs were all NaN by the second iteration. Batch norm was putting a band-aid on something that needed a tourniquet. It has its place, but only after you know your network is bug-free.
  • Increase (mini-)batch size. A larger batch size—heck, the whole training set if you could—reduces variance in gradient updates, making each iteration more accurate. In other words, weight updates will be in the right direction. But! There’s an effective upper bound on its usefulness, as well as physical memory limits. Typically, we find this less useful than the previous two suggestions to reduce batch size to 1 and remove batch norm.
  • Check your reshaping. Drastic reshaping (like changing an image’s X,Y dimensions) can destroy spatial locality, making it harder for a network to learn since it must also learn the reshape. (Natural features become fragmented. The fact that natural features appear spatially local is why conv nets are so effective!) Be especially careful if reshaping with multiple images/channels; use numpy.stack() for proper alignment.
  • Scrutinize your loss function. If using a complex function, try simplifying it to something like L1 or L2. We’ve found L1 to be less sensitive to outliers, making less drastic adjustments when hitting a noisy batch or training point.
  • Scrutinize your visualizations, if applicable. Is your viz library (matplotlib, OpenCV, etc.) adjusting the scale of the values, or clipping them? Consider using a perceptually-uniform color scheme as well.

An Example Case Study

To help make the process described above more relatable, here are a few loss charts (via TensorBoard) for some actual regression experiments of a convolutional neural network that we built.

At first, the network was not learning at all:

We tried clipping the values, to prevent them from going out of bounds:

Huh. Look at how crazy the un-smoothed values are. Learning rate too high? We tried decaying the learning rate and training on just one input:


You can see where the first few changes to the learning rate occurred (at about steps 300 and 3000). Obviously, we decayed too quickly. So, giving it more time between decays, it did better:


You can see we decayed at steps 2000 and 5000. This was better, but still not great, because it didn’t go to 0.

Then we disabled LR decay and tried moving the values into a narrower range instead by putting the inputs through a tanh. While this obviously brought the error values below 1, we still couldn’t overfit the training set:


This is where we discovered, by removing batch normalization, that the network was quickly outputting NaN after one or two iterations. We left batch norm disabled and changed our initialization to variance scaling. These made all the difference! We were able to overfit our test set of just one or two inputs. While the chart on the bottom clips the Y axis, the initial error value was well above 5, showing a reduction in error by almost 4 orders of magnitude:

The top chart is heavily smoothed, but you can see that it overfit the test input extremely quickly, and the loss of the whole training set marched down below 0.01 over time. This was without decaying the learning rate. We then continued training after dropping the learning rate by one order of magnitude, and got even better results:

These results were much better! But what if we decayed the learning rate geometrically rather than splitting training into two parts?

By multiplying the learning rate by 0.9995 at each step, the results were not as good:

… presumably because the decay was too quick. A multiplier of 0.999995 did better, but the results were nearly equivalent to not decaying it at all. We concluded from this particular sequence of experiments that batch normalization was hiding exploding gradients caused by poor initialization, and that decaying the learning rate was not particularly helpful with the ADAM optimizer, except perhaps one deliberate decay at the end. Along with batch norm, clipping the values was just masking the real problem. We also tamed our high-variance input values by putting them through a tanh.

We hope you will find these basic tips useful as you become more familiar with building deep neural networks. Often, it’s just simple things that can make all the difference.

Practical Advice for Building Deep Neural Networks相关推荐

  1. 3.Deep Neural Networks for YouTube Recommendations论文精细解读

    一.总述 今天分享的是Deep Neural Networks for Y ouTube Recommendations这篇论文的一些核心,这篇论文被称为推荐系统工程实践领域的一篇神文,每一个细节都值 ...

  2. 关于'Deep Neural Networks for YouTube Recommendations'的一些思考和实现

    七月 上海 | 高性能计算之GPU CUDA培训 6月27-29日三天密集式学习  快速带你入门阅读全文> 正文共6326个字,6张图,预计阅读时间30分钟. @blog : http://sh ...

  3. (zhuan) Building Convolutional Neural Networks with Tensorflow

    Ahmet Taspinar Home About Contact Building Convolutional Neural Networks with Tensorflow Posted on a ...

  4. ResNeXt - Aggregated Residual Transformations for Deep Neural Networks

    <Aggregated Residual Transformations for Deep Neural Networks>是Saining Xie等人于2016年公开在arXiv上: h ...

  5. [论文阅读] ICCV2015 Joint Fine-Tuning in Deep Neural Networks for Facial Expression Recognition

    Joint Fine-Tuning in Deep Neural Networks for Facial Expression Recognition 论文链接:https://ieeexplore. ...

  6. 机器学习入门课程笔记(二)——deeplearning.ai: Improving Deep Neural Networks

    欢迎前往我的个人博客网站:mathscode.top获取更多学习资源. 所有文本内容会在知乎: MathsCode同步 所有开放资源会在Github: MathsCode开放下载 有问题欢迎大家评论或 ...

  7. 【翻译】Aggregated Residual Transformations for Deep Neural Networks

    Aggregated Residual Transformations for Deep Neural Networks 作者:Saining Xie1 Ross Girshick2 Piotr Do ...

  8. Deep Neural Networks for YouTube Recommendations 双语译文+理解

    目录 Deep Neural Networks for YouTube Recommendations Abstract 摘要 INTRODUCTION 1 引言 2 SYSTEM OVERVIEW ...

  9. Procedural Noise Adversarial Examples for Black-Box Attacks on Deep Neural Networks论文笔记

    Procedural Noise Adversarial Examples for Black-Box Attacks on Deep Neural Networks论文笔记 0. 概述 如今一些深度 ...

最新文章

  1. LeetCode-二分查找-278. 第一个错误的版本
  2. ubuntu scp命令或者用root连接ssh提示:Permission denied, please try again.错误
  3. .NET IdentityServer4实战-开篇介绍与规划
  4. PCI总线特性及信号说明
  5. OpenShift 4 - 通过Service的nodePort访问应用
  6. 非极大值抑制_OpenCV非极大值抑制bug
  7. 市场大幅逆转速度达史上最快 宏观交易领域也被机器“占领”
  8. 煤改气加剧雾霾”“石油焦是祸首”等谣言,你中招了吗?
  9. 比较好用的mysql可视化工具-----pycharm连接mysql图文教程
  10. python寻峰,寻找峰值
  11. 企业资源计划(ERP)原理与实践 第三章 需求计划
  12. php中exec的用法,php exec用法详解
  13. 物联网开发常用的开发板_物联网开发人员简介:物联网开发人员调查的结果
  14. 7-5 循环日程表 (10 分)
  15. vue中实现生成海报图片html2canvas详细教程
  16. 仓库建设细节及注意事项
  17. 《电路分析导论(原书第12版)》一导读
  18. IDaaS 服务的三大场景 B2E/B2C/B2B
  19. 计算机功能区介绍,Excel2007使用教程 Excel2007强大的功能区介绍
  20. HD声卡与AC97声卡设置方法及原理

热门文章

  1. 如何利用TikTok免费流量做例如黑五类这样的产品?
  2. Selenium Webdriver重新使用已打开的浏览器实例(Chrome版)
  3. 如何将qsv转换成mp4?
  4. Spring Boot之使用阿里巴巴Druid数据库连接池(数据源)
  5. 【LTE CAT1】ML302 OpenCPU | 开发环境搭建及固件更新
  6. 基于PCM的简单语音信号基带传输系统
  7. 方法论:后台产品经理的能力模型(二)
  8. kerberos使用详解
  9. Eclipse代码提示设置
  10. mysql 增加或减去 一段时间