转载自:Deep Learning for Computer Vision with Caffe and cuDNN | Parallel Forall

http://devblogs.nvidia.com/parallelforall/deep-learning-computer-vision-caffe-cudnn/

TaggedComputer Vision,cuDNN,Deep Learning,Deep Neural Networks,Machine Learning

Deep learning models are making great strides in research papers and industrial deployments alike, but it’s helpful to have a guide and toolkit to join this frontier. This post serves to orient researchers, engineers, and machine learning practitioners on how to incorporate deep learning into their own work. This orientation pairs an introduction to model structure and learned features for general understanding with an overview of the Caffe deep learning framework for practical know-how. References highlight recent and historical research for perspective on current progress.The framework survey points out key elements of the Caffe architecture, reference models, and worked examples. Through collaboration with NVIDIA, drop-in integration of the cuDNN library accelerates Caffe models. Follow this post to join the active deep learning community around Caffe.

Automating Perception by Deep Learning

Deep learning is a branch of machine learning that is advancing the state of the art for perceptual problems like vision and speech recognition. We can pose these tasks as mapping concrete inputs such as image pixels or audio waveforms to abstract outputs like the identity of a face or a spoken word. The “depth” of deep learning models comes from composing functions into a series of transformations from input, through intermediate representations, and on to output. The overall composition gives a deep, layered model, in which each layer encodes progress from low-level details to high-level concepts. This yields a rich, hierarchical representation of the perceptual problem. Figure 1 shows the kinds of visual features captured in the intermediate layers of the model between the pixels and the output. A simple classifier can recognize a category from these learned features while a classifier on the raw pixels has a more complex decision to make.

Figure 1: Visualization of deep features by example. Each 3 x 3 array shows the nine image patches from a standard data set that maximize the response of a given feature from a low-level (left) and high-level (right) layer of the popular Zeiler-Fergus network [8]. Similarly rich features are found in concurrent work by Girshick et al. [4]. The low-level features capture color, simple shapes, and similar textures. The high-level features respond to parts like eyes and wheels, flowers in different colors, and text in various styles. Figure 1: Visualization of deep features by example. Each 3 x 3 array shows the nine image patches from a standard data set that maximize the response of a given feature from a low-level (left) and high-level (right) layer of the popular Zeiler-Fergus network [8]. Similarly rich features are found in concurrent work by Girshick et al. [4]. The low-level features capture color, simple shapes, and similar textures. The high-level features respond to parts like eyes and wheels, flowers in different colors, and text in various styles.Why does depth make a difference? The keys are abstraction and invariance—constancy across change.  What makes a million points of light a cat is not at all obvious when inspecting the light itself. The perception of the cat only becomes clear through parts like ears and a tail and textures like fur. Recognizing your face in the mirror everyday is no simpler: there are infinite variations in expression, pose, and lighting but your ability to recognize yourself is invariant across these fluctuations. At the pixels the world is a “blooming, buzzing confusion,” as psychologist William James called the torrent of information assailing the senses. Depth is a scaffold for mapping the chaos of the input into an orderly representation in which the cat is still a cat whether white or black. Figure 2 compares the representations of a “shallow” and “deep” layer from the hierarchy projected into 2D for visualization.

Figure 2: Projection of low-level “shallow” features (left) and high-level “deep” features (right) from a vision model where related objects group together in the deep representation. Points that are close in this visualization are close in the model representation. Each point represents the feature extracted from an image and the color marks the general category of its contents. The model was trained on precise object classes like “espresso” and “chickadee” but learned features that group dogs, birds, and even animals on a whole together despite their visual contrasts [2]. Figure 2: Projection of low-level “shallow” features (left) and high-level “deep” features (right) from a vision model where related objects group together in the deep representation. Points that are close in this visualization are close in the model representation. Each point represents the feature extracted from an image and the color marks the general category of its contents. The model was trained on precise object classes like “espresso” and “chickadee” but learned features that group dogs, birds, and even animals on a whole together despite their visual contrasts [2].The strength of deep models is that they are not only powerful but learnable. The capacity to represent a function is not enough if all the details of it cannot be described and engineered. The visual world is too vast and varied to fully describe by hand, so it has to be learned from data. We train a deep net by feeding it input and letting it compute layer-by-layer to generate output for comparison with the correct answer. After computing the error at the output, this error flows backward through the net by back-propagation. At each step backward the model parameters are tuned in a direction that tries to reduce the error. This process sweeps over the data improving the model as it goes.

Convolutional Neural Networks (CNNs) are a particular type of deep models responsible for many exciting recent results in computer vision. Originally proposed in the 1980’s by Kunihiko Fukushima as the NeoCognitron [3] and then refined by Yann LeCun and collaborators as LeNet [7], CNNs gained fame through the success of LeNet on the challenging task of handwritten digit recognition in 1989 and the comprehensive 1998 journal paper that followed. It took a couple of decades for CNNs to generate another breakthrough in computer vision, beginning with AlexNet [6] in 2012, which won the world-wide ImageNet Large-scale Visual Recognition Challenge (ILSVRC).

In a CNN, the key computation is the convolution of a feature detector with an input signal. Convolution with a collection of filters, like the learned filters in Figure 3, enriches the representation: at the first layer of a CNN the features go from individual pixels to simple primitives like horizontal and vertical lines, circles, and patches of color. In contrast to conventional single-channel image processing filters, these CNN filters are computed across all of the input channels. Convolutional filters are translation-invariant so they yield a high response wherever a feature is detected.

Figure 3: the first layer of learned convolutional filters in CaffeNet, the Caffe reference ImageNet model based on AlexNet by Krizhevsky et al. These filters are tuned to edges of different orientations, frequency, and phase and colors. The filter outputs expand the dimensionality of the visual representation from the three color channels of the image to these 96 primitives. Deeper layers further enrich the representation. Figure 3: the first layer of learned convolutional filters in CaffeNet, the Caffe reference ImageNet model based on AlexNet by Krizhevsky et al. These filters are tuned to edges of different orientations, frequency, and phase and colors. The filter outputs expand the dimensionality of the visual representation from the three color channels of the image to these 96 primitives. Deeper layers further enrich the representation.

Caffe: a Fast Open-Source Framework for Deep Learning

The Caffe framework from UC Berkeley is designed to let researchers create and explore CNNs and other Deep Neural Networks (DNNs) easily, while delivering high speed needed for both experiments and industrial deployment [5]. Caffe provides state-of-the-art modeling for advancing and deploying deep learning in research and industry with support for a wide variety of architectures and efficient implementations of prediction and learning.

Caffe models and optimization are defined by plain text schema for ease of experimentation. For instance, a convolutional layer for 20 filters of size 5 x 5 is defined using the following text:

layers{name:"conv1"type:CONVOLUTIONbottom:"data"top:"conv1"convolution_param{num_output:20kernel_size:5stride:1}}

Every model layer is defined in this way. The LeNet tutorial included in the Caffe examples walks through defining and training Yann LeCun’s famous model for handwritten digit recognition [7]. It can reach 99% accuracy in less than a minute with GPU training.

Here’s a first sip of Caffe coding that loads a model and classifies an image in Python.

importcaffenet=caffe.Classifier(model_definition,model_parameters)net.set_phase_test()# test = inference, train = learningnet.set_mode_gpu()# gpu or cpu with the same modelscores=net.predict([image])

Caffeincludes a general `caffe.Net` interface for working with any Caffe model. As a next step check out the worked example offeature extraction and visualization.

The Caffe Layer Architecture

In Caffe, the code for a deep model follows its layered and compositional structure for modularity. The Net (class definition) has Layers (class definition), and the computations of the Net are delegated to the Layers. All deep model computations are framed as layer types like convolution, pooling, nonlinearities, and so on. By encapsulating the details of the operation, the layer provides modularity since it can be implemented once and then instantiated everywhere. To afford this encapsulation the layers must follow a common protocol:

  1. Setup and Reshape: the layer setup does one-time initialization like loading parameters while reshape handles the input-output configuration of the layer so that the user does not have to do dimension bookkeeping.
  2. Forward: compute the output given the input.
  3. Backward: compute the gradient of the output with respect to the input and with respect to the parameters (if needed).

The Layer class follows this protocol in its public method definitions. Each layer type is an inherited class that declares these same methods, as in the cuDNN convolution class declaration. The actual implementations are found in a combination of .cpp and .cu files. For example the CuDNNConvolutionLayer class Setup / Reshape and CPU-mode Forward / Backward methods are implemented in cudnn_conv_layer.cpp, and the GPU-mode Forward / Backward methods are implemented in cudnn_conv_layer.cu. Note that CuDNNConvolutionLayer is a strategy class for ConvolutionLayer, and does not declare the {Forward,Backward}_cpu variants of Forward / Backward so that it inherits the standard Caffe CPU mode instead. The GPU mode CUDA code is optional for CPU-only layers or prototyping. The GPU mode falls back to CPU implementations with automatic communication between the host and device as needed. All layer classes follow this .cpp + .cu arrangement.

Once implemented, the layer needs to be included in the model schema. This schema is coded in the Protocol Buffer format in the caffe.proto master definition. To include a layer in the schema:

  1. Register its type in the LayerType enumeration like CONVOLUTION.
  2. Define a field to hold the layer’s configuration parameters, if any, like the ConvolutionParameter field.
  3. Define the actual layer parameter message, like the ConvolutionParameter.

While this layer crash course identifies the main points, the Caffe development wiki has a full guide for layer development.

Drop-in Acceleration with cuDNN

cuDNN_logo_white_on_black_1432x920Deep networks require intense computation, so Caffe has taken advantage of both GPU and CPU processing from the project’s beginning. A single machine with GPU(s) can train state-of-the-art models quickly without the engineering overhead or cost of a CPU cluster. The bundled Caffe reference models and many experiments were learned and run over millions of iterations and images on NVIDIA GPUs. On a single K40 GPU, Caffe can classify over 60 million images a day with the ILSVRC12-winning AlexNet model [6] and the CaffeNet variant. This level of performance is crucial for exploring new models and tasks.

The new cuDNN library provides implementations tuned and tested by NVIDIA of the most computationally-demanding routines needed for CNNs. cuDNN accelerates Caffe 1.38x overall for training and evaluating the CaffeNet model with layer-wise speedups of 1.2-3x as shown in Table 1. The cuDNN paper preprint [1] details the computational approach of the library and its integration in Caffe. Caffe + cuDNN lets you define your models just as before—as plain text—while taking advantage of these computational speedups through drop-in integration. cuDNN or pure Caffe computation can be selected per-layer to pick the fastest implementation for a given architecture. cuDNN integration is now included in the release candidate version of Caffe in its master branch. We are excited for these latest developments and are already using cuDNN to train models faster than before.

Table 1: Times (in milliseconds) for computing a single iteration by Caffe and Caffe + cuDNN for training and testing the CaffeNet model. These are average times over 200 training iterations for Forward, Backward, and Update and 1000 testing iterations for Testing. The benchmark was measured with CUDA 6.5, driver version 340.42, a K40c with ECC off and boost clock on, and data prefetching.
Caffe Caffe + cuDNN Speedup
Training 1325 ms 960 ms 1.38
Testing 100 ms 66.7 ms 1.50

Brew Your Own Deep Neural Networks with Caffe and cuDNN

Here are some pointers to help you learn more and get started with Caffe. Sign up for theDIY Deep learning with CaffeNVIDIA Webinar  (Wednesday, December 3 2014) for a hands-on tutorial for incorporating deep learning in your own work. To start exploring deep learning today, check out theCaffe projectcode with bundled examples and models on Github.Caffeis a popular framework with an active user and open source development community of over 1,200 subscribers and over 600 forks on GitHub. It is used by universities, industry, and startups, and severalparticipants in this year’sImageNet Large Scale Visual Recognition Challengebuilt their submissions on the framework. Subscribe to thecaffe-usersmailing list for questions on installation, examples, and usage. Welcome to brewing deep networks with Caffe!

REFERENCES

[1] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer. “cuDNN: Efficient primitives for deep learning”. arXiv preprint arXiv:1410.0759, 2014.

[2] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. “DeCAF: A deep convolutional activation feature for generic visual recognition”. ICML, 2014.

[3] K. Fukushima. “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position”. Biological cybernetics, 1980.

[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik. “Rich feature hierarchies for accurate object detection and semantic segmentation”. CVPR, 2014.

[5] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. “Caffe: Convolutional architecture for fast feature embedding”. arXiv preprint arXiv:1408.5093, 2014.

[6] A. Krizhevsky, I. Sutskever, and G. Hinton. “Imagenet classification with deep convolutional neural networks”. NIPS, 2012.

[7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. “Gradient-based learning applied to document recognition”. IEEE, 1998.

[8] M. Zeiler and R. Fergus. “Visualizing and understanding convolutional networks”. ECCV, 2014.

Deep Learning for Computer Vision with Caffe and cuDNN相关推荐

  1. Deep Learning for Computer Vision with MATLAB and cuDNN

    转载自:Deep Learning for Computer Vision with MATLAB and cuDNN | Parallel Forall http://devblogs.nvidia ...

  2. 《Deep Learning for Computer Vision withPython》阅读笔记-StarterBundle(第6 - 7章)

    6.配置您的开发环境 当涉及到学习新技术(尤其是深度学习)时,配置开发环境往往是成功的一半.在不同的操作系统.不同的依赖版本以及实际的库本身之间,配置您自己的深度学习开发环境可能是相当令人头痛的事情. ...

  3. 《Deep Learning for Computer Vision withPython》阅读笔记-PractitionerBundle(第9 - 11章)

    9.使用HDF5和大数据集 到目前为止,在本书中,我们只使用了能够装入机器主存储器的数据集.对于小数据集来说,这是一个合理的假设--我们只需加载每一个单独的图像,对其进行预处理,并允许其通过我们的网络 ...

  4. 《Deep Learning for Computer Vision withPython》阅读笔记-StarterBundle(第4 - 5章)

    4.图像分类基础 这句格言在我们的生活中已经听过无数次了.它只是意味着一个复杂的想法可以在一个单一的图像中传达.无论是查看我们股票投资组合的折线图,查看即将到来的足球比赛的传播,还是简单地学习绘画大师 ...

  5. 阅读笔记:What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?

    阅读笔记:What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? 1.介绍 2.相关工作 2.1 贝叶 ...

  6. 深度学习中的不确定性:What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision

    转载 : https://zhuanlan.zhihu.com/p/98756147 原文:What Uncertainties Do We Need in Bayesian Deep Learnin ...

  7. 《Deep Learning for Computer Vision with Python》阅读笔记-ImageNetBundle(第7章)-在ImageNet上训练VGGNet

    //2022.2.27日下午18:33开始学习笔记 7.在ImageNet上训练VGGNet 在本章中,我们将学习如何在ImageNet数据集上从零开始训练VGG16网络架构.卷积神经网络的VGG家族 ...

  8. 《Deep Learning for Computer Vision withPython》阅读笔记-StarterBundle(第18 - 23章)

    18.检查点模型 截止到P265页 //2022.1.18日22:14开始学习 在第13章中,我们讨论了如何在培训完成后将模型保存和序列化到磁盘上.在上一章中,我们学习了如何在发生欠拟合和过拟合时发现 ...

  9. Deep Learning for Computer Vision (Andrej Karpathy, OpenAI)

    Deep Learning for Computer Vision (Andrej Karpathy, OpenAI) https://www.youtube.com/watch?v=u6aEYuem ...

最新文章

  1. Python 之 matplotlib (十三) subplot分格显示
  2. 数据库学习day_03:关联关系/ 关联查询/ JDBC
  3. 苹果电脑基本设置+Linux 命令+Android 实战集锦
  4. shiro学习(7):shiro连接数据库 方式二
  5. 解决Unity3D导出apk失败:Failed to re-package resources
  6. mysql 并发避免锁表_Yii+MYSQL锁表防止并发情况下重复数据的方法
  7. 最强云硬盘来了,让AI模型迭代从1周缩短到1天
  8. 自学前端很难吗?只要你足够努力,高中学历也能获得offer
  9. BERT模型实战之多文本分类(附源码)
  10. 面对压力,我们可以做什么?
  11. 基于JSP实现的影视创作论坛系统
  12. wps表格宏被禁用如何解禁_wps宏被禁用如何打开?
  13. IBM SPSS Modeler简单案例
  14. BDTC 2016 出品人阵容曝光!附首批邀请嘉宾名单
  15. 如何打造一份it项目计划书
  16. 计算机网络wlan实验报告,无线网络实验报告.doc
  17. bcdedit无法打开启动配置数据存储拒绝访问
  18. PL330 DMAC笔记(2) - DMAC接口,状态机,初始化,APB slave接口
  19. h5 nan_易企秀资深前端架构师袁飞:移动H5开发如何避坑
  20. MEMS传感器领域关于薄膜性能的中国国家标准,“带状薄膜抗拉性能的试验方法”由北京智芯传感等单位发布并实施

热门文章

  1. MySQL——外部数据的批量导入
  2. Python基础知识(第二天)
  3. 【Python】csv、excel、pkl、txt、dict
  4. error_reporting()
  5. 风控评分卡建模全流程
  6. 把自己当成打工的,一辈子都是打工的!:周鸿祎
  7. 人工智能产品化的关键是基础架构和数据,而非算法
  8. Java 8 - Stream流骚操作解读
  9. Redis进阶-Redis对于过期键的三种清除策略
  10. python编程16章_Python核心编程——Chapter16