gpu 加速矩阵 深度学习

Neural networks and deep learning are not recent methods. In fact, they are quite old. Perceptrons, the first neural networks, were created in 1958 by Frank Rosenblatt. Even the invention of the ubiquitous building blocks of deep learning architectures happened mostly near the end of the 20th century. For example, convolutional networks were introduced in 1989 in the landmark paper Backpropagation Applied to Handwritten Zip Code Recognition by Yann LeCun et al.

ñeural网络和深厚的学习不是最近的方法。 实际上,它们已经很老了。 感知器是第一个神经网络,由Frank Rosenblatt于1958年创建。 即使是无所不在的深度学习架构构建基块的发明,也大多发生在20世纪末。 例如,卷积网络在1989年由Yann LeCun等人在具有里程碑意义的论文《 反向传播应用于手写邮政编码识别》中引入。

Why did the deep learning revolution had to wait decades?

为什么深度学习革命必须等待数十年?

One major reason was the computational cost. Even the smallest architectures can have dozens of layers and millions of parameters, so repeatedly calculating gradients during is computationally expensive. On large enough datasets, training used to take days or even weeks. Nowadays, you can train a state of the art model in your notebook under a few hours.

主要原因之一是计算成本。 即使是最小的体系结构也可以具有数十个层和数百万个参数,因此在此期间重复计算梯度在计算上是昂贵的。 在足够大的数据集上,培训过去需要几天甚至几周的时间。 如今,您可以在几个小时内在笔记本中训练最先进的模型。

There were three major advances which brought deep learning from a research tool to a method present in almost all areas of our life. These are backpropagation, stochastic gradient descent and GPU computing. In this post, we are going to dive into the latter and see that neural networks are actually embarrassingly parallel algorithms, which can be leveraged to improve computational costs by orders of magnitude.

三项重大进步将深度学习从研究工具带入了我们生活几乎所有领域的方法。 这些是反向传播随机梯度下降GPU计算 。 在这篇文章中,我们将深入研究后者,并看到神经网络实际上是令人尴尬的并行算法,可以利用它来将计算成本降低几个数量级。

一大堆线性代数 (A big pile of linear algebra)

Deep neural networks may seem complicated for the first glance. However, if we zoom into them, we can see that its components are pretty simple in most cases. As the always brilliant xkcd puts it, a network is (mostly) a pile of linear algebra.

乍一看,深度神经网络似乎很复杂。 但是,如果我们放大它们,可以看到在大多数情况下其组件非常简单。 正如总是出色的xkcd所说的,网络(主要是一堆线性代数)。

xkcdxkcd

During training, the most commonly used functions are the basic linear algebra operations such as matrix multiplication and addition. The situation is simple: if you call a function a bazillion times, shaving off just the tiniest amount of the time from the function call can compound to a serious amount.

在训练期间,最常用的函数是基本的线性代数运算,例如矩阵乘法和加法。 情况很简单:如果您调用一个函数的次数不计其数,那么节省掉函数调用中最短的时间可能会增加很多。

Using GPU-s not only provide a small improvement here, they supercharge the entire process. To see how it is done, let’s consider activations for instance.

使用GPU-s不仅在这里提供了很小的改进,而且还增强了整个过程。 为了了解它是如何完成的,让我们考虑一下激活。

Suppose that φ is an activation function such as ReLU or Sigmoid. Applied to the output of the previous layer

假设φ是激活函数,例如ReLU或Sigmoid。 应用于上一层的输出

the result is

结果是

(The same goes for multidimensional input such as images.)

(对于多维输入(例如图像)也是如此。)

This requires to loop over the vector and calculate the value for each element. There are two ways to make this computation faster. First, we can calculate each φ(xᵢ) faster. Second, we can calculate the values φ(x), φ(x), …, φ(x) simultaneously, in parallel. In fact, this is embarrassingly parallel, which means that the computation can be parallelized without any significant additional effort.

这需要遍历向量并计算每个元素的值。 有两种方法可以使此计算更快。 首先,我们可以更快地计算每个φ(xᵢ) 。 其次,我们可以并行地并行计算值φ(x₁ ),φ(x 2 ),…,φ( xₙ ) 实际上,这令人尴尬地并行的 ,这意味着可以并行化计算而无需付出任何额外的努力。

Over the years, doing things faster became much more difficult. A processor’s clock speed used to double almost every year, but this has plateaued recently. Modern processor design has reached a point where packing more transistors into the units has quantum-mechanical barriers.

多年来,更快地做事变得更加困难。 处理器的时钟速度过去几乎每年都翻一番,但是最近一直稳定。 现代处理器设计已经达到了将更多的晶体管封装到单元中的程度,从而产生了量子力学的障碍。

However, calculating the values in parallel does not require faster processors, just more of them. This is how GPUs work, as we are going to see.

但是,并行计算值不需要更快的处理器,只需更多的处理器即可。 我们将看到,这就是GPU的工作方式。

GPU计算原理 (The principles of GPU computing)

Graphics Processing Units, or GPUs in short were developed to create and process images. Since the value of every pixel can be calculated independently of others, it is better to have a lot of weaker processors than a single very strong one doing the calculations sequentially.

开发了图形处理单元(简称为GPU)来创建和处理图像。 由于每个像素的值都可以独立于其他像素进行计算,因此拥有许多较弱的处理器比按顺序执行计算的单个非常强大的处理器要好。

This is the same situation we have for deep learning models. Most operations can be easily decomposed to parts which can be completed independently.

这与深度学习模型的情况相同。 大多数操作可以很容易地分解为可以独立完成的部分。

NVIDIA Fermi architecture whitepaperNVIDIA Fermi架构白皮书

To give you an analogy, let’s consider a restaurant, which has to produce French fries on a massive scale. To do this, workers must peel, slice and fry the potato. Hiring people to peel the potatoes costs much more than purchasing many more kitchen robots capable to perform this task. Even if the robots are slower, you can buy much more from the budget, so overall the process will be faster.

为了给您一个比喻,让我们考虑一家餐厅,该餐厅必须大规模生产炸薯条。 为此,工人必须将马铃薯去皮,切片和油炸。 雇用人们去皮土豆要比购买更多能够执行此任务的厨房机器人花费更多。 即使机器人速度较慢,您也可以从预算中购买更多东西,因此整个过程将更快。

并行模式 (Modes of parallelism)

When talking about parallel programming, one can classify the computing architectures into four different classes. This was introduced by Michael J. Flynn in 1966 and it is in use ever since.

在谈论并行编程时,可以将计算体系结构分为四个不同的类别。 它是由Michael J. Flynn于1966年推出的,此后一直在使用。

  1. Single Instruction, Single Data (SISD)

    小号英格尔 nstruction,S英格尔d ATA(SISD)

  2. Single Instruction, Multiple Data (SIMD)

    小号英格尔 nstruction, ultiple d ATA(SIMD)

  3. Multiple Instructions, Single Data (MISD)

    中号 ultiple nstructions,S英格尔d ATA(MISD)

  4. Multiple Instructions, Multiple Data (MIMD)

    中号 ultiple nstructions, ultiple d ATA(MIMD)

A multi-core processor is MIMD, while GPUs are SIMD. Deep learning is a problem for which SIMD is very well suited. When you calculate the activations, the same exact operation needs to be performed, with different data for each call.

多核处理器是MIMD,而GPU是SIMD。 深度学习是SIMD非常适合的问题。 在计算激活数时,需要执行相同的确切操作,并且每个呼叫的数据不同。

延迟与吞吐量 (Latency vs throughput)

To give a more detailed picture on what GPU better than CPU, we need to take a look into latency and throughput. Latency is the time required to complete a single task, while throughput is the number of tasks completed per unit time.

为了更详细地说明哪种GPU比CPU更好,我们需要研究延迟吞吐量 。 延迟是完成单个任务所需的时间,而吞吐量是每单位时间完成的任务数。

Simply put, a GPU can provide much better throughput, at the cost of latency. For embarrassingly parallel tasks such as matrix computations, this can offer an order of magnitude improvement in performance. However, it is not well suited for complex tasks, such as running an operating system.

简而言之,GPU可以以延迟为代价提供更好的吞吐量。 对于令人尴尬的并行任务(例如矩阵计算),这可以使性能提高一个数量级。 但是,它不适用于复杂的任务,例如运行操作系统。

CPU, on the other hand, is optimized for latency, not throughput. They can do much more than floating point calculations.

另一方面,CPU针对延迟而不是吞吐量进行了优化。 他们可以做的不仅仅是浮点计算。

通用GPU编程 (General purpose GPU programming)

In practice, general purpose GPU programming was not available for a long time. GPU-s were restricted to do graphics, and if you wanted to leverage their processing power, you needed to learn graphics programming languages such as OpenGL. This was not very practical and the barrier of entry was high.

在实践中,通用GPU编程很长一段时间都不可用。 GPU只能做图形,如果您想利用它们的处理能力,则需要学习诸如OpenGL之类的图形编程语言。 这不是很实际,进入的障碍很高。

This was the case until 2007, when NVIDIA launched the CUDA framework, an extension of C, which provides an API for GPU computing. This significantly flattened the learning curve for users. Fast forward a few years: modern deep learning frameworks use GPUs without us explicitly knowing about it.

直到2007年NVIDIA推出CUDA框架(C的扩展)时,情况一直如此。 这极大地拉平了用户的学习曲线。 快进几年:现代深度学习框架使用GPU,而我们没有明确地了解它。

深度学习的GPU计算 (GPU computing for deep learning)

So, we have talked about how GPU computing can be used for deep learning, but we haven’t seen the effects. The following table shows a benchmark, which was made in 2017. Although it was made three years ago, it still demonstrates the order of magnitude improvement in speed.

因此,我们讨论了如何将GPU计算用于深度学习,但尚未看到效果。 下表显示了基准测试,该基准测试于2017年制定。尽管基准测试是在三年前制定的,但它仍然表明了速度的提高幅度。

Benchmarking State-of-the-Art Deep Learning Software Tools基准化最先进的深度学习软件工具

现代深度学习框架如何使用GPU (How modern deep learning frameworks use GPUs)

Programming directly in CUDA and writing kernels by yourself is not the easiest thing to do. Thankfully, modern deep learning frameworks such as TensorFlow and PyTorch doesn’t require you to do that. Behind the scenes, the computationally intensive parts are written in CUDA using its deep learning library cuDNN by NVIDIA. These are called from Python, so you don’t need to use them directly at all. Python is really strong in this aspect: it can be combined with C easily, which gives you both the power and the ease of use.

直接在CUDA中编程和自己编写内核并不是一件容易的事。 值得庆幸的是,诸如TensorFlow和PyTorch之类的现代深度学习框架不需要您这样做。 在后台,使用NVIDIA的深度学习库cuDNN在CUDA中编写计算密集型部分。 这些是从Python调用的,因此您根本不需要直接使用它们。 Python在这方面确实很强大:它可以轻松地与C结合使用,这既给您强大的功能,又让其易于使用。

This is similar to how NumPy works behind the scenes: it is blazing fast because its functions are written directly in C.

这类似于NumPy在幕后的工作方式:之所以Swift发展是因为它的功能直接用C编写。

您是否需要构建深度学习平台? (Do you need to build a deep learning rig?)

If you want to train deep learning models on your own, you have several choices. First, you can build a GPU machine for yourself, however, this can be a significant investment. Thankfully, you don’t need to do that: cloud providers such as Amazon and Google offer remote GPU instances to work on. If you want to access resources for free, check out Google Colab, which offers free access to GPU instances.

如果您想自己训练深度学习模型,则有多种选择。 首先,您可以自己构建一台GPU机器,但这可能是一笔巨大的投资。 值得庆幸的是,您不需要这样做:像Amazon和Google这样的云提供商都提供了可以使用的远程GPU实例。 如果您想免费访问资源,请查看Google Colab ,它提供对GPU实例的免费访问。

结论 (Conclusion)

Deep learning is computationally very intensive. For decades, training neural networks was limited by hardware. Even relatively smaller models had to be trained for days, and training large architectures on huge datasets was impossible.

深度学习在计算上非常密集。 几十年来,训练神经网络一直受到硬件的限制。 即使是相对较小的模型也必须进行几天的训练,并且不可能在庞大的数据集上训练大型架构。

However, with the appearance of general computing GPU programming, deep learning exploded. GPUs excel in parallel programming, and since these algorithms can be parallelized very efficiently, it can accelerate training and inference by several orders of magnitude.

但是,随着通用计算GPU编程的出现,深度学习迅猛发展。 GPU在并行编程方面表现出色,并且由于这些算法可以非常高效地并行化,因此可以将训练和推理速度提高几个数量级。

This has opened the way for rapid growth. Now, even relatively cheap commercially available computers can train state of the art models. Combined with the amazing open source tools such as TensorFlow and PyTorch, people are building awesome things every day. This is truly a great time to be in the field.

这为快速增长开辟了道路。 现在,即使是相对便宜的商用计算机也可以训练最先进的模型。 结合TensorFlow和PyTorch等令人惊叹的开源工具,人们每天都在构建很棒的东西。 这确实是该领域的绝佳时机。

翻译自: https://towardsdatascience.com/how-gpus-accelerate-deep-learning-3d9dec44667a

gpu 加速矩阵 深度学习


http://www.taodudu.cc/news/show-1944334.html

相关文章:

  • 土豆视频ipad 5.0 客户端
  • 加速下载 玩转土豆网FLV视频四招
  • 利用GPU加速的软件
  • granfana 使用cdn模式加速页面加载
  • 为什么说索引会加速查找过程
  • flash计算机硬件,实测Flash在硬件加速下的对比
  • 360html5播放加速,总结:没有讨论加速问题,“视频快速观看”完全支持360种浏览器...
  • 优酷土豆集团
  • 怎么关闭计算机硬件加速,win7关闭硬件加速的方法,手把手抓图教你如何关闭硬件加速功能...
  • 观看Tudou视频更流畅
  • Windows Server 2003 安装教程
  • u盘如何安装2003服务器系统安装,u盘怎么安装win server2003系统是iso
  • Windows server 2003 下载
  • Windows Server 2003 安装教程——图文小白版(附下载地址)
  • 在VMware安装Windows server 2003步骤
  • VMware Workstation 14 Pro 安装 Windows Server 2003(完)
  • 初试 Windows Small Business Server 2003
  • 2003系统如何搭建ftp服务器配置,WINDOWSSERVER2003系统架设FTP服务器配置方法.pdf
  • 使用Windows Server 2003搭建ASP网站001
  • mysqlserver 下载安装
  • 搭建邮件服务器2003,用Windows Server 2003来搭建简易的邮件服务器
  • Windows 2003 server下载
  • Windows Server2003搭建ssl通信
  • 网络安全05_VMware 虚拟机软件安装_准备Kali- Linux虚拟机_Windows Server 2003 Enterprise 虚拟机下载和安装
  • 2003服务器系统QQ安装不了,windows2003server
  • Windows Server 2003 SP2 企业版 ISO 下载
  • EVC下如何直接访问寄存器?
  • ffmpeg 合并 flv 文件
  • flv文件解析工具
  • ffplay播放flv文件没有声音的解决方法

gpu 加速矩阵 深度学习_GPU如何加速深度学习相关推荐

  1. win10+anaconda+cuda配置dlib,使用GPU对dlib的深度学习算法进行加速(以人脸检测为例)...

    win10+anaconda+cuda配置dlib,使用GPU对dlib的深度学习算法进行加速(以人脸检测为例) 转载于:https://www.cnblogs.com/zhehan54/p/8540 ...

  2. GPU驱动“后摩尔定律时代” 为HPC和深度学习提供强大加速动力

    中国HPC领域盛会2015年全国高性能计算学术年会(HPC China 2015)今日在无锡开幕.全球视觉计算的行业领袖NVIDIA®(英伟达™)及应用其GPU的众多企业和科研机构,带来近20场报告和 ...

  3. 深度学习的异构加速技术(一):AI 需要一个多大的“心脏”?

    欢迎大家前往腾讯云社区,获取更多腾讯海量技术实践干货哦~ 作者:kevinxiaoyu,高级研究员,隶属腾讯TEG-架构平台部,主要研究方向为深度学习异构计算与硬件加速.FPGA云.高速视觉感知等方向 ...

  4. 收藏 | PyTorch深度学习模型训练加速指南2021

    点上方蓝字计算机视觉联盟获取更多干货 在右上方 ··· 设为星标 ★,与你不见不散 仅作学术分享,不代表本公众号立场,侵权联系删除 转载于:作者:LORENZ KUHN 编译:ronghuaiyang ...

  5. 实时深度学习的推理加速

    还未完成的...... 作者 Yanchen 毕业于普林斯顿大学机器学习方向,现就职于微软Redmond总部,从事大规模分布式机器学习和企业级AI研发工作.在该篇文章中,作者介绍了实时深度学习的推理加 ...

  6. 深度神经网络压缩与加速总结

    深度神经网络压缩与加速综述 1. 深度神经网络压缩与加速的任务 2. 模型压缩与加速方法 (1) 参数剪枝 (2) 参数共享 (3) 低秩分解 (4) 紧性滤波设计 (5) 知识蒸馏 3. 深度神经网 ...

  7. 【27】SIMD:如何加速矩阵乘法?

    [计算机组成原理]学习笔记--总目录 [27]SIMD:如何加速矩阵乘法? 引言 一.超线程:Intel 多卖给你的那一倍 CPU 1.背景 2.超线程(Hyper-Threading)技术 二.SI ...

  8. 淘宝教育视频加速观看(在淘宝教育上看学习视频,需要加速,在谷歌浏览器上安装视频加速插件)

    淘宝教育视频加速观看(在淘宝教育上看学习视频,需要加速,在谷歌浏览器上安装视频加速插件) 1.在此网站上下载Video Speed Controller插件 https://extfans.com/ ...

  9. AVX指令集加速矩阵乘法

    AVX简介 SIMD SIMD(Single Instruction Multiple Data,单指令多数据流),是一种实现空间上的并行性的技术.这种技术使用一个控制器控制多个处理单元,同时对一组数 ...

  10. java模拟加速匀速减速_Android学习之 动画加速减速 匀速控制

    今天说一下新发现的一个类的使用: 网上很多都写了Interpolator,你们先了解一下.我主要是简单说怎么用. Android 动画之Interpolator插入器 --AccelerateInte ...

最新文章

  1. 电脑记事本在哪_【锦囊站第002期】电脑一秒内完成文件搜索是如何实现的?
  2. Study on Android【六】--消息机制,异步和多线程
  3. 【NLP】如何利用BERT来做基于阅读理解的信息抽取
  4. ADO学习(六)服务器和客户端游标
  5. 5 操作系统第二章 进程管理 线程介绍
  6. linux命令——crontab的使用方法
  7. 工程计算软件_同望BIM工程量计算软件—土石方
  8. linux 权限加号是,请教:drwxrwxr-x   什么权限后面有个加号,代表什么意思
  9. php json函数参数传递,JSON作为函数参数时应该如何使用
  10. 怎么批量给文件夹名称加上数字序号前缀?怎么对文件夹名称进行编号排序?
  11. 运营商线路细分_国内三大运营商宽带线路及分级介绍
  12. 百度指数 数据分析(介绍)
  13. dismiss和remove_你真的了解iOS中控制器的present和dismiss吗?
  14. 【2021最后一波官方福利】七天玩转Redis | 打卡还能领周边活动开始啦
  15. perf 性能分析实例——使用perf优化cache利用率
  16. 针对华为产品,如何在小红书宣传中发布有关图文笔记并达到最佳效果?
  17. 千万别在微社区太投入
  18. 多线程练习:模拟多人爬山
  19. Android接入高德地图SDK,Android高德SDK 地图篇一:集成高德SDK
  20. 深入操作系统底层分析nginx网络请求及响应过程

热门文章

  1. 计算机操作系统(汤小丹第4版)
  2. 记一次大量数据导入导出SAP系统实验
  3. Fiddler4 抓取Chrome浏览器的Http(s)
  4. 躲避校园网客户端的检测实现客户端移动热点开启
  5. java 临时文件 删除_Java临时文件何时被删除?
  6. java异常处理拦截器
  7. Linkedin葵花宝典
  8. 竞价推广的流程有哪些?
  9. XenCenter为虚拟机C盘扩容
  10. Java从入门到精通第一版(Java基础)