怎样为深度学习系统选择GPU

转载自： https://timdettmers.wordpress.com/2014/08/14/which-gpu-for-deep-learning/

Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep LearningAugust 14, 2014 Tim Dettmers UncategorizedCUDA, Deep Learning, GPU, Neural Networks

It is again and again amazing to see how much speedup you get when you use GPUs for deep learning: Compared to CPUs 20x speedups are typical, but on larger problems one can achieve 50x speedups. With GPUs you can try out new ideas, algorithms and experiments much faster than usual and get almost immediate feedback as to what works and what does not. If you do not have a GPU and you are serious about deep learning you should definitely get one. But which one should you get? In this blog post I will guide you through the choices to the GPU which is best for you.

Having a fast GPU is a very important aspect when one begins to learn deep learning as this rapid gain in practical experience is key to building the expertise with which you will be able to apply deep learning to new problems. Without this rapid feedback it just takes too much time to learn from one’s mistakes and it can be discouraging and frustrating to go on with deep learning.

With GPUs I quickly learned how to apply deep learning on a range of Kaggle competitions and I
managed to earn second place in the Partly Sunny with a Chance of Hashtags Kaggle competition, where it was the task to predict weather ratings for a given tweet. In the competition I used a rather large two layered deep neural network with rectified linear units and dropout for regularization and this deep net fitted barely into my 6GB GPU memory. More details on my approach can be found here.

Should I get multiple GPUs?

Excited by what deep learning can do with GPUs I plunged myself into multi-GPU territory by assembling a small GPU cluster with InfiniBand 40Gbit/s interconnect. I was thrilled to see if even better results can be obtained with multiple GPUs.

I quickly found that it is not only very difficult to parallelize neural networks on multiple GPU efficiently, but also that the speedup was only mediocre for dense neural networks. Small neural networks could be parallelized rather efficiently using data parallelism, but larger neural networks like I used in the Partly Sunny with a Chance of Hashtags Kaggle competition received almost no speedup.

[size=15.3999996185303px]Setup in my main computer: You can see three GXT Titan and an InfiniBand card. Is this a good setup for doing deep learning?

However, using model parallelism, I was able to train neural networks that were much larger and that had almost 3 billion connections. But to leverage these connections one needs just much larger data sets which are uncommon outside of large companies – I found some for these large models when I trained a language model on the entire Wikipedia corpus, but that’s about it.

On the other hand, one advantage of multiple GPUs is that you can run multiple algorithms or experiments separately on each GPU. You gain no speedups, but you get more information of your performance by using different algorithms or parameters at once. This is highly useful if your main goal is to gain deep learning experience as quickly as possible and also it is very useful for researchers, who want try multiple version of a new algorithm at the same time.

If you use deep learning only occasionally, or you use rather small data sets (smaller than say 10-15GB) and foremost dense neural networks, then multiple GPUs are probably not for you. However, when you use convolutional neural networks a lot, then multiple GPUs might still make sense.

Alex Krizhevsky released his new updated code which can run convolutional neural networks on up to four GPUs. Convolutional neural networks – unlike dense neural networks – can be run very efficiently on multiple GPUs because their use of weight sharing makes data parallelism very efficient. On top of that, Alex Krizhevsky’s implementation utilizes model parallelism for the densely connected final layers of the network which gives additional gains in speed.

However, if you want to program similar networks yourself, be aware that to program efficient multiple GPU networks is a difficult undertaking which will consume much more time than programming a simple network on one GPU.

So overall, one can say that one GPU should be sufficient for almost any task and that additional GPUs convey only benefits under very specific circumstances (many for very, very large data set).

So what kind of GPU should I get? NVIDIA or AMD?

NVIDIA’s standard libraries made it very easy to establish the first deep learning libraries in CUDA, while there were no such powerful standard libraries for AMD’s OpenCL. Right now, there are just no good deep learning libraries for AMD cards – so NVIDIA it is. Even if some OpenCL libraries would be available in the future I would stick with NVIDIA: The thing is that the GPU computing or GPGPU community is very large for CUDA and rather small for OpenCL. Thus in the CUDA community good open source solutions and solid advice for your programming is readily available.

Required memory size for simple neural networks

People often ask me if the GPUs with the largest memory are best for them, as this would enable them to run the largest neural networks. I thought like this when I bought my GPU, a GTX Titan with 6GB memory. And I also thought that this was a good choice when my neural network in the Partly Sunny with a Chance of Hashtags Kaggle competition barely fitted in my GPU memory. But later I found out that my neural network implementation was very memory inefficient and much less memory would have been sufficient – so sometimes the underlying code is limiting rather than the GPU. Although standard deep learning libraries are efficient, these libraries are often optimized for speed rather than for memory efficiency.

There is an easy formula for calculating the memory requirements of a simple neural network. This formula is the memory requirement in GB for standard implementation of a simple neural network, with dropout and momentum/Nesterov/AdaGrad/RMSProp:

Memory formula: The units for the first layer is the dimensionality of the input. In words this formula means: Sum up the weight sizes and input sizes each; multiply the input sizes by the batch size; multiply everything by 4 for bytes and by another 3 for the momentum and gradient matrix for the first term, and the dropout and error matrix for the second term; divide to get gigabytes.

For the Kaggle competition I used a 9000x4000x4000x32 network with batchsize 128 which uses up:
So this fits even into a small GPU with 1.5GBs memory. However, the world looks quite different when you use convolutional neural networks.

Understanding the memory requirements of convolutional nets

In the last update I made some corrections for the memory requirement of convolutional neural networks, but the topic was still unclear and lacked definite advice on the situation.

I sat down and tried to build a similar formula as for simple neural networks, but the many parameters that vary from one net to another, and the implementations that vary from one net to another made such a formula too impractical – it would just contain too many variables and would be too large to give a quick overview of what memory size is needed. Instead I want to give a practical rule of thumb. But first let us dive into the question why convolutional nets require so much memory.

There are generally two types of convolutional implementations. One uses Fourier transforms the other direct computation on image patches.

Continuous convolution theorem with abuse in notation: The input represents an image or feature map and the subtraction of the argument can be thought of creating image patches with width with respect to some x, which is then multiplied by the kernel. The integration turns into a multiplication in the Fourier domain; here denotes a Fourier transformed function. For discrete “dimensions”(x) we have a sum instead of an integral – but the idea is the same.

The mathematical operation of convolution can be described by a simple matrix multiplication in the Fourier frequency domain. So one can perform a fast Fourier transform on the inputs and on each kernel and multiply them to obtain feature maps – the outputs of a convolutional layer. During the backward pass we do an inverse fast Fourier transform to receive gradients in the standard domain so that we can update the weights. Ideally, we store all these Fourier transforms in memory to save the time of allocating the memory during each pass. This can amount to a lot of extra memory and this is the chunk of memory that is added for the Fourier method for convolutional nets – holding all this memory is just required to make everything run smoothly.

The method that operates directly on image patches realigns memory for overlapping patches to allow contiguous memory access. This means that all memory addresses lie next to each other – there is no “skipping” of indices – and this allows much faster memory reads. One can imagine this operation as unrolling a square (cubic) kernel in the convolutional net into a single line, a single vector. Slow memory access is probably the thing that hurts an algorithms performance the most and this prefetching and aligning of memory makes the convolution run much faster. There is much more going on in the CUDA code for this approach of calculating convolutions, but prefetching of inputs or pixels seems to be the main reason of increased memory usage.

Another memory related issue which is general for any convolutional net implementation is that convolutional nets have small weight matrices due to weight sharing, but much more units then densely connected, simple neural networks. The result is that the second term in the equation for simple neural networks above is much larger for convolutional neural networks, while the first term is smaller. So this gives another intuition where the memory requirements come from.

I hope this gives an idea about what is going on with memory in convolutional neural nets. Now we have a look at what practical advice might look like.

Required memory size for convolutional nets

Now that we understand that implementations and architectures vary wildly for convolutional nets we know that it might be best for look for other ways to gauge memory requirements.
In general, if you have enough memory, convolutional neural nets can be made nearly arbitrarily large for a given problem (dozen of convolutional layers). However, you will soon run into problems with overfitting, so that larger nets will not be any better than smaller ones. Therefore data set size and the label class size might serve well as a gauge of how big you can make your net and in turn how large your GPU memory needs to be.

One example is here the Kaggle plankton detection competition. At first I thought about entering the competition as I might have a huge advantage through my 4 GPU system. I reasoned I might be able to train a very large convolutional net in a very short time – one thing that others cannot do because they lack the hardware. However, due to the small data set (about 50×50 pixels, 2 color channels, 400k training images; about 100 classes) I quickly found that overfitting was an issue even for small nets that neatly fit into one GPU and which are fast to train. So there was hardly a speed advantage of multiple GPUs and not any advantage at all of having a large GPU memory.

If you look at the image net competition, you have 1000 classes, over a million 250×250 images with three color channels – that’s more than 250GB of data. Here Alex Krizhevsky’s original convolutional net did not fit into a single 3GB memory (but it did not use much more than 3GB). For ImageNet 6GB will usually be very sufficient for a competitive net. You can receive slightly better results if you throw dozens of high memory GPUs at the problem, but the improvement is only marginal compared to the resources that you need for that.

I think it is very likely, that the next breakthrough in deep learning will be done with a single GPU by some researchers that try new recipes, rather than by researchers that use GPU clusters and try variations of the same recipe. So if you are a researcher you should not fear a small memory size – a faster GPU, or multiple smaller GPUs will often give you a better experience than a single large GPU.

Right now, there are not many relevant data sets that are larger than ImageNet and thus for most people a 6GB GPU memory should be plenty; if you want to invest into a smaller GPU and are unsure if 3GB or 4GB is okay, a good rule might to look at the size of your problems when compared to ImageNet and judge from that.

Overall, I think memory size is overrated. You can nicely gain some speedups if you have very large memory, but these speedups are rather small. I would say that GPU clusters are nice to have, but that they cause more overhead than the accelerate progress; a single 12GB GPU will last you for 3-6 years; a 6GB GPU is plenty for now; a 4GB GPU is good but might be limiting on some problems; and a 3GB GPU will be fine for most research that looks into new architectures.

Fastest GPU for a given budget

Processing performance is most often measured in floating-point operations per second (FLOPS). This measure is often advertised in GPU computing and it is also the measure which determines which supercomputer enters the TOP500 list of the fastest supercomputers. However, this measure is misleading, as it measures processing power on problems that do not occur in practice.
It turns out that the most important practical measure for GPU performance is bandwidth in GB/s, which measures how much memory can be read and written per second. This is because almost all mathematical operations, such as dot product, sum, addition etcetera, are bandwidth bound, i.e. limited by the GB/s of the card rather than its FLOPS.

[size=15.3999996185303px]Comparison of bandwidth for CPUs and GPUs over time:Bandwidth is one of the main reasons why GPUs are faster for computing than CPUs are.

To determine the fastest GPU for a given budget one can use this Wikipedia page and look at Bandwidth in GB/s; the listed prices are quite accurate for newer cards (700 and 900 series), but the 500 and series is significantly cheaper than the listed prices – especially if you buy those cards via eBay. Sometimes cryptocurreny mining benchmarks also are a reliable gauge of performance (you could gauge the performance of the GTX980 nicely before there were any deep learning benchmarks). But beware, some cryptocurreny mining benchmarks are compute bound and thus are uninformative for deep learning performance.

Another important factor to consider however, is that the Maxwell and Fermi architecture (Maxwell 900 series; Fermi 400 and 500 series) are quite a bit faster than the Kepler architecture (600 and 700 series); so that for example the GTX 580 is faster than any GTX 600 series GPU. The new Maxwell GPUs are significantly faster than most Kepler GPUs, and you should prefer Maxwell if you have the money. If you cannot afford a GTX 980, a 3GB GTX 580 from eBay is an excellent choice that will suffice for most application on medium sized data sets (like Kaggle competition data sets). The 700 series is outdated, and only the GTX Titan is interesting as a cost effective 6GB option. The GTX 970 might also be an option, but along with the GTX 580 there are some thing you will need to consider.

Cheap but troubled

The GTX 970 is a special case you need to be aware of. The GTX 970 has a weird architecture which may cripple performance if more than 3.5GB is used and so it might be troublesome to use if you train large convolutional nets. The problem had been dramatized quite a bit in the news, but it turned out that the whole problem was not as dramatic for deep learning as the original benchmarks showed: If you do not go above 3.75GB, it will still be faster than a GTX 580.

To manage the memory problem, it would still be best, if you would learn to extend libraries with your own software routines which will alert you when you hit the performance decreasing memory zone above 3.5GB – if you do this, then the GTX 970 is an excellent, very cost efficient card that is just limited to 3.5GB. You might get the GTX 970 even cheaper on eBay, where disappointed gamers – who do not have control over the memory routines – may sell it cheaply. Other people made it clear to me that they think that a recommendation for the GTX 970 is crazy, but if you are short on money this is just better than no GPU, and other cheap GPUs like the GTX 580 have troubles too.

The GTX 580 is now a bit dated and many GPU libraries no longer support the GTX 580. You can still use libraries like cuda-convnet 1 and deepnet, which are both excellent libraries, but you should be aware that you cannot use (or not fully) libraries like torch7.

If you do not like all these troubles, then the new 4GB GTX 960 is the cheapest GPU with no troubles, but it is not as fast (on a par with a GTX 580) or cost efficient.

Conclusion

With all this information in this article you should be able to reason which GPU to choose by balancing the required memory size, bandwidth in GB/s for speed and the price of the GPU, and this reasoning will be solid for many years to come. But right now my recommendation is to get a GTX980 if you have the money, a 3GB GTX 580 or a GTX 970 from eBay otherwise (decide which trouble is least significant for you); and if you need the memory a 6GB GTX Titan from eBay will be good (Titan Black if you have the money).

TL;DR advice

Best GPU overall: GTX 980
Troubled but very cost efficient: GTX 580 3GB (lacks software support) or GTX 970 (has memory problem) eBay
Cheapest card with no troubles: GTX 960 4GB
I work with data sets > 250GB: GTX Titan eBay
I have no money: GTX 580 3GB eBay
I do Kaggle: GTX 980 or GTX 960 4GB
I am a researcher: 1-4x GTX 980
I am a researcher with data sets > 250GB: 1-4x GTX Titan
I want to build a GPU cluster: This is really complicated, I will write some advice about this soon, but you can get some ideas here
I started deep learning and I am serious about it: Start with one GTX 580 3GB and buy more GTX 580s as you feel the need for them; save money for Volta GPUs in 2016 Q2/Q3

Update 2014-09-28: Added emphasis for memory requirement of CNNs
Update 2015-02-23: Updated GPU recommendations and memory calculations
Update 2015-03-16: Updated GPU recommendations: GTX 970 and GTX 580

Acknowledgements

I want to thank Mat Kelcey for helping me to debug and test custom code for the GTX 970; I want to thank Sander Dieleman for making me aware of the short comings of my GPU memory advice for convolutional nets; I want to thank Hannes Bretschneider for pointing out software dependency problems for the GTX 580.

怎样为深度学习系统选择GPU相关推荐

[AI开发]深度学习如何选择GPU？
机器推理在深度学习的影响下,准确性越来越高.速度越来越快.深度学习对人工智能行业发展的贡献巨大,这得益于现阶段硬件计算能力的提升.互联网海量训练数据的出现.本篇文章主要介绍深度学习过程中如何选择合适的 ...
深度学习之选择GPU或CPU方法
1 TensorFlow设置方法 1.1 指定GPU 法1 import os os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID ...
智源青年科学家候选人 | 张祥雨：挑战自动化深度学习系统
4月16日,北京智源人工智能研究院发布"智源学者计划",宣布重点支持四类人才:智源科学家首席(CS).智源研究项目经理(PM).智源研究员(PI),以及智源青年科学家. 其中,智源 ...
Kubernetes-native 弹性分布式深度学习系统
9月11日,蚂蚁金服在 Google Developer Day Shanghai 2019 上宣布开源了基于 TensorFlow 2.0 eager execution 的分布式深度学习系统 El ...
糖尿病视网膜病变的深度学习系统笔记
糖尿病视网膜病变的深度学习系统笔记论文地址:A deep learning system for detecting diabetic retinopathy across the disease ...
特斯拉如何开发基于纯视觉的深度学习系统
点击上方"3D视觉工坊",选择"星标" 干货第一时间送达作者丨初光来源丨糖果Autosar 打造全自动驾驶汽车所需的技术堆栈是什么?公司和研究人员对该问题的 ...
Facebook更新PyTorch 1.1，深度学习CPU抢GPU饭碗？
在一年一度的开发者大会F8上,Facebook放出PyTorch的1.1版本,直指TensorFlow"腹地". 不仅宣布支持TensorFlow的可视化工具TensorBoard ...
ElasticDL: Kubernetes-native 弹性分布式深度学习系统
9月11日,蚂蚁金服在 Google Developer Day Shanghai 2019 上宣布开源了基于 TensorFlow 2.0 eager execution 的分布式深度学习系统 El ...
一万元搭建深度学习系统：硬件、软件安装教程，以及性能测试
本文来自AI新媒体量子位(QbitAI) Macbook这种轻薄的笔记本,是搞不了深度学习的.亚马逊P2云服务,会给堆积越来越多的账单,换个便宜的服务,训练时间又太长-- 没办法,已经十多年没用过台式 ...

怎样为深度学习系统选择GPU

怎样为深度学习系统选择GPU相关推荐

最新文章

热门文章