How are multiple gpus utilized in Caffe? - caffe

I want to know how Caffe utilizes multiple GPUs so that I can decide to upgrade to a new more powerful card or just buy the same card and run on SLI.
for example am I better off buying one TitanX 12 GB , or two GTX 1080 8 GB ?
If I go SLI the 1080s, will my effective memory get doubled? I mean can I run a network which takes 12 or more GB of vram using them? Or am I left with only 8 GB ?
Again how is memory utilized in such scenarios ?
What would happen if two different cards are installed (both NVIDIA) ? Does caffe utilize the memory available the same? (suppose one 980 and one 970!)

for example am I better off buying one TitanX 12 GB , or two GTX 1080
8 GB ? If I go SLI the 1080s, will my effective memory get doubled? I
mean can I run a network which takes 12 or more GB of vram using them?
Or am I left with only 8 GB ?
No, effective memory size in case of 2 GPU with 8Gb of RAM will be 8Gb, but effective batch size will be doubled which will lead to more stable\fast training.
What would happen if two different cards are installed (both NVIDIA) ?
Does caffe utilize the memory available the same? (suppose one 980 and
one 970!)
I think you will be limited to lower card and may have problem with drivers, so I don't recomend to try this configuration.
Also from documentation:
Current implementation has a "soft" assumption that the devices being
used are homogeneous. In practice, any devices of the same general
class should work together, but performance and total size is limited
by the smallest device being used. e.g. if you combine a TitanX and a
GTX980, performance will be limited by the 980. Mixing vastly
different levels of boards, e.g. Kepler and Fermi, is not supported.
Summing up: with GPU that have lots of RAM you can train deeper models, with multiply GPUs you can train single model faster and also you can train separate models per GPU. I would choose single GPU with more memory (TitanX) because deep networks nowadays are more RAM bounded(e.g. ResNet-152 or some semantic segmentation network) and more memory will give the opportunity to run deeper networks and with larger batch size, otherwise if you have some task that fit on single GPU (GTX 1080) you can buy 2 or 4 of them just to make things faster.
Also here is some info about multi GPU support in Caffe:
The current implementation uses a tree reduction strategy. e.g. if
there are 4 GPUs in the system, 0:1, 2:3 will exchange gradients, then
0:2 (top of the tree) will exchange gradients, 0 will calculate
updated model, 0->2, and then 0->1, 2->3.
https://github.com/BVLC/caffe/blob/master/docs/multigpu.md

I don't believe Caffe supports SLI mode. The two GPUs are treated as
separate cards.
When you run Caffe and add the '-gpu' flag (assuming you are using the
command line), you can specify which GPU to use (-gpu 0 or -gpu 1 for
example). You can also specify multiple GPUs (-gpu 0,1,3) including
using all GPUs (-gpu all).
When you execute using multiple GPUs, Caffe will execute the training
across all of the GPUs and then merge the training updates across the
models. This is effectively doubling (or more if you have more than 2
GPUs) the batch size for each iteration.
In my case, I started with a NVIDIA GTX 970 (4GB card) and then
upgraded to a NVIDIA GTX Titan X (Maxwell version with 12 GB) because
my models were too large to fit in the GTX 970. I can run some of the
smaller models across both cards (even though they are not the same)
as long as the model will fully fit into the 4GB of the smaller card.
Using the standard ImageNet model, I could execute across both cards
and cut my training time in half.
If I recall correctly, other frameworks (TensorFlow and maybe the
Microsoft CNTK) support splitting a model among different nodes to
effectively increase the available GPU memory like what you are
describing. Although I haven't personally tried either one, I
understand you can define on a per-layer basis where the layer
executes.
Patrick
Link

Perhaps a late answer, but caffe supports gpu parallelism, which means you can indeed fully utilize both gpu's, but I do recommend getting two gpu's of equal memory size, since I don't think caffe lets you select the batch size per gpu.
As for how memory is utilized, using multiple gpu's each gpu gets a batch of batch size as specified in your train_val.prototxt, so if your batch size is for example 16 and you're using 2 gpu's, you'd have an effective batch size 32.
Finally, I know that for things such as gaming, SLI seems to be much less effective and often much more problematic than having a single, more powerful GPU. So if you are planning on using the GPU's for more than only Deep Learning, I'd recommend you still go for the Titan X

Related

How to extend tensorflow's GPU memory from system RAM

I want to train googles object detection with faster_rcnn_with resnet101 using mscoco datasetcode. I used only 10,000 images for training purpose.I used graphics: GeForce 930M/PCIe/SSE2. NVIDIA Driver Version:384.90. here is the picture of my GeForce.
And I have 8Gb RAM but in tensorflow gpu it is showed 1.96 Gb.
. Now How can I extend my PGU's RAM. I want to use full system memory.
You can train on the cpu to take advantage of the RAM on your machine. However, to run something on the gpu it has to be loaded to the gpu first. Now you can swap memory in and out, because not all the results are needed at any step. However, you pay with a very long training time and I would rather advise you to reduce the batch size. Nevertheless, details about this process and implementation can be found here: https://medium.com/#Synced/how-to-train-a-very-large-and-deep-model-on-one-gpu-7b7edfe2d072.

Is it fair to compare SSE/AVX units to GPU cores?

I have a presentation to make to people who have (almost) no clue of how a GPU works. I think saying that a GPU has a thousand cores where a CPU only has four to eight of them is a non-sense. But I want to give my audience an element of comparison.
After a few months working with NVidia's Kepler and AMD's GCN architectures, I'm tempted to compare a GPU "core" to a CPU's SIMD ALU (I don't know if they have a name for that at Intel). Is it fair ? After all, when looking at an assembly level, those programming models have much in common (at least with GCN, take a look at p2-6 of the ISA manual).
This article states that an Haswell processor can do 32 single-precision operations per cycle, but I suppose there is pipelining or other things happening to achieve that rate. In NVidia parlance, how many Cuda-cores does this processor have ? I would say 8 per CPU-core for 32 bits operations, but this is just a guess based on the SIMD width.
Of course there is many other things to take into account when comparing CPU and GPU hardware, but this is not what I'm trying to do. I just have to explain how the thing is working.
PS: All pointers to CPU hardware documentations or CPU/GPU presentations are greatly appreciated !
EDIT:
Thanks for your answers, sadly I had to chose only one of them. I marked Igor's answer because it sticks the most to my initial question and gave me enough informations to justify why this comparison shouldn't be taken too far, but CaptainObvious provided very good articles.
I'd be very caution on making this kind of comparison. After all even in the GPU world the term "core" depending on the context has really different capability: the new AMD GCN is quite different from the old VLIW4 one which itself is quite different from the CUDA core one.
Besides that, you will bring more puzzlement than understanding to your audience if you make just one small comparison with CPU and that's it. If I were you I'd still go for a more detailed (can still be quick) comparison. For instance someone used to CPU and with little knowledge of GPU, might wonder how come a GPU can have so many registers though it's so expensive (in the CPU world). An explanation to that question is given at the end of this post as well as some more comparison GPU vs CPU.
This other article gives a nice comparison between these two kind of processing units by explaining how GPUs work but also how they evolved and showing the differences with CPUs. It addresses topics like data flow, memory hierarchy but also for what kind of applications a GPU is useful. After all the power a GPU can developed is accessible (efficiently) only for some types of problems.
And personally, If I had to make a presentation about GPU and had the possibility to make only one reference to CPU it would be this: presenting the problems a GPU can solve efficiently vs those a CPU can handle better.
As a bonus even though it's not related directly to your presentation here is an article that put GPGPU in perspective, showing that some speedup claimed by some people are overrated (this is linked to my last point btw :))
Very loosely speaking, it is not entirely unreasonable to say that a Haswell core has about 16 CUDA cores, but you definitely don't want to take that comparison too far. You may want to be cautious about making that statement directly in a presentation, but I've found it to be useful to think of a CUDA core as being somewhat related to a scalar FP unit.
It may help if I explain why Haswell can perform 32 single-precision operations per cycle.
8 single-precision operations execute in each AVX/AVX2 instruction. When writing code that will run on a Haswell CPU, you can use AVX and AVX2 instructions which operate on 256-bit vectors. These 256-bit vectors can represent 8 single-precision FP numbers, 8 integers (32-bit) or 4 double-precision FP numbers.
2 AVX/AVX2 instructions can execute in each core per cycle, although there are some restrictions on which instructions can be paired up.
A fused multiply add (FMA) instruction technically performs 2 single-precision operations. FMA instructions perform "fused" operations such as A = A * B + C, so there are arguably two operations per scalar operand: a multiplication and an addition.
This article explains the above points in more detail: http://www.realworldtech.com/haswell-cpu/4/
In the total accounting, a Haswell core can perform 8 * 2 * 2 single-precision operations per cycle. Since CUDA cores support FMA operations as well, you cannot count that factor of 2 when comparing CUDA cores to Haswell cores.
A Kepler CUDA core has one single-precision floating-point unit, so it can perform one floating-point operation per cycle: http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, http://www.realworldtech.com/kepler-brief/
If I was putting together slides on this, I would have one section explaining how many FP operations Haswell can do per cycle: the three points above, plus you have multiple cores and possibly multiple processors. And, I'd have another section explaining how many FP operations a Kepler GPU can do per cycle: 192 per SMX, and you have multiple SMX units on the GPU.
PS.: I may be stating the obvious, but just to avoid confusion: the Haswell architecture also includes an integrated GPU, which has an altogether different architecture from the Haswell CPU.
I completely agree with CaptainObvious, especially that presenting the problems a GPU can solve efficiently vs those a CPU can handle better would be a good idea.
One way I like to compare CPUs and GPUs is by the number of operation/sec that they can perorm. But of course don't compare one cpu core to a multi-core gpu.
A SandyBridge core can perform 2 AVX op/cycles, that is crunch 8 double precision numbers/cycle. Hence, a computer with 16 Sandy-Bridge cores clocked at 2.6 GHz has a peak power of 333 Gflops.
A K20 computing module GK110 has a peak of 1170 Gflops, that is 3.5 times more. This is a fair comparaison in my opinion, and it should be emphasized that the peak performance is much easier to reach on CPU (some applications reach 80%-90% of peak) than on GPU (best cases I know are less than 50% of peak).
So to summerize, I would not go into architecture details, but rather state some shear numbers with the perspective that the peak is often far from reach on GPUs.
It's more fair to compare GPU to vectorized CPU units however if your audience has zero idea of how GPUs work, it seems fair to assume that they have a similar knowledge of vectorized SSE instructions.
For audiences such as these it's important to point out the high level differences, like how blocks of "cores" on the gpu share a scheduler and register file.
I would refer to the GTC Kepler architecture overview for a better idea of what the Kepler architecture looks like.
This is also a reasonably graspable comparison between the two if you want to stick to the "gpu core" idea.

How does the speed of CUDA program scale with the number of blocks?

I am working on Tesla C1060, which contains 240 processor cores with compute capability 1.3. Knowing that each 8 cores are controlled by a single multi-processor, and that each block of threads is assigned to a single multi-processor, then I would expect that launching a grid of 30 blocks, should take the same execution time as one single block. However, things don't scale that nicely, and I never got this nice scaling even with 8 threads per block. Going to the other extreme with 512 threads per block, I get approximately the same time of one block, when the grid contains a maximum of 5 blocks. This was disappointing when I compared the performance with implementing the same task parallelized with MPI on an 8-core CPU machine.
Can some one explain that to me?
By the way, the computer actually contains two of this Tesla card, so does it distribute blocks between them automatically, or do I have to take further steps to ensure that both are fully exploited?
EDIT:
Regarding my last question, if I launch two independent MPI processes on the same computer, how can I make each work on a different graphics card?
EDIT2: Based on the request of Pedro, here is a plot depicting the total time on the vertical access, normalized to 1 , versus the number of parallel blocks. The number of threads/block = 512. The numbers are rough, since I observed quite large variance of the times for large numbers of blocks.
The speed is not a simple linear relation with the number of blocks. It depends on bunch of stuffs. For example, the memory usage, the number of instruction excuted in a block, etc.
If you want to do multi-GPU computing, you need to modify your code, otherwise you can only use one GPU card.
It seems to me that you have simply taken a C program and compiled it in CUDA without much tought.
Dear friend, this is not the way to go. You have to design your code to take advantage of the fact that CUDA cards have a different internal architecture than regular CPUs. In particular, take the following into account:
memory access pattern - there is a number of memory systems in a GPU and each requires consideration on how to use it best
thread divergence problems - performance will only be good if most of your threads follow the same code path most of the time
If your system has 2 GPUs, you can use both to accelerate some(suitable) problems. The thing is that the memory area of the two are split and not easily 'visible' by each other - you have to design your algorithm to take this into account.
A typical C program written in pre-GPU era will often not be easily transplantable unless originally written with MPI in mind.
To make each CPU MPI thread work with a different GPU card you can use cudaSetDevice()

CUDA: Differences between HtoD and DtoH bandwidth

Yet another bandwidth related question. I expected the plots of Device-to-host bandwidth and that of Host-to-Device to be similar, but I see that there is a significant difference between the two. Considering both following the same route, so the effective bandwidth should be the same, isn't it? The testbed consists of total 12 Intel Westmere CPUs on two sockets, 4 Tesla C2050 GPUs with 4 PCIe Gen2 Express slots. Using the bandwidthtest program from NVidia code samples.
What are the overheads of doing a cudamemCpy from the host vs the device?
First, I would say those two curves are similar. I can honestly say that I've never seen symmetric PCI-e bandwidth on any system I have used -- and that includes both CUDA and graphics (OpenGL/D3D) tests, so I don't think it's something (especially this small difference) that should concern you.
As with your other PCI-e bandwidth question, the answer is similar -- the driver may use different strategies for different types and sizes of transfers, attempting to get the highest throughput possible.
Actual throughput depends on many factors, including the type of GPU, and especially on the host chipset in use.

GTX 295 vs other nvidia cards for cuda development

what is the best nvidia Video Card for cuda development. a single GTX 295 has 2 GPUs, is it possible to have 2 GTX 295 and use the 4 GPUs in my cuda code?
is it better to get two 480 cards rather than two 295? would a fermi be better than both cards?
what is the best nvidia Video Card for cuda development.
Whatever fits in your budget and suits your needs. I know this is a bit vague, but after all it really is as simple as that ;)
a single GTX 295 has 2 GPUs, is it possible to have 2 GTX 295 and use the 4 GPUs in my cuda code?
Sure, it is. The only drawback is that the 2 GPUs on the GTX 295 share a single PCI. Whether this is relevant for you or not depends if the application needs intensive communication with the host or not.
is it better to get two 480 cards rather than two 295? would a fermi be better than both cards?
From the point of view of raw peak performance a GTX 295 (which is almost 2x GTX 280, not considering the shared PCI) is better than a 480. However the GF10x series architecture improved on many points compared to the GT200, for details see the "Fermi whitepaper" and the "Fermi Tuning Guide".
If you're planning to use double precision, the GF10x series has much improved double precision support, but it's good to know that this is capped on GeForce cards to 1/8-th of the single precision performance (normally it's about half)
Therefor, I would suggest that unless you have a strong reason to get lots of GFlops (Folding#Home?) in the form of soon to be outdated hardware, get a GTX 480 or a 470 if you want to save ~25%.
Direct answer: I would go with one or maybe two GTX 480's. But I think my reasoning is a bit different from #bobince or #pszilard.
Backgroud: I just made the same decision you're facing, but our situations may be vastly different.
I'm a statistics graduate student in a department with minimal funding for gpu computing resources, the campus does have one fermi box hooked up to two nodes that I have access to. But these were in linux -- which I love -- but I really want to use nSight to benchmark and tune my code, so I need windows -- so I decided to purchase a development box which I dual boot, Ubuntu x64 for production runs and Win 7 with VS 2010 (a battle which I'm presently fighting) and nSight 1.5 for development. That said, back to the reason why I bought two GTX 480's (EVGA is awesome!!) and not two GTX 285's or 295's.
I've spent the past two years developing a couple of CUDA kernels. The trickiest part of the development, for me, is the memory management. I spent the better part of three months trying to squeeze a Cholesky decomposition & back substitution into 16 single-precision registers -- the max you can use before either the GTX 285 or 295 incur a 50% performance penalty (literally 3 weeks going from 17 to 16 registers). For me, the fact that all Fermi architectures have double the registers means that those three months would've gained me about 10% improvement on a GTX 480 instead of 50% on GTX 285 and hence, probably not worth my time -- in truth a bit more subtle than that, but you get the drift.
If you're fairly new to CUDA -- which you probably are since you're asking -- I would say 32 registers is HUGE. Second, I think the L1 cache of the Fermi architecture can directly translate to faster global memory accesses -- of course it does, but I haven't measured the impact directly yet. If you don't need the global memory as much, you can trade the bigger L1 cache for triple the shared memory -- which was also a tight squeeze for me as the matrix sizes increased.
Then I would agree with #pszilard that if you need double precision, Fermi is definitely the way to go -- though I'd still write your code in single precision first, tune it, and then migrate to double.
I don't think that concurrent kernel execution will matter for you -- it's really cool, the delays to kernel completion can be orders of magnitude less -- but you're probably going to focus on one kernel first, not parallel kernels. If you want to do streaming or parallel kernels, then you need Fermi -- the 285 / 295's simply can't do it.
And lastly, the drawback of going with the 295's is that you have to write two layers of parallelism: (1) one to distribute blocks (or kernels?) across the cards and (2) the gpu kernel itself. If you're just starting out, it's much easier to keep the parallelism in one place (on a single card) as opposed to fighting two battles at once.
Ps. If you haven't written your kernels yet, you might consider getting only one card and waiting six months to see if the landscape changes again -- though I have no idea when the next cards are to be released.
PPs. I absolutely loved running my cuda kernel on the GTX 480 which I had debugged / designed on a Tesla C1070 and instantly realizing a 2x speed improvement. Money well spent.
is it possible to have 2 GTX 295 and use the 4 GPUs in my cuda code?
Yes. Or quad, if you're totally insane.
is it better to get two 480 cards rather than two 295?
Arguable. 295 as a dual-gpu has slightly more raw oomph, but 480 as a 40nm-process card without the dual-gpu overhead may use its resources better. Benchmarks vary. Of course the Fermi 4xx range has more modern feature support (3D, DirectX, OpenCL etc).
But dual-295 is going to have seriously huge PSU and cooling requirements. And dual-480 runs almost as hot. Not to mention the expense. What are you working on that you think you're going to need this? Have you considered the more mainstream parts, eg 460, which is generally considered to offer a better price/performance than the troubled 470–480 (GF100) part?