Strange results for CUDA SDK Bandwidth Test - cuda

I have a CUDA application that is data movement bound (i.e. large memcopies from host to device with relatively little computations done in the kernel). On older GPUs I was compute-bound (e.g. QUADRO FX 5800), but with Fermi and Kepler architectures, that is no longer the case (for my optimized code).
I just moved my code to a GTX 680 and was impressed with the increased compute performance, but perplexed that the bandwidth between the host and GPU seems to have dropped (relative to my Fermi M20270).
In short when I run the canned SDK bandwidth test I get ~5000 MB/sec on the GTX 680 versus ~5700 MB/sec on the M2070. I recognize that the GTX is "only a gamer card", but the specs for the GTX 680 seem more impressive than for the M2070, WITH THE EXCEPTION OF THE BUS WIDTH.
From wikipedia:
M2070: 102.4 GB/sec, GDDR3, 512 bit bus width
GTX 680: 192 GB/sec, GDDR5, 256 bit bus width
I'm running the canned test with "--wc --memory=pinned" to use write-combined memory.
The improved results I get with this test are mirrored by the results I am getting with my optimized CUDA code.
Unfortunately, I can't run the test on the same machine (and just switch video cards), but I have tried the GTX 680 on older and newer machines and get the same "degraded" results (relative to what I get on the M2070).
Can anyone confirm that they are able to achieve higher throughput memcopies with the M2070 Quadro than the GTX 680? Doesn't the bandwidth spec take into consideration the bus width? The other possibility is that I'm not doing the memcopies correctly/optimally on the GTX 680, but in that case, is there a patch for the bandwidth test so that it will also show that I'm transfering data faster to the 680 than to the M2070?
Thanks.

As Robert Crovella has already commented, your bottleneck is the PCIe bandwidth, not the GPU memory bandwidth.
Your GTX 680 can potentially outperform the M2070 by a factor of two here as it supports PCIe 3.0 which doubles the bandwidth over the PCIe 2.0 interface of the M2070. However you need a mainboard supporting PCIe 3.0 for that.
The bus width of the GPU memory is not a concern in itself, even for programs that are GPU memory bandwidth bound. Nvidia managed to substantially increase the frequencies used on the memory bus of the GTX 680, which more than compensates for the reduced bus width relative to the M2070.

Related

Maximum number of concurrent kernels & virtual code architecture

So I found this wikipedia resource
Maximum number of resident grids per device
(Concurrent Kernel Execution)
and for each compute capability it says a number of concurrent kernels, which I assume to be the maximum number of concurrent kernels.
Now I am getting a GTX 1060 delivered which according to this nvidia CUDA resource has a compute capability of 6.1. From what I have learned about CUDA so far you can specify the virtual compute capability of your code at compile time in NVCC though with the flag -arch=compute_XX.
So will my GPU be hardware constrained to 32 concurrent kernels or is it capable of 128 with the -arch=compute_60 flag?
According to table 13 in the NVIDIA CUDA programming guide compute capability 6.1 devices have a maximum of 32 resident grids = 32 concurrent kernels.
Even if you use the -arch=compute_60 flag, you will be limited to the hardware limit of 32 concurrent kernels. Choosing particular architectures to compile for does not allow you to exceed the hardware limits of the machine.

What is the relation between compute units, SMXs, CUDA cores, etc.?

I'm quite confused with these terminologies... I understand that an nVidia GPU has some streaming multiprocessors (SMX), each consisting of a number of CUDA cores (streaming processor, SP). However I can't seem to figure out how this applies to OpenCL compute units.
For example, my GeForce GTS 250 says it has 16 compute units. The official nVidia site says it has 128 CUDA cores. However, some papers say the compute unit itself is a core.
So which one is which? Also, which one of these executes an OpenCL workgroup? So far I thought a work group gets executed on a CUDA core. But the OpenCL spec says it gets executed on a compute unit (which should be an SMX then).
Honestly, WTF???
I would completely ignore the term 'core' when thinking about OpenCL, because different hardware vendors have different opinions about what it actually means (as you have already found out). Neither an SM nor a 'CUDA core' is directly comparable to a traditional CPU core.
For NVIDIA hardware, an SM is an OpenCL compute unit. Each work-group will therefore be assigned to an SM, although each SM is capable of running multiple work-groups concurrently.

Memory bandwidth test on Nvidia GPU's

I tried to use the code posted by nvidia and do a memory bandwidth test but i got some surprising results
Program used is here : https://developer.nvidia.com/content/how-optimize-data-transfers-cuda-cc
On a Desktop (with MacOS)
Device: GeForce GT 650M
Transfer size (MB): 16
Pageable transfers
Host to Device bandwidth (GB/s): 4.053219
Device to Host bandwidth (GB/s): 5.707841
Pinned transfers
Host to Device bandwidth (GB/s): 6.346621
Device to Host bandwidth (GB/s): 6.493052
On a Linux server :
Device: Tesla K20c
Transfer size (MB): 16
Pageable transfers
Host to Device bandwidth (GB/s): 1.482011
Device to Host bandwidth (GB/s): 1.621912
Pinned transfers
Host to Device bandwidth (GB/s): 1.480442
Device to Host bandwidth (GB/s): 1.667752
BTW i do not have the root privilege..
I am not sure why its less on the tesla device.. Can anyone point out what would be the reason ?
It is most likely that the GPU in your server isn't in a 16 lane PCI express slot. I would expect a PCI-e v2.0 device like the K20C to be able to achieve between 4.5-5.5Gb/s peak throughput on a reasonably specified modern server (probably about 6Gb/s on a desktop system an integrated PCI-e controller). Your results look like you are hosting the GPU in a 16x slot with only 8 or even 4 active lanes.
There can be also other factors at work, like CPU-IOH affinity, which can increase the number of "hops" between the PCI-e bus hosting the GPU and the processor and its memory running the test). But providing further analysis would require more details about the configuration and hardware of the server, which is really beyond the scope of StackOverflow.
A quick view of Tesla K20c spec and GT 650M spec can clarify things. We see that PCIe interface of the Tesla has version 2.0 which is slower vs the GT PCIe interface which is 3.0. Although the Tesla has more resources in terms of Memory and Memory Bus those two parameters would limit the Memory Bandwidth. Therefore the Tesla may issue more memory instructions than the GT but they stall because of the PCIe interface.
Of course this may not be the only reason, but for details I would explore the architectures of both cards as I saw small difference (at least in the naming).
Edit#1: Referring the comments below apparently you can achieve PCIe 3.0 speed on PCIe 2.0 boards. Check this

PCI-e lane allocation on 2-GPU cards?

The data rate of cudaMemcpy operations is heavily influenced by the number of PCI-e 3.0 (or 2.0) lanes that are allocated to run from the CPU to GPU. I'm curious about how PCI-e lanes are used on Nvidia devices containing two GPUs.
Nvidia has a few products that have two GPUs on a single PCI-e device. For example:
The GTX 590 contains two Fermi GF110 GPUs
The GTX 690 contains two Kepler GK104 GPUs
As with many newer graphics cards, these devices mount in PCI-e 16 slots. For cards that contain only one GPU, the GPU can use 16 PCI-e lanes.
If I have a device containing two GPUs (like the GTX 690), but I'm only running compute jobs on just one of the GPUs, can all 16 PCI-e lanes serve the one GPU that is being utilized?
To show this as ascii art...
[ GTX690 (2x GF110) ] ------16 PCI-e lanes ----- [ CPU ]
I'm not talking about the case where the CPU is connected to two cards that have one GPU each. (like the following diagram)
[ GTX670 (1x GK104) ] ------ PCI-e lanes ----- [ CPU ] ------ PCI-e lanes -----
[ GTX670 (1x GK104) ]
The GTX 690 uses a PLX PCIe Gen 3 bridge chip to connect the two GK104 GPUs with the host PCIe bus. There is a full x16 connection from the host to the PLX device, and from the PLX device to each GPU (the PLX device has a total of 48 lanes). Therefore, if only using one GPU, you can achieve approximately full x16 bandwidth to that GPU. You can explore this by using the bandwidthTest that is included in the CUDA samples. bandwidthTest will target a single GPU (of the two that are on the card, and this is selectable via command line option), and you should see approximately full bandwidth depending on the system. If your system is Gen3 capable, you should see full PCIe x16 Gen 3 bandwidth (don't forget to use --memory=pinned option), which will vary depending on the specific system but should be well north of 6GB/s (probably in the 9-11GB/s range). If your system is Gen2 capable, you should see something in the 4-6GB/s range. A similar statement can be made about GTX 590, however it is a Gen2 only device and uses a different bridge chip. The results of bandwidthTest confirm that a full x16 logical path exists between the root port and either GPU. There is no free lunch of course, so you cannot get simultaneous full bandwidth to both GPUs: you are limited by the x16 slot.

How to estimate relative performance of CUDA gpus?

How can I estimate the cuda performance of cards that I don't own, ie. new cards?
For instance I found an incomplete Cuda example and the author wrote, that it takes him 0,7 s on his GF 8600 GT. But on my Quadro it takes 1,7s.
My question is: Is the code which I used to fill the gaps faulty or is the GF 8600 really twice as fast?
The kernel is memory bound, but my card has an higher memory bandwidth. I don't know what conclusions to draw from this.
Name Quadro FX 580 GeForce 8600 GT
CUDA Cores 32 32
Core clock (MHz) 450 540
Memory clock (MHz) 400 700
Memory BW (GB/s) 25.6 22.4
Shader Clock (MHz) ???? 1180
Just want to provide you with some pointers that may be possible sources of error. Firstly, use cudaEvents to time your code, not cuda profiler as cudaEvents is more accurate. Secondly, please check what the author is measuring; is he only talking about the computation time, or is he also considering the time to transfer data to and from the GPU. Are you measuring the same time?
Secondly, the cuda architecture is changing quite fast. For example, for cards with cc 1.x, it is suggested that we should use shared memory to get better performance; however, for cards with cc 2.x, there is a L1 cache with each multiprocessor that makes global memory accesses quite fast. So, you may aslo want to compare the architecture of the two cards and their compute capabilities.