Memory bandwidth test on Nvidia GPU's - cuda

I tried to use the code posted by nvidia and do a memory bandwidth test but i got some surprising results
Program used is here : https://developer.nvidia.com/content/how-optimize-data-transfers-cuda-cc
On a Desktop (with MacOS)
Device: GeForce GT 650M
Transfer size (MB): 16
Pageable transfers
Host to Device bandwidth (GB/s): 4.053219
Device to Host bandwidth (GB/s): 5.707841
Pinned transfers
Host to Device bandwidth (GB/s): 6.346621
Device to Host bandwidth (GB/s): 6.493052
On a Linux server :
Device: Tesla K20c
Transfer size (MB): 16
Pageable transfers
Host to Device bandwidth (GB/s): 1.482011
Device to Host bandwidth (GB/s): 1.621912
Pinned transfers
Host to Device bandwidth (GB/s): 1.480442
Device to Host bandwidth (GB/s): 1.667752
BTW i do not have the root privilege..
I am not sure why its less on the tesla device.. Can anyone point out what would be the reason ?

It is most likely that the GPU in your server isn't in a 16 lane PCI express slot. I would expect a PCI-e v2.0 device like the K20C to be able to achieve between 4.5-5.5Gb/s peak throughput on a reasonably specified modern server (probably about 6Gb/s on a desktop system an integrated PCI-e controller). Your results look like you are hosting the GPU in a 16x slot with only 8 or even 4 active lanes.
There can be also other factors at work, like CPU-IOH affinity, which can increase the number of "hops" between the PCI-e bus hosting the GPU and the processor and its memory running the test). But providing further analysis would require more details about the configuration and hardware of the server, which is really beyond the scope of StackOverflow.

A quick view of Tesla K20c spec and GT 650M spec can clarify things. We see that PCIe interface of the Tesla has version 2.0 which is slower vs the GT PCIe interface which is 3.0. Although the Tesla has more resources in terms of Memory and Memory Bus those two parameters would limit the Memory Bandwidth. Therefore the Tesla may issue more memory instructions than the GT but they stall because of the PCIe interface.
Of course this may not be the only reason, but for details I would explore the architectures of both cards as I saw small difference (at least in the naming).
Edit#1: Referring the comments below apparently you can achieve PCIe 3.0 speed on PCIe 2.0 boards. Check this

Related

How to access Managed memory simultaneously by CPU and GPU in compute capability 5.0?

Since simultaneous access to managed memory on devices of compute capability lower than 6.x is not possible (CUDA Toolkit Documentation), is there a way to simulatneously access managed memory by CPU and GPU with compute capability 5.0 or any method that can make CPU access managed memory while GPU kernel is running.
is there a way to simulatneously access managed memory by CPU and GPU with compute capability 5.0
No.
or any method that can make CPU access managed memory while GPU kernel is running.
Not on a compute capability 5.0 device.
You can have "simultaneous" CPU and GPU access to data using CUDA zero-copy techniques.
A full tutorial on both Unified memory as well as Pinned/Mapped/Zero-copy memory is well beyond the scope of what I can write in an answer here. Unified Memory has its own section in the programming guide. Both of these topics are extensively covered here on the cuda tag on SO as well as many other places on the web. Any questions will likely be answerable with a google search.
In a nutshell, zero-copy memory on 64-bit OS is accessed via a host pinning API such as cudaHostAlloc(). The memory so allocated is host memory, and always remains there, but it is accessible to the GPU. The access to this memory from GPU to host memory occurs across the PCIE bus, so it is much slower than normal global memory access. The pointer returned by the allocation (on 64-bit OS) is usable in both host and device code. You can study CUDA sample codes that use zero-copy techniques such as simpleZeroCopy.
By contrast, ordinary unified memory (UM) is data that will be migrated to the processor that is using it. In a pre-pascal UM regime, this migration is triggered by kernel calls and synchronizing operations. Simultaneous access by host and device in this regime is not possible. For pascal and beyond devices in a proper UM post-pascal regime (basically, 64-bit linux only, CUDA 8+), the data is migrated on-demand, even during kernel execution, thus allowing a limited form of "simultaneous" access. Unified Memory has various behavior modes, and some of those will cause a unified memory allocation to "decay" into a pinned/zero-copy host allocation, under some circumstances.

Peer-to-Peer CUDA transfers

I heard about peer-to-peer memory transfers and read something about it but could not really understand how much fast this is compared to standard PCI-E bus transfers.
I have a CUDA application which uses more than one gpu and I might be interested in P2P transfers. My question is: how fast is it compared to PCI-E? Can I use it often to have two devices communicate with each other?
A CUDA "peer" refers to another GPU that is capable of accessing data from the current GPU. All GPUs with compute 2.0 and greater have this feature enabled.
Peer to peer memory copies involve using cudaMemcpy to copy memory over PCI-E as shown below.
cudaMemcpy(dst, src, bytes, cudaMemcpyDeviceToDevice);
Note that dst and src can be on different devices.
cudaDeviceEnablePeerAccess enables the user to launch a kernel that uses data from multiple devices. The memory accesses are still done over PCI-E and will have the same bottlenecks.
A good example of this would be simplep2p from the cuda samples.

PCI-e lane allocation on 2-GPU cards?

The data rate of cudaMemcpy operations is heavily influenced by the number of PCI-e 3.0 (or 2.0) lanes that are allocated to run from the CPU to GPU. I'm curious about how PCI-e lanes are used on Nvidia devices containing two GPUs.
Nvidia has a few products that have two GPUs on a single PCI-e device. For example:
The GTX 590 contains two Fermi GF110 GPUs
The GTX 690 contains two Kepler GK104 GPUs
As with many newer graphics cards, these devices mount in PCI-e 16 slots. For cards that contain only one GPU, the GPU can use 16 PCI-e lanes.
If I have a device containing two GPUs (like the GTX 690), but I'm only running compute jobs on just one of the GPUs, can all 16 PCI-e lanes serve the one GPU that is being utilized?
To show this as ascii art...
[ GTX690 (2x GF110) ] ------16 PCI-e lanes ----- [ CPU ]
I'm not talking about the case where the CPU is connected to two cards that have one GPU each. (like the following diagram)
[ GTX670 (1x GK104) ] ------ PCI-e lanes ----- [ CPU ] ------ PCI-e lanes -----
[ GTX670 (1x GK104) ]
The GTX 690 uses a PLX PCIe Gen 3 bridge chip to connect the two GK104 GPUs with the host PCIe bus. There is a full x16 connection from the host to the PLX device, and from the PLX device to each GPU (the PLX device has a total of 48 lanes). Therefore, if only using one GPU, you can achieve approximately full x16 bandwidth to that GPU. You can explore this by using the bandwidthTest that is included in the CUDA samples. bandwidthTest will target a single GPU (of the two that are on the card, and this is selectable via command line option), and you should see approximately full bandwidth depending on the system. If your system is Gen3 capable, you should see full PCIe x16 Gen 3 bandwidth (don't forget to use --memory=pinned option), which will vary depending on the specific system but should be well north of 6GB/s (probably in the 9-11GB/s range). If your system is Gen2 capable, you should see something in the 4-6GB/s range. A similar statement can be made about GTX 590, however it is a Gen2 only device and uses a different bridge chip. The results of bandwidthTest confirm that a full x16 logical path exists between the root port and either GPU. There is no free lunch of course, so you cannot get simultaneous full bandwidth to both GPUs: you are limited by the x16 slot.

Strange results for CUDA SDK Bandwidth Test

I have a CUDA application that is data movement bound (i.e. large memcopies from host to device with relatively little computations done in the kernel). On older GPUs I was compute-bound (e.g. QUADRO FX 5800), but with Fermi and Kepler architectures, that is no longer the case (for my optimized code).
I just moved my code to a GTX 680 and was impressed with the increased compute performance, but perplexed that the bandwidth between the host and GPU seems to have dropped (relative to my Fermi M20270).
In short when I run the canned SDK bandwidth test I get ~5000 MB/sec on the GTX 680 versus ~5700 MB/sec on the M2070. I recognize that the GTX is "only a gamer card", but the specs for the GTX 680 seem more impressive than for the M2070, WITH THE EXCEPTION OF THE BUS WIDTH.
From wikipedia:
M2070: 102.4 GB/sec, GDDR3, 512 bit bus width
GTX 680: 192 GB/sec, GDDR5, 256 bit bus width
I'm running the canned test with "--wc --memory=pinned" to use write-combined memory.
The improved results I get with this test are mirrored by the results I am getting with my optimized CUDA code.
Unfortunately, I can't run the test on the same machine (and just switch video cards), but I have tried the GTX 680 on older and newer machines and get the same "degraded" results (relative to what I get on the M2070).
Can anyone confirm that they are able to achieve higher throughput memcopies with the M2070 Quadro than the GTX 680? Doesn't the bandwidth spec take into consideration the bus width? The other possibility is that I'm not doing the memcopies correctly/optimally on the GTX 680, but in that case, is there a patch for the bandwidth test so that it will also show that I'm transfering data faster to the 680 than to the M2070?
Thanks.
As Robert Crovella has already commented, your bottleneck is the PCIe bandwidth, not the GPU memory bandwidth.
Your GTX 680 can potentially outperform the M2070 by a factor of two here as it supports PCIe 3.0 which doubles the bandwidth over the PCIe 2.0 interface of the M2070. However you need a mainboard supporting PCIe 3.0 for that.
The bus width of the GPU memory is not a concern in itself, even for programs that are GPU memory bandwidth bound. Nvidia managed to substantially increase the frequencies used on the memory bus of the GTX 680, which more than compensates for the reduced bus width relative to the M2070.

CUDA: Host to Device bandwidth greater than peak b/w of PCIe?

I had used the same plot as attached, for another question. One could see that the peak bandwidth is more than 5.5GB/s. I am using NVidia's bandwidth test program from code samples to find the bandwidth between host to device and vice versa.
The system consists of total 12 Intel Westmere CPUs on two sockets, 4 Tesla C2050 GPUs with 4 PCIe Gen2 Express slots. Now the question is, since the peak bandwidth of PCIe x16 Gen2 is 4GB/s in one direction, how come I am getting a much more bandwidth while doing host to device transfer?
I have in mind that each PCIe is connected to the CPU via an I/O Controller Hub, which is connected through QPI (much more b/w) to the CPU.
The peak bandwidth of PCIe x16 Gen2 is 8GB/s in each direction. You are not exceeding the peak.