CUDA: Host to Device bandwidth greater than peak b/w of PCIe? - cuda

I had used the same plot as attached, for another question. One could see that the peak bandwidth is more than 5.5GB/s. I am using NVidia's bandwidth test program from code samples to find the bandwidth between host to device and vice versa.
The system consists of total 12 Intel Westmere CPUs on two sockets, 4 Tesla C2050 GPUs with 4 PCIe Gen2 Express slots. Now the question is, since the peak bandwidth of PCIe x16 Gen2 is 4GB/s in one direction, how come I am getting a much more bandwidth while doing host to device transfer?
I have in mind that each PCIe is connected to the CPU via an I/O Controller Hub, which is connected through QPI (much more b/w) to the CPU.

The peak bandwidth of PCIe x16 Gen2 is 8GB/s in each direction. You are not exceeding the peak.

Related

Peer-to-Peer CUDA transfers

I heard about peer-to-peer memory transfers and read something about it but could not really understand how much fast this is compared to standard PCI-E bus transfers.
I have a CUDA application which uses more than one gpu and I might be interested in P2P transfers. My question is: how fast is it compared to PCI-E? Can I use it often to have two devices communicate with each other?
A CUDA "peer" refers to another GPU that is capable of accessing data from the current GPU. All GPUs with compute 2.0 and greater have this feature enabled.
Peer to peer memory copies involve using cudaMemcpy to copy memory over PCI-E as shown below.
cudaMemcpy(dst, src, bytes, cudaMemcpyDeviceToDevice);
Note that dst and src can be on different devices.
cudaDeviceEnablePeerAccess enables the user to launch a kernel that uses data from multiple devices. The memory accesses are still done over PCI-E and will have the same bottlenecks.
A good example of this would be simplep2p from the cuda samples.

Memory bandwidth test on Nvidia GPU's

I tried to use the code posted by nvidia and do a memory bandwidth test but i got some surprising results
Program used is here : https://developer.nvidia.com/content/how-optimize-data-transfers-cuda-cc
On a Desktop (with MacOS)
Device: GeForce GT 650M
Transfer size (MB): 16
Pageable transfers
Host to Device bandwidth (GB/s): 4.053219
Device to Host bandwidth (GB/s): 5.707841
Pinned transfers
Host to Device bandwidth (GB/s): 6.346621
Device to Host bandwidth (GB/s): 6.493052
On a Linux server :
Device: Tesla K20c
Transfer size (MB): 16
Pageable transfers
Host to Device bandwidth (GB/s): 1.482011
Device to Host bandwidth (GB/s): 1.621912
Pinned transfers
Host to Device bandwidth (GB/s): 1.480442
Device to Host bandwidth (GB/s): 1.667752
BTW i do not have the root privilege..
I am not sure why its less on the tesla device.. Can anyone point out what would be the reason ?
It is most likely that the GPU in your server isn't in a 16 lane PCI express slot. I would expect a PCI-e v2.0 device like the K20C to be able to achieve between 4.5-5.5Gb/s peak throughput on a reasonably specified modern server (probably about 6Gb/s on a desktop system an integrated PCI-e controller). Your results look like you are hosting the GPU in a 16x slot with only 8 or even 4 active lanes.
There can be also other factors at work, like CPU-IOH affinity, which can increase the number of "hops" between the PCI-e bus hosting the GPU and the processor and its memory running the test). But providing further analysis would require more details about the configuration and hardware of the server, which is really beyond the scope of StackOverflow.
A quick view of Tesla K20c spec and GT 650M spec can clarify things. We see that PCIe interface of the Tesla has version 2.0 which is slower vs the GT PCIe interface which is 3.0. Although the Tesla has more resources in terms of Memory and Memory Bus those two parameters would limit the Memory Bandwidth. Therefore the Tesla may issue more memory instructions than the GT but they stall because of the PCIe interface.
Of course this may not be the only reason, but for details I would explore the architectures of both cards as I saw small difference (at least in the naming).
Edit#1: Referring the comments below apparently you can achieve PCIe 3.0 speed on PCIe 2.0 boards. Check this

PCI-e lane allocation on 2-GPU cards?

The data rate of cudaMemcpy operations is heavily influenced by the number of PCI-e 3.0 (or 2.0) lanes that are allocated to run from the CPU to GPU. I'm curious about how PCI-e lanes are used on Nvidia devices containing two GPUs.
Nvidia has a few products that have two GPUs on a single PCI-e device. For example:
The GTX 590 contains two Fermi GF110 GPUs
The GTX 690 contains two Kepler GK104 GPUs
As with many newer graphics cards, these devices mount in PCI-e 16 slots. For cards that contain only one GPU, the GPU can use 16 PCI-e lanes.
If I have a device containing two GPUs (like the GTX 690), but I'm only running compute jobs on just one of the GPUs, can all 16 PCI-e lanes serve the one GPU that is being utilized?
To show this as ascii art...
[ GTX690 (2x GF110) ] ------16 PCI-e lanes ----- [ CPU ]
I'm not talking about the case where the CPU is connected to two cards that have one GPU each. (like the following diagram)
[ GTX670 (1x GK104) ] ------ PCI-e lanes ----- [ CPU ] ------ PCI-e lanes -----
[ GTX670 (1x GK104) ]
The GTX 690 uses a PLX PCIe Gen 3 bridge chip to connect the two GK104 GPUs with the host PCIe bus. There is a full x16 connection from the host to the PLX device, and from the PLX device to each GPU (the PLX device has a total of 48 lanes). Therefore, if only using one GPU, you can achieve approximately full x16 bandwidth to that GPU. You can explore this by using the bandwidthTest that is included in the CUDA samples. bandwidthTest will target a single GPU (of the two that are on the card, and this is selectable via command line option), and you should see approximately full bandwidth depending on the system. If your system is Gen3 capable, you should see full PCIe x16 Gen 3 bandwidth (don't forget to use --memory=pinned option), which will vary depending on the specific system but should be well north of 6GB/s (probably in the 9-11GB/s range). If your system is Gen2 capable, you should see something in the 4-6GB/s range. A similar statement can be made about GTX 590, however it is a Gen2 only device and uses a different bridge chip. The results of bandwidthTest confirm that a full x16 logical path exists between the root port and either GPU. There is no free lunch of course, so you cannot get simultaneous full bandwidth to both GPUs: you are limited by the x16 slot.

CUDA: Differences between HtoD and DtoH bandwidth

Yet another bandwidth related question. I expected the plots of Device-to-host bandwidth and that of Host-to-Device to be similar, but I see that there is a significant difference between the two. Considering both following the same route, so the effective bandwidth should be the same, isn't it? The testbed consists of total 12 Intel Westmere CPUs on two sockets, 4 Tesla C2050 GPUs with 4 PCIe Gen2 Express slots. Using the bandwidthtest program from NVidia code samples.
What are the overheads of doing a cudamemCpy from the host vs the device?
First, I would say those two curves are similar. I can honestly say that I've never seen symmetric PCI-e bandwidth on any system I have used -- and that includes both CUDA and graphics (OpenGL/D3D) tests, so I don't think it's something (especially this small difference) that should concern you.
As with your other PCI-e bandwidth question, the answer is similar -- the driver may use different strategies for different types and sizes of transfers, attempting to get the highest throughput possible.
Actual throughput depends on many factors, including the type of GPU, and especially on the host chipset in use.

cudaMemcpy too slow

I use cudaMemcpy() one time to copy exactly 1GB of data to the device. This takes 5.9s. The other way round it takes 5.1s. Is this normal? Does the function itself have so much overhead before copying?
Theoretical there should be a throughput of at least 4GB/s for the PCIe bus.
There are no memory transfers overlapping because the Tesla C870 just does not support it. Any hints?
EDIT 2: my test program + updated timings; I hope it is not too much to read!
The cutCreateTimer() functions wont compile for me: 'error: identifier "cutCreateTimer" is undefined' - this could be related to the old cuda version (2.0) installed on the machine
__host__ void time_int(int print){
static struct timeval t1; /* var for previous time stamp */
static struct timeval t2; /* var of current time stamp */
double time;
if(gettimeofday(&t2, 0) == -1) return;
if(print != 0){
time = (double) (t2.tv_sec - t1.tv_sec) + ((double) (t2.tv_usec - t1.tv_usec)) / 1000000.0;
printf(...);
}
t1 = t2;
}
main:
time(0);
void *x;
cudaMallocHost(&x,1073741824);
void *y;
cudaMalloc(&y, 1073741824);
time(1);
cudaMemcpy(y,x,1073741824, cudaMemcpyHostToDevice);
time(1);
cudaMemcpy(x,y,1073741824, cudaMemcpyDeviceToHost);
time(1);
Displayed timings are:
0.86 s allocation
0.197 s first copy
5.02 s second copy
The weird thing is: Although it displays 0.197s for first copy it takes much longer if I watch the program run.
Yes, This is normal. cudaMemcpy() does a lot of checks and works (if host memory was allocated by usual malloc() or mmap()). It should check that every page of data is in memory, and move the pages (one-by-one) to the driver.
You can use cudaHostAlloc function or cudaMallocHost for allocating memory instead of malloc. It will allocate pinned memory which is always stored in RAM and can be accessed by GPU's DMA directly (faster cudaMemcpy()). Citing from first link:
Allocates count bytes of host memory that is page-locked and accessible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy().
Only limiting factor is that total amount of pinned memory in system is limited (not more than RAM size; it is better to use not more than RAM - 1Gb):
Allocating excessive amounts of pinned memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.
Assuming the transfers are timed accurately, 1.1 seconds for a transfer of 1 GB from pinned memory seems slow. Are you sure the PCIe slot is configured to the correct width? For full performance, you'd want a x16 configuration. Some platforms provide two slots, one of which is configured as a x16, the other as a x4. So if you machine has two slots, you might want try moving the card into the other slot. Other systems have two slots, where you get x16 if only one slot is occupied, but you get two slots of x8 if both are occupied. The BIOS setup may help in figuring out how the PCIe slots are configured.
The Tesla C870 is rather old technology, but if I recall correctly transfer rates of around 2 GB/s from pinned memory should be possible with these parts, which used a 1st generation PCIe interface. Current Fermi-class GPUs use a PCIe gen 2 interface and can achieve 5+ GB/s for tranfers from pinned memory (for throughput measurements, 1 GB/s = 10^9 bytes/s).
Note that PCIe uses a packetized transport, and the packet overhead can be significant at the packet sizes supported by common chipsets, with newer chipsets typically supporting somewhat longer packets. One is unlikely to exceed 70% of the nominal per-direction maximum (4 GB/s for PCIe 1.0 x16, 8 GB/s for PCIe 2.0 x16), even for transfers from / to pinned host memory. Here is a white paper that explains the overhead issue and has a handy graph showing the utilization achievable with various packet sizes:
http://www.plxtech.com/files/pdf/technical/expresslane/Choosing_PCIe_Packet_Payload_Size.pdf
Other than a system that just is not configured properly, the best explanation for dreadful PCIe bandwidth is a mismatch between IOH/socket and the PCIe slot that the GPU is plugged into.
Most multi-socket Intel i7-class (Nehalem, Westmere) motherboards have one I/O hub per socket. Since the system memory is directly connected to each CPU, DMA accesses that are "local" (fetching memory from the CPU connected to the same IOH as the GPU doing the DMA access) are much faster than nonlocal ones (fetching memory from the CPU connected to the other IOH, a transaction that has to be satisfied via the QPI interconnect that links the two CPUs).
IMPORTANT NOTE: unfortunately it is common for SBIOS's to configure systems for interleaving, which causes contiguous memory allocations to be interleaved between the sockets. This mitigates performance cliffs from local/nonlocal access for the CPUs (one way to think of it: it makes all memory accesses equally bad for both sockets), but wreaks havoc with GPU access to the data since it causes every other page on a 2-socket system to be nonlocal.
Nehalem and Westmere class systems don't seem to suffer from this problem if the system only has one IOH.
(By the way, Sandy Bridge class processors take another step down this path by integrating the PCI Express support into the CPU, so with Sandy Bridge, multi-socket machines automatically have multiple IOH's.)
You can investigate this hypothesis by either running your test using a tool that pins it to a socket (numactl on Linux, if it's available) or by using platform-dependent code to steer the allocations and threads to run on a specific socket. You can learn a lot without getting fancy - just call a function with global effects at the beginning of main() to force everything onto one socket or another, and see if that has a big impact on your PCIe transfer performance.