cudaMemcpy too slow

cudaMemcpy too slow - cuda

I use cudaMemcpy() one time to copy exactly 1GB of data to the device. This takes 5.9s. The other way round it takes 5.1s. Is this normal? Does the function itself have so much overhead before copying?
Theoretical there should be a throughput of at least 4GB/s for the PCIe bus.
There are no memory transfers overlapping because the Tesla C870 just does not support it. Any hints?
EDIT 2: my test program + updated timings; I hope it is not too much to read!
The cutCreateTimer() functions wont compile for me: 'error: identifier "cutCreateTimer" is undefined' - this could be related to the old cuda version (2.0) installed on the machine
__host__ void time_int(int print){
static struct timeval t1; /* var for previous time stamp */
static struct timeval t2; /* var of current time stamp */
double time;
if(gettimeofday(&t2, 0) == -1) return;
if(print != 0){
time = (double) (t2.tv_sec - t1.tv_sec) + ((double) (t2.tv_usec - t1.tv_usec)) / 1000000.0;
printf(...);
}
t1 = t2;
}
main:
time(0);
void *x;
cudaMallocHost(&x,1073741824);
void *y;
cudaMalloc(&y, 1073741824);
time(1);
cudaMemcpy(y,x,1073741824, cudaMemcpyHostToDevice);
time(1);
cudaMemcpy(x,y,1073741824, cudaMemcpyDeviceToHost);
time(1);
Displayed timings are:
0.86 s allocation
0.197 s first copy
5.02 s second copy
The weird thing is: Although it displays 0.197s for first copy it takes much longer if I watch the program run.

Yes, This is normal. cudaMemcpy() does a lot of checks and works (if host memory was allocated by usual malloc() or mmap()). It should check that every page of data is in memory, and move the pages (one-by-one) to the driver.
You can use cudaHostAlloc function or cudaMallocHost for allocating memory instead of malloc. It will allocate pinned memory which is always stored in RAM and can be accessed by GPU's DMA directly (faster cudaMemcpy()). Citing from first link:
Allocates count bytes of host memory that is page-locked and accessible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy().
Only limiting factor is that total amount of pinned memory in system is limited (not more than RAM size; it is better to use not more than RAM - 1Gb):
Allocating excessive amounts of pinned memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.

Assuming the transfers are timed accurately, 1.1 seconds for a transfer of 1 GB from pinned memory seems slow. Are you sure the PCIe slot is configured to the correct width? For full performance, you'd want a x16 configuration. Some platforms provide two slots, one of which is configured as a x16, the other as a x4. So if you machine has two slots, you might want try moving the card into the other slot. Other systems have two slots, where you get x16 if only one slot is occupied, but you get two slots of x8 if both are occupied. The BIOS setup may help in figuring out how the PCIe slots are configured.
The Tesla C870 is rather old technology, but if I recall correctly transfer rates of around 2 GB/s from pinned memory should be possible with these parts, which used a 1st generation PCIe interface. Current Fermi-class GPUs use a PCIe gen 2 interface and can achieve 5+ GB/s for tranfers from pinned memory (for throughput measurements, 1 GB/s = 10^9 bytes/s).
Note that PCIe uses a packetized transport, and the packet overhead can be significant at the packet sizes supported by common chipsets, with newer chipsets typically supporting somewhat longer packets. One is unlikely to exceed 70% of the nominal per-direction maximum (4 GB/s for PCIe 1.0 x16, 8 GB/s for PCIe 2.0 x16), even for transfers from / to pinned host memory. Here is a white paper that explains the overhead issue and has a handy graph showing the utilization achievable with various packet sizes:
http://www.plxtech.com/files/pdf/technical/expresslane/Choosing_PCIe_Packet_Payload_Size.pdf

Other than a system that just is not configured properly, the best explanation for dreadful PCIe bandwidth is a mismatch between IOH/socket and the PCIe slot that the GPU is plugged into.
Most multi-socket Intel i7-class (Nehalem, Westmere) motherboards have one I/O hub per socket. Since the system memory is directly connected to each CPU, DMA accesses that are "local" (fetching memory from the CPU connected to the same IOH as the GPU doing the DMA access) are much faster than nonlocal ones (fetching memory from the CPU connected to the other IOH, a transaction that has to be satisfied via the QPI interconnect that links the two CPUs).
IMPORTANT NOTE: unfortunately it is common for SBIOS's to configure systems for interleaving, which causes contiguous memory allocations to be interleaved between the sockets. This mitigates performance cliffs from local/nonlocal access for the CPUs (one way to think of it: it makes all memory accesses equally bad for both sockets), but wreaks havoc with GPU access to the data since it causes every other page on a 2-socket system to be nonlocal.
Nehalem and Westmere class systems don't seem to suffer from this problem if the system only has one IOH.
(By the way, Sandy Bridge class processors take another step down this path by integrating the PCI Express support into the CPU, so with Sandy Bridge, multi-socket machines automatically have multiple IOH's.)
You can investigate this hypothesis by either running your test using a tool that pins it to a socket (numactl on Linux, if it's available) or by using platform-dependent code to steer the allocations and threads to run on a specific socket. You can learn a lot without getting fancy - just call a function with global effects at the beginning of main() to force everything onto one socket or another, and see if that has a big impact on your PCIe transfer performance.

Related

CUDA malloc, mmap/mremap

CUDA device memory can be allocated using cudaMalloc/cudaFree, sure. This is fine, but primitive.
I'm curious to know, is device memory virtualised in some way? Are there equivalent operations to mmap, and more importantly, mremap for device memory?
If device memory is virtualised, I expect these sorts of functions should exist. It seems modern GPU drivers implement paging when there is contention for limited video resources by multiple processes, which suggests it's virtualised in some way or another...
Does anyone know where I can read more about this?
Edit:
Okay, my question was a bit general. I've read the bits of the manual that talk about mapping system memory for device access. I was more interested in device-allocated memory however.
Specific questions:
- Is there any possible way to remap device memory? (ie, to grow a device allocation)
- Is it possible to map device allocated memory to system memory?
- Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
I have cases where the memory is used by the GPU 99% of the time; so it should be device-local, but it may be convenient to map device memory to system memory for occasional structured read-back without having to implement an awkward deep-copy.
Yes, unified memory exists, however I'm happy with explicit allocation, save for the odd moment when I want a sneaky read-back.
I've found the manual fairly light on detail in general.

CUDA comes with a fine CUDA C Programming Guide as it's main manual which has sections on Mapped Memory as well as Unified Memory Programming.
Responding to your additional posted questions, and following your cue to leave UM out of the consideration:
Is there any possible way to remap device memory? (ie, to grow a device allocation)
There is no direct method. You would have to manually create a new allocation of the desired size, and copy the old data to it, then free the old allocation. If you expect to do this a lot, and don't mind the significant overhead associated with it, you could take a look at thrust device vectors which will hide some of the manual labor and allow you to resize an allocation in a single vector-style .resize() operation. There's no magic, however, so thrust is just a template library built on top of CUDA C (for the CUDA device backend) and so it is going to do a sequence of cudaMalloc and cudaFree operations, just as you would "manually".
Is it possible to map device allocated memory to system memory?
Leaving aside UM, no. Device memory cannot be mapped into host address space.
Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
no, host mapped data is never duplicated in device memory, and apart from L2 caching, mapped data needed by the GPU will always be fetched across the PCI-E bus

Understanding concurrency and GPU as a limited resource

With CPU and memory it's simple.
A process has a large virtual address space, which is partially mapped into physical memory. When the current process attempts to access a page that is not in physical memory, OS steps in, chooses a page to swap (e.g. with Round Robin), swaps it into disc, then reads the required page from the swap, and the control is returned back to the process. This is straightforward, because the process cannot continue without having that page.
GPU kernels is a different story.
Let's consider a usecase:
A high-priority [cpu] process, namely X, makes a call to kernel (which is a blocking call). At this moment, it is reasonable for OS to switch contexts and give the CPU to a different process, namely Z. For the sake of example, let the process Z also do something heavy with the GPU.
Now, what does the GPU driver do? Does it stop the kernel that belongs to [higher prioritized] X? Does it inform OS that Z isn't prioritized enough to offload kernels of X? In general, what happens when two processes need GPU resources, but the available GPU memory is sufficient to serve only one of them at a time?

CUDA GPUs context-switch cooperatively at a coarse granularity (think "memcpy" or "kernel launch"). If there is enough memory for both contexts, the hardware is happy to cooperatively context switch between them at a slight performance cost. (But because it's cooperative, long-running kernels will interfere with other kernels' execution.)
Modern GPUs do support virtual memory (i.e. memory protection through address translation), but they do NOT support demand paging. That means every piece of memory accessible to the GPU (device memory and mapped pinned memory) must be physically present and mapped after allocation.
The Windows Display Driver Model (WDDM) introduced in Windows Vista does paging at a very coarse granularity. The driver is required to track which "memory objects" are needed to execute a given command buffer, and the OS ensures that they are present. The OS can swap them out when not needed. The wrinkle with CUDA is that since pointers can be stored, all memory objects associated with the CUDA address space must be resident in order to run a CUDA kernel. So the paging doesn't work as well for CUDA as it does for graphics applications, which WDDM was designed to run.

using cublasSetMatrix in a pthread

does cublasSetMatrix work if used in a separate pthread?
I want to overlap tasks on CPU with CPU->GPU data transfer. However, since the data is quite large, I'm trying to avoid allocate large pinned memory.

No it won't. GPU contexts are tied to threads that created them. If you try running cublasSetMatrix or cudaMemcpy in another thread without doing anything else, it will make another context. Memory allocations are not portable between contexts, effectively every context has its own virtual address space. The result will be that you wind up with two GPU contexts, and the copy will fail.
The requirement for pinned memory comes from the CUDA driver. For overlapping copying and execution, the host memory involved in the copying must be in a physical address range that the GPU can access by DMA over the PCI-e bus. That is why pinning is required, otherwise the host memory could be in swap space or other virtual memory, and the DMA transaction would fail.
If you are worried about the amount of pinned host memory required for large problems, try using one or two smaller pinned buffers and executing multiple transfers, using the pinned memory as staging buffers. The performance won't be quite as good as using a single large pinned buffer and one big transfer, but you can still achieve useful kernel/copy overlap and hide a lot of PCI-e latency in the process.

GPU reads from CPU or CPU writes to the GPU?

I am beginner in parallel programming. I have a query which might be seem to be silly but I didn't get a definitive answer when I googled it out.
In GPU computing there is a device i.e. the GPU and the host i.e. the CPU. I wrote a simple hello world program which will allocate some memory on the gpu, pass two parameters (say src[] and dest[]) to the kernel, copy src string i.e. Hello world to dest string and get the dest string from gpu to the host.
Is the string "src" read by the GPU or the CPU writes to the GPU ? Also when we get back the string from GPU, is the GPU writing to the CPU or the CPU reading from the GPU?
In transferring the data back and forth there can be four possibilities
1. CPU to GPU
- CPU writes to GPU
- GPU reads form CPU
2. GPU to CPU
- GPU writes to the CPU
- CPU reads from GPU
Can someone please explain which of these are possible and which are not?

In earlier versions of CUDA and corresponding hardware models, the GPU was more strictly a coprocessor owned by the CPU; the CPU wrote information to the GPU, and read the information back when the GPU was ready. At the lower level, this meant that really all four things were happening: the CPU wrote data to PCIe, the GPU read data from PCIe, the GPU then wrote data to PCIe, and the CPU read back the result. But transactions were initiated by the CPU.
More recently (CUDA 3? 4? maybe even beginning in 2?), some of these details are hidden from the application level, so that, effectively, GPU code can cause transfers to be initiated in much the same way as the CPU can. Consider unified virtual addressing, whereby programmers can access a unified virtual address space for CPU and GPU memory. When the GPU requests memory in the CPU space, this must initiate a transfer from the CPU, essentially reading from the CPU. The ability to put data onto the GPU from the CPU side is also retained. Basically, all ways are possible now, at the top level (at low levels, it's largely the same sort of protocol as always: both read from and write to the PCIe bus, but now, GPUs can initiate transactions as well).

Actually none of these.
Your CPU code initiates the copy of data, but while the data is transferred by the memory controller to the memory of the GPU through whatever bus you have on your system. Meanwhile, the CPU can process other data.
Similarly, when the GPU has finished running the kernels you launched, your CPU code initiates the copy of data, but meanwhile both GPU and CPU can handle other data or run other code.
The copies are called asynchronous or non-blocking. You can optionally do blocking copies, in which the CPU waits for the copy to be completed.
When launching asynchronous tasks, you usually register an "event", which is some kind of flag that you can check later on, to see if the task is finished or not.

In OpenCL the Host (CPU) is exclusively controlling all the transfers of data between GPU and GPU. The host transfers data to the GPU using buffers. The host transfers (reads) back
from the GPU using buffers. For some systems and devices, the transfer isn't physically copying bytes as the Host and GPU use the same physical memory. This is called zero copy.

I just found out in this forum http://devgurus.amd.com/thread/129897 that using CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR in clCreateBuffer allocates memory on the host and that it wont be copied on the device.
There may be issue with performance but this is what I am looking for. Your comments please..

How much memory can I actually allocated on a cuda card

I'm writing a server process that performs calculations on a GPU using cuda. I want to queue up in-coming requests until enough memory is available on the device to run the job, but I'm having a hard time figuring out how much memory I can allocate on the the device. I have a pretty good estimate of how much memory a job requires, (at least how much will be allocated from cudaMalloc()), but I get device out of memory long before I've allocated the total amount of global memory available.
Is there some king of formula to compute from the total global memory the amount I can allocated? I can play with it until I get an estimate that works empirically, but I'm concerned my customers will deploy different cards at some point and my jerry-rigged numbers won't work very well.

The size of your GPU's DRAM is an upper bound on the amount of memory you can allocate through cudaMalloc, but there's no guarantee that the CUDA runtime can satisfy a request for all of it in a single large allocation, or even a series of small allocations.
The constraints of memory allocation vary depending on the details of the underlying driver model of the operating system. For example, if the GPU in question is the primary display device, then it's possible that the OS has also reserved some portion of the GPU's memory for graphics. Other implicit state the runtime uses (such as the heap) also consumes memory resources. It's also possible that the memory has become fragmented and no contiguous block large enough to satisfy the request exists.
The CUDART API function cudaMemGetInfo reports the free and total amount of memory available. As far as I know, there's no similar API call which can report the size of the largest satisfiable allocation request.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008