Pinned memory in Nvidia CUDA - cuda

I'm writing matrix addition program for GPUs using Streams and obviously pinned memory.So I allocated 3 matrices in pinned memory but after particular dimensions it shows API error 2:out of memory.My RAM is 4GB but i'm not able to use beyond 800MB.Is there any way by which we can control this upper limit?
My sys config:
nVidia GEForce 9800GTX
Intel core 2 Quad
For streamed execution code looks as follows
(int i=0;i<no_of_streams;i++)
{
cudaMemcpyAsync(device_a+i*(n/no_of_streams),hAligned_on_host_a+i*(n/no_of_streams),nbytes/no_of_streams,cudaMemcpyHostToDevice,streams[i]);
cudaMemcpyAsync(device_b+i*(n/no_of_streams),hAligned_on_host_b+i*(n/no_of_streams),nbytes/no_of_streams,cudaMemcpyHostToDevice,streams[i]);
cudaMemcpyAsync(device_c+i*(n/no_of_streams),hAligned_on_host_c+i*(n/no_of_streams),nbytes/no_of_streams,cudaMemcpyHostToDevice,streams[i]);
matrixAddition<<<blocks,threads,0,streams[i]>>>(device_a+i*(n/no_of_streams),device_b+i*(n/no_of_streams),device_c+i*(n/no_of_streams));
cudaMemcpyAsync(hAligned_on_host_a+i*(n/no_of_streams),device_a+i*(n/no_of_streams),nbytes/no_of_streams,cudaMemcpyDeviceToHost,streams[i]);
cudaMemcpyAsync(hAligned_on_host_b+i*(n/no_of_streamss),device_b+i*(n/no_of_streams),nbytes/no_of_streams,cudaMemcpyDeviceToHost,streams[i]);
cudaMemcpyAsync(hAligned_on_host_c+i*(n/no_of_streams),device_c+i*(n/no_of_streams),nbytes/no_of_streams,cudaMemcpyDeviceToHost,streams[i]));
}

So, you haven't specified if this happens after the cudaMalloc or the cudaHostAlloc function calls.
Pinned memory is a limited resource. Any memory defined as being in pinned memory must always be in RAM. As such, that leaves less room in RAM for other system applications. This means, you can't have 4GB of pinned memory if you have 4GB of RAM, or else nothing else could run.
800MB might be a system imposed limit. Considering it's a quarter of your RAM, it might be a reasonable limit. It is also quite close to the size of your global memory. A failure on the card wouldn't translate to a failure on the host, so if it's complaining without having to run something like cudaGetLastError, it's probably a problem on the host.
Sorry I don't know specifics of increasing your pinned memory limit.

Related

Why mxnet's GPU version cost more memory than CPU version?

I made a very simple network using mxnet(two fc layers with dim of 512).
By changing the ctx = mx.cpu() or ctx = mx.gpu(0), I run the same code on both CPU and GPU.
The memory cost of GPU is much bigger than CPU version.(I checked that using 'top' instead of 'nvidia-smi').
It seems strange, as the GPU version also has memory on GPU already, why GPU still need more space on memory?
(line 1 is CPU program / line 2 is GPU program)
It may be due to differences in how much time each process was running.
Looking at your screenshot, CPU process has 5:48.85 while GPU has 9:11.20 - so the GPU training was running almost double the time which could be the reason.
When running on GPU you are loading a bunch of different lower-level libraries in memory (CUDA, CUDnn, etc) which are allocated first in your RAM. If your network is very small like in your current case, the overhead of loading the libraries in RAM will be higher than the cost of storing the network weights in RAM.
For any more sizable network, when running on CPU the amount of memory used by the weights will be significantly larger than the libraries loaded in memory.

CUDA malloc, mmap/mremap

CUDA device memory can be allocated using cudaMalloc/cudaFree, sure. This is fine, but primitive.
I'm curious to know, is device memory virtualised in some way? Are there equivalent operations to mmap, and more importantly, mremap for device memory?
If device memory is virtualised, I expect these sorts of functions should exist. It seems modern GPU drivers implement paging when there is contention for limited video resources by multiple processes, which suggests it's virtualised in some way or another...
Does anyone know where I can read more about this?
Edit:
Okay, my question was a bit general. I've read the bits of the manual that talk about mapping system memory for device access. I was more interested in device-allocated memory however.
Specific questions:
- Is there any possible way to remap device memory? (ie, to grow a device allocation)
- Is it possible to map device allocated memory to system memory?
- Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
I have cases where the memory is used by the GPU 99% of the time; so it should be device-local, but it may be convenient to map device memory to system memory for occasional structured read-back without having to implement an awkward deep-copy.
Yes, unified memory exists, however I'm happy with explicit allocation, save for the odd moment when I want a sneaky read-back.
I've found the manual fairly light on detail in general.
CUDA comes with a fine CUDA C Programming Guide as it's main manual which has sections on Mapped Memory as well as Unified Memory Programming.
Responding to your additional posted questions, and following your cue to leave UM out of the consideration:
Is there any possible way to remap device memory? (ie, to grow a device allocation)
There is no direct method. You would have to manually create a new allocation of the desired size, and copy the old data to it, then free the old allocation. If you expect to do this a lot, and don't mind the significant overhead associated with it, you could take a look at thrust device vectors which will hide some of the manual labor and allow you to resize an allocation in a single vector-style .resize() operation. There's no magic, however, so thrust is just a template library built on top of CUDA C (for the CUDA device backend) and so it is going to do a sequence of cudaMalloc and cudaFree operations, just as you would "manually".
Is it possible to map device allocated memory to system memory?
Leaving aside UM, no. Device memory cannot be mapped into host address space.
Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
no, host mapped data is never duplicated in device memory, and apart from L2 caching, mapped data needed by the GPU will always be fetched across the PCI-E bus

Global memory details

This is a follow up question to CUDA Global Memory, Where is it? In reference to GSmith's response. These Q's address the CC > 2.0 case.
When I lookup the spec's of my Nvida card, it lists 2GB of 'memory'. I've come to believe this is the 'Global' memory for this card. That is, this is GDDR3 memory that resides 'off-chip', but on the card. Would this be correct?
I don't see any case where the spec'd 'memory' is zero. Does one exist? That is, can I have a card w/ no off-chip memory? In that all my texture, local, and constant memory actually resides in pinned & mapped host memory.
Can I extend my global memory usage by pinning more than 2GB of host memory? Can I use all my off-chip global memory (2GB) and add (1GB) more global pinned memory? Or am I to understand that this card is only capable of an addressing space of 2GB max? i.e. I can only access 2GB of mem, unPinned, pinned, mapped, or any combo.
If the device is using pinned host memory (not mapped), do I need to Memcpy from dev to host? That is, the mem is physically on the host side. And it is being used by the device, so they can both see it. Why do I need to copy it to the host, when it is already there. It seems to be 'mapped' by default. (What mechanism is preventing this dual access?)
How does one go about mapping shared mem to global mem? (I'm not finding any mention of this in the doc's.) Is this a 'mapped' arrangement or do I still need to copy it from global to shared, and back again? (Could this save me a copy step?)
It's recommended to ask one question per question.
When I lookup the spec's of my Nvida card, it lists 2GB of 'memory'. I've come to believe this is the 'Global' memory for this card. That is, this is GDDR3 memory that resides 'off-chip', but on the card. Would this be correct?
Yes.
I don't see any case where the spec'd 'memory' is zero. Does one exist? That is, can I have a card w/ no off-chip memory? In that all my texture, local, and constant memory actually resides in pinned & mapped host memory.
The closest NVIDIA came to this idea was probably in the Ion 2 chipset. But there are no cuda-capable nvidia discrete graphics cards with zero on-board off-chip memory.
Can I extend my global memory usage by pinning more than 2GB of host memory?
You can pin more than 2GB of host memory. However this does not extend global memory. It does enable a variety of things such as improved host-device transfer rates, overlapped copy and compute, and zero-copy access of host memory from the GPU, but this is not the same as what you use global memory for. Zero-copy techniques perhaps come the closest to extending global memory onto host memory (conceptually) but zero copy is very slow from a GPU standpoint.
If the device is using pinned host memory (not mapped), do I need to Memcpy from dev to host?
Yes you still need to cudaMemcpy data back and forth.
That is, the mem is physically on the host side. And it is being used by the device
I don't know where this concept is coming from. Perhaps you are referring to zero-copy, but zero-copy is relatively slow compared to accessing data that is in global memory. It should be used judiciously in cases of small data sizes, and is by no means a straightforward way to provide a bulk increase to the effective size of the global memory on the card.
How does one go about mapping shared mem to global mem?
Shared memory is not automatically mapped to global memory. The methodology is to copy the data you need back and forth between shared and global memory.

How much memory can I actually allocated on a cuda card

I'm writing a server process that performs calculations on a GPU using cuda. I want to queue up in-coming requests until enough memory is available on the device to run the job, but I'm having a hard time figuring out how much memory I can allocate on the the device. I have a pretty good estimate of how much memory a job requires, (at least how much will be allocated from cudaMalloc()), but I get device out of memory long before I've allocated the total amount of global memory available.
Is there some king of formula to compute from the total global memory the amount I can allocated? I can play with it until I get an estimate that works empirically, but I'm concerned my customers will deploy different cards at some point and my jerry-rigged numbers won't work very well.
The size of your GPU's DRAM is an upper bound on the amount of memory you can allocate through cudaMalloc, but there's no guarantee that the CUDA runtime can satisfy a request for all of it in a single large allocation, or even a series of small allocations.
The constraints of memory allocation vary depending on the details of the underlying driver model of the operating system. For example, if the GPU in question is the primary display device, then it's possible that the OS has also reserved some portion of the GPU's memory for graphics. Other implicit state the runtime uses (such as the heap) also consumes memory resources. It's also possible that the memory has become fragmented and no contiguous block large enough to satisfy the request exists.
The CUDART API function cudaMemGetInfo reports the free and total amount of memory available. As far as I know, there's no similar API call which can report the size of the largest satisfiable allocation request.

cudaMemcpy too slow

I use cudaMemcpy() one time to copy exactly 1GB of data to the device. This takes 5.9s. The other way round it takes 5.1s. Is this normal? Does the function itself have so much overhead before copying?
Theoretical there should be a throughput of at least 4GB/s for the PCIe bus.
There are no memory transfers overlapping because the Tesla C870 just does not support it. Any hints?
EDIT 2: my test program + updated timings; I hope it is not too much to read!
The cutCreateTimer() functions wont compile for me: 'error: identifier "cutCreateTimer" is undefined' - this could be related to the old cuda version (2.0) installed on the machine
__host__ void time_int(int print){
static struct timeval t1; /* var for previous time stamp */
static struct timeval t2; /* var of current time stamp */
double time;
if(gettimeofday(&t2, 0) == -1) return;
if(print != 0){
time = (double) (t2.tv_sec - t1.tv_sec) + ((double) (t2.tv_usec - t1.tv_usec)) / 1000000.0;
printf(...);
}
t1 = t2;
}
main:
time(0);
void *x;
cudaMallocHost(&x,1073741824);
void *y;
cudaMalloc(&y, 1073741824);
time(1);
cudaMemcpy(y,x,1073741824, cudaMemcpyHostToDevice);
time(1);
cudaMemcpy(x,y,1073741824, cudaMemcpyDeviceToHost);
time(1);
Displayed timings are:
0.86 s allocation
0.197 s first copy
5.02 s second copy
The weird thing is: Although it displays 0.197s for first copy it takes much longer if I watch the program run.
Yes, This is normal. cudaMemcpy() does a lot of checks and works (if host memory was allocated by usual malloc() or mmap()). It should check that every page of data is in memory, and move the pages (one-by-one) to the driver.
You can use cudaHostAlloc function or cudaMallocHost for allocating memory instead of malloc. It will allocate pinned memory which is always stored in RAM and can be accessed by GPU's DMA directly (faster cudaMemcpy()). Citing from first link:
Allocates count bytes of host memory that is page-locked and accessible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy().
Only limiting factor is that total amount of pinned memory in system is limited (not more than RAM size; it is better to use not more than RAM - 1Gb):
Allocating excessive amounts of pinned memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.
Assuming the transfers are timed accurately, 1.1 seconds for a transfer of 1 GB from pinned memory seems slow. Are you sure the PCIe slot is configured to the correct width? For full performance, you'd want a x16 configuration. Some platforms provide two slots, one of which is configured as a x16, the other as a x4. So if you machine has two slots, you might want try moving the card into the other slot. Other systems have two slots, where you get x16 if only one slot is occupied, but you get two slots of x8 if both are occupied. The BIOS setup may help in figuring out how the PCIe slots are configured.
The Tesla C870 is rather old technology, but if I recall correctly transfer rates of around 2 GB/s from pinned memory should be possible with these parts, which used a 1st generation PCIe interface. Current Fermi-class GPUs use a PCIe gen 2 interface and can achieve 5+ GB/s for tranfers from pinned memory (for throughput measurements, 1 GB/s = 10^9 bytes/s).
Note that PCIe uses a packetized transport, and the packet overhead can be significant at the packet sizes supported by common chipsets, with newer chipsets typically supporting somewhat longer packets. One is unlikely to exceed 70% of the nominal per-direction maximum (4 GB/s for PCIe 1.0 x16, 8 GB/s for PCIe 2.0 x16), even for transfers from / to pinned host memory. Here is a white paper that explains the overhead issue and has a handy graph showing the utilization achievable with various packet sizes:
http://www.plxtech.com/files/pdf/technical/expresslane/Choosing_PCIe_Packet_Payload_Size.pdf
Other than a system that just is not configured properly, the best explanation for dreadful PCIe bandwidth is a mismatch between IOH/socket and the PCIe slot that the GPU is plugged into.
Most multi-socket Intel i7-class (Nehalem, Westmere) motherboards have one I/O hub per socket. Since the system memory is directly connected to each CPU, DMA accesses that are "local" (fetching memory from the CPU connected to the same IOH as the GPU doing the DMA access) are much faster than nonlocal ones (fetching memory from the CPU connected to the other IOH, a transaction that has to be satisfied via the QPI interconnect that links the two CPUs).
IMPORTANT NOTE: unfortunately it is common for SBIOS's to configure systems for interleaving, which causes contiguous memory allocations to be interleaved between the sockets. This mitigates performance cliffs from local/nonlocal access for the CPUs (one way to think of it: it makes all memory accesses equally bad for both sockets), but wreaks havoc with GPU access to the data since it causes every other page on a 2-socket system to be nonlocal.
Nehalem and Westmere class systems don't seem to suffer from this problem if the system only has one IOH.
(By the way, Sandy Bridge class processors take another step down this path by integrating the PCI Express support into the CPU, so with Sandy Bridge, multi-socket machines automatically have multiple IOH's.)
You can investigate this hypothesis by either running your test using a tool that pins it to a socket (numactl on Linux, if it's available) or by using platform-dependent code to steer the allocations and threads to run on a specific socket. You can learn a lot without getting fancy - just call a function with global effects at the beginning of main() to force everything onto one socket or another, and see if that has a big impact on your PCIe transfer performance.