How much memory can I actually allocated on a cuda card - cuda

I'm writing a server process that performs calculations on a GPU using cuda. I want to queue up in-coming requests until enough memory is available on the device to run the job, but I'm having a hard time figuring out how much memory I can allocate on the the device. I have a pretty good estimate of how much memory a job requires, (at least how much will be allocated from cudaMalloc()), but I get device out of memory long before I've allocated the total amount of global memory available.
Is there some king of formula to compute from the total global memory the amount I can allocated? I can play with it until I get an estimate that works empirically, but I'm concerned my customers will deploy different cards at some point and my jerry-rigged numbers won't work very well.

The size of your GPU's DRAM is an upper bound on the amount of memory you can allocate through cudaMalloc, but there's no guarantee that the CUDA runtime can satisfy a request for all of it in a single large allocation, or even a series of small allocations.
The constraints of memory allocation vary depending on the details of the underlying driver model of the operating system. For example, if the GPU in question is the primary display device, then it's possible that the OS has also reserved some portion of the GPU's memory for graphics. Other implicit state the runtime uses (such as the heap) also consumes memory resources. It's also possible that the memory has become fragmented and no contiguous block large enough to satisfy the request exists.
The CUDART API function cudaMemGetInfo reports the free and total amount of memory available. As far as I know, there's no similar API call which can report the size of the largest satisfiable allocation request.

Related

When should texture memory be prefered over constant memory?

Does the use of data storage in constant memory provides any benefit over texture in the Pascal architecture if the data request frequency is very high among threads (every thread pick at least one data from a specific column)?
EDIT: This is a split version of this question to improve community searching
If the expectations for constant memory usage are satisfied, the use of constant memory is a good idea in the general case. It is allowing your code to take advantage of an additional cache mechanism provided by the GPU hardware, and in so doing putting less pressure on the usage of texture by other parts of your code.
Since the constant memory and its cache, as the texture and surface memory and it is own cache are defined by the hardware Compute Capability, the target hardware should be accounted. Thus the option by constant memory and texture memory is dependent of the access pattern and the cache use, as the cache availability.
The constant memory performance is related to data broadcast among threads in a warp, so the maximum performance is achieved if all threads request the very same data address and the data is already on the cache. Thus, if in the same warp there are request to multiple address, the service is splitted in multiple requests, since it can retrive a single address per operation. If the number of splitted requests due to data retrieval from multiple addresses is too high, the texture and surface memory performance may superior over constant memory in this specific situation.. This information is detailed in the Cuda Programming Guide:
The constant memory space resides in device memory and is cached in the constant cache mentioned in Compute Capability 2.x.
A request is then split into as many separate requests as there are
different memory addresses in the initial request, decreasing
throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the
constant cache in case of a cache hit, or at the throughput of device
memory otherwise.
The texture memory cache is more flexible than constant memory cache. It can take advantage of readings in the same warp of address that are close together in a 2D fashion. Despite of some advantages over constant memory, in general, the texture memory should be used if the data access pattern or the data size does not follow the constant memory requirements or to make use of texture memory cache. More detailed information can be found at:
The texture and surface memory spaces
reside in device memory and are cached in texture cache, so a texture
fetch or surface read costs one memory read from device memory only on
a cache miss, otherwise it just costs one read from texture cache. The
texture cache is optimized for 2D spatial locality, so threads of the
same warp that read texture or surface addresses that are close
together in 2D will achieve best performance. Also, it is designed for
streaming fetches with a constant latency; a cache hit reduces DRAM
bandwidth demand but not fetch latency.
Reading device memory through texture or surface fetching present some
benefits that can make it an advantageous alternative to reading
device memory from global or constant memory:
If the memory reads do not follow the access patterns that global or
constant memory reads must follow to get good performance, higher
bandwidth can be achieved providing that there is locality in the
texture fetches or surface reads;
Addressing calculations are
performed outside the kernel by dedicated units;
Packed data may be
broadcast to separate variables in a single operation;
8-bit and
16-bit integer input data may be optionally converted to 32 bit
floating-point values in the range [0.0, 1.0] or [-1.0, 1.0] (see
Texture Memory).
The developer should keep in mind that exploiting of the combination of texture memory with constant memory can be a real advantage over the preference for a single one, because it may allow to take advantage of the dedicated cache from both, since both caches have higher performance than over any data retrieved outside the cache (i.e. device memory).

Performance of cmem vs texture on Pascal [duplicate]

Does the use of data storage in constant memory provides any benefit over texture in the Pascal architecture if the data request frequency is very high among threads (every thread pick at least one data from a specific column)?
EDIT: This is a split version of this question to improve community searching
If the expectations for constant memory usage are satisfied, the use of constant memory is a good idea in the general case. It is allowing your code to take advantage of an additional cache mechanism provided by the GPU hardware, and in so doing putting less pressure on the usage of texture by other parts of your code.
Since the constant memory and its cache, as the texture and surface memory and it is own cache are defined by the hardware Compute Capability, the target hardware should be accounted. Thus the option by constant memory and texture memory is dependent of the access pattern and the cache use, as the cache availability.
The constant memory performance is related to data broadcast among threads in a warp, so the maximum performance is achieved if all threads request the very same data address and the data is already on the cache. Thus, if in the same warp there are request to multiple address, the service is splitted in multiple requests, since it can retrive a single address per operation. If the number of splitted requests due to data retrieval from multiple addresses is too high, the texture and surface memory performance may superior over constant memory in this specific situation.. This information is detailed in the Cuda Programming Guide:
The constant memory space resides in device memory and is cached in the constant cache mentioned in Compute Capability 2.x.
A request is then split into as many separate requests as there are
different memory addresses in the initial request, decreasing
throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the
constant cache in case of a cache hit, or at the throughput of device
memory otherwise.
The texture memory cache is more flexible than constant memory cache. It can take advantage of readings in the same warp of address that are close together in a 2D fashion. Despite of some advantages over constant memory, in general, the texture memory should be used if the data access pattern or the data size does not follow the constant memory requirements or to make use of texture memory cache. More detailed information can be found at:
The texture and surface memory spaces
reside in device memory and are cached in texture cache, so a texture
fetch or surface read costs one memory read from device memory only on
a cache miss, otherwise it just costs one read from texture cache. The
texture cache is optimized for 2D spatial locality, so threads of the
same warp that read texture or surface addresses that are close
together in 2D will achieve best performance. Also, it is designed for
streaming fetches with a constant latency; a cache hit reduces DRAM
bandwidth demand but not fetch latency.
Reading device memory through texture or surface fetching present some
benefits that can make it an advantageous alternative to reading
device memory from global or constant memory:
If the memory reads do not follow the access patterns that global or
constant memory reads must follow to get good performance, higher
bandwidth can be achieved providing that there is locality in the
texture fetches or surface reads;
Addressing calculations are
performed outside the kernel by dedicated units;
Packed data may be
broadcast to separate variables in a single operation;
8-bit and
16-bit integer input data may be optionally converted to 32 bit
floating-point values in the range [0.0, 1.0] or [-1.0, 1.0] (see
Texture Memory).
The developer should keep in mind that exploiting of the combination of texture memory with constant memory can be a real advantage over the preference for a single one, because it may allow to take advantage of the dedicated cache from both, since both caches have higher performance than over any data retrieved outside the cache (i.e. device memory).

CUDA malloc, mmap/mremap

CUDA device memory can be allocated using cudaMalloc/cudaFree, sure. This is fine, but primitive.
I'm curious to know, is device memory virtualised in some way? Are there equivalent operations to mmap, and more importantly, mremap for device memory?
If device memory is virtualised, I expect these sorts of functions should exist. It seems modern GPU drivers implement paging when there is contention for limited video resources by multiple processes, which suggests it's virtualised in some way or another...
Does anyone know where I can read more about this?
Edit:
Okay, my question was a bit general. I've read the bits of the manual that talk about mapping system memory for device access. I was more interested in device-allocated memory however.
Specific questions:
- Is there any possible way to remap device memory? (ie, to grow a device allocation)
- Is it possible to map device allocated memory to system memory?
- Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
I have cases where the memory is used by the GPU 99% of the time; so it should be device-local, but it may be convenient to map device memory to system memory for occasional structured read-back without having to implement an awkward deep-copy.
Yes, unified memory exists, however I'm happy with explicit allocation, save for the odd moment when I want a sneaky read-back.
I've found the manual fairly light on detail in general.
CUDA comes with a fine CUDA C Programming Guide as it's main manual which has sections on Mapped Memory as well as Unified Memory Programming.
Responding to your additional posted questions, and following your cue to leave UM out of the consideration:
Is there any possible way to remap device memory? (ie, to grow a device allocation)
There is no direct method. You would have to manually create a new allocation of the desired size, and copy the old data to it, then free the old allocation. If you expect to do this a lot, and don't mind the significant overhead associated with it, you could take a look at thrust device vectors which will hide some of the manual labor and allow you to resize an allocation in a single vector-style .resize() operation. There's no magic, however, so thrust is just a template library built on top of CUDA C (for the CUDA device backend) and so it is going to do a sequence of cudaMalloc and cudaFree operations, just as you would "manually".
Is it possible to map device allocated memory to system memory?
Leaving aside UM, no. Device memory cannot be mapped into host address space.
Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
no, host mapped data is never duplicated in device memory, and apart from L2 caching, mapped data needed by the GPU will always be fetched across the PCI-E bus

CUDA bandwidthTest to get attainable peak

I want to know how good my CUDA kernels are in terms of memory bandwidth utilisation. I run them on a Tesla K40c with ECC on. Is the result given by the bandwidthTest utility a good approximation to the attainable peak? Else, how would one go about writing a similar test to find the peak bandwidth?
I mean device memory bandwidth.
The source code for bandwidth test is included with the CUDA SDK so you can review it directly. The bandwidthTest example performs a test of the transfer time between the device and the host, the host and the device, and the device and the device (transferring memory on the card).
This is a real execution of a memory transfer but it takes advantage of several things:
Medium to large memory transfers. If you are doing tons of tiny
transfers you will pay a high penalty in overhead and this will
reduce your transfer rates.
Pinned memory. The bandwidthTest uses pinned memory so that the transfers can be as fast as possible. You may or may not have this option.
Sustained read/write of memory. As I recall, the bandwidthTest does a number of transfers that can be queued up. Any startup delays or anomalies will be smoothed out and it has the advantage of stringing together lots of transfers together in the queue. You may have to do transfer-work-work-transfer so you may end up with additional delays. Improvements in memory transfers from CUDA 5 may assist in mitigating this.
Doing real work with a kernel while performing memory transfers will likely result in a reduction of performance. However, you can reference the bandwidth test code and use it as a guide for improving your transfers. Consider pinned memory, asynchronous transfers, or the newer shared memory methods that do not require explicit transfer of data. Also keep in mind that bandwidthTest is only counting bulk transfers around memory and is not really taking a measure of things like shared memory.
The final performance will depend greatly on the kernel and the count and size of the memory transfers you are performing.

Understanding concurrency and GPU as a limited resource

With CPU and memory it's simple.
A process has a large virtual address space, which is partially mapped into physical memory. When the current process attempts to access a page that is not in physical memory, OS steps in, chooses a page to swap (e.g. with Round Robin), swaps it into disc, then reads the required page from the swap, and the control is returned back to the process. This is straightforward, because the process cannot continue without having that page.
GPU kernels is a different story.
Let's consider a usecase:
A high-priority [cpu] process, namely X, makes a call to kernel (which is a blocking call). At this moment, it is reasonable for OS to switch contexts and give the CPU to a different process, namely Z. For the sake of example, let the process Z also do something heavy with the GPU.
Now, what does the GPU driver do? Does it stop the kernel that belongs to [higher prioritized] X? Does it inform OS that Z isn't prioritized enough to offload kernels of X? In general, what happens when two processes need GPU resources, but the available GPU memory is sufficient to serve only one of them at a time?
CUDA GPUs context-switch cooperatively at a coarse granularity (think "memcpy" or "kernel launch"). If there is enough memory for both contexts, the hardware is happy to cooperatively context switch between them at a slight performance cost. (But because it's cooperative, long-running kernels will interfere with other kernels' execution.)
Modern GPUs do support virtual memory (i.e. memory protection through address translation), but they do NOT support demand paging. That means every piece of memory accessible to the GPU (device memory and mapped pinned memory) must be physically present and mapped after allocation.
The Windows Display Driver Model (WDDM) introduced in Windows Vista does paging at a very coarse granularity. The driver is required to track which "memory objects" are needed to execute a given command buffer, and the OS ensures that they are present. The OS can swap them out when not needed. The wrinkle with CUDA is that since pointers can be stored, all memory objects associated with the CUDA address space must be resident in order to run a CUDA kernel. So the paging doesn't work as well for CUDA as it does for graphics applications, which WDDM was designed to run.