Cuda Different Memory Allocations - cuda

I am developing a small application using CUDA.
i have a huge 2d array (won't fit on shared memory) in which threads in all blocks will read from constantly at random places.
this 2d array is a read-only array.
where should i allocate this 2d array? global memory?constant memroy? texture memory?

Depending on the size of your device's texture memory, you should implement it in this area. Indeed, texture memory is based upon sequential locality cache mechanism. It means that memory accesses are optimized when threads of consecutive identifiers try to reach data elements within relatively close storage locations.
Moreover, this locality is here implemented for 2D accesses. So when each thread reaches a data element of an array stored in texture memory, you're in the case of consecutive 2D accesses. Consequently, you take a full advantage of the memory architecture.
Unfortunately, this memory is not that big and with huge arrays you might be able to make your data fit in it. In this case, you can't avoid to use the global memory.

I agree the jHackTheRipper, a simple solution would be to use texture memory and then profile using the Compute Visual Profiler. Heres a good set of slides from NVIDIA about the different memory types for image convolution; it shows that good shared memory usage and global reads was not too much faster than using texture memory. In your case you should get some coalesced reads from the texmemory that you wouldn't usually get with accessing random values in global memory.

If it's small enough to fit it constant or texture, I would just try all three.
One interesting option that you don't have listed here is mapped memory on the host. You can allocate memory on the host that will be accessible from the device, without explicitly transferring it to device memory. Depending on the amount of the array you need to access, it could be faster than copying to global memory and reading from there.

Related

global memory access for individual threads

I am writing a simplistic raytracer. The idea is that for every pixel there is a thread that traverses a certain structure (geometry) that resides in global memory.
I invoke my kernel like so:
trace<<<gridDim, blockDim>>>(width, height, frameBuffer, scene)
Where scene is a structure that was previously allocated with cudaMalloc. Every thread has to start traversing this structure starting from the same node, and chances are that many concurrent threads will attempt to read the same nodes many times. Does that mean that when such reads take place, it cripples the degree of parallelism?
Given that geometry is large, I would assume that replicating it is not an option. I mean the whole processing still happens fairly fast, but I was wondering whether it is something that has to be dealt with, or simply left flung to the breeze.
First of all I think you got the wrong idea when you say concurrent reads may or may not cripple the degree of parallelism. Because that is what it means to be parallel. Each thread is reading concurrently. Instead you should be thinking if it affects the performance due to more memory accesses when each thread basically wants the same thing i.e. the same node.
Well according to the article here, Memory accesses can be coalesced if data locality is present and within warps only.
Which means if threads within a warp are trying to access memory locations near each other they can be coalesced. In your case each thread is trying to access the "same" node until it meets an endpoint where they branch.
This means the memory accesses will be coalesced within the warp till the threads branch off.
Efficient access to global memory from each thread depends on both your device architecture and your code. Arrays allocated on global memory are aligned to 256-byte memory segments by the CUDA driver. The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size. The device coalesces global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth. A misaligned data access for devices with a compute capability of less than 2.0 affects the effective bandwidth of accessing data. This is not a serious issue when working with a device that has a compute capability of > 2.0. That being said, pretty much regardless of your device generation, when accessing global memory with large strides, the effective bandwidth becomes poor (Reference). I would assume that for random access the the same behavior is likely.
Unless you are not changing the structure while reading, which I assume you do (if it's a scene you probably render each frame?) then yes, it cripples performance and may cause undefined behaviour. This is called a race condition. You can use atomic operations to overcome this type of problem. Using atomic operations guarantees that the race conditions don't happen.
You can try, stuffing the 'scene' to the shared memory if you can fit it.
You can also try using streams to increase concurrency which also brings some sort of synchronization to the kernels that are run in the same stream.

When should texture memory be prefered over constant memory?

Does the use of data storage in constant memory provides any benefit over texture in the Pascal architecture if the data request frequency is very high among threads (every thread pick at least one data from a specific column)?
EDIT: This is a split version of this question to improve community searching
If the expectations for constant memory usage are satisfied, the use of constant memory is a good idea in the general case. It is allowing your code to take advantage of an additional cache mechanism provided by the GPU hardware, and in so doing putting less pressure on the usage of texture by other parts of your code.
Since the constant memory and its cache, as the texture and surface memory and it is own cache are defined by the hardware Compute Capability, the target hardware should be accounted. Thus the option by constant memory and texture memory is dependent of the access pattern and the cache use, as the cache availability.
The constant memory performance is related to data broadcast among threads in a warp, so the maximum performance is achieved if all threads request the very same data address and the data is already on the cache. Thus, if in the same warp there are request to multiple address, the service is splitted in multiple requests, since it can retrive a single address per operation. If the number of splitted requests due to data retrieval from multiple addresses is too high, the texture and surface memory performance may superior over constant memory in this specific situation.. This information is detailed in the Cuda Programming Guide:
The constant memory space resides in device memory and is cached in the constant cache mentioned in Compute Capability 2.x.
A request is then split into as many separate requests as there are
different memory addresses in the initial request, decreasing
throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the
constant cache in case of a cache hit, or at the throughput of device
memory otherwise.
The texture memory cache is more flexible than constant memory cache. It can take advantage of readings in the same warp of address that are close together in a 2D fashion. Despite of some advantages over constant memory, in general, the texture memory should be used if the data access pattern or the data size does not follow the constant memory requirements or to make use of texture memory cache. More detailed information can be found at:
The texture and surface memory spaces
reside in device memory and are cached in texture cache, so a texture
fetch or surface read costs one memory read from device memory only on
a cache miss, otherwise it just costs one read from texture cache. The
texture cache is optimized for 2D spatial locality, so threads of the
same warp that read texture or surface addresses that are close
together in 2D will achieve best performance. Also, it is designed for
streaming fetches with a constant latency; a cache hit reduces DRAM
bandwidth demand but not fetch latency.
Reading device memory through texture or surface fetching present some
benefits that can make it an advantageous alternative to reading
device memory from global or constant memory:
If the memory reads do not follow the access patterns that global or
constant memory reads must follow to get good performance, higher
bandwidth can be achieved providing that there is locality in the
texture fetches or surface reads;
Addressing calculations are
performed outside the kernel by dedicated units;
Packed data may be
broadcast to separate variables in a single operation;
8-bit and
16-bit integer input data may be optionally converted to 32 bit
floating-point values in the range [0.0, 1.0] or [-1.0, 1.0] (see
Texture Memory).
The developer should keep in mind that exploiting of the combination of texture memory with constant memory can be a real advantage over the preference for a single one, because it may allow to take advantage of the dedicated cache from both, since both caches have higher performance than over any data retrieved outside the cache (i.e. device memory).

Performance of cmem vs texture on Pascal [duplicate]

Does the use of data storage in constant memory provides any benefit over texture in the Pascal architecture if the data request frequency is very high among threads (every thread pick at least one data from a specific column)?
EDIT: This is a split version of this question to improve community searching
If the expectations for constant memory usage are satisfied, the use of constant memory is a good idea in the general case. It is allowing your code to take advantage of an additional cache mechanism provided by the GPU hardware, and in so doing putting less pressure on the usage of texture by other parts of your code.
Since the constant memory and its cache, as the texture and surface memory and it is own cache are defined by the hardware Compute Capability, the target hardware should be accounted. Thus the option by constant memory and texture memory is dependent of the access pattern and the cache use, as the cache availability.
The constant memory performance is related to data broadcast among threads in a warp, so the maximum performance is achieved if all threads request the very same data address and the data is already on the cache. Thus, if in the same warp there are request to multiple address, the service is splitted in multiple requests, since it can retrive a single address per operation. If the number of splitted requests due to data retrieval from multiple addresses is too high, the texture and surface memory performance may superior over constant memory in this specific situation.. This information is detailed in the Cuda Programming Guide:
The constant memory space resides in device memory and is cached in the constant cache mentioned in Compute Capability 2.x.
A request is then split into as many separate requests as there are
different memory addresses in the initial request, decreasing
throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the
constant cache in case of a cache hit, or at the throughput of device
memory otherwise.
The texture memory cache is more flexible than constant memory cache. It can take advantage of readings in the same warp of address that are close together in a 2D fashion. Despite of some advantages over constant memory, in general, the texture memory should be used if the data access pattern or the data size does not follow the constant memory requirements or to make use of texture memory cache. More detailed information can be found at:
The texture and surface memory spaces
reside in device memory and are cached in texture cache, so a texture
fetch or surface read costs one memory read from device memory only on
a cache miss, otherwise it just costs one read from texture cache. The
texture cache is optimized for 2D spatial locality, so threads of the
same warp that read texture or surface addresses that are close
together in 2D will achieve best performance. Also, it is designed for
streaming fetches with a constant latency; a cache hit reduces DRAM
bandwidth demand but not fetch latency.
Reading device memory through texture or surface fetching present some
benefits that can make it an advantageous alternative to reading
device memory from global or constant memory:
If the memory reads do not follow the access patterns that global or
constant memory reads must follow to get good performance, higher
bandwidth can be achieved providing that there is locality in the
texture fetches or surface reads;
Addressing calculations are
performed outside the kernel by dedicated units;
Packed data may be
broadcast to separate variables in a single operation;
8-bit and
16-bit integer input data may be optionally converted to 32 bit
floating-point values in the range [0.0, 1.0] or [-1.0, 1.0] (see
Texture Memory).
The developer should keep in mind that exploiting of the combination of texture memory with constant memory can be a real advantage over the preference for a single one, because it may allow to take advantage of the dedicated cache from both, since both caches have higher performance than over any data retrieved outside the cache (i.e. device memory).

cudamalloc vs __device__ variables [duplicate]

I have a CUDA (v5.5) application that will need to use global memory. Ideally I would prefer to use constant memory, but I have exhausted constant memory and the overflow will have to be placed in global memory. I also have some variables that will need to be written to occasionally (after some reduction operations on the GPU) and I am placing this in global memory.
For reading, I will be accessing the global memory in a simple way. My kernel is called inside a for loop, and on each call of the kernel, every thread will access the exact same global memory addresses with no offsets. For writing, after each kernel call a reduction is performed on the GPU, and I have to write the results to global memory before the next iteration of my loop. There are far more reads from than writes to global memory in my application however.
My question is whether there are any advantages to using global memory declared in global (variable) scope over using dynamically allocated global memory? The amount of global memory that I need will change depending on the application, so dynamic allocation would be preferable for that reason. I know the upper limit on my global memory use however and I am more concerned with performance, so it is also possible that I could declare memory statically using a large fixed allocation that I am sure not to overflow. With performance in mind, is there any reason to prefer one form of global memory allocation over the other? Do they exist in the same physical place on the GPU and are they cached the same way, or is the cost of reading different for the two forms?
Global memory can be allocated statically (using __device__), dynamically (using device malloc or new) and via the CUDA runtime (e.g. using cudaMalloc).
All of the above methods allocate physically the same type of memory, i.e. memory carved out of the on-board (but not on-chip) DRAM subsystem. This memory has the same access, coalescing, and caching rules regardless of how it is allocated (and therefore has the same general performance considerations).
Since dynamic allocations take some non-zero time, there may be performance improvement for your code by doing the allocations once, at the beginning of your program, either using the static (i.e. __device__ ) method, or via the runtime API (i.e. cudaMalloc, etc.) This avoids taking the time to dynamically allocate memory during performance-sensitive areas of your code.
Also note that the 3 methods I outline, while having similar C/C++ -like access methods from device code, have differing access methods from the host. Statically allocated memory is accessed using the runtime API functions like cudaMemcpyToSymbol and cudaMemcpyFromSymbol, runtime API allocated memory is accessed via ordinary cudaMalloc / cudaMemcpy type functions, and dynamically allocated global memory (device new and malloc) is not directly accessible from the host.
First of all you need to think of coalescing the memory access. You didn't mention about the GPU you are using. In the latest GPUs, the coal laced memory read will give same performance as that of constant memory. So always make your memory read and write in coal laced manner as possible as you can.
Another you can use texture memory (If the data size fits into it). This texture memory has some caching mechanism. This is previously used in case when global memory read was non-coalesced. But latest GPUs give almost same performance for texture and global memory.
I don't think the globally declared memory give more performance over dynamically allocated global memory, since the coalescing issue still exists. Also global memory declared in global (variable) scope is not possible in case of CUDA global memory. The variables that can declare globally (in the program) are constant memory variables and texture, which we doesn't required to pass to kernels as arguments.
for memory optimizations please see the memory optimization section in cuda c best practices guide http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#memory-optimizations

CUDA malloc, mmap/mremap

CUDA device memory can be allocated using cudaMalloc/cudaFree, sure. This is fine, but primitive.
I'm curious to know, is device memory virtualised in some way? Are there equivalent operations to mmap, and more importantly, mremap for device memory?
If device memory is virtualised, I expect these sorts of functions should exist. It seems modern GPU drivers implement paging when there is contention for limited video resources by multiple processes, which suggests it's virtualised in some way or another...
Does anyone know where I can read more about this?
Edit:
Okay, my question was a bit general. I've read the bits of the manual that talk about mapping system memory for device access. I was more interested in device-allocated memory however.
Specific questions:
- Is there any possible way to remap device memory? (ie, to grow a device allocation)
- Is it possible to map device allocated memory to system memory?
- Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
I have cases where the memory is used by the GPU 99% of the time; so it should be device-local, but it may be convenient to map device memory to system memory for occasional structured read-back without having to implement an awkward deep-copy.
Yes, unified memory exists, however I'm happy with explicit allocation, save for the odd moment when I want a sneaky read-back.
I've found the manual fairly light on detail in general.
CUDA comes with a fine CUDA C Programming Guide as it's main manual which has sections on Mapped Memory as well as Unified Memory Programming.
Responding to your additional posted questions, and following your cue to leave UM out of the consideration:
Is there any possible way to remap device memory? (ie, to grow a device allocation)
There is no direct method. You would have to manually create a new allocation of the desired size, and copy the old data to it, then free the old allocation. If you expect to do this a lot, and don't mind the significant overhead associated with it, you could take a look at thrust device vectors which will hide some of the manual labor and allow you to resize an allocation in a single vector-style .resize() operation. There's no magic, however, so thrust is just a template library built on top of CUDA C (for the CUDA device backend) and so it is going to do a sequence of cudaMalloc and cudaFree operations, just as you would "manually".
Is it possible to map device allocated memory to system memory?
Leaving aside UM, no. Device memory cannot be mapped into host address space.
Is there some performance hazard using mapped pinned memory? Is the memory duplicated on the device as needed, or will it always fetch the memory across the pci-e bus?
no, host mapped data is never duplicated in device memory, and apart from L2 caching, mapped data needed by the GPU will always be fetched across the PCI-E bus