filtering an image, best practices - cuda

I have an input image "let it be a buffer of 1024 * 1024 pixels, with RGBA color data"
what I want to do for each pixel, is to filter it depending on neighbors , like [-15,15] in x and y directions
so my concern is, doing this with global memory will do like 31 * 31 global memory access for each pixel "which would be very performance bottleneck" , also I'm not sure about the behavior of multiple threads trying to read from the same memory location at the same time "may be some of them fail to read so -> rubbish data in -> rubbish data out"
this question is for CUDA or OpenCL as the concept should be the same
I know that shared memory (per work group) or local memory (per thread) won't solve this as I can't read another thread local memory, or another group shared memory "correct me if I misunderstand this concept"

Shared memory is a typical approach to this problem, although the stencil area (31*31) is quite large. Data re-use benefit can still be gained however. Since adjacent pixel computations only extend the region required by one column, in a 16KB shared memory array of 32bit RGBA pixels, you could have enough data for at least 64 threads to cooperatively compute their pixel values out of a single shared memory load.
Regarding the concern about multiple threads reading the same location, there is no possibility for garbage data reads. Certainly there is a possibility for contention leading to a performance impact, but in fact with an orderly for-loop progression in the kernel, no threads will be reading the same location at the same time anyway. With appropriate data organization there will be good opportunity for coalesced reads from global memory and no bank conflicts in shared memory.
This type of problem is well-suited for GPUs e.g. CUDA or OpenCL, and there are many examples of programs like this on SO.

Related

global memory access for individual threads

I am writing a simplistic raytracer. The idea is that for every pixel there is a thread that traverses a certain structure (geometry) that resides in global memory.
I invoke my kernel like so:
trace<<<gridDim, blockDim>>>(width, height, frameBuffer, scene)
Where scene is a structure that was previously allocated with cudaMalloc. Every thread has to start traversing this structure starting from the same node, and chances are that many concurrent threads will attempt to read the same nodes many times. Does that mean that when such reads take place, it cripples the degree of parallelism?
Given that geometry is large, I would assume that replicating it is not an option. I mean the whole processing still happens fairly fast, but I was wondering whether it is something that has to be dealt with, or simply left flung to the breeze.
First of all I think you got the wrong idea when you say concurrent reads may or may not cripple the degree of parallelism. Because that is what it means to be parallel. Each thread is reading concurrently. Instead you should be thinking if it affects the performance due to more memory accesses when each thread basically wants the same thing i.e. the same node.
Well according to the article here, Memory accesses can be coalesced if data locality is present and within warps only.
Which means if threads within a warp are trying to access memory locations near each other they can be coalesced. In your case each thread is trying to access the "same" node until it meets an endpoint where they branch.
This means the memory accesses will be coalesced within the warp till the threads branch off.
Efficient access to global memory from each thread depends on both your device architecture and your code. Arrays allocated on global memory are aligned to 256-byte memory segments by the CUDA driver. The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size. The device coalesces global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth. A misaligned data access for devices with a compute capability of less than 2.0 affects the effective bandwidth of accessing data. This is not a serious issue when working with a device that has a compute capability of > 2.0. That being said, pretty much regardless of your device generation, when accessing global memory with large strides, the effective bandwidth becomes poor (Reference). I would assume that for random access the the same behavior is likely.
Unless you are not changing the structure while reading, which I assume you do (if it's a scene you probably render each frame?) then yes, it cripples performance and may cause undefined behaviour. This is called a race condition. You can use atomic operations to overcome this type of problem. Using atomic operations guarantees that the race conditions don't happen.
You can try, stuffing the 'scene' to the shared memory if you can fit it.
You can also try using streams to increase concurrency which also brings some sort of synchronization to the kernels that are run in the same stream.

Performance of cmem vs texture on Pascal [duplicate]

Does the use of data storage in constant memory provides any benefit over texture in the Pascal architecture if the data request frequency is very high among threads (every thread pick at least one data from a specific column)?
EDIT: This is a split version of this question to improve community searching
If the expectations for constant memory usage are satisfied, the use of constant memory is a good idea in the general case. It is allowing your code to take advantage of an additional cache mechanism provided by the GPU hardware, and in so doing putting less pressure on the usage of texture by other parts of your code.
Since the constant memory and its cache, as the texture and surface memory and it is own cache are defined by the hardware Compute Capability, the target hardware should be accounted. Thus the option by constant memory and texture memory is dependent of the access pattern and the cache use, as the cache availability.
The constant memory performance is related to data broadcast among threads in a warp, so the maximum performance is achieved if all threads request the very same data address and the data is already on the cache. Thus, if in the same warp there are request to multiple address, the service is splitted in multiple requests, since it can retrive a single address per operation. If the number of splitted requests due to data retrieval from multiple addresses is too high, the texture and surface memory performance may superior over constant memory in this specific situation.. This information is detailed in the Cuda Programming Guide:
The constant memory space resides in device memory and is cached in the constant cache mentioned in Compute Capability 2.x.
A request is then split into as many separate requests as there are
different memory addresses in the initial request, decreasing
throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the
constant cache in case of a cache hit, or at the throughput of device
memory otherwise.
The texture memory cache is more flexible than constant memory cache. It can take advantage of readings in the same warp of address that are close together in a 2D fashion. Despite of some advantages over constant memory, in general, the texture memory should be used if the data access pattern or the data size does not follow the constant memory requirements or to make use of texture memory cache. More detailed information can be found at:
The texture and surface memory spaces
reside in device memory and are cached in texture cache, so a texture
fetch or surface read costs one memory read from device memory only on
a cache miss, otherwise it just costs one read from texture cache. The
texture cache is optimized for 2D spatial locality, so threads of the
same warp that read texture or surface addresses that are close
together in 2D will achieve best performance. Also, it is designed for
streaming fetches with a constant latency; a cache hit reduces DRAM
bandwidth demand but not fetch latency.
Reading device memory through texture or surface fetching present some
benefits that can make it an advantageous alternative to reading
device memory from global or constant memory:
If the memory reads do not follow the access patterns that global or
constant memory reads must follow to get good performance, higher
bandwidth can be achieved providing that there is locality in the
texture fetches or surface reads;
Addressing calculations are
performed outside the kernel by dedicated units;
Packed data may be
broadcast to separate variables in a single operation;
8-bit and
16-bit integer input data may be optionally converted to 32 bit
floating-point values in the range [0.0, 1.0] or [-1.0, 1.0] (see
Texture Memory).
The developer should keep in mind that exploiting of the combination of texture memory with constant memory can be a real advantage over the preference for a single one, because it may allow to take advantage of the dedicated cache from both, since both caches have higher performance than over any data retrieved outside the cache (i.e. device memory).

Memory coalescing and transaction

After reading about the topic, I have 2 questions related to Global Memory coalescing access:
1- I read that one requirement for Memory coalescing is that words accessed by the threads must be 4, 8, or 16 byte but apparently this is valid only for device with compute capability less than 1.3. Is that right? for the latter device (>=1.3), a thread can even access one or 2 bytes and have a successful coalesced memory access
2- Will it matter (time mainly) if a (half) warp Global Memory access generates a 128-byte instead of 64-byte memory transaction because of the words misalignment and what about the extra data transferred, will it be discarded by the system?
Thank you
1) You can access the data any way you want on later devices, but the performance will still be poor if you request a data segment that is narrow, i.e. you will not achieve the full memory bandwidth of your GPU.
2) This again depends on the overall scheme of you code. Generally, the improvement in later version of CUDA was that non-aligned reads/writes did not result in abysmal performance, but resulted in e.g. 2 write commands being issues instead of one.
Think of it like putting people on a bus. If you can fill your whole crowd into a single bus with one destination, you get better efficiency than using two buses that are only half filled.
So yes, it will matter, but depending on whether you are memory or compute bound, it will matter differently.
Arranging your read/write patterns to utilize the full bandwidth have given me the last 20-30% performance in many applications.
/Henrik

CUDA Compute Capability 2.0. Global memory access pattern

From CUDA Compute Capability 2.0 (Fermi) global memory access works through 768 KB L2 cache. It looks, developer don't care anymore about global memory banks. But global memory is still very slow, so the right access pattern is important. Now the point is to use/reuse L2 as much as possible. And my question is, how? I would be thankful for some detailed info, how L2 works and how should I organize and access global memory if I need, for example, 100-200 elements array per thread.
L2 cache helps in some ways, but it does not obviate the need for coalesced access of global memory. In a nutshell, coalesced access means that for a given read (or write) instruction, individual threads in a warp are reading (or writing) adjacent, contiguous locations in global memory, preferably that are aligned as a group on a 128-byte boundary. This will result in the most effective utilization of the available memory bandwidth.
In practice this is often not difficult to accomplish. For example:
int idx=threadIdx.x + (blockDim.x * blockIdx.x);
int mylocal = global_array[idx];
will give coalesced (read) access across all the threads in a warp, assuming global_array is allocated in an ordinary fashion using cudaMalloc in global memory. This type of access makes 100% usage of the available memory bandwidth.
A key takeaway is that memory transactions ordinarily occur in 128-byte blocks, which happens to be the size of a cache line. If you request even one of the bytes in a block, the entire block will be read (and stored in L2, normally). If you later read other data from that block, it will normally be serviced from L2, unless it has been evicted by other memory activity. This means that the following sequence:
int mylocal1 = global_array[0];
int mylocal2 = global_array[1];
int mylocal3 = global_array[31];
would all typically be serviced from a single 128-byte block. The first read for mylocal1 will trigger the 128 byte read. The second read for mylocal2 would normally be serviced from the cached value (in L2 or L1) not by triggering another read from memory. However, if the algorithm can be suitably modified, it's better to read all your data contiguously from multiple threads, as in the first example. This may be just a matter of clever organization of data, for example using Structures of Arrays rather than Arrays of structures.
In many respects, this is similar to CPU cache behavior. The concept of a cache line is similar, along with the behavior of servicing requests from the cache.
Fermi L1 and L2 can support write-back and write-through. L1 is available on a per-SM basis, and is configurably split with shared memory to be either 16KB L1 (and 48KB SM) or 48KB L1 (and 16KB SM). L2 is unified across the device and is 768KB.
Some advice I would offer is to not assume that the L2 cache just fixes sloppy memory accesses. The GPU caches are much smaller than equivalent caches on CPUs, so it's easier to get into trouble there. A general piece of advice is simply to code as if the caches were not there. Rather than CPU oriented strategies like cache-blocking, it's usually better to focus your coding effort on generating coalesced accesses and then possibly make use of shared memory in some specific cases. Then for the inevitable cases where we can't make perfect memory accesses in all situations, we let the caches provide their benefit.
You can get more in-depth guidance by looking at some of the available NVIDIA webinars. For example, the Global Memory Usage & Strategy webinar (and slides ) or the CUDA Shared Memory & Cache webinar would be instructive for this topic. You may also want to read the Device Memory Access section of the CUDA C Programming Guide.

the latency of acessing shared memory

which latency is longer between two situation below,
The data be filled into the shared memory from global memory, and all the thread access the shared memory concurrently.the data maybe the same for multiple threads accessing
All the threads access the global memory,but the data are neighbors.
If you plan on accessing each value only once, then you won't gain anything from using shared memory.
Values in shared memory are only valid within a block, so one or more threads in each block will have to load the values from global memory. So you're not able to avoid the global memory accesses.
If you have a device of compute capability >= 2.0 (Fermi), values read from global memory are automatically cached in the L1 and L2 caches. L1 has the same latency as shared memory.
Latency is a fixed value that depends on which memory you're accessing. It doesn't change. Latency is always much lower for shared memory than for global memory.
I think what you might really be asking is what type of access would give you the best memory throughput. If you will be using each value only once, case (2) will give the best throughput. If you will be reusing values and have CC >= 2.0, letting L1 handle the caching is likely to give the best throughput. If you're reusing values on CC < 2.0, using shared memory will give the best throughput.
Case (1) may or may not cause bank conflicts but will give better throughput regardless, for values that are already stored in shared memory.
Case (2) describes the optimal access pattern for global memory.
Perhaps I don't understand the difference between the two case. But if I do:
The second is faster if your hardware architecture allows it. For example, on a multicore machine with parallel registers. Notice also that in the second case, even from a pure software viewpoint, the data does not need to be made thread-safe for such fears as race-conditions due to interleaving.
Think of it like this:
CASE 2:
you have a large table with five dinners, and you have five kids to eat them: no synchronization needed.
CASE 1:
You have, say, three tables with three dinners; so that two kids may have to eat from the same plate and thus may need to synchronize their movements so they don't hit each other. Synchronization means delay.