global memory access for individual threads - cuda

I am writing a simplistic raytracer. The idea is that for every pixel there is a thread that traverses a certain structure (geometry) that resides in global memory.
I invoke my kernel like so:
trace<<<gridDim, blockDim>>>(width, height, frameBuffer, scene)
Where scene is a structure that was previously allocated with cudaMalloc. Every thread has to start traversing this structure starting from the same node, and chances are that many concurrent threads will attempt to read the same nodes many times. Does that mean that when such reads take place, it cripples the degree of parallelism?
Given that geometry is large, I would assume that replicating it is not an option. I mean the whole processing still happens fairly fast, but I was wondering whether it is something that has to be dealt with, or simply left flung to the breeze.

First of all I think you got the wrong idea when you say concurrent reads may or may not cripple the degree of parallelism. Because that is what it means to be parallel. Each thread is reading concurrently. Instead you should be thinking if it affects the performance due to more memory accesses when each thread basically wants the same thing i.e. the same node.
Well according to the article here, Memory accesses can be coalesced if data locality is present and within warps only.
Which means if threads within a warp are trying to access memory locations near each other they can be coalesced. In your case each thread is trying to access the "same" node until it meets an endpoint where they branch.
This means the memory accesses will be coalesced within the warp till the threads branch off.

Efficient access to global memory from each thread depends on both your device architecture and your code. Arrays allocated on global memory are aligned to 256-byte memory segments by the CUDA driver. The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size. The device coalesces global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth. A misaligned data access for devices with a compute capability of less than 2.0 affects the effective bandwidth of accessing data. This is not a serious issue when working with a device that has a compute capability of > 2.0. That being said, pretty much regardless of your device generation, when accessing global memory with large strides, the effective bandwidth becomes poor (Reference). I would assume that for random access the the same behavior is likely.

Unless you are not changing the structure while reading, which I assume you do (if it's a scene you probably render each frame?) then yes, it cripples performance and may cause undefined behaviour. This is called a race condition. You can use atomic operations to overcome this type of problem. Using atomic operations guarantees that the race conditions don't happen.
You can try, stuffing the 'scene' to the shared memory if you can fit it.
You can also try using streams to increase concurrency which also brings some sort of synchronization to the kernels that are run in the same stream.

Related

Memory coalescing and transaction

After reading about the topic, I have 2 questions related to Global Memory coalescing access:
1- I read that one requirement for Memory coalescing is that words accessed by the threads must be 4, 8, or 16 byte but apparently this is valid only for device with compute capability less than 1.3. Is that right? for the latter device (>=1.3), a thread can even access one or 2 bytes and have a successful coalesced memory access
2- Will it matter (time mainly) if a (half) warp Global Memory access generates a 128-byte instead of 64-byte memory transaction because of the words misalignment and what about the extra data transferred, will it be discarded by the system?
Thank you
1) You can access the data any way you want on later devices, but the performance will still be poor if you request a data segment that is narrow, i.e. you will not achieve the full memory bandwidth of your GPU.
2) This again depends on the overall scheme of you code. Generally, the improvement in later version of CUDA was that non-aligned reads/writes did not result in abysmal performance, but resulted in e.g. 2 write commands being issues instead of one.
Think of it like putting people on a bus. If you can fill your whole crowd into a single bus with one destination, you get better efficiency than using two buses that are only half filled.
So yes, it will matter, but depending on whether you are memory or compute bound, it will matter differently.
Arranging your read/write patterns to utilize the full bandwidth have given me the last 20-30% performance in many applications.
/Henrik

How the access of the same global memory address is performed by threads from different kernels?

If many threads in a warp want to read an address in global memory, this data is broadcasted, is that right?
If many threads in a warp want to write into an address in global memory, there is a serialization, but is not possible to predict the order, is that right?
But, the first question: If many threads in a different warps, in different blocks, want to write into an address in global memory? What the GPU gonna do? Serializes all the access to this address? Is there any guarantee of data consistence?
With Hyper-Q it is possible to launch a lot of streams containing kernels. If I have a position in the memory, and a number of threads in different kernels wants to write or read this address, what the GPU gonna do? Serializes the accesses of all threads from different kernels, or does the GPU do nothing and some inconsistencies gonna happen? Is there any guarantee of data consistence when multiple kernels are reading/writing into the same address?
It's preferred that you ask one question per question.
If many threads in a warp want to read an adress in global memory, this data is broadcasted, is that right?
Yes this is true for Fermi (CC2.0) and beyond.
If many threads in a warp want to write into an adress in global memory, there is a serialization, but is not possible to predict the order, is that right?
Correct. The order is undefined.
If many threads in a different warps, in different blocks, want to write into an adress in global memory? What the GPU gonna do? Serializes all the access to this address?
If the accesses are simultaneous, they are serialized. Again, order is undefined.
Is there any guarantee of data consistence?
Not sure what you mean by data consistence. Anyway, what else could the GPU do except serialize simultaneous writes? I'm surprised this is such a difficult concept, as there appears to me to be no obvious alternative.
If I have a possition in the memory, and a number of threads in different kernels wants to write or read this address, what the GPU gonna do? Serializes the access of all threads from different kernels, or the GPU do nothing and some inconsistences gonna happen? Is there any guarantee of data consistence when multiple kernels are reading/writing into the same address?
It does not matter what is the origin of simultaneous writes to global memory, whether from the same warp, or different warps, in different blocks, in different kernels. Simultaneous writes are serialized, in an undefined order. Again, for "data consistence" I'd like to know what you mean by that. Simultaneous reads and writes are also going to produce undefined behavior. The reads may return a value including the initial value of the memory location or any of the values that were written.
The final result of simultaneous writes to any GPU memory location is undefined. If all simultaneous writes are writing the same value, then the final value in that location will reflect that. Otherwise, the final value will reflect one of the values that got written. Which value is undefined. Beyond that, most of your questions and statements don't make sense to me. (What do you mean by data consistence?) You should not expect anything rational from such programming behavior. The GPU should be programmed as a distributed independent work machine, not a globally synchronous machine. Note that "undefined" also means that results may vary from one run of a kernel to the next, even if the input data is identical.
Simultaneous or nearly simultaneous reading and writing of global memory from different blocks (whether from the same or different kernels) is especially hazardous on Fermi (cc2.x) devices due to the independent non-coherent L1 caches that are interposed between the SMs (where the threadblocks execute) and the L2 cache (which is device-wide, and therefore coherent). Attempting to create synchronized behavior between threadblocks using global memory as a vehicle is difficult at best, and discouraged. It is suggested to consider ways to recast your algorithm to structure the work independently.

the latency of acessing shared memory

which latency is longer between two situation below,
The data be filled into the shared memory from global memory, and all the thread access the shared memory concurrently.the data maybe the same for multiple threads accessing
All the threads access the global memory,but the data are neighbors.
If you plan on accessing each value only once, then you won't gain anything from using shared memory.
Values in shared memory are only valid within a block, so one or more threads in each block will have to load the values from global memory. So you're not able to avoid the global memory accesses.
If you have a device of compute capability >= 2.0 (Fermi), values read from global memory are automatically cached in the L1 and L2 caches. L1 has the same latency as shared memory.
Latency is a fixed value that depends on which memory you're accessing. It doesn't change. Latency is always much lower for shared memory than for global memory.
I think what you might really be asking is what type of access would give you the best memory throughput. If you will be using each value only once, case (2) will give the best throughput. If you will be reusing values and have CC >= 2.0, letting L1 handle the caching is likely to give the best throughput. If you're reusing values on CC < 2.0, using shared memory will give the best throughput.
Case (1) may or may not cause bank conflicts but will give better throughput regardless, for values that are already stored in shared memory.
Case (2) describes the optimal access pattern for global memory.
Perhaps I don't understand the difference between the two case. But if I do:
The second is faster if your hardware architecture allows it. For example, on a multicore machine with parallel registers. Notice also that in the second case, even from a pure software viewpoint, the data does not need to be made thread-safe for such fears as race-conditions due to interleaving.
Think of it like this:
CASE 2:
you have a large table with five dinners, and you have five kids to eat them: no synchronization needed.
CASE 1:
You have, say, three tables with three dinners; so that two kids may have to eat from the same plate and thus may need to synchronize their movements so they don't hit each other. Synchronization means delay.

How to adjust the cuda number of block and of thread to get optimal performances

I've tested empirically for several values of block and of thread, and the execution time can be greatly reduced with specific values.
I don't see what are the differences between blocks and thread. I figure that it may be that thread in a block have specific cache memory but it's quite fuzzy for me. For the moment, I parallelize my functions in N parts, which are allocated on blocks/threads.
My goal could be to automaticaly adjust the number of blocks and thread regarding to the size of the memory that I've to use. Could it be possible? Thank you.
Hong Zhou's answer is good, so far. Here are some more details:
When using shared memory you might want to consider it first, because it's a very much limited resource and it's not unlikely for kernels to have very specific needs that constrain
those many variables controlling parallelism.
You either have blocks with many threads sharing larger regions or blocks with fewer
threads sharing smaller regions (under constant occupancy).
If your code can live with as little as 16KB of shared memory per multiprocessor
you might want to opt for larger (48KB) L1-caches calling
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
Further, L1-caches can be disabled for non-local global access using the compiler option -Xptxas=-dlcm=cg to avoid pollution when the kernel accesses global memory carefully.
Before worrying about optimal performance based on occupancy you might also want to check
that device debugging support is turned off for CUDA >= 4.1 (or appropriate optimization options are given, read my post in this thread for a suitable compiler
configuration).
Now that we have a memory configuration and registers are actually used aggressively,
we can analyze the performance under varying occupancy:
The higher the occupancy (warps per multiprocessor) the less likely the multiprocessor will have to wait (for memory transactions or data dependencies) but the more threads must share the same L1 caches, shared memory area and register file (see CUDA Optimization Guide and also this presentation).
The ABI can generate code for a variable number of registers (more details can be found in the thread I cited). At some point, however, register spilling occurs. That is register values get temporarily stored on the (relatively slow, off-chip) local memory stack.
Watching stall reasons, memory statistics and arithmetic throughput in the profiler while
varying the launch bounds and parameters will help you find a suitable configuration.
It's theoretically possible to find optimal values from within an application, however,
having the client code adjust optimally to both different device and launch parameters
can be nontrivial and will require recompilation or different variants of the kernel to be deployed for every target device architecture.
I believe to automatically adjust the blocks and thread size is a highly difficult problem. If it is easy, CUDA would most probably have this feature for you.
The reason is because the optimal configuration is dependent of implementation and the kind of algorithm you are implementing. It requires profiling and experimenting to get the best performance.
Here are some limitations which you can consider.
Register usage in your kernel.
Occupancy of your current implementation.
Note: having more threads does not equate to best performance. Best performance is obtained by getting the right occupancy in your application and keeping the GPU cores busy all the time.
I've a quite good answer here, in a word, this is a difficult problem to compute the optimal distribution on blocks and threads.

CUDA thread synchronization

I am a bit confused about synchronization.
Using __syncthreads you can synchronize threads in a block.This,
(the use of __syncthreads) must be done only with shared memory? Or
using shared memory with __syncthreads has best performance?
Generally, threads may only safely communicate with each other if
and only if they exist within the same thread block, right? So, why
don't we always use shared memory? Because it's not big enough?
And, if we don't use shared memory how can we ensure that results
are right?
I have a program that sometimes runs ok (I get the results) and
sometimes i get 'nan' results without altering anything. Can that be
a problem of synchronization?
The use of __syncthreads does not involve shared memory, it only ensures synchronization within a block. But you need to synchronize threads when you want them to share data through shared memory.
We don't always use shared memory because it is quite small, and because it can slow down your application when badly used. This is due to potential bank conflicts when badly addressing shared memory. Moreover, recent architectures (from 2.0) implement shared memory in the same hardware area than cache. Thus, some seasoned CUDA developers recommend not to use shared memory and rely on the cache mechanisms only.
Can be. If you want to know whether it is a deadlock, try to increase the number of blocks you're using. If it is a deadlock, your GPU should freeze. If it is not, post your code, it will be easier for us to answer ;)
__syncthreads() and shared memory are independent ideas, you don't need one to use the other. The only requirement for using __syncthreads() that comes to my mind is that all the threads must eventually arrive at the point in the code, otherwise your program will simply hang.
As for shared memory, yes it's probably a matter of size that you don't see it being used all the time. From my understanding shared memory is split amongst all blocks. For example, to launch a kernel using a shared memory of 1kb with a 100 blocks will require 100kb which exceeds what is available on the SM.
Although shared memory and __syncthreads() are independent concepts, but they often go hand in hand. Otherwise if threads operate independently, there is no need to use __syncthreads().
Two aspects restrict the use of shared memory: 1). the size of shared memory is limited 2). to achieve best performance, you need to avoid bank conflict when using shared memory.
It could be due to the lack of __syncthreads(). Sometimes, using shared memory without __syncthreads() could lead to unpredictable results.