Bank conflict in CUDA when reading from the same location - cuda

I have a CUDA kernel where there is a point where each thread is reading the same value from the global memory. So something like:
__global__ void my_kernel(const float4 * key_pts)
{
if (key_pts[blockIdx.x] < 0 return;
}
The kernel is configured as follows:
dim3 blocks(16, 16);
dim3 grid(2000);
my_kernel<<<grid, blocks, 0, stream>>>(key_pts);
My question is whether this will lead to some sort bank conflict or sub-optimal access in CUDA. I must confess I do not understand this issue in detail yet.
I was thinking I could do something like the following in case we have sub-optimal access:
__global__ void my_kernel(const float4 * key_pts)
{
__shared__ float x;
if (threadIdx.x == 0 && threadIdx.y == 0)
x = key_pts[blockIdx.x];
__syncthreads();
if (x < 0) return;
}
Doing some timing though, I do not see any difference between the two but so far my tests are with limited data.

bank conflicts apply to shared memory, not global memory.
Since all threads need (ultimately) the same value to make their decision, this won't yield sub-optimal access on global memory because there is a broadcast mechanism so that all threads in the same warp, requesting the same location/value from global memory, will retrieve that without any serialization or overhead. All threads in the warp can be serviced at the same time:
Note that threads can access any words in any order, including the same words.
Furthermore, assuming your GPU has a cache (cc2.0 or newer) the value retrieved from global memory for the first warp encountering this will likely be available in the cache for subsequent warps that hit this point.
I wouldn't expect much performance difference between the two cases.

Related

CUDA shared memory read/write order within a single thread

The shared memory is not synchronized between threads in a block. But I don't know if the shared memory is synchronized with the writer thread.
For example, in this example:
__global__ void kernel()
{
__shared__ int i, j;
if(threadIdx.x == 0)
{
i = 10;
j = i;
}
// #1
}
Is it guaranteed at #1 that, for thread 0, i=10 and j=10, or do I need some memory fence or introduce a local variable?
I'm going to assume that by
for thread 0
you mean, "the thread that passed the if-test". And for the sake of this discussion, I will assume there is only one of those.
Yes, it's guaranteed. Otherwise basic C++ compliance would be broken in CUDA.
Challenges in CUDA may arise in inter-thread communication or behavior. However you don't have that in view in your question.
As an example, it is certainly not guaranteed that for some other thread, i will be visible as 10, without some sort of fence or barrier.

Are needless write operations in multi-thread kernels in CUDA inefficient?

I have a kernel in my CUDA code where I want a bunch of threads to do a bunch of computations on some piece of shared memory (because it's much faster than doing so on global memory), and then write the result to global memory (so I can use it in later kernels). The kernel looks something like this:
__global__ void calc(float * globalmem)
{
__shared__ float sharemem; //initialize shared memory
sharemem = 0; //set it to initial value
__syncthreads();
//do various calculations on the shared memory
//for example I use atomicAdd() to add each thread's
//result to sharedmem...
__syncthreads();
*globalmem = sharedmem;//write shared memory to global memory
}
The fact that every single thread is writing the data out from shared to global memory, when I really only need to write it out once, feels fishy to me. I also get the same feeling from the fact that every thread initializes the shared memory to zero at the start of the code. Is there a faster way to do this than my current implementation?
At the warp level, there's probably not much performance difference between doing a redundant read or write vs. having a single thread do it.
However I would expect a possibly measurable performance difference by having multiple warps in a threadblock do the redundant read or write (vs. a single thread).
It should be sufficient to address these concerns by having a single thread do the read or write, rather than redundantly:
__global__ void calc(float * globalmem)
{
__shared__ float sharemem; //initialize shared memory
if (!threadIdx.x) sharemem = 0; //set it to initial value
__syncthreads();
//do various calculations on the shared memory
//for example I use atomicAdd() to add each thread's
//result to sharedmem...
__syncthreads();
if (!threadIdx.x) *globalmem = sharemem;//write shared memory to global memory
}
Although you didn't ask about it, using atomics within a threadblock on shared memory may possibly be replaceable (for possibly better performance) by a shared memory reduction method.

Declaring Variables in a CUDA kernel

Say you declare a new variable in a CUDA kernel and then use it in multiple threads, like:
__global__ void kernel(float* delt, float* deltb) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
float a;
a = delt[i] + deltb[i];
a += 1;
}
and the kernel call looks something like below, with multiple threads and blocks:
int threads = 200;
uint3 blocks = make_uint3(200,1,1);
kernel<<<blocks,threads>>>(d_delt, d_deltb);
Is "a" stored on the stack?
Is a new "a" created for each thread when they are initialized?
Or will each thread independently access "a" at an unknown time, potentially messing up the algorithm?
Any variable (scalar or array) declared inside a kernel function, without an extern specifier, is local to each thread, that is each thread has its own "copy" of that variable, no data race among threads will occur!
Compiler chooses whether local variables will reside on registers or in local memory (actually global memory), depending on transformations and optimizations performed by the compiler.
Further details on which variables go on local memory can be found in the NVIDIA CUDA user guide, chapter 5.3.2.2
None of the above. The CUDA compiler is smart enough and aggressive enough with optimisations that it can detect that a is unused and the complete code can be optimised away.You can confirm this by compiling the kernel with -Xptxas=-v as an option and look at the resource count, which should be basically no registers and no local memory or heap.
In a less trivial example, a would probably be stored in a per thread register, or in per thread local memory, which is off-die DRAM.

CUDA shared memory broadcast and __syncthreads behavior

I am experiencing a strange issue, well at least to me it looks strange, and I was hoping someone might be able to shed some light on it. I have a CUDA kernel which relies on shared memory for fast local accesses. To the limits of my knowledge, if all the threads within a half-warp access the same shared memory bank then the value will be broadcast to the threads in the warp. Also, access from multiple warps to the same bank do not cause bank conflicts, they will just be serialized. Keeping this in mind, I have created a small kernel to test this out (after encountering issues in my original kernel). Here's the snippet:
#define NUM_VALUES 16
#define NUM_LOOPS 1024
__global__ void shared_memory_test(float *output)
{
// Create some shared memory
__shared__ int dm_delays[NUM_VALUES];
// Loop over NUM_LOOPS
float accumulator = 0;
for(unsigned c = 0; c < NUM_LOOPS; c++)
{
// Force shared memory update
for(int d = threadIdx.x; d < NUM_VALUES; d++)
dm_delays[d] = c * d;
// __syncthreads();
for(int d = 0; d < NUM_VALUES; d++)
accumulator += dm_delays[d];
}
// Store accumulated value to global memory
for(unsigned d = 0; d < NUM_VALUES; d++)
output[d] = accumulator;
}
I've run this with a block dimension of 16 (half a warp, not terribly efficient but it's just for testing purposes). All the threads should be addressing the same shared memory bank, so there should be no conflicts. However, the opposite seems to be true. I'm using Parallel Nsight on Visual Studio 2010 for this testing.
What is even more mysterious to me is the fact that if I uncomment the __syncthreads call in the outer loop then the number of bank conflicts increases dramatically.
Just some number to give you an idea (this is for a grid containing one block with 16 threads, so a single half-warp, NUM_VALUES = 16, NUM_LOOPS = 1024):
without __syncthreads: 4 bank conflicts
with __syncthreads : 4,096 bank conflicts
I'm running this on a GTX 670, set at compute_capability 3.0
Thank you in advance
UPDATE: It was pointed out that without __syncthreads the NUM_LOOPS reads in the outer loop were being optimised away by the compiler since the values of dm_delays never change. Now I get a constant 4,096 bank conflicts in both cases, which still doesn't play well with the broadcast behavior for shared memory.
Since the value of dm_delays does not change, this may be a case where the compiler optimizes away the 1024 reads to shared memory if the __syncthreads is not present. With the __syncthreads there, it may assume that the value could be changed by another thread, and so it reads the value over and over again.

Why is global + shared faster than global alone

I need some help understanding the behavior of Ron Farber's code: http://www.drdobbs.com/parallel/cuda-supercomputing-for-the-masses-part/208801731?pgno=2
I'm not understanding how the use of shared mem is giving faster performance over the non-shared memory version. i.e. If I add a few more index calculation steps and use add another Rd/Wr cycle to access the shared mem, how can this be faster than just using global mem alone? The same number or Rd/Wr cycles access global mem in either case. The data is still access only once per kernel instance. Data still goes in/out using global mem. The num of kernel instances is the same. The register count looks to be the same. How can adding more processing steps make it faster. (We are not subtracting any process steps.) Essentially we are doing more work, and it is getting done faster.
Shared mem access is much faster than global, but it is not zero, (or negative).
What am I missing?
The 'slow' code:
__global__ void reverseArrayBlock(int *d_out, int *d_in) {
int inOffset = blockDim.x * blockIdx.x;
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int in = inOffset + threadIdx.x;
int out = outOffset + (blockDim.x - 1 - threadIdx.x);
d_out[out] = d_in[in];
}
The 'fast' code:
__global__ void reverseArrayBlock(int *d_out, int *d_in) {
extern __shared__ int s_data[];
int inOffset = blockDim.x * blockIdx.x;
int in = inOffset + threadIdx.x;
// Load one element per thread from device memory and store it
// *in reversed order* into temporary shared memory
s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];
// Block until all threads in the block have written their data to shared mem
__syncthreads();
// write the data from shared memory in forward order,
// but to the reversed block offset as before
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int out = outOffset + threadIdx.x;
d_out[out] = s_data[threadIdx.x];
}
Early CUDA-enabled devices (compute capability < 1.2) would not treat the d_out[out] write in your "slow" version as a coalesced write. Those devices would only coalesce memory accesses in the "nicest" case where i-th thread in a half warp accesses i-th word. As a result, 16 memory transactions would be issued to service the d_out[out] write for every half warp, instead of just one memory transaction.
Starting with compute capability 1.2, the rules for memory coalescing in CUDA became much more relaxed. As a result, the d_out[out] write in the "slow" version would also get coalesced, and using shared memory as a scratch pad is no longer necessary.
The source of your code sample is article "CUDA, Supercomputing for the Masses: Part 5", which was written in June 2008. CUDA-enabled devices with compute capability 1.2 only arrived on the market 2009, so the writer of the article clearly talked about devices with compute capability < 1.2.
For more details, see section F.3.2.1 in the NVIDIA CUDA C Programming Guide.
this is because the shared memory is closer to the computing units, hence the latency and peak bandwidth will not be the bottleneck for this computation (at least in the case of matrix multiplication)
But most importantly, the top reason is that a lot of the numbers in the tile are being reused by many threads. So if you access from global you are retrieving those numbers multiple times. Writing them once to shared memory will eliminate that wasted bandwidth usage
When looking at the global memory accesses, the slow code reads forwards and writes backwards. The fast code both read and writes forwards. I think the fast code if faster because the cache hierarchy is optimized in, some way, for accessing the global memory in descending order (towards higher memory addresses).
CPUs do some speculative fetching, where they will fill cache lines from higher memory addresses before the data has been touched by the program. Maybe something similar happens on the GPU.