CUDA (JCUDA) shared memory (?) problems / undefined behaviour - cuda

Im working on my game project (tower defence) and Im trying to compute the distance between all criters and a tower with JCuda using the shared memory. For each tower I run 1 block with N threds, where N equals the number of critters on the map. Im computing distance between all criters and that tower for given block, and I store the smallest found distance so far in the block's shared memory. My current code looks like that:
extern "C"
__global__ void calcDistance(int** globalInputData, int size, int
critters, int** globalQueryData, int* globalOutputData) {
//shared memory
__shared__ float minimum[2];
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = blockIdx.y;
if (x < critters) {
int distance = 0;
//Calculate the distance between tower and criter
for (int i = 0; i < size; i++) {
int d = globalInputData[x][i] - globalQueryData[y][i];
distance += d * d;
}
if (x == 0) {
minimum[0] = distance;
minimum[1] = x;
}
__syncthreads();
if (distance < minimum[0]) {
minimum[0] = distance;
minimum[1] = x;
}
__syncthreads();
globalOutputData[y * 2] = minimum[0];
globalOutputData[y] = minimum[1];
}
}
The problem is if I rerun the code using the same input multiple times (I free all the memory on both host and device after each run) I get different output each time I the code gets executed for blocks (tower) number > 27... Im fairly sure it has something to do with the shared memory and the way im dealing with it, as rewriting the code to use global memory gives the same result whenever the code gets executed. Any ideas?

There is a memory race problem (so read-after-write correctness) in that kernel here:
if (distance < minimum[0]) {
minimum[0] = distance;
minimum[1] = x;
}
When executed, every thread in the block is going to try and simultaneously read and write the value of minimum. There are no guarantees what will happen when multiple threads in a warp try writing to the same shared memory location, and there are no guarantees what values that other warps in the same block will read when loading from a memory location to which is being written. Memory access is not atomic, and there is no locking or serialization which would ensure that code performed the type of reduction operation you seem to be trying to do.
A milder version of the same problem applies to the write back to global memory at the end of the kernel:
__syncthreads();
globalOutputData[y * 2] = minimum[0];
globalOutputData[y] = minimum[1];
The barrier before the writes ensures that the writes to minimum will be completed prior that a "final" (although inconsistent) value will be stored in minimum, but then every thread in the block will execute the write.
If your intention is to have each thread compute a distance, and then for the minimum of the distance values over the block to get written out to global memory, you will have to either use atomic memory operations (for shared memory this is supported on compute 1.2/1.3 and 2.x devices only), or write an explicit shared memory reduction. After that, only one thread should execute the write back to global memory.
Finally, you also have a potential synchronization correctness problem that could cause the kernel to hang. __syncthreads() (which maps to the PTX bar instruction) demands that every thread in the block arrive and execute the instruction prior to the kernel continuing. Having this sort of control flow:
if (x < critters) {
....
__syncthreads();
....
}
will cause the kernel to hang if some threads in the block can branch around the barrier and exit while others wait at the barrier. There should never be any branch divergence around a __syncthreads() call to ensure execution correctness of a kernel in CUDA.
So, in summary, back to the drawing board on at least three issues in the current code.

Related

how to sync threads in this cuda example

I have the following rough code outline:
run a loop, millions of times
in that loop, compute values 'I's - see example of such functions below
After all 'I's have been computed, compute other values 'V's
repeat the loop
Each computation of an I or V could involve up to 20ish mathematical operations, (e.g. I1 = A + B/C * D + 1/exp(V1) - E + F + V2 etc).
There are roughly:
50 'I's
10 'V's
10 values in each I and V, i.e. they are vectors of length 10
At first I tried running a simple loop in C, with kernel calls for each time step but this was really slow. It seems like I can get the code to run faster if the main loop is in a kernel that calls other kernels. However, I'm worried about kernel call overhead (maybe I shouldn't be) so I came up with something like the following, where each I and V loop independently, with syncing between the kernels as necessary.
For reference, the variables below are hardcoded as __device__ values, but eventually I will pass some values into specific kernels to make the system interesting.
__global__ void compute_IL1()
{
int id = threadIdx.x;
//n_t = 1e6;
for (int i = 0; i < n_t; i++){
IL1[id] = gl_1*(V1[id] - El_1);
//atomic, sync, event????,
}
}
__global__ void compute_IK1()
{
int id = threadIdx.x;
for (int i = 0; i < n_t; i++){
Ik1[id] = gk_1*powf(0.75*(1-H1[id]),4)*(V1[id]-Ek_1);
//atomic, sync, event?
}
}
__global__ void compute_V1()
{
int id = threadIdx.x;
for (int i = 0; i < n_t; i++){
//wait for IL1 and Ik1 and others, but how????
V1[id] = Ik1[id]+IL1[id] + ....
//trigger the I's again
}
}
//main function
compute_IL1<<<1,10,0,s0>>>();
compute_IK1<<<1,10,0,s1>>>();
//repeat this for many 50 - 70 more kernels (Is and Vs)
So the question is, how would I sync these kernels? Is an event approach best? Is there a better paradigm to use here?
There is no sane mechanism I can think of to have multiple resident kernels synchronize without resorting to hacky atomic tricks which may well not work reliably.
If you are running blocks with 10 threads and these kernels cannot execute concurrently for correctness reasons, you are (in the best possible case) using 1/64 of the computational capacity of your device. This problem as you have described it sounds completely Ill suited to a GPU.
So, I tried a couple of approaches.
A loop with a few kernel calls, where the last kernel call is dependent on the previous ones. This can be done with cudaStreamWaitEvent which can wait for multiple events. I found this on: http://cedric-augonnet.com/declaring-dependencies-with-cudastreamwaitevent/ . Unfortunately, the kernel calls were too expensive.
Global variables between concurrent streams. The logic was pretty simple, having one thread pause until a global variable equaled the loop variable, indicating that all threads could proceed. This was then followed by a sync-threads call. Unfortunately, this did not work well.
Ultimately, I think I've settled on a nested loop, where the outer loop represents time, and the inner loop indicates which of a set instructions to run, based on dependencies. I also launched the maximum number of threads per block (1024) and broke up the vectors that needed to be processed into warps. The rough psuedocode is:
run_main<<<1,1024>>>();
__global__ void run_main(){
int warp = threadIdx.x/32;
int id = threadIdx.x - warp*32;
if (id < 10){
for (int i = 0; i < n_t; i++){
for(int j = 0; j < n_j; j++){
switch (j){
case 0:
switch(warp){
case 0:
I1[id] = a + b + c*d ...
break;
case 1:
I2[id] = f*g/h
break;
}
break;
//These things depend on case 0 OR
//we've run out of space in the first pass
//32 cases max [0 ... 31]
case 1:
switch(warp){
case 0:
V1[ID] = I1*I2+ ...
break;
case 1:
V2[ID] = ...
//syncs across the block
__syncthreads();
This design is based on my impression that each set of 32 threads runs independently but should run the same code, otherwise things can slow done significantly.
So at the end, I'm running roughly 32*10 instructions simultaneously.
Where 32 is the number of warps, and it depends on how many different values I can compute at the same time (due to dependencies) and 10 is the # of elements in each vector. This is slowed down by any imbalances in the # of computations in each warp case, since all warps need to merge before moving onto the next step (due to the syncthreads call). I'm running different parameters (parameter sweep) on top of this, so I could potentially run 3 at a time in the block, multiplied by the # of streaming processors (or whatever the official name is) on the card.
One thing I need to change is that I'm currently testing on a video card that is attached to a monitor as well. Apparently Windows will kill the kernel if it lasts for more than 5 seconds, so I need to call the kernel in chunked time steps, like once every 1e5 time steps (in my case).

Memory access in CUDA kernel functions (simple example)

I am novice in GPU parallel computing and I'm trying to learn CUDA by looking at some examples in NVidia "CUDA by examples" book.
And I do not understand properly how thread access and change variables in such a simple example (dot product of two vectors).
The kernel function is defined as follows
__global__ void dot( float *a, float *b, float *c ) {
__shared__ float cache[threadsPerBlock];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;
float temp = 0;
while (tid < N) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}
// set the cache values
cache[cacheIndex] = temp;
I do not understand three things.
What is the sequence of execution of this function? Is there any sequence between threads? For example, the first are the thread from the first block, then threads from the second block come into play and so on (this is connected to the question why this is necessary to divide threads into blocks).
Do all threads have their own copy of the "temp" variable or not (if not, why is there no race condition?)
How is it operated? What exactly goes to the variable temp in the while loop? The array cache stores values of temp for different threads. How does the summation go on? It seems that temp already contains all sums necessary for dot product because variable tid goes from 0 to N-1 in the while loop.
Despite the code you provide is incomplete, here are some clarifications about what you are asking :
The kernel code will be executed by all the threads in all the blocks. The way to "split the jobs" is to make threads work only on one or a few elements.
For instance, if you have to treat 100 integers with a specific algorithm, you probably want 100 threads to treat 1 element each.
In CUDA the amount of blocks and threads is defined at the kernel launch on host side :
myKernel<<<grid, threads>>>(...);
Where grids and threads are dim3, which define the size on three dimensions.
There is no specific order in the execution of threads and blocks. As you can read there :
http://mc.stanford.edu/cgi-bin/images/3/34/Darve_cme343_cuda_3.pdf
On page 6 : "No specific order in which blocks are dispatched and executed".
Since the temp variable is defined in the kernel in no specific way, it is not distributed and each thread will have this value stored in a register.
This is equivalent of what is done on CPU side. So yes, this means each threads has its own "temp" variable.
The temp variable is updated in each iteration of the loop, using access to device arrays.
Again, this is equivalent of what is done on CPU side.
I think you should probably check if you are used enough to C/C++ programming on CPU side before going further into GPU programming. Meaning no offense, it seems you have a lack in several main topics.
Since CUDA allows you to drive your GPU with C code, the difficulty is not in the syntax, but in the specificities of the hardware.

CUDA shared memory broadcast and __syncthreads behavior

I am experiencing a strange issue, well at least to me it looks strange, and I was hoping someone might be able to shed some light on it. I have a CUDA kernel which relies on shared memory for fast local accesses. To the limits of my knowledge, if all the threads within a half-warp access the same shared memory bank then the value will be broadcast to the threads in the warp. Also, access from multiple warps to the same bank do not cause bank conflicts, they will just be serialized. Keeping this in mind, I have created a small kernel to test this out (after encountering issues in my original kernel). Here's the snippet:
#define NUM_VALUES 16
#define NUM_LOOPS 1024
__global__ void shared_memory_test(float *output)
{
// Create some shared memory
__shared__ int dm_delays[NUM_VALUES];
// Loop over NUM_LOOPS
float accumulator = 0;
for(unsigned c = 0; c < NUM_LOOPS; c++)
{
// Force shared memory update
for(int d = threadIdx.x; d < NUM_VALUES; d++)
dm_delays[d] = c * d;
// __syncthreads();
for(int d = 0; d < NUM_VALUES; d++)
accumulator += dm_delays[d];
}
// Store accumulated value to global memory
for(unsigned d = 0; d < NUM_VALUES; d++)
output[d] = accumulator;
}
I've run this with a block dimension of 16 (half a warp, not terribly efficient but it's just for testing purposes). All the threads should be addressing the same shared memory bank, so there should be no conflicts. However, the opposite seems to be true. I'm using Parallel Nsight on Visual Studio 2010 for this testing.
What is even more mysterious to me is the fact that if I uncomment the __syncthreads call in the outer loop then the number of bank conflicts increases dramatically.
Just some number to give you an idea (this is for a grid containing one block with 16 threads, so a single half-warp, NUM_VALUES = 16, NUM_LOOPS = 1024):
without __syncthreads: 4 bank conflicts
with __syncthreads : 4,096 bank conflicts
I'm running this on a GTX 670, set at compute_capability 3.0
Thank you in advance
UPDATE: It was pointed out that without __syncthreads the NUM_LOOPS reads in the outer loop were being optimised away by the compiler since the values of dm_delays never change. Now I get a constant 4,096 bank conflicts in both cases, which still doesn't play well with the broadcast behavior for shared memory.
Since the value of dm_delays does not change, this may be a case where the compiler optimizes away the 1024 reads to shared memory if the __syncthreads is not present. With the __syncthreads there, it may assume that the value could be changed by another thread, and so it reads the value over and over again.

Why is global + shared faster than global alone

I need some help understanding the behavior of Ron Farber's code: http://www.drdobbs.com/parallel/cuda-supercomputing-for-the-masses-part/208801731?pgno=2
I'm not understanding how the use of shared mem is giving faster performance over the non-shared memory version. i.e. If I add a few more index calculation steps and use add another Rd/Wr cycle to access the shared mem, how can this be faster than just using global mem alone? The same number or Rd/Wr cycles access global mem in either case. The data is still access only once per kernel instance. Data still goes in/out using global mem. The num of kernel instances is the same. The register count looks to be the same. How can adding more processing steps make it faster. (We are not subtracting any process steps.) Essentially we are doing more work, and it is getting done faster.
Shared mem access is much faster than global, but it is not zero, (or negative).
What am I missing?
The 'slow' code:
__global__ void reverseArrayBlock(int *d_out, int *d_in) {
int inOffset = blockDim.x * blockIdx.x;
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int in = inOffset + threadIdx.x;
int out = outOffset + (blockDim.x - 1 - threadIdx.x);
d_out[out] = d_in[in];
}
The 'fast' code:
__global__ void reverseArrayBlock(int *d_out, int *d_in) {
extern __shared__ int s_data[];
int inOffset = blockDim.x * blockIdx.x;
int in = inOffset + threadIdx.x;
// Load one element per thread from device memory and store it
// *in reversed order* into temporary shared memory
s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];
// Block until all threads in the block have written their data to shared mem
__syncthreads();
// write the data from shared memory in forward order,
// but to the reversed block offset as before
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int out = outOffset + threadIdx.x;
d_out[out] = s_data[threadIdx.x];
}
Early CUDA-enabled devices (compute capability < 1.2) would not treat the d_out[out] write in your "slow" version as a coalesced write. Those devices would only coalesce memory accesses in the "nicest" case where i-th thread in a half warp accesses i-th word. As a result, 16 memory transactions would be issued to service the d_out[out] write for every half warp, instead of just one memory transaction.
Starting with compute capability 1.2, the rules for memory coalescing in CUDA became much more relaxed. As a result, the d_out[out] write in the "slow" version would also get coalesced, and using shared memory as a scratch pad is no longer necessary.
The source of your code sample is article "CUDA, Supercomputing for the Masses: Part 5", which was written in June 2008. CUDA-enabled devices with compute capability 1.2 only arrived on the market 2009, so the writer of the article clearly talked about devices with compute capability < 1.2.
For more details, see section F.3.2.1 in the NVIDIA CUDA C Programming Guide.
this is because the shared memory is closer to the computing units, hence the latency and peak bandwidth will not be the bottleneck for this computation (at least in the case of matrix multiplication)
But most importantly, the top reason is that a lot of the numbers in the tile are being reused by many threads. So if you access from global you are retrieving those numbers multiple times. Writing them once to shared memory will eliminate that wasted bandwidth usage
When looking at the global memory accesses, the slow code reads forwards and writes backwards. The fast code both read and writes forwards. I think the fast code if faster because the cache hierarchy is optimized in, some way, for accessing the global memory in descending order (towards higher memory addresses).
CPUs do some speculative fetching, where they will fill cache lines from higher memory addresses before the data has been touched by the program. Maybe something similar happens on the GPU.

Copying whole global memory buffer many times to shared memory buffer

I have a buffer in global memory that I want to copy in shared memory for each block as to speed up my read-only access. Each thread in each block will use the whole buffer at different positions concurrently.
How does one do that?
I know the size of the buffer only at run time:
__global__ void foo( int *globalMemArray, int N )
{
extern __shared__ int s_array[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if( idx < N )
{
...?
}
}
The first point to make is that shared memory is limited to a maximum of either 16kb or 48kb per streaming multiprocessor (SM), depending on which GPU you are using and how it is configured, so unless your global memory buffer is very small, you will not be able to load all of it into shared memory at the same time.
The second point to make is that the contents of shared memory only has the scope and lifetime of the block it is associated with. Your sample kernel only has a single global memory argument, which makes me think that you are either under the misapprehension that the contents of a shared memory allocation can be preserved beyond the life span of the block that filled it, or that you intend to write the results of the block calculations back into same global memory array from which the input data was read. The first possibility is wrong and the second will result in memory races and inconsistant results. It is probably better to think of shared memory as a small, block scope L1 cache which is fully programmer managed than some sort of faster version of global memory.
With those points out of the way, a kernel which loaded sucessive segments of a large input array, processed them and then wrote some per thread final result back input global memory might look something like this:
template <int blocksize>
__global__ void foo( int *globalMemArray, int *globalMemOutput, int N )
{
__shared__ int s_array[blocksize];
int npasses = (N / blocksize) + (((N % blocksize) > 0) ? 1 : 0);
for(int pos = threadIdx.x; pos < (blocksize*npasses); pos += blocksize) {
if( pos < N ) {
s_array[threadIdx.x] = globalMemArray[pos];
}
__syncthreads();
// Calculations using partial buffer contents
.......
__syncthreads();
}
// write final per thread result to output
globalMemOutput[threadIdx.x + blockIdx.x*blockDim.x] = .....;
}
In this case I have specified the shared memory array size as a template parameter, because it isn't really necessary to dynamically allocate the shared memory array size at runtime, and the compiler has a better chance at performing optimizations when the shared memory array size is known at compile time (perhaps in the worst case there could be selection between different kernel instances done at run time).
The CUDA SDK contains a number of good example codes which demonstrate different ways that shared memory can be used in kernels to improve memory read and write performance. The matrix transpose, reduction and 3D finite difference method examples are all good models of shared memory usage. Each also has a good paper which discusses the optimization strategies behind the shared memory use in the codes. You would be well served by studying them until you understand how and why they work.