Is this a CUDA thread synchronization issue or something else? - cuda

I am very new to parallel programming and stack overflow. I am working on a matrix multiplication implementation using CUDA. I am using column order float arrays as matrix representations.
The algorithm I developed is a bit unique and goes as follows. Given a matrix an n x m matrix A and an m x k matrix B, I launch an n x k blocks with m threads in each block. Essentially, I launch a block for every entry in the resulting matrix, with each thread computing one multiplication for that entry. For example,
1 0 0 0 1 2
0 1 0 * 3 4 5
0 0 1 6 7 8
For the first entry in the resulting matrix I would launch each thread with
thread 0 computing 1 * 3
thread 1 computing 0 * 0
thread 2 computing 0 * 1
With each thread adding to a 0-initialized matrix.
Right now, I am not getting a correct answer. I am getting this over and over again
0 0 2
0 0 5
0 0 8
My kernel function is below. Could this be a thread synchronization problem or am I screwing up array indexing or something?
/*#param d_A: Column order matrix
*#param d_B: Column order matrix
*#param d_result: 0-initialized matrix that kernels write to
*#param dim_A: dimensionality of A (number of rows)
*#param dim_B: dimensionality of B (number of rows)
*/
__global__ void dot(float *d_A, float *d_B, float *d_result, int dim_A, int dim_B) {
int n = blockIdx.x;
int k = blockIdx.y;
int m = threadIdx.x;
float a = d_A[(m * dim_A) + n];
float b = d_B[(k * dim_B) + m];
//d_result[(k * dim_A) + n] += (a * b);
syncthreads();
float temp = d_result[(k*dim_A) + n];
syncthreads();
temp = temp + (a * b);
syncthreads();
d_result[(k*dim_A) + n] = temp;
syncthreads();
}

The whole idea of using syncthreads() is wrong in this case. This API call has a block scope.
1. syncthreads();
2. float temp = d_result[(k*dim_A) + n];
3. syncthreads();
4. temp = temp + (a * b);
5. syncthreads();
6. d_result[(k*dim_A) + n] = temp;
7. syncthreads();
The local variable float temp; has thread scope and using this synchronization barrier is senseless.
The pointer d_result is global memory pointer and using this synchronization barrier is also senseless. Note that there isn't available yet (maybe there will never be available) a barrier which synchronizes threads globally.
Typically the usage of syncthreads() is required when shared memory is used for computation. In this case you may want to use shared memory. Here you could see an example of how to use shared memory and syncthreads() properly. Here you have an example of matrix multiplication with shared memory.

Related

Is these way coalesced access?

note "When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions".
but I have some questions.
__global__ void add(double *a. double *b){
int i = blockDim.x * blockIdx.x + threadIdx.x;
i = 3 * i;
b[i] = a[i] + a[i + 1] + a[i + 2];
}
can the three accesses(a[i] , a[i + 1] , a[i + 2]) executed with only an instruction? (I mean that is it coalesced access?)
or does the coalesced only exist in the different thread(transverse) of a warp?(no exist in a thread?)
I have read the similar questionss:
From non coalesced access to coalesced memory access CUDA
But I still don't understand,so is it non-coalesced memory access?
2.
__global__ void add(double *a. double *b){
int i = blockDim.x * blockIdx.x + threadIdx.x;
b[i] = a[i] + a[i + 10] + a[i + 12];//assuming no out of indeax
}
It may can be the non-coalesced access.
so I change the code to:
__global__ void add(double *a. double *b){
int i = blockDim.x * blockIdx.x + threadIdx.x;
__shared__ double shareM[3*BLOCK_SIZE];
shareM[threadIdx.x] = a[i];
shareM[threadIdx.x + 1] = a[i + 10];
shareM[threadIdx.x + 2] = a[i + 12];
b[i] = shareM[threadIdx.x] + shareM[threadIdx.x + 1] + shareM[threadIdx.x + 2];
}
I see that coalescent access do not matter with shared memory.
but it mean that is the way below coalesced access under one thread?
shareM[threadIdx.x] = a[i];
shareM[threadIdx.x + 1] = a[i + 10];
shareM[threadIdx.x + 2] = a[i + 12];
or does the shared memory coalesced access only exist in diferent thread like the fllowing example?:
thread0:
shareM[0] = a[3]
thread1:
shareM[4] = a[23]
thread2:
shareM[7] = a[56]
3.I that don't understand "coalescent access do not matter with shared memory".
is it mean that load the data to local(or register) memory from global memory slower than load the data to shared memory from global memory ?
if it is, why we don't use the shared memory as transfer station(just only one 8bytes shared memory for one thread is enough)?
thank you.
can the three accesses(a[i] , a[i + 1] , a[i + 2]) executed with only an instruction? (I mean that is it coalesced access?)
When working with GPU kernels, I guess it's better to think everything in a parallel way. Every instruction is executed in a group of 32 threads, a.k.a a warp, so they are actually not just three accesses(here the word "access" is also vague, I assume you mean array accessing), they are 32 x 3 = 96 accesses in total. A more correct way to say this is that they are three array accesses per thread.
According to [1-3], the coalesced accessing pattern is a behavior in terms of a warp:
When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads.
So, we need to think respectively for these three array accesses. Let's rewrite the code as:
__global__ void add(double *a. double *b){
int i = blockDim.x * blockIdx.x + threadIdx.x;
i = 3 * i;
double ai = a[i]; // <1>
double ai1 = a[i + 1]; // <2>
double ai2 = a[i + 2]; // <3>
b[i] = ai + ai1 + ai2;
}
And it is succient to only consider the first warp with threadid range from 0 to 31.
<1>: Each thread in a warp allocates a double variable called ai in its register and wants to access a value from a based on the index i. Note the original i \in [0,31] and then it's multiped by 3, so the warp is accessing a[0], a[3], ... , a[93]. Since a is a double array(i.e. every entry is of size 8 byte), it needs to access 32 * 8 = 256 byte in total, that's two 128-byte segments that can be dealt with two 128-byte memory transactions. According to [4]:
If the size of the words accessed by each thread is more than 4 bytes, a memory request by a warp is first split into separate 128-byte memory requests that are issued independently: Two memory requests, one for each half-warp, if the size is 8 bytes, Four memory requests, one for each quarter-warp, if the size is 16 bytes.
to load these 256-byte data from global memory to register, the minimum memory request number is 2. If a can be accessed in this way, then this accessing pattern is coalescing. But apparently the pattern used in <1> is not, it's like the graph below:
<1>
t0 + t31
+---+---+---+-------------+----------------------+
| | | | ...... |
v v v v v
+---+-------+----+--------+-------+--------+-----+--+-
|segment| | | | | |
+----------------+--------+-------+--------+--------+-
a[0] a[31] a[63] a[95]
32 threads in the warp are accessing memory separately in six 128-byte segments. In the cached mode, it needs six 128-byte memory transactions at least. That's 768 bytes in total, but only 256 bytes are useful. The bus utilization is about 1/3.
<2>: This is very similar to <1>, with 1 offset from the start:
<2>
t0 + t31
+---+---+---+-------------+----------------------+
| | | | ...... |
v v v v v
++---+---+---+---+--------+-------+--------+------+-+-
|segment| | | | | |
+----------------+--------+-------+--------+--------+-
a[0] a[31] a[63] a[95]
<3>: This is very similar to <1>, with 2 offset from the start:
<3>
t0 + t31
+---+---+---+-------------+----------------------+
| | | | ...... |
v v v v v
+-+---+---+---+--+--------+-------+--------+-------++-
|segment| | | | | |
+----------------+--------+-------+--------+--------+-
a[0] a[31] a[63] a[95]
I think now you already get the idea and probably think: How about loading these 768 bytes from global memory in one pass because all of them are used once, exactly. However, recall that each thread has its private registers and these registers cannot communicate with each other([5]), so this cannot be done merely with registers and that's where shared memory comes in.
(warp1) (warp2) (warp3)
+ + +
| | |
t0 | t31 | t0 | t31
+-+-+-+---+-+-+-+---------+---------+-+-+-+++-+-+-+-+
| | | | | | | | | ...... | | | | | | | | |
v v v v v v v v v v v v v v v v v v
+-+-+-+---+-+-+-++--------+-------+-+-+-+-+++-+-+-+---
|segment| | | | | |
+----------------+--------+-------+--------+--------+-
a[0] a[31] a[63] a[95]
is it mean that load the data to local(or register) memory from global memory slower than load the data to shared memory from global memory ? if it is, why we don't use the shared memory as transfer station(just only one 8bytes shared memory for one thread is enough)?
AFAICT, you cannot directly transfer data from global memory to shared memory.
References:
[1]. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#maximize-memory-throughput
[2]. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses
[3]. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-3-0__examples-of-global-memory-accesses
[4]. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-3-0
[5]. I lied, there is a way to do this by using __shlf intrinsics.

What is the correlation between dimensional nature of threads and the dimensions of the data itself in CUDA?

I've read as a beginner that using a 2D block of threads is the simplest way to deal with a 2D dataset. I am trying to implement the following matrix operations in sequence:
Swap elements at odd and even positions of each row in the matrix
1 2 2 1
3 4 becomes 4 3
Reflect the elements of the matrix across the principal diagonal
2 1 2 4
4 3 becomes 1 3
To implement this, I wrote the following kernel:
__global__ void swap_and_reflect(float *d_input, float *d_output, int M, int N)
{
int j = threadIdx.x;
int i = threadIdx.y;
for(int t=0;t<M*N;t++)
d_output[t] = d_input[t];
float temp = 0.0;
if (j%2 == 0){
temp = d_output[j];
d_output[j] = d_output[j+1];
d_output[j+1] = temp;
}
__syncthreads(); // Wait for swap to complete
if (i!=j){
temp = d_output[i];
d_output[i] = d_output[j];
d_output[j] = temp;
}
}
The reflection does not happen as expected. But at this point, I am tending to find myself confused with the 2D structure of the executing threads with the 2D structure of the matrix itself.
Could you please correct my understanding of the multi-dimensional arrangement of threads and how it correlates to the dimensionality of the data itself? I believe this is the reason why I have the reflection part of it incorrect.
Any pointers/resources that could help me visualize/understand this correctly would be of immense help.
Thank you for reading.
The thread indices are laid out in your hypothetical 4x4 block in (x,y) pairs as
(0,0) (0,1)
(1,0) (1,1)
and the ordering is
thread ID (x,y) pair
--------- ----------
0 (0,0)
1 (1,0)
2 (0,1)
3 (1,1)
You need to choose an ordering for your array in memory and then modify your kernel accordingly, for example:
if (i!=j){
temp = d_output[i+2*j];
d_output[i+2*j] = d_output[j+2*i];
d_output[j+2*i] = temp;
}

2D threads in CUDA

I'm trying to use 2D threads in CUDA. threadIDx.x and blockIdx.x work fine, but threadIdx.y and blockIdx.y don't work. The .y ones are always 0.
Here is my code:
#define N 16
__global__ void add(int* a) {
int i=threadIdx.x;
int j=threadIdx.y;
a[i] = j;
}
int main(int argc, char **argv)
{
int a[N];
const int size = N*sizeof(int);
int *da;
cudaMalloc((void**)&da, size);
add<<<1, N>>>(da);
cudaMemcpy(a, da, size, cudaMemcpyDeviceToHost);
printf("Thread indices:\n");
for(int i=0;i<N;i++)
{
printf("%d ", a[i]);
}
cudaFree(da);
return 0;
}
The result for a[i] = j; or a[j] = j;
Thread indices:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
and for a[i] = i;
Thread indices:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
I tried using
#define M 4
#define N 4
...
int i = (blockDim.x * blockIdx.x) + threadIdx.x;
int j = (blockDim.y * blockIdx.y) + threadIdx.y;
...
add<<<M, N>>>(da);
...
and result is same: .x ones are fine but .y ones are all 0. Can anyone help me fixing this? Thanks
You are confusing blocks and threads with dimensions.
add <<<M,N>>> is interpreted as add<<<dim3(M,1,1),dim3(N,1,1)>>> where M is the number of blocks and N is the number of threads per kernel.
If you want to have MxN blocks with MxN threads call add<<<dim3(M,N),dim3(M,N)>>>
I would recommend Udacity CUDA course for beginners, it is very beginner friendly.
I want M blocks with N threads per block.
Well then add<<<M,N>>> is correct but it is 1 dimensional, there is no y to it. If you want to locate the thread use this code.
int index = threadIdx.x + blockDim.x * blockIdx.x
There is no y in it. The entire thing is 1D. Each block can only have a limited number of threads (64 or 128 usually) that is why threads and blocks are separated. There are a lot of nuances to it. I would recommend the Udacity course it helped me a lot.

cuda matrix multiplication by columns

I'm trying to do matrix multiplication in cuda. My implementation is different from the cuda example.
The cuda example (from the cuda samples) performs matrix multiplication by multiplying each value in the row of the first matrix by each value in the column of the second matrix, then summing the products and storing it in an output vector at the index of the row from the first matrix.
My implementation multiplies each value in the column of the first matrix by the single value of the row of the second matrix, where the row index = column index. It then has an output vector in global memory that has each of its indices updated.
The cuda example implementation can have a single thread update each index in the output vector, whereas my implementation can have multiple threads updating each index.
The results that I get show only some of the values. For example, if I had it do 4 iterations of updates, it would only do 2 or 1.
I think that the threads might be interfering with each other since they're all trying to write to the same indices of the vector in global memory. So maybe, while one thread is writing to an index, the other might not be able to insert its value and update the index?
Just wondering if this assessment makes sense.
For example. To multiply the following two matrices:
[3 0 0 2 [1 [a
3 0 0 2 x 2 = b
3 0 0 0 3 c
0 1 1 0] 4] d]
The Cuda sample does matrix multiplication in the following way using 4 threads where a,b,c,d are stored in global memory:
Thread 0: 3*1 + 0*2 + 0*3 + 2*4 = a
Thread 1: 3*1 + 0*2 + 0*3 + 2*4 = b
Thread 2: 3*1 + 0*2 + 0*3 + 0*4 = c
Thread 3: 0*1 + 1*2 + 1*3 + 0*4 = d
My implementation looks like this:
a = b = c = d = 0
Thread 0:
3*1 += a
3*1 += b
3*1 += c
0*1 += d
Thread 1:
0*2 += a
0*2 += b
0*2 += c
1*2 += d
Thread 2:
0*3 += a
0*3 += b
0*3 += c
1*3 += d
Thread 3:
2*4 += a
2*4 += b
0*4 += c
0*4 += d
So at one time all four threads could be trying to update one of the indices.
In order to fix this issue, I used atomicAdd to do the += operation. When a thread performs the operation 3*1 += a (for example), it does three things.
It gets the previous value of a
It updates the value by doing 3*1 + previous value of a
It then stores the new value into a
By using atomicAdd it guarantees that these operations can occur by the thread without interruption from other threads. If atomicAdd is not used, thread0 could get the previous value of a and while thread0 is updating the value, thread1 could get the previous value of a and perform its own update. In this way a += operation would not occur because the threads aren't able to finish their operations.
If a += 3*1 is used instead of atomicAdd(&a, 3*1), then it is possible for thread1 to interfere and change the value of thread0 before thread0 finishes what it's doing. It creates a race condition.
atomicAdd is a += operation. You would use the following code to perform the operation:
__global__ void kernel(){
int a = 0;
atomicAdd(&a, 3*1); //is the same as a += 3*1
}

CUDA Warp Synchronization Problem

In generalizing a kernel thats shifts the values of a 2D array one space to the right (wrapping around the row boundaries), I have come across a warp synchronization problem. The full code is attached and included below.
The code is meant to work for arbitrary array width, array height, number of thread blocks, and number of threads per block. When choosing a thread size of 33 (i.e. one more thread than a full warp), the 33rd thread doesn't synchronize with __syncthreads() is called. This causes problems with the output data. The problem is only present when there is more than one warp, and the width of the array is more than the number of threads (e.g. with width=35 and 34 threads).
The following is a downsized example of what happens (in reality the array would need to have more elements for the kernel to produce the error).
Initial array:
0 1 2 3 4
5 6 7 8 9
Expected Result:
4 0 1 2 3
9 5 6 7 8
Kernel Produces:
4 0 1 2 3
8 5 6 7 8
The first line is done correctly (for each block if there are more than one), with all subsequent lines having the second last value repeated. I have tested this one two different cards (8600GT and GTX280) and get the same results. I would like to know if this is just a bug with my kernel, or a problem that can't be fixed by adjusting my code?
The full source file is included below.
Thank you.
#include <cstdio>
#include <cstdlib>
// A method to ensure all reads use the same logical layout.
inline __device__ __host__ int loc(int x, int y, int width)
{
return y*width + x;
}
//kernel to shift all items in a 2D array one position to the right (wrapping around rows)
__global__ void shiftRight ( int* globalArray, int width, int height)
{
int temp1=0; //temporary swap variables
int temp2=0;
int blockRange=0; //the number of rows that a single block will shift
if (height%gridDim.x==0) //logic to account for awkward array sizes
blockRange = height/gridDim.x;
else
blockRange = (1+height/gridDim.x);
int yStart = blockIdx.x*blockRange;
int yEnd = yStart+blockRange; //the end condition for the y-loop
yEnd = min(height,yEnd); //make sure that the array doesn't go out of bounds
for (int y = yStart; y < yEnd ; ++y)
{
//do the first read so the swap variables are loaded for the x-loop
temp1 = globalArray[loc(threadIdx.x,y,width)];
//Each block shifts an entire row by itself, even if there are more columns than threads
for (int threadXOffset = threadIdx.x ; threadXOffset < width ; threadXOffset+=blockDim.x)
{
//blockDim.x is added so that we store the next round of values
//this has to be done now, because the next operation will
//overwrite one of these values
temp2 = globalArray[loc((threadXOffset + blockDim.x)%width,y,width)];
__syncthreads(); //sync before the write to ensure all the values have been read
globalArray[loc((threadXOffset +1)%width,y,width)] = temp1;
__syncthreads(); //sync after the write so ensure all the values have been written
temp1 = temp2; //swap the storage variables.
}
if (threadIdx.x == 0 && y == 0)
globalArray[loc(12,2,width)]=globalArray[67];
}
}
int main (int argc, char* argv[])
{
//set the parameters to be used
int width = 34;
int height = 3;
int threadsPerBlock=33;
int numBlocks = 1;
int memSizeInBytes = width*height*sizeof(int);
//create the host data and assign each element of the array to equal its index
int* hostData = (int*) malloc (memSizeInBytes);
for (int y = 0 ; y < height ; ++y)
for (int x = 0 ; x < width ; ++x)
hostData [loc(x,y,width)] = loc(x,y,width);
//create an allocate the device pointers
int* deviceData;
cudaMalloc ( &deviceData ,memSizeInBytes);
cudaMemset ( deviceData,0,memSizeInBytes);
cudaMemcpy ( deviceData, hostData, memSizeInBytes, cudaMemcpyHostToDevice);
cudaThreadSynchronize();
//launch the kernel
shiftRight<<<numBlocks,threadsPerBlock>>> (deviceData, width, height);
cudaThreadSynchronize();
//copy the device data to a host array
int* hostDeviceOutput = (int*) malloc (memSizeInBytes);
cudaMemcpy (hostDeviceOutput, deviceData, memSizeInBytes, cudaMemcpyDeviceToHost);
cudaFree (deviceData);
//Print out the expected/desired device output
printf("---- Expected Device Output ----\n");
printf(" | ");
for (int x = 0 ; x < width ; ++x)
printf("%4d ",x);
printf("\n---|-");
for (int x = 0 ; x < width ; ++x)
printf("-----");
for (int y = 0 ; y < height ; ++y)
{
printf("\n%2d | ",y);
for (int x = 0 ; x < width ; ++x)
printf("%4d ",hostData[loc((x-1+width)%width,y,width)]);
}
printf("\n\n");
printf("---- Actual Device Output ----\n");
printf(" | ");
for (int x = 0 ; x < width ; ++x)
printf("%4d ",x);
printf("\n---|-");
for (int x = 0 ; x < width ; ++x)
printf("-----");
for (int y = 0 ; y < height ; ++y)
{
printf("\n%2d | ",y);
for (int x = 0 ; x < width ; ++x)
printf("%4d ",hostDeviceOutput[loc(x,y,width)]);
}
printf("\n\n");
}
Because not all threads are executing the same number of loop iterations, synchronisation is a problem! All threads should hit the same __syncthreads()-s all the time.
I would suggest transforming your innermost for loop into something like this:
for(int blockXOffset=0; blockXOffset < width; blockXOffset+=blockDim.x) {
int threadXOffset=blockXOffset+threadIdx.x;
bool isActive=(threadXOffset < width);
if (isActive) temp2 = globalArray[loc((threadXOffset + blockDim.x)%width,y,width)];
__syncthreads();
if (isActive) globalArray[loc((threadXOffset +1)%width,y,width)] = temp1;
__syncthreads();
temp1 = temp2;
}
From the Programming Guide:
__syncthreads() is allowed in
conditional code but only if the
conditional evaluates identically
across the entire thread block,
otherwise the code execution is likely
to hang or produce unintended side
effects.
In my example, not all threads are executing the same number of loop iterations, so synchronization doesn't happen.