CUDA independent thread scheduling - cuda

Q1: The programming guide v11.6.0 states that the following code pattern is valid on Volta and later GPUs:
if (tid % warpSize < 16) {
...
float swapped = __shfl_xor_sync(0xffffffff, val, 16);
...
} else {
...
float swapped = __shfl_xor_sync(0xffffffff, val, 16);
...
}
Why so?
Suppose the if branch gets executed first, when threads 0~15 hit the __shfl_xor_sync statement, they become inactive, and threads 16~31 start executing instructions until they hit the same statement, where the first and second half warps exchange val. Is my understanding correct?
If so, the programming guide also states that "if the target thread is inactive, the retrieved value is undefined" and that "threads can be inactive for a variety of reasons including ... having taken a different branch path than the branch path currently executed by the warp." Doesn't it mean both the if and else branches will get undefined values??
Q2: On GPUs with current implementation of independent thread scheduling (Volta~Ampere), when the if branch is executed, are inactive threads still doing NOOP? That is, should I still think of warp execution as lockstep?
Q3: Is synchronization (such as __shfl_sync, __ballot_sync) the only cause for statement interleaving (statements A and B from the if branch interleaved with X and Y from the else branch)? I'm curious how the current ITS differs from subwarp interleaving.

Q1:
Why so?
This is an exceptional case. The programming guide doesn't give a complete description of the detailed behavior of __shfl_sync() to understand this case (that I know of), although the statements given in the programming guide are correct. To get a detailed behavioral description of the instruction, I suggest looking at the PTX guide:
shfl.sync will cause executing thread to wait until all non-exited threads corresponding to membermask have executed shfl.sync with the same qualifiers and same membermask value before resuming execution.
Careful study of that statement may be sufficient for understanding. But we can unpack it a bit.
As already stated, this doesn't apply to compute capability less than 7.0. For those compute capabilities, all threads named in member mask must participate in the exact line of code/instruction, and for any warp lane's result to be valid, the source lane must be named in the member mask and must not be excluded from participation due to forced divergence at that line of code
I would describe __shfl_sync() as "exceptional" in the cc7.0+ case because it causes partial-warp execution to pause at that point of the instruction, and control/scheduling would then be given to other warp fragments. Those other warp fragments would be allowed to proceed (due to Volta ITS) until all threads named in the member mask have arrived at a __shfl_sync() statement that "matches", i.e. has the same member mask and qualifiers. Then the shuffle statement executes. Therefore, in spite of the enforced divergence at this point, the __shfl_sync() operation behaves as if the warp were sufficiently converged at that point to match the member mask.
I would describe that as "unusual" or "exceptional" behavior.
If so, the programming guide also states that "if the target thread is inactive, the retrieved value is undefined" and that "threads can be inactive for a variety of reasons including ... having taken a different branch path than the branch path currently executed by the warp."
In my view, the "if the target thread is inactive, the retrieved value is undefined" statement most directly applies to compute capability less than 7.0. It also applies to compute capability 7.0+ if there is no corresponding/matching shuffle statement elsewhere, that the thread scheduler can use to create an appropriate warp-wide (or member-mask wide) shuffle op. The provided code example only gives sensible results because there is a matching op both in the if portion and the else portion. If we made the else portion an empty statement, the code would not give interesting results for any thread in the warp.
Q2:
On GPUs with current implementation of independent thread scheduling (Volta~Ampere), when the if branch is executed, are inactive threads still doing NOOP? That is, should I still think of warp execution as lockstep?
If we consider the general case, I would suggest that the way to think about inactive threads is that they are inactive. You can call that a NOOP if you like. Warp execution at that point is not "lockstep" across the entire warp, because of the enforced divergence (in my view). I don't wish to argue the semantics here. If you feel an accurate description there is "lockstep execution given that some threads are executing the instruction and some aren't", that is ok. We have now seen, however, that for the specific case of the shuffle sync ops, the Volta+ thread scheduler works around the enforced divergence, combining ops from different execution paths, to satisfy the expectations for that particular instruction.
Q3:
Is synchronization (such as __shfl_sync, __ballot_sync) the only cause for statement interleaving (statements A and B from the if branch interleaved with X and Y from the else branch)?
I don't believe so. Any time you have a conditional if-else construct that causes a division intra-warp, you have the possibility for interleaving. I define Volta+ interleaving (figure 12) as forward progress of one warp fragment, followed by forward progress of another warp fragment, perhaps with continued alternation, prior to reconvergence. This ability to alternate back and forth doesn't only apply to the sync ops. Atomics could be handled this way (that is a particular use-case for the Volta ITS model - e.g. use in a producer/consumer algorithm or for intra-warp negotiation of locks - referred to as "starvation free" in the previously linked article) and we could also imagine that a warp fragment could stall for any number of reasons (e.g. a data dependency, perhaps due to a load instruction) which prevents forward progress of that warp fragment "for a while". I believe the Volta ITS can handle a variety of possible latencies, by alternating forward progress scheduling from one warp fragment to another. This idea is covered in the paper in the introduction ("load-to-use"). Sorry, I won't be able to provide an extended discussion of the paper here.
EDIT: Responding to a question in the comments, paraphrased "Under what circumstances can the scheduler use a subsequent shuffle op to satisfy the needs of a warp fragment that is waiting for shuffle op completion?"
First, let's notice that the PTX description above implies some sort of synchronization. The scheduler has halted execution of the warp fragment that encounters the shuffle op, waiting for other warp fragments to participate (somehow). This is a description of synchronization.
Second, the PTX description makes allowance for exited threads.
What does all this mean? The simplest description is just that a subsequent "matching" shuffle op can/will be "found by the scheduler", if it is possible, to satisfy the shuffle op. let's consider some examples.
Test case 1: As given in the programming guide, we see expected results:
$ cat t1971.cu
#include <cstdio>
__global__ void k(){
int tid = threadIdx.x;
float swapped = 32;
float val = threadIdx.x;
if (tid % warpSize < 16) {
swapped = __shfl_xor_sync(0xffffffff, val, 16);
} else {
swapped = __shfl_xor_sync(0xffffffff, val, 16);
}
printf("thread: %d, swp: %f\n", tid, swapped);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_70 -o t1971 t1971.cu
$ ./t1971
thread: 0, swp: 16.000000
thread: 1, swp: 17.000000
thread: 2, swp: 18.000000
thread: 3, swp: 19.000000
thread: 4, swp: 20.000000
thread: 5, swp: 21.000000
thread: 6, swp: 22.000000
thread: 7, swp: 23.000000
thread: 8, swp: 24.000000
thread: 9, swp: 25.000000
thread: 10, swp: 26.000000
thread: 11, swp: 27.000000
thread: 12, swp: 28.000000
thread: 13, swp: 29.000000
thread: 14, swp: 30.000000
thread: 15, swp: 31.000000
thread: 16, swp: 0.000000
thread: 17, swp: 1.000000
thread: 18, swp: 2.000000
thread: 19, swp: 3.000000
thread: 20, swp: 4.000000
thread: 21, swp: 5.000000
thread: 22, swp: 6.000000
thread: 23, swp: 7.000000
thread: 24, swp: 8.000000
thread: 25, swp: 9.000000
thread: 26, swp: 10.000000
thread: 27, swp: 11.000000
thread: 28, swp: 12.000000
thread: 29, swp: 13.000000
thread: 30, swp: 14.000000
thread: 31, swp: 15.000000
$
Test case 2: remove the body of the else clause. This still "works" because of the allowance for exited threads to satisfy the sync point, but the results are not matching the previous case at all. None of the shuffle ops are "successful":
$ cat t1971.cu
#include <cstdio>
__global__ void k(){
int tid = threadIdx.x;
float swapped = 32;
float val = threadIdx.x;
if (tid % warpSize < 16) {
swapped = __shfl_xor_sync(0xffffffff, val, 16);
} else {
// swapped = __shfl_xor_sync(0xffffffff, val, 16);
}
printf("thread: %d, swp: %f\n", tid, swapped);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_70 -o t1971 t1971.cu
$ ./t1971
thread: 16, swp: 32.000000
thread: 17, swp: 32.000000
thread: 18, swp: 32.000000
thread: 19, swp: 32.000000
thread: 20, swp: 32.000000
thread: 21, swp: 32.000000
thread: 22, swp: 32.000000
thread: 23, swp: 32.000000
thread: 24, swp: 32.000000
thread: 25, swp: 32.000000
thread: 26, swp: 32.000000
thread: 27, swp: 32.000000
thread: 28, swp: 32.000000
thread: 29, swp: 32.000000
thread: 30, swp: 32.000000
thread: 31, swp: 32.000000
thread: 0, swp: 0.000000
thread: 1, swp: 0.000000
thread: 2, swp: 0.000000
thread: 3, swp: 0.000000
thread: 4, swp: 0.000000
thread: 5, swp: 0.000000
thread: 6, swp: 0.000000
thread: 7, swp: 0.000000
thread: 8, swp: 0.000000
thread: 9, swp: 0.000000
thread: 10, swp: 0.000000
thread: 11, swp: 0.000000
thread: 12, swp: 0.000000
thread: 13, swp: 0.000000
thread: 14, swp: 0.000000
thread: 15, swp: 0.000000
$
Test case 3: Using test case 2, introduce a barrier, to prevent threads from exiting. Now we see a hang on Volta. This is because the sync point associated with the shuffle op can never be satisfied:
$ cat t1971.cu
#include <cstdio>
__global__ void k(){
int tid = threadIdx.x;
float swapped = 32;
float val = threadIdx.x;
if (tid % warpSize < 16) {
swapped = __shfl_xor_sync(0xffffffff, val, 16);
} else {
// swapped = __shfl_xor_sync(0xffffffff, val, 16);
}
__syncwarp();
printf("thread: %d, swp: %f\n", tid, swapped);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_70 -o t1971 t1971.cu
$ ./t1971
<hang>
Test case 4: Start with test case 2, introduce an additional shuffle op after the conditional area. We see partially correct results in this case. The sync point for the warp fragment encountering the shuffle op in the conditional area is apparently satisfied by the remaining warp fragment encountering the shuffle op outside the conditional area. However, as we shall see, the explanation for the partially correct results is that one warp fragment is doing 2 shuffles, the other only 1. The one that does two shuffles (the lower fragment) has a second shuffle op whose sync point is satisfied due to the exiting thread condition, but whose results are "not correct" because the source lanes are not participating at that point; they have exited:
$ cat t1971.cu
#include <cstdio>
__global__ void k(){
int tid = threadIdx.x;
float swapped = 32;
float val = threadIdx.x;
if (tid % warpSize < 16) {
swapped = __shfl_xor_sync(0xffffffff, val, 16);
} else {
// swapped = __shfl_xor_sync(0xffffffff, val, 16);
}
swapped = __shfl_xor_sync(0xffffffff, val, 16);
printf("thread: %d, swp: %f\n", tid, swapped);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_70 -o t1971 t1971.cu
$ ./t1971
thread: 16, swp: 0.000000
thread: 17, swp: 1.000000
thread: 18, swp: 2.000000
thread: 19, swp: 3.000000
thread: 20, swp: 4.000000
thread: 21, swp: 5.000000
thread: 22, swp: 6.000000
thread: 23, swp: 7.000000
thread: 24, swp: 8.000000
thread: 25, swp: 9.000000
thread: 26, swp: 10.000000
thread: 27, swp: 11.000000
thread: 28, swp: 12.000000
thread: 29, swp: 13.000000
thread: 30, swp: 14.000000
thread: 31, swp: 15.000000
thread: 0, swp: 0.000000
thread: 1, swp: 0.000000
thread: 2, swp: 0.000000
thread: 3, swp: 0.000000
thread: 4, swp: 0.000000
thread: 5, swp: 0.000000
thread: 6, swp: 0.000000
thread: 7, swp: 0.000000
thread: 8, swp: 0.000000
thread: 9, swp: 0.000000
thread: 10, swp: 0.000000
thread: 11, swp: 0.000000
thread: 12, swp: 0.000000
thread: 13, swp: 0.000000
thread: 14, swp: 0.000000
thread: 15, swp: 0.000000
$
Test case 5: Start with test case 4, introduce a sychronization at the end. Once again we observe a hang. The warp fragment (lower) that is doing 2 shuffle ops, does not have its second shuffle op sync point satisfied:
$ cat t1971.cu
#include <cstdio>
__global__ void k(){
int tid = threadIdx.x;
float swapped = 32;
float val = threadIdx.x;
if (tid % warpSize < 16) {
swapped = __shfl_xor_sync(0xffffffff, val, 16);
} else {
// swapped = __shfl_xor_sync(0xffffffff, val, 16);
}
swapped = __shfl_xor_sync(0xffffffff, val, 16);
printf("thread: %d, swp: %f\n", tid, swapped);
__syncwarp();
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_70 -o t1971 t1971.cu
$ ./t1971
thread: 16, swp: 0.000000
thread: 17, swp: 1.000000
thread: 18, swp: 2.000000
thread: 19, swp: 3.000000
thread: 20, swp: 4.000000
thread: 21, swp: 5.000000
thread: 22, swp: 6.000000
thread: 23, swp: 7.000000
thread: 24, swp: 8.000000
thread: 25, swp: 9.000000
thread: 26, swp: 10.000000
thread: 27, swp: 11.000000
thread: 28, swp: 12.000000
thread: 29, swp: 13.000000
thread: 30, swp: 14.000000
thread: 31, swp: 15.000000
<hang>
The partial printout prior to the hang at this point is expected. It is an exercise left to the reader to explain:
why do we see any print out at all?
why is it the way it is (only the upper fragment, but apparently having correct shuffle results)?

Related

How can I find out which thread is getting executed on which core of the GPU?

I'm developing some simple programs in Cuda and i want to know which thread is getting executed on which core of the GPU. I'm using Visual Studio 2012 and i have a NVIDIA GeForce 610M graphic card.
Is it possible to do so... I've already searched a lot on google but all in vain.
EDIT :
I know this is really weird to ask but i have been asked to do that by my college project guide.
Combining information from the PTX manual and a simple inline-PTX wrapper, the following functions should give you what you need:
static __device__ __inline__ uint32_t __mysmid(){
uint32_t smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;}
the above function will tell you which multiprocessor the (thread) code is executing on.
static __device__ __inline__ uint32_t __mywarpid(){
uint32_t warpid;
asm volatile("mov.u32 %0, %%warpid;" : "=r"(warpid));
return warpid;}
the above function will tell you which warp the (thread) code belongs to.
static __device__ __inline__ uint32_t __mylaneid(){
uint32_t laneid;
asm volatile("mov.u32 %0, %%laneid;" : "=r"(laneid));
return laneid;}
the above function will tell you which warp lane the (thread) code belongs to.
Note that in the case of dynamic parallelism (and possibly other scenarios such as debugging), this information is volatile and may change during program execution.
Refer to the programming guide for definition of terms like multiprocessor and warp.
Here is a fully-worked example:
$ cat t646.cu
#include <stdio.h>
#include <stdint.h>
static __device__ __inline__ uint32_t __mysmid(){
uint32_t smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;}
static __device__ __inline__ uint32_t __mywarpid(){
uint32_t warpid;
asm volatile("mov.u32 %0, %%warpid;" : "=r"(warpid));
return warpid;}
static __device__ __inline__ uint32_t __mylaneid(){
uint32_t laneid;
asm volatile("mov.u32 %0, %%laneid;" : "=r"(laneid));
return laneid;}
__global__ void mykernel(){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
printf("I am thread %d, my SM ID is %d, my warp ID is %d, and my warp lane is %d\n", idx, __mysmid(), __mywarpid(), __mylaneid());
}
int main(){
mykernel<<<4,4>>>();
cudaDeviceSynchronize();
return 0;
}
$ nvcc -arch=sm_20 -o t646 t646.cu
$ ./t646
I am thread 0, my SM ID is 0, my warp ID is 0, and my warp lane is 0
I am thread 1, my SM ID is 0, my warp ID is 0, and my warp lane is 1
I am thread 2, my SM ID is 0, my warp ID is 0, and my warp lane is 2
I am thread 3, my SM ID is 0, my warp ID is 0, and my warp lane is 3
I am thread 8, my SM ID is 3, my warp ID is 0, and my warp lane is 0
I am thread 9, my SM ID is 3, my warp ID is 0, and my warp lane is 1
I am thread 10, my SM ID is 3, my warp ID is 0, and my warp lane is 2
I am thread 11, my SM ID is 3, my warp ID is 0, and my warp lane is 3
I am thread 12, my SM ID is 4, my warp ID is 0, and my warp lane is 0
I am thread 13, my SM ID is 4, my warp ID is 0, and my warp lane is 1
I am thread 14, my SM ID is 4, my warp ID is 0, and my warp lane is 2
I am thread 15, my SM ID is 4, my warp ID is 0, and my warp lane is 3
I am thread 4, my SM ID is 1, my warp ID is 0, and my warp lane is 0
I am thread 5, my SM ID is 1, my warp ID is 0, and my warp lane is 1
I am thread 6, my SM ID is 1, my warp ID is 0, and my warp lane is 2
I am thread 7, my SM ID is 1, my warp ID is 0, and my warp lane is 3
$
Note that the above output will vary depending on what kind of GPU you are running on. Don't expect your output to be exactly like the above.

Multiple Host threads launch CUDA kernels together

I have encountered a very strange situation. Here is our code:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
void initCuda(int g)
{
cudaDeviceProp prop;
if(cudaGetDeviceProperties(&prop, g) == cudaSuccess) printf("MP cnt: %d ,Concurrent Kernels:%d , AsyncEngineCount:%d , ThrdPerMP: %d\n",
prop.multiProcessorCount,prop.concurrentKernels,prop.asyncEngineCount,192);
cudaSetDevice(g);
}
__global__ void cudaJob(float *mem){
unsigned int tid=threadIdx.x+blockIdx.x*blockDim.x;
mem[tid]=-1e5;
while(mem[tid]<1.0e5){
mem[tid]=mem[tid]+1e-2;
}
}
void wrapper(int n,int b){
float** dmem=(float**)malloc(n*(sizeof(float*)));
cudaStream_t* stream=(cudaStream_t*)malloc(sizeof(cudaStream_t)*n);
dim3 grid=dim3(b,1,1);
dim3 block=dim3(192,1,1);//2496/13=192
for(int i=0;i<n;i++) {
cudaMalloc((void**)&dmem[i],192*b*sizeof(float));
cudaStreamCreate(&stream[i]);
}
for(int i=0;i<n;i++) cudaJob<<<grid,block,0,stream[i]>>>(dmem[i]);
for(int i=0;i<n;i++) {
cudaStreamDestroy(stream[i]);
cudaFree(dmem[i]);
}
free(stream);
free(dmem);
}
int main(int argc,char* argv[]){
initCuda(0);
int n=atoi(argv[1]);
int nthreads=atoi(argv[2]);
int b=atoi(argv[3]);
float t1=omp_get_wtime();
#pragma omp parallel num_threads(nthreads) firstprivate(nthreads,n,b)
{
#pragma omp barrier
float time=omp_get_wtime();
int id=omp_get_thread_num();
wrapper(n,b);
time=omp_get_wtime()-time;
printf("Num Threads: %d, Time: %f\n",id,time);
}
printf("total: %f\n",omp_get_wtime()-t1);
return 0;
}
So if we run ./main 1 8 1. It means that their will be 8 threads and each of them will launch one kernel. However sometimes the actual run time suggest that the kernels are not launch simultaneously:
MP cnt: 13 ,Concurrent Kernels:1 , AsyncEngineCount:2 , ThrdPerMP: 192
Num Threads: 0, Time: 3.788108
Num Threads: 6, Time: 6.661960
Num Threads: 7, Time: 9.535245
Num Threads: 2, Time: 12.408561
Num Threads: 5, Time: 12.410481
Num Threads: 1, Time: 12.411650
Num Threads: 4, Time: 12.412888
Num Threads: 3, Time: 12.414572
total: 12.414601
After some debuging we found that the problem may be caused by the cleaning up of the memory and stream. If we comment out all the cudaFree and StreamDestroy and free. Then the run time will suggest that everything is concurrent:
MP cnt: 13 ,Concurrent Kernels:1 , AsyncEngineCount:2 , ThrdPerMP: 192
Num Threads: 7, Time: 3.805691
Num Threads: 1, Time: 3.806201
Num Threads: 3, Time: 3.806624
Num Threads: 2, Time: 3.806695
Num Threads: 6, Time: 3.807018
Num Threads: 5, Time: 3.807456
Num Threads: 0, Time: 3.807486
Num Threads: 4, Time: 3.807792
total: 3.807799
At last we found that if we add an omp barrier right behind the kernel launching call. Then the cleaning up will not cause any problem:
for(int i=0;i<n;i++) cudaJob<<<grid,block,0,stream[i]>>>(dmem[i]);
#pragma omp barrier
for(int i=0;i<n;i++) {
cudaStreamDestroy(stream[i]);
cudaFree(dmem[i]);
}
So, we think that when multiple host threads are trying to clean up the memory and streams on the device, they may compete with each other. But we are not sure.
Is that right? Can any one help us remove the omp barrier? Because we don't think it is necessary for our problem.
Yes, cudaMalloc, cudaFree, and cudaStreamCreate are all synchronous, which means they will tend to serialize activity, by forcing any cuda calls issued before them to complete, before they execute.
The usual recommendation is to do all such allocations outside of time-critical code. Figure out how many allocations you need, allocate them up-front, then use (and perhaps re-use) them during your main processing loop, then free whatever is needed at the end.

CUDA racecheck, shared memory array and cudaDeviceSynchronize()

I recently discovered the racecheck tool of cuda-memcheck, available in CUDA 5.0 (cuda-memcheck --tool racecheck, see the NVIDIA doc). This tool can detect race conditions with shared memory in a CUDA kernel.
In debug mode, this tool does not detect anything, which is apparently normal. However, in release mode (-O3), I get errors depending on the parameters of the problem.
Here is an error example (initialization of shared memory on line 22, assignment on line 119):
========= ERROR: Potential WAW hazard detected at shared 0x0 in block (35, 0, 0) :
========= Write Thread (32, 0, 0) at 0x00000890 in ....h:119:void kernel_test3(Data*)
========= Write Thread (0, 0, 0) at 0x00000048 in ....h:22:void kernel_test3(Data*)
========= Current Value : 13, Incoming Value : 0
The first thing that surprised me is the thread ids. When I first encountered the error, each block contained 32 threads (ids 0 to 31). So why is there a problem with the thread id 32? I even added an extra check on threadIdx.x, but this changed nothing.
I use shared memory as a temporary buffer, and each thread deals with its own parameters of a multidimensional array, e.g. __shared__ float arr[SIZE_1][SIZE_2][NB_THREADS_PER_BLOCK]. I do not really understand how there could be any race conditions, since each thread deals with its own part of shared memory.
Reducing the grid size from 64 blocks to 32 blocks seemed to solve the issue (with 32 threads per block). I do not understand why.
In order to understand what was happening, I tested with some simpler kernels.
Let me show you an example of a kernel that creates that kind of error. Basically, this kernel uses SIZE_X*SIZE_Y*NTHREADS*sizeof(float) B of shared memory, and I can use 48KB of shared memory per SM.
test.cu
template <unsigned int NTHREADS>
__global__ void kernel_test()
{
const int SIZE_X = 4;
const int SIZE_Y = 4;
__shared__ float tmp[SIZE_X][SIZE_Y][NTHREADS];
for (unsigned int i = 0; i < SIZE_X; i++)
for (unsigned int j = 0; j < SIZE_Y; j++)
tmp[i][j][threadIdx.x] = threadIdx.x;
}
int main()
{
const unsigned int NTHREADS = 32;
//kernel_test<NTHREADS><<<32, NTHREADS>>>(); // ---> works fine
kernel_test<NTHREADS><<<64, NTHREADS>>>();
cudaDeviceSynchronize(); // ---> gives racecheck errors if NBLOCKS > 32
}
Compilation:
nvcc test.cu --ptxas-options=-v -o test
If we run the kernel:
cuda-memcheck --tool racecheck test
kernel_test<32><<<32, 32>>>(); : 32 blocks, 32 threads => does not lead to any apparent racecheck error.
kernel_test<32><<<64, 32>>>(); : 64 blocks, 32 threads => leads to WAW hazards (threadId.x = 32?!) and errors.
========= ERROR: Potential WAW hazard detected at shared 0x6 in block (57, 0, 0) :
========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Write Thread (1, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Current Value : 0, Incoming Value : 128
========= INFO:(Identical data being written) Potential WAW hazard detected at shared 0x0 in block (47, 0, 0) :
========= Write Thread (32, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Current Value : 0, Incoming Value : 0
So what am I missing here? Am I doing something wrong with shared memory? (I am still a beginner with this)
** UPDATE **
The problem seems to be coming from cudaDeviceSynchronize() when NBLOCKS > 32. Why is this happening?
For starters, the cudaDeviceSynchronize() isn't the cause; your kernel is the cause, but it's an asynchronous call, so the error is caught on your call to cudaDeviceSynchronize().
As for kernel, your shared memory is of size SIZE_X*SIZE_Y*NTHREADS (which in the example translates to 512 elements per block). In your nested loops you index into it using [i*blockDim.x*SIZE_Y + j*blockDim.x + threadIdx.x] -- this is where your problem is.
To be more specific, your i and j values will range from [0, 4), your threadIdx.x from [0, 32), and your SIZE_{X | Y} values are 4.
When blockDim.x is 64, your maximum index used in the loop will be 991 (from 3*64*4 + 3*64 + 31). When your blockDim.x is 32, your maximum index will be 511.
Based on your code, you should get errors whenever your NBLOCKS exceeds your NTHREADS
NOTE: I originally posted this to https://devtalk.nvidia.com/default/topic/527292/cuda-programming-and-performance/cuda-racecheck-shared-memory-array-and-cudadevicesynchronize-/
This was apparently a bug in NVIDIA drivers for Linux. The bug disappeared after the 313.18 release.

Can't get simple CUDA program to work

I'm trying the "hello world" program of CUDA programming: adding two vectors together. Here's the program I have tried:
#include <cuda.h>
#include <stdio.h>
#define SIZE 10
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
float A[SIZE], B[SIZE], C[SIZE];
float *devPtrA, *devPtrB, *devPtrC;
size_t memsize= SIZE * sizeof(float);
for (int i=0; i< SIZE; i++) {
A[i] = i;
B[i] = i;
}
cudaMalloc(&devPtrA, memsize);
cudaMalloc(&devPtrB, memsize);
cudaMalloc(&devPtrC, memsize);
cudaMemcpy(devPtrA, A, memsize, cudaMemcpyHostToDevice);
cudaMemcpy(devPtrB, B, memsize, cudaMemcpyHostToDevice);
vecAdd<<<1, SIZE>>>(devPtrA, devPtrB, devPtrC);
cudaMemcpy(C, devPtrC, memsize, cudaMemcpyDeviceToHost);
for (int i=0; i<SIZE; i++)
printf("C[%d]: %f + %f => %f\n",i,A[i],B[i],C[i]);
cudaFree(devPtrA);
cudaFree(devPtrB);
cudaFree(devPtrC);
}
Compiled with:
nvcc cuda.cu
Output is this:
C[0]: 0.000000 + 0.000000 => 0.000000
C[1]: 1.000000 + 1.000000 => 0.000000
C[2]: 2.000000 + 2.000000 => 0.000000
C[3]: 3.000000 + 3.000000 => 0.000000
C[4]: 4.000000 + 4.000000 => 0.000000
C[5]: 5.000000 + 5.000000 => 0.000000
C[6]: 6.000000 + 6.000000 => 0.000000
C[7]: 7.000000 + 7.000000 => 0.000000
C[8]: 8.000000 + 8.000000 => 366987238703104.000000
C[9]: 9.000000 + 9.000000 => 0.000000
Every time I run it, I get a different answer for C[8], but the results for all the other elements are always 0.000000.
The Ubuntu 11.04 system a 64-bit Xeon server with 4 cores running the latest NVIDIA drivers (downloaded on Oct 4, 2012). The card is an EVGA GeForce GT 430 with 96 cores and 1GB of RAM.
What should I do to figure out what's going on?
It seems that your drivers are not initialized, but not checking the cuda return codes is always a bad practice, you should avoid that. Here is simple function + Macro that you can use for cuda calls(quoted from Cuda by Example):
static void HandleError( cudaError_t err,
const char *file,
int line ) {
if (err != cudaSuccess) {
printf( "%s in %s at line %d\n", cudaGetErrorString( err ),
file, line );
exit( EXIT_FAILURE );
}
}
#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))
Now start calling your functions like:
HANDLE_ERROR(cudaMemcpy(...));
Most likely cause: the NVIDIA drivers weren't loaded. On a headless Linux system, X Windows isn't running, so the drivers aren't loaded at boot time.
Run nvidia-smi -a as root to load them and get a confirmation in the form of a report.
Although the drivers are now loaded, they still need to be initialized every time a CUDA program is run. Put the drivers into persistent mode with nvidia-smi -pm 1 so they remain initialized all the time. Add this to a boot script (such as rc.local) so it happens at every boot.

cuda programming problem

I'm very new to cuda .I'm using cuda on my ubuntu 10.04 in device emulation mode.
I write a code to compute the square of array which is following :
#include <stdio.h>
#include <cuda.h>
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x + threadIdx.x;
if (idx<=N)
a[idx] = a[idx] * a[idx];
}
int main(void)
{
float *a_h, *a_d;
const int N = 10;
size_t size = N * sizeof(float);
a_h = (float *)malloc(size);
cudaMalloc((void **) &a_d, size);
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
square_array <<< 1,10>>> (a_d, N);
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf(" %f\n", a_h[i]);
free(a_h);
cudaFree(a_d);
return 0;
}
When I run this code it show no problem it give me proper output.
Now my problem is that when i use <<<2,5>>> or<<<5,2>>> the result is same . what is happening on gpu ?
All I understand is that I just launch cuda kernel with 5 blocks containing 2 thread.
Can anyone explain me how Gpu handle this or implement the launch(kernel call)?
Now my real problem is that when i call the kernel with <<<1,10>>> It is ok . It shows the perfect result.
but when i call the kernel with <<<1,5>> the result is following:
0.000000
1.000000
4.000000
9.000000
16.000000
5.000000
6.000000
7.000000
8.000000
9.000000
similarly when i reduce or increase the second parameter in kernel call it show different result for example when i change it to <<1,4>> it shows following result:
0.000000
1.000000
4.000000
9.000000
4.000000
5.000000
6.000000
7.000000
8.000000
9.000000
Why this result is coming ?
Can any body explain the working of kernel launch call ?
what is blockdim type variable contain ?
Please help me to understand the concept of kernel call launching and working ?
I searched the programming guide but they didn't explain it very well.
The calculation of idx in your kernel code is incorrect. If you change it to:
int idx = blockDim.x * blockIdx.x + threadIdx.x;
You might find the results a little easier to understand.
EDIT: For any given kernel launch
square_array<<<gridDim,blockDim>>>(...)
in the GPU, the automatic variable blockDim will contain the x,y, and z components of the blockDim argument passed in the host side kernel launch. Similarly gridDim will contain the x and y components of the gridDim argument passed in the launch.
Apart from what talonmies has said, you may need to do the following to have better performance in real world applications.
if (idx < N) {
tmp = a[idx];
a[idx] = tmp * tmp;
}
The way kernels are invoked in CUDA is like so:
kernel<<<numBlocks,numThreads>>>(Kernel arguments);
This means that there will be numBlocks blocks with numThreads threads running in each block. For example, if you call
kernel<<<1,5>>>(Kernel args);
then 1 block will run with 5 threads running in parallel. and if you call
kernel<<<2,5>>>(Kernel args);
then there well be 2 blocks with 5 threads running in each. Unless you alter your device code, the maximum dimension of the array that you are "squaring" is the product numBlocks*numThreads. This explains why not all of the values in your original array were squared.
I suggest you read through the CUDA_C_Programming_Guide.pdf that comes with the CUDA toolkit.