Threads of a CUDA kernel execute sequentially

Threads of a CUDA kernel execute sequentially - cuda

I have two kernels that process some data sequentially (launched with only one thread). I want to combine the two so that I can have one kernel to launch with two threads. After doing so, I was expecting to get an exec time of max(kernel1, kernel2) but what I got was the sum of the two exec times. I narrowed down the problem to something like the code below.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include<iostream>
#include<string>
#include<vector>
#include<random>
#include<functional>
#include<algorithm>
#include<iterator>
__global__ void dummyKernel(const float *d_data_Re, const float *d_data_Im,
float *d_out_Re, float *d_out_Im, const int dataLen) {
int i{ threadIdx.x };
if (i == 0) {
printf("Thread zero started \n");
for (int j{}; j < 1000000; j++)
d_out_Re[j%dataLen] = sqrtf(2) + d_data_Re[j%dataLen] * (j % 4 == 1);
printf("Thread zero finished \n");
}
else if (i == 1) {
printf("Thread one started \n");
for (int j{}; j < 1000000; j++)
d_out_Im[j%dataLen] = sqrtf(2) + d_data_Im[j%dataLen] * (j % 4 == 1);
printf("Thread one finished \n");
}
}
__global__ void dummyKernel2(const float *d_data_Re, const float *d_data_Im,
float *d_out_Re, float *d_out_Im, const int dataLen) {
int i{ threadIdx.x };
//if (i == 0) {
printf("Thread zero started \n");
for (int j{}; j < 1000000; j++)
d_out_Re[j%dataLen] = sqrtf(2) + d_data_Re[j%dataLen] * (j % 4 == 1);
printf("Thread zero finished \n");
//}
//else if (i == 1) {
// printf("Thread one started \n");
// for (int j{}; j < 1000000; j++)
// d_out_Im[j%dataLen] = sqrtf(2) + d_data_Im[j%dataLen] * (j % 4 == 1);
// printf("Thread one finished \n");
//}
}
int main()
{
cudaError_t cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
return 1;
}
const int sizeOfFrame = 2 * 1024 * 1024;
std::vector<float> data_re(sizeOfFrame), data_im;
//random number generator
std::uniform_real_distribution<float> distribution(0.0f, 2.0f); //Values between 0 and 2
std::mt19937 engine; // Mersenne twister MT19937
auto generator = std::bind(distribution, engine);
std::generate_n(data_re.begin(), sizeOfFrame, generator);
std::copy(data_re.begin(), data_re.end(), std::back_inserter(data_im));
//
float *d_data_re, *d_data_im;
cudaMalloc(&d_data_re, sizeOfFrame * sizeof(float));
cudaMalloc(&d_data_im, sizeOfFrame * sizeof(float));
cudaMemcpy(d_data_re, data_re.data(), sizeOfFrame * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_data_im, data_im.data(), sizeOfFrame * sizeof(float), cudaMemcpyHostToDevice);
float *d_pll_out_re, *d_pll_out_im;
cudaMalloc(&d_pll_out_re, sizeOfFrame * sizeof(float));
cudaMalloc(&d_pll_out_im, sizeOfFrame * sizeof(float));
dummyKernel << <1, 2 >> >(d_data_re, d_data_im,
d_pll_out_re, d_pll_out_im, sizeOfFrame);
cudaDeviceSynchronize();
// cudaDeviceReset must be called before exiting in order for profiling and
// tracing tools such as Nsight and Visual Profiler to show complete traces.
cudaStatus = cudaDeviceReset();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceReset failed!");
return 1;
}
return 0;
}
btw I got the code for random number generator from an answer to this question. So, the dummyKernel doesn't do anything useful, I just wanted to have a kernel that took relatively long to finish. If you launch dummyKernel, the order of the output will be "Thread zero started", "Thread zero finished", "Thread one started", "Thread one finished". Sequential. But if you launch dummyKernel2, the order of the output will be "Thread zero started", "Thread zero started", "Thread zero finished", "Thread zero finished" and the exec time is almost half as dummyKernel. I don't understand this behavior and the effect of the if-else I used.
OS: Windows 10, GTX 1050 Ti, CUDA Driver/Runtime version: 11.1/10.1.

Each Cuda multiprocessor has execution units (several each for int, float, special functions, ...). Those work as pipelines, which take several cycles to complete a calculation, but in each cycle a new calculation can be inserted (=scheduled) and several calculations are processed at the same time at different stages of the pipeline.
Groups of 32 threads (warps) within a block are scheduled the same instruction at the same time (same cycle or often two cycles depending on how many execution and datapath resources are available on the architecture and needed for this instruction), together with a bitfield, stating, for which threads this instruction should be actively executed. If some threads of a warp evaluated an if clause as false, they are temporarily deactivated. Or some threads may have already exited the kernel.
The effect is that if the 32 warps diverge (branch differently), each execution path has to be run through for each of the 32 threads (with some threads deactivated for each path). That should be avoided for performance reasons, as the computation resources are reserved nevertheless. Threads from different warps don't have this interdependency. The algorithm should be structured in a way to consider this.
With Volta, Independent Thread Scheduling was introduced. Each thread has its own instruction counter (and manages a separate function callstack). But the scheduler still will schedule groups of 32 threads (warps) with bitfields for active threads. What changed is that the scheduler can interleave the diverging paths. Instead of executing CCCIIIEEECCC pre-Volta (instructions: C=common, I=if branch, e=else branch), it could execute CCCIEEIIECCC, if the available execution units or the memory latency better fits. As programmer, one has to be careful, as it can be no longer assumed that the threads have not diverged, even when executing the same instruction. That is why __syncwarp was introduced and all kind of cooperation functions (e.g. the shuffle instructions) got a sync variant. Nevertheless (although we cannot know for sure, if the threads diverged) one still has to program in a way that all 32 threads can work together, if executed synchronously, especially for coalesced memory accesses. Putting __syncwarp after each possibly diverging instruction can help to ensure convergence. (But do performance profiling).
The Independent Thread Scheduling is also the reason, why __syncthreads must definitely be called correctly on the RTX 3080 - with each thread participating. A typical correcting solution for the deadlock case you mentioned in the comment is to close the if clause, sync all the threads and open a new if clause with the same condition as the previous one.

Related

CUDA shared vs global memory, possible speedup

I believe my CUDA application could potentially benefit from shared memory, in order to keep the data near the GPU cores. Right now, I have a single kernel to which I pass a pointer to a previously allocated chunk of device memory, and some constants. After the kernel has finished, the device memory includes the result, which is copied to host memory. This scheme works perfectly and is cross-checked with the same algorithm run on the CPU.
The docs make it quite clear that global memory is much slower and has higher access latency than shared memory, but either way to get the best performance you should make your threads coalesce and align any access. My GPU has Compute Capability 6.1 "Pascal", has 48 kiB of shared memory per thread block and 2 GiB DRAM. If I refactor my code to use shared memory, how do I make sure to avoid bank conflicts?
Shared memory is organized in 32 banks, so that 32 threads from the same block each may simultaneously access a different bank without having to wait. Let's say I take the kernel from above, launch a kernel configuration with one block and 32 threads in that block, and statically allocate 48 kiB of shared memory outside the kernel. Also, each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on. Given this, I would access those 32 shared memory locations with on offset of 48 kiB / 32 banks / sizeof(double) which equals 192:
__shared__ double cache[6144];
__global__ void kernel(double *buf_out, double a, double b, double c)
{
for(...)
{
// Perform calculation on shared memory
cache[threadIdx.x * 192] = ...
}
// Write result to global memory
buf_out[threadIdx.x] = cache[threadIdx.x * 192];
}
My reasoning: while threadIdx.x runs from 0 to 31, the offset together with cache being a double make sure that each thread will access the first element of a different bank, at the same time. I haven't gotten around to modify and test the code, but is this the right way to align access for the SM?
MWE added:
This is the naive CPU-to-CUDA port of the algorithm, using global memory only. Visual Profiler reports a kernel execution time of 10.3 seconds.
Environment: Win10, MSVC 2019, x64 Release Build, CUDA v11.2.
#include "cuda_runtime.h"
#include <iostream>
#include <stdio.h>
#define _USE_MATH_DEFINES
#include <math.h>
__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{
double z, y, y_max;
unsigned int tid = threadIdx.x/* + blockIdx.x * blockDim.x*/;
double Z = tid * SCREEN_STEP_SIZE, Y = 0;
double temp = WAVE_NUMBER / SCREEN_DIST;
// Make sure the per-thread accumulator is zero before we begin
buf[tid] = 0;
for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
{
y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);
for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
{
buf[tid] += cos(temp * (Y * y + Z * z));
}
}
}
int main(void)
{
double *dev_mem;
double *buf = NULL;
cudaError_t cudaStatus;
unsigned int screen_elems = 1000;
if ((buf = (double*)malloc(screen_elems * sizeof(double))) == NULL)
{
printf("Could not allocate memory...");
return -1;
}
memset(buf, 0, screen_elems * sizeof(double));
if ((cudaStatus = cudaMalloc((void**)&dev_mem, screen_elems * sizeof(double))) != cudaSuccess)
{
printf("cudaMalloc failed with code %u", cudaStatus);
return cudaStatus;
}
kernel<<<1, 1000>>>(dev_mem, 1e-3, 5e-5, 50e-9, 10.0, 2 * M_PI / 5e-7);
cudaDeviceSynchronize();
if ((cudaStatus = cudaMemcpy(buf, dev_mem, screen_elems * sizeof(double), cudaMemcpyDeviceToHost)) != cudaSuccess)
{
printf("cudaMemcpy failed with code %u", cudaStatus);
return cudaStatus;
}
cudaFree(dev_mem);
cudaDeviceReset();
free(buf);
return 0;
}
The kernel below uses shared memory instead and takes approximately 10.6 seconds to execute, again measured in Visual Profiler:
__shared__ double cache[1000];
__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{
double z, y, y_max;
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
double Z = tid * SCREEN_STEP_SIZE, Y = 0;
double temp = WAVE_NUMBER / SCREEN_DIST;
// Make sure the per-thread accumulator is zero before we begin
cache[tid] = 0;
for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
{
y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);
for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
{
cache[tid] += cos(temp * (Y * y + Z * z));
}
}
buf[tid] = cache[tid];
}
The innermost line inside the loops is typically executed several million times, depending on the five constants passed to the kernel. So instead of thrashing the off-chip global memory, I expected the on-chip shared-memory version to be much faster, but apparently it is not - what am I missing?

Let's say... each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on.
In that case, it does not make sense to use shared memory. The whole point of shared memory is the sharing... among all threads in a block. Under your assumptions, you should keep your element in a register, not in shared memory. Indeed, in your "MWE Added" kernel - that's probably what you should do.
If your threads were to share information - then the pattern of this sharing would determine how best to utilize shared memory.
Also remember that if you don't read data repeatedly, or from multiple threads, it is much less likely that shared memory will help you - as you always have to read from global memory at least once and write to shared memory at least once to have your data in shared memory.

Can CUDA branch divergence help me in this case?

This is little more than a thought experiment right now, but I want to check my understanding of the CUDA execution model. Consider the following case:
I am running on a GPU with poor double-precision performance (a non-Tesla card).
I have a kernel that needs to calculate a value using double precision. That value is a constant for the rest of the runtime of the kernel, and it is also constant across a warp.
Is something like the following pseudocode advantageous?
// value that we use later in the kernel; this is constant across all threads
// in a warp
int constant_value;
// check to see if this is the first thread in a warp
enum { warp_size = 32 };
if (!(threadIdx.x & (warp_size - 1))
{
// only do the double-precision math in one thread
constant_value = (int) round(double_precision_calculation());
}
// broadcast constant_value to all threads in the warp
constant_value = __shfl(v, 0);
// go on to use constant_value as needed later in the kernel
The reason why I considered doing this is my (possibly wrong) understanding of how double-precision resources are made available on each multiprocessor. From what I understand, there are simply 1/32 as many double-precision ALUs as single-precision ones on recent Geforce cards. Does this mean that if the other threads in a warp diverge, I can work around this lack of resources, and still get decent performance, as long as the double-precision values that I want can be broadcast to all threads in a warp?

Does this mean that if the other threads in a warp diverge, I can work around this lack of resources, and still get decent performance, as long as the double-precision values that I want can be broadcast to all threads in a warp?
No, you can't.
An instruction issue always occurs at the warp level, even in a warp-diverged scenario. Since it is issued at the warp level, it will require/use/schedule enough execution resources for the warp, even for inactive threads.
Therefore a computation done on only one thread will still use the same resources/scheduling slot as a computation done on all 32 threads in the warp.
For example, a floating point multiply will require 32 instances of usage of a floating point ALU. The exact scheduling of this will vary based on the specific GPU, but you cannot reduce the 32 instance usage to a lower number through warp divergence or any other mechanism.
Based on a question in the comments, here's a worked example on CUDA 7.5, Fedora 20, GT640 (GK208 - has 1/24 ratio of DP to SP units):
$ cat t1241.cu
#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
const int nTPB = 32;
const int nBLK = 1;
const int rows = 1048576;
const int nSD = 128;
typedef double mytype;
template <bool use_warp>
__global__ void mpy_k(const mytype * in, mytype * out){
__shared__ mytype sdata[nTPB*nSD];
int idx = threadIdx.x + blockDim.x*blockIdx.x;
mytype accum = in[idx];
#pragma unroll 128
for (int i = 0; i < rows; i++)
if (use_warp)
accum += accum*sdata[threadIdx.x+(i&(nSD-1))*nTPB];
else
if (threadIdx.x == 0)
accum += accum*sdata[threadIdx.x+(i&(nSD-1))*nTPB];
out[idx] = accum;
}
int main(){
mytype *din, *dout;
cudaMalloc(&din, nTPB*nBLK*rows*sizeof(mytype));
cudaMalloc(&dout, nTPB*nBLK*sizeof(mytype));
cudaMemset(din, 0, nTPB*nBLK*rows*sizeof(mytype));
cudaMemset(dout, 0, nTPB*nBLK*sizeof(mytype));
mpy_k<true><<<nBLK, nTPB>>>(din, dout); // warm-up
cudaDeviceSynchronize();
unsigned long long dt = dtime_usec(0);
mpy_k<true><<<nBLK, nTPB>>>(din, dout);
cudaDeviceSynchronize();
dt = dtime_usec(dt);
printf("full warp elapsed time: %f\n", dt/(float)USECPSEC);
mpy_k<false><<<nBLK, nTPB>>>(din, dout); //warm up
cudaDeviceSynchronize();
dt = dtime_usec(0);
mpy_k<false><<<nBLK, nTPB>>>(din, dout);
cudaDeviceSynchronize();
dt = dtime_usec(dt);
printf("one thread elapsed time: %f\n", dt/(float)USECPSEC);
cudaError_t res = cudaGetLastError();
if (res != cudaSuccess) printf("CUDA runtime failure %s\n", cudaGetErrorString(res));
return 0;
}
$ nvcc -arch=sm_35 -o t1241 t1241.cu
$ CUDA_VISIBLE_DEVICES="1" ./t1241
full warp elapsed time: 0.034346
one thread elapsed time: 0.049174
$
It is not faster to use just one thread in the warp for a floating-point multiply

Shared memory, branching performance and register count

I came across some peculiar performance behaviour when trying out the CUDA shuffle instruction. The test kernel below is based on an image processing algorithm which adds input-dependent values to all neighbouring pixels within a square of side rad. The output for each block is added in shared memory. If only one thread per warp adds its result to shared memory, the performance is poor (Option 1), whereas on the other hand, if all threads add to shared memory (one thread adds the desired value, the rest just add 0), the execution time drops by 2-3 times (Option 2).
#include <iostream>
#include "cuda_runtime.h"
#define warpSz 32
#define tileY 32
#define rad 32
__global__ void test(float *out, int pitch)
{
// Set shared mem to 0
__shared__ float tile[(warpSz + 2*rad) * (tileY + 2*rad)];
for (int i = threadIdx.y*blockDim.x+threadIdx.x; i<(tileY+2*rad)*(warpSz+2*rad); i+=blockDim.x*blockDim.y) {
tile[i] = 0.0f;
}
__syncthreads();
for (int row=threadIdx.y; row<tileY; row += blockDim.y) {
// Loop over pixels in neighbourhood
for (int i=0; i<2*rad+1; ++i) {
float res = 0.0f;
int rowStartIdx = (row+i)*(warpSz+2*rad);
for (int j=0; j<2*rad+1; ++j) {
res += float(threadIdx.x+row); // Substitute for real calculation
// Option 1: one thread writes to shared mem
if (threadIdx.x == 0) {
tile[rowStartIdx + j] += res;
res = 0.0f;
}
//// Option 2: all threads write to shared mem
//float tmp = 0.0f;
//if (threadIdx.x == 0) {
// tmp = res;
// res = 0.0f;
//}
//tile[rowStartIdx + threadIdx.x+j] += tmp;
res = __shfl(res, (threadIdx.x+1) % warpSz);
}
res += float(threadIdx.x+row);
tile[rowStartIdx + threadIdx.x+2*rad] += res;
__syncthreads();
}
}
// Add result back to global mem
for (int row=threadIdx.y; row<tileY+2*rad; row+=blockDim.y) {
for (int col=threadIdx.x; col<warpSz+2*rad; col+=warpSz) {
int idx = (blockIdx.y*tileY + row)*pitch + blockIdx.x*warpSz + col;
atomicAdd(out+idx, tile[row*(warpSz+2*rad) + col]);
}
}
}
int main(void)
{
int2 dim = make_int2(512, 512);
int pitchOut = (((dim.x+2*rad)+warpSz-1) / warpSz) * warpSz;
int sizeOut = pitchOut*(dim.y+2*rad);
dim3 gridDim((dim.x+warpSz-1)/warpSz, (dim.y+tileY-1)/tileY, 1);
float *devOut;
cudaMalloc((void**)&devOut, sizeOut*sizeof(float));
cudaEvent_t start, stop;
float elapsedTime;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaFree(0);
cudaEventRecord(start, 0);
test<<<gridDim, dim3(warpSz, 8)>>>(devOut, pitchOut);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
cudaFree(devOut);
cudaDeviceReset();
std::cout << "Elapsed time: " << elapsedTime << " ms.\n";
std::cin.ignore();
}
Is this expected behaviour/can anyone explain why this happens?
One thing I have noted is that Option 1 uses only 15 registers, whereas Option 2 uses 37, which seems a big difference to me.
Another is that the if-statement in the innermost loop is converted to explicit bra instructions in the PTX code for Option 1, whereas for Option 2 it is converted to two selp instructions. Could it be that the explicit branching is behind the 2-3 times slow down similar to what's suspected in this question?
There are two reasons why I am reluctant to go for Option 2. First, when profiling the original application it seems to be limited by share memory bandwidth, which indicates that there is potential to increase the performance by having fewer threads accessing it. Second, unless we use the volatile keyword, writes to shared memory can be optimised to registers. Since we are only interested in the contribution from last the thread to access each memory location (threadIdx.x == 0), and all others add 0, this is not a problem as long as all changes temporarily located in registers are guaranteed to be written back to shared memory in the same order they were issued. Is this the case though? (This far, both options have produced the exact same result.)
Any thoughts or ideas are much appreciated!
PS. I compile for compute capability 3.0. (However, the shuffle instruction is not necessary to demonstrate the behaviour and can be commented out.)

Dot Product in CUDA using atomic operations - getting wrong results

I am trying to implement the dot product in CUDA and compare the result with what MATLAB returns. My CUDA code (based on this tutorial) is the following:
#include <stdio.h>
#define N (2048 * 8)
#define THREADS_PER_BLOCK 512
#define num_t float
// The kernel - DOT PRODUCT
__global__ void dot(num_t *a, num_t *b, num_t *c)
{
__shared__ num_t temp[THREADS_PER_BLOCK];
int index = threadIdx.x + blockIdx.x * blockDim.x;
temp[threadIdx.x] = a[index] * b[index];
__syncthreads(); //Synchronize!
*c = 0.00;
// Does it need to be tid==0 that
// undertakes this task?
if (0 == threadIdx.x) {
num_t sum = 0.00;
int i;
for (i=0; i<THREADS_PER_BLOCK; i++)
sum += temp[i];
atomicAdd(c, sum);
//WRONG: *c += sum; This read-write operation must be atomic!
}
}
// Initialize the vectors:
void init_vector(num_t *x)
{
int i;
for (i=0 ; i<N ; i++){
x[i] = 0.001 * i;
}
}
// MAIN
int main(void)
{
num_t *a, *b, *c;
num_t *dev_a, *dev_b, *dev_c;
size_t size = N * sizeof(num_t);
cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, size);
a = (num_t*)malloc(size);
b = (num_t*)malloc(size);
c = (num_t*)malloc(size);
init_vector(a);
init_vector(b);
cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);
dot<<<N/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(dev_a, dev_b, dev_c);
cudaMemcpy(c, dev_c, sizeof(num_t), cudaMemcpyDeviceToHost);
printf("a = [\n");
int i;
for (i=0;i<10;i++){
printf("%g\n",a[i]);
}
printf("...\n");
for (i=N-10;i<N;i++){
printf("%g\n",a[i]);
}
printf("]\n\n");
printf("a*b = %g.\n", *c);
free(a); free(b); free(c);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
}
and I compile it with:
/usr/local/cuda-5.0/bin/nvcc -m64 -I/usr/local/cuda-5.0/include -gencode arch=compute_20,code=sm_20 -o multi_dot_product.o -c multi_dot_product.cu
g++ -m64 -o multi_dot_product multi_dot_product.o -L/usr/local/cuda-5.0/lib64 -lcudart
Information about my NVIDIA cards can be found at http://pastebin.com/8yTzXUuK. I tried to verify the result in MATLAB using the following simple code:
N = 2048 * 8;
a = zeros(N,1);
for i=1:N
a(i) = 0.001*(i-1);
end
dot_product = a'*a;
But as N increases, I'm getting significantly different results (For instance, for N=2048*32 CUDA reutrns 6.73066e+07 while MATLAB returns 9.3823e+07. For N=2048*64 CUDA gives 3.28033e+08 while MATLAB gives 7.5059e+08). I incline to believe that the discrepancy stems from the use of float in my C code, but if I replace it with double the compiler complains that atomicAdd does not support double parameters. How should I fix this problem?
Update: Also, for high values of N (e.g. 2048*64), I noticed that the result returned by CUDA changes at every run. This does not happen if N is low (e.g. 2048*8).
At the same time I have a more fundamental question: The variable temp is an array of size THREADS_PER_BLOCK and is shared between threads in the same block. Is it also shared between blocks or every block operates on a different copy of this variable? Should I think of the method dot as instructions to every block? Can someone elaborate on how exactly the jobs are split and how the variables are shared in this example

Comment this line out of your kernel:
// *c = 0.00;
And add these lines to your host code, before the kernel call (after the cudaMalloc of dev_c):
num_t h_c = 0.0f;
cudaMemcpy(dev_c, &h_c, sizeof(num_t), cudaMemcpyHostToDevice);
And I believe you'll get results that match matlab, more or less.
The fact that you have this line in your kernel unprotected by any synchronization is messing you up. Every thread of every block, whenever they happen to execute, is zeroing out c as you have written it.
By the way, we can do significantly better with this operation in general by using a classical parallel reduction method. A basic (not optimized) illustration is here. If you combine that method with your usage of shared memory and a single atomicAdd at the end (one atomicAdd per block) you'll have a significantly improved implementation. Although it's not a dot product, this example combines those ideas.
Edit: responding to a question below in the comments:
A kernel function is the set of instructions that all threads in the grid (all threads associated with a kernel launch, by definition) execute. However, it's reasonable to think of execution as being managed by threadblock, since the threads in a threadblock are executing together to a large extent. However, even within a threadblock, execution is not in perfect lockstep across all threads, necessarily. Normally when we think of lockstep execution, we think of a warp which is a group of 32 threads in a single threadblock. Therefore, since execution amongst warps within a block can be skewed, this hazard was present even for a single threadblock. However, if there were only one threadblock, we could have gotten rid of the hazard in your code using appropriate sync and control mechanisms like __syncthreads() and (if threadIdx.x == 0) etc. But these mechanisms are useless for the general case of controlling execution across multiple threadsblocks. Multiple threadblocks can execute in any order. The only defined sync mechanism across an entire grid is the kernel launch itself. Therefore to fix your issue, we had to zero out c prior to the kernel launch.

cudaDeviceReset for multiple gpu's

I am currently working on a gpu server which has 4 Tesla T10 gpu's. While I keep testing the kernels and have to frequently kill the processes using ctrl-C, I added a few lines to the end of a simple device query code. The code is given below :
#include <stdio.h>
// Print device properties
void printDevProp(cudaDeviceProp devProp)
{
printf("Major revision number: %d\n", devProp.major);
printf("Minor revision number: %d\n", devProp.minor);
printf("Name: %s\n", devProp.name);
printf("Total global memory: %u\n", devProp.totalGlobalMem);
printf("Total shared memory per block: %u\n", devProp.sharedMemPerBlock);
printf("Total registers per block: %d\n", devProp.regsPerBlock);
printf("Warp size: %d\n", devProp.warpSize);
printf("Maximum memory pitch: %u\n", devProp.memPitch);
printf("Maximum threads per block: %d\n", devProp.maxThreadsPerBlock);
for (int i = 0; i < 3; ++i)
printf("Maximum dimension %d of block: %d\n", i, devProp.maxThreadsDim[i]);
for (int i = 0; i < 3; ++i)
printf("Maximum dimension %d of grid: %d\n", i, devProp.maxGridSize[i]);
printf("Clock rate: %d\n", devProp.clockRate);
printf("Total constant memory: %u\n", devProp.totalConstMem);
printf("Texture alignment: %u\n", devProp.textureAlignment);
printf("Concurrent copy and execution: %s\n", (devProp.deviceOverlap ? "Yes" : "No"));
printf("Number of multiprocessors: %d\n", devProp.multiProcessorCount);
printf("Kernel execution timeout: %s\n", (devProp.kernelExecTimeoutEnabled ? "Yes" : "No"));
return;
}
int main()
{
// Number of CUDA devices
int devCount;
cudaGetDeviceCount(&devCount);
printf("CUDA Device Query...\n");
printf("There are %d CUDA devices.\n", devCount);
// Iterate through devices
for (int i = 0; i < devCount; ++i)
{
// Get device properties
printf("\nCUDA Device #%d\n", i);
cudaDeviceProp devProp;
cudaGetDeviceProperties(&devProp, i);
printDevProp(devProp);
}
printf("\nPress any key to exit...");
char c;
scanf("%c", &c);
**for (int i = 0; i < devCount; i++) {
cudaSetDevice(i);
cudaDeviceReset();
}**
return 0;
}
My query is related to the for loop just before the main() ends in which I set each device one by one and then use cudaResetDevice command. I get a strange feeling that this code, although doesnt produce any error but I am not able to reset all the devices. Instead, the program is resetting only the default device i.e device 0 each time. Can anyone tell me what should I do to reset each of the 4 devices.
Thanks

It looks like you can add a function to your GPU programs to catch the ctrl+c signal (SIGINT) and call the cudaDeviceReset() function for each device that was used by the program.
The example code to call a function when SIGINT is caught can be found here:
https://stackoverflow.com/a/482725
It seems like a good practice to include code like this for every GPU program you write, and I will do the same :-)
I don't have time to write up a full detailed answer, so read the other answer and it's comments also.

This is probably too late but if you write a signal-handler function you can get rid of the memory leaks and reset the device in a sure way:
// State variables for
extern int no_sigint;
int no_sigint = 1;
extern int interrupts;
int interrupts = 0;
/* Catches signal interrupts from Ctrl+c.
If 1 signal is detected the simulation finishes the current frame and
exits in a clean state. If Ctrl+c is pressed again it terminates the
application without completing writes to files or calculations but
deallocates all memory anyway. */
void
sigint_handler (int sig)
{
if (sig == SIGINT)
{
interrupts += 1;
std::cout << std::endl
<< "Aborting loop.. finishing frame."
<< std::endl;
no_sigint = 0;
if (interrupts >= 2)
{
std::cerr << std::endl
<< "Multiple Interrupts issued: "
<< "Clearing memory and Forcing immediate shutdown!"
<< std::endl;
// write a function to free dynamycally allocated memory
free_mem ();
int devCount;
cudaGetDeviceCount (&devCount);
for (int i = 0; i < devCount; ++i)
{
cudaSetDevice (i);
cudaDeviceReset ();
}
exit (9);
}
}
}
....
int main(){
.....
for (int simulation_step=1 ; simulation_step < SIM_STEPS && no_sigint; ++simulation_step)
{
.... simulation code
}
free_mem();
... cuda device resets
return 0;
}
If you use this code (you can even include the first snippet in an external header, it works. You can have 2 levels of control of ctrl+c: the first press stops your simulation and exits normally but the application finishes rendering the step which is great to stop gracefully and have correct results, if you press ctrl+c again it closes the application freeing all memory.

cudaDeviceReset is intended for destroying resources associated with a given GPU context within the process in which it is run. One CUDA process can't reset or otherwise effect the context of another process. So when your modified device query calls cudaDeviceReset, it is only releases resources that it allocated, not those in use by any other process.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008