Threads hierarchy design in CUDA for my code - cuda

I want to convert my previous code in c++ to CUDA
for(int x=0 ; x < 100; x++)
{
for(int y=0 ; y < 100; y++)
{
for(int w=0 ; w < 100; w++)
{
for(int z=0 ; z < 100; z++)
{
........
}
}
}
}
these loops combine to make a new int value.
if I want to use CUDA I have to design threads hierarchy before building
the kernel code.
So How can I design the hierarchy ?
depend on every loop I think it will be like this:
100*100*100*100 = 100000000 thread .
Could you help me
Thanks
My CUDA spec:
CUDA Device #0
Major revision number: 1
Minor revision number: 1
Name: GeForce G 105M
Total global memory: 536870912
Total shared memory per block: 16384
Total registers per block: 8192
Warp size: 32
Maximum memory pitch: 2147483647
Maximum threads per block: 512
Maximum dimension 1 of block: 512
Maximum dimension 2 of block: 512
Maximum dimension 3 of block: 64
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Maximum dimension 3 of grid: 1
Clock rate: 1600000
Total constant memory: 65536
Texture alignment: 256
Concurrent copy and execution: No
Number of multiprocessors: 1
Kernel execution timeout: Yes

100000000 threads (or blocks) is not too many for a GPU.
Your GPU has compute capability 1.1, so it is limited to 65535 blocks in each of the first two grid dimensions (x and y). Since 100*100 = 10000, we could launch 10000 blocks in each of the first two grid dimensions, to cover your entire for-loop extent. This would launch one block per for-loop iteration (unique combination of x,y,z, and w) and assume that you would use the threads in a block to address the needs of your for-loop calculation code:
__global__ void mykernel(...){
int idx = blockIdx.x;
int idy = blockIdx.y;
int w = idx/100;
int z = idx%100;
int x = idy/100;
int y = idy%100;
int tx = threadIdx.x;
// (the body of your for-loop code here...
}
launch:
dim3 blocks(10000, 10000);
dim3 threads(...); // can use any number here up to 512 for your device
mykernel<<<blocks, threads>>>(...);
If instead, you wanted to assign one thread to each of the inner z iterations of your for-loop (might be useful/higher performance depending on what you are doing and your data organization) you could do something like this:
__global__ void mykernel(...){
int idx = blockIdx.x;
int idy = blockIdx.y;
int w = idx/100;
int x = idx%100;
int y = idy;
int z = threadIdx.x;
// (the body of your for-loop code here...
}
launch:
dim3 blocks(10000, 100);
dim3 threads(100);
mykernel<<<blocks, threads>>>(...);
All of the above assumes your for-loop iterations are independent. If your for-loop iterations are dependent on each other (dependent on the order of execution) then these simplistic answers won't work, and you have not provided enough information in your question to discuss a reasonable strategy.

Related

nvprof events "fb_subp0_read_sectors" and "fb_subp1_read_sectors" do not report correct results

I tried to count the number of DRAM (global memory) accesses for simple vector add kernel.
__global__ void AddVectors(const float* A, const float* B, float* C, int N)
{
int blockStartIndex = blockIdx.x * blockDim.x * N;
int threadStartIndex = blockStartIndex + threadIdx.x;
int threadEndIndex = threadStartIndex + ( N * blockDim.x );
int i;
for( i=threadStartIndex; i<threadEndIndex; i+=blockDim.x ){
C[i] = A[i] + B[i];
}
}
Grid Size = 180
Block size = 128
size of array = 180 * 128 * N floats where N is input parameter (elements per thread)
when N = 1, size of array = 180 * 128 * 1 floats = 90KB
All arrays A, B and C should be read from DRAM.
Therefore theoretically,
DRAM writes (C) = 2880 (32 byte accesses)
DRAM reads (A,B) = 2880 + 2880 = 5760 (32 byte accesses)
But when I used nvprof
DRAM writes = fb_subp0_write_sectors + fb_subp1_write_sectors = 1440 + 1440 = 2880 (32 byte accesses)
DRAM reads = fb_subp0_read_sectors + fb_subp1_read_sectors = 23 + 7 = 30 (32 byte accesses)
Now this is the problem. Theoretically there should be 5760 DRAM reads, but nvprof only reports 30, for me this looks impossible. Further more, if you double the size of the vector (N = 2), still the reported DRAM accesses remains at 30.
It would be great, if someone can shed some light.
I have disabled the L1 cache by using compiler option "-Xptxas -dlcm=cg"
Thanks,
Waruna
If you have done cudaMemcpy before the kernel launch to copy the source buffers from host to device, that gets the source buffers in L2 cache and hence the kernel doesn't see any misses from L2 for reads and you get less number of (fb_subp0_read_sectors + fb_subp1_read_sectors).
If you comment out cudaMemcpy before the kernel launch, you will see that the event values of fb_subp0_read_sectors and fb_subp1_read_sectors include the values you are expecting.

About the number of registers allocated per SM in CUDA

First Question.
The CUDA C Programming Guide is written like below.
The same on-chip memory is used for both L1 and shared memory: It can
be configured as 48 KB of shared memory and 16 KB of L1 cache or as 16
KB of shared memory and 48 KB of L1 cache
But, device query shows "Total number of registers available per block: 32768".
I use GTX580.(CC is 2.0)
The guide says default cache size is 16KB, but 32768 means 32768*4(byte) = 131072 Bytes = 128 KBytes. Actually, I don't know which is correct.
Second Question.
I set like below,
dim3 grid(32, 32); //blocks in a grid
dim3 block(16, 16); //threads in a block
kernel<<<grid,block>>>(...);
Then, the number of threads per a block is 256. => we need 256*N registers per a block.
N means the number of registers per a thread needed.
(256*N)*blocks is the number of registers per a SM.(not byte)
So, if default size is 16KB and threads/SM is MAX(1536), then N can't over 2. Because of "Maximum number of threads per multiprocessor: 1536".
16KB/4Bytes = 4096 registers, 4096/1536 = 2.66666...
In case of larger caches 48KB, N can't over 8.
48KB/4Bytes = 12288 registers, 12288/1536 = 8
Is that true? Actually I'm so confused.
Actually, My almost full code is here.
I think, the kernel is optimized when the block dimension is 16x16.
But, in case of 8x8, faster than 16x16 or similar.
I don't know the why.
the number of registers per a thread is 16, the shared memory is 80+16 bytes.
I had asked same question, but I couldn't get the exact solution.:
The result of an experiment different from CUDA Occupancy Calculator
#define WIDTH 512
#define HEIGHT 512
#define TILE_WIDTH 8
#define TILE_HEIGHT 8
#define CHANNELS 3
#define DEVICENUM 1
#define HEIGHTs HEIGHT/DEVICENUM
__global__ void PRINT_POLYGON( unsigned char *IMAGEin, int *MEMin, char a, char b, char c){
int Col = blockIdx.y*blockDim.y+ threadIdx.y; //Col is y coordinate
int Row = blockIdx.x*blockDim.x+ threadIdx.x; //Row is x coordinate
int tid_in_block = threadIdx.x + threadIdx.y*blockDim.x;
int bid_in_grid = blockIdx.x + blockIdx.y*gridDim.x;
int threads_per_block = blockDim.x * blockDim.y;
int tid_in_grid = tid_in_block + threads_per_block * bid_in_grid;
float result_a, result_b;
__shared__ int M[15];
for(int k = 0; k < 5; k++){
M[k] = MEMin[a*5+k];
M[k+5] = MEMin[b*5+k];
M[k+10] = MEMin[c*5+k];
}
int result_a_up = (M[11]-M[1])*(Row-M[0]) - (M[10]-M[0])*(Col-M[1]);
int result_b_up = (M[6] -M[1])*(M[0]-Row) - (M[5] -M[0])*(M[1]-Col);
int result_down = (M[11]-M[1])*(M[5]-M[0]) - (M[6]-M[1])*(M[10]-M[0]);
result_a = (float)result_a_up / (float)result_down;
result_b = (float)result_b_up / (float)result_down;
if((0 <= result_a && result_a <=1) && ((0 <= result_b && result_b <= 1)) && ((0 <= (result_a+result_b) && (result_a+result_b) <= 1))){
IMAGEin[tid_in_grid*CHANNELS] += M[2] + (M[7]-M[2])*result_a + (M[12]-M[2])*result_b; //Red Channel
IMAGEin[tid_in_grid*CHANNELS+1] += M[3] + (M[8]-M[3])*result_a + (M[13]-M[3])*result_b; //Green Channel
IMAGEin[tid_in_grid*CHANNELS+2] += M[4] + (M[9]-M[4])*result_a + (M[14]-M[4])*result_b; //Blue Channel
}
}
struct DataStruct {
int deviceID;
unsigned char IMAGE_SEG[WIDTH*HEIGHTs*CHANNELS];
};
void* routine( void *pvoidData ) {
DataStruct *data = (DataStruct*)pvoidData;
unsigned char *dev_IMAGE;
int *dev_MEM;
unsigned char *IMAGE_SEG = data->IMAGE_SEG;
HANDLE_ERROR(cudaSetDevice(5));
//initialize array
memset(IMAGE_SEG, 0, WIDTH*HEIGHTs*CHANNELS);
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
printf("Device %d Starting..\n", data->deviceID);
//Evaluate Time
cudaEvent_t start, stop;
cudaEventCreate( &start );
cudaEventCreate( &stop );
cudaEventRecord(start, 0);
HANDLE_ERROR( cudaMalloc( (void **)&dev_MEM, sizeof(int)*35) );
HANDLE_ERROR( cudaMalloc( (void **)&dev_IMAGE, sizeof(unsigned char)*WIDTH*HEIGHTs*CHANNELS) );
cudaMemcpy(dev_MEM, MEM, sizeof(int)*35, cudaMemcpyHostToDevice);
cudaMemset(dev_IMAGE, 0, sizeof(unsigned char)*WIDTH*HEIGHTs*CHANNELS);
dim3 grid(WIDTH/TILE_WIDTH, HEIGHTs/TILE_HEIGHT); //blocks in a grid
dim3 block(TILE_WIDTH, TILE_HEIGHT); //threads in a block
PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 0, 1, 2);
PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 0, 2, 3);
PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 0, 3, 4);
PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 0, 4, 5);
PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 3, 2, 4);
PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 2, 6, 4);
HANDLE_ERROR( cudaMemcpy( IMAGE_SEG, dev_IMAGE, sizeof(unsigned char)*WIDTH*HEIGHTs*CHANNELS, cudaMemcpyDeviceToHost ) );
HANDLE_ERROR( cudaFree( dev_MEM ) );
HANDLE_ERROR( cudaFree( dev_IMAGE ) );
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime( &elapsed_time_ms[data->deviceID], start, stop );
cudaEventDestroy(start);
cudaEventDestroy(stop);
elapsed_time_ms[DEVICENUM] += elapsed_time_ms[data->deviceID];
printf("Device %d Complete!\n", data->deviceID);
return 0;
}
The blockDim 8x8 is faster than 16x16 due to the increase in address divergence in your memory access when you increase the block size.
Metrics collected on GTX480 with 15 SMs.
metric 8x8 16x16
duration 161µs 114µs
issued_ipc 1.24 1.31
executed_ipc .88 .59
serialization 54.61% 28.74%
The number of instruction replays clues us in that we likely have bad memory access patterns.
achieved occupancy 88.32% 30.76%
0 warp schedulers issues 8.81% 7.98%
1 warp schedulers issues 2.36% 29.54%
2 warp schedulers issues 88.83% 52.44%
16x16 appears to keep the warp scheduler busy. However, it is keeping the schedulers busy re-issuing instructions.
l1 global load trans 524,407 332,007
l1 global store trans 401,224 209,139
l1 global load trans/request 3.56 2.25
l1 global store trans/request 16.33 8.51
The first priority is to reduce transactions per request. The Nsight VSE source view can display memory statistics per instruction. The primary issue in your kernel is the interleaved U8 load and stores for IMAGEin[] += value. At 16x16 this is resulting in 16.3 transactions per request but only 8.3 for 8x8 configuration.
Changing
IMAGEin[(i*HEIGHTs+j)*CHANNELS] += ...
to be consecutive increases performance of 16x16 by 3x. I imagine increasing channels to 4 and handling packing in the kernel will improve cache performance and memory throughput.
If you fix the number of memory transactions per request you will then likely have to look at execution dependencies and try to increase your ILP.
It is faster with block size of 8x8 because it is a lesser multiple of 32, as it is visible in the picture below, there are 32 CUDA cores bound together, with two different warp schedulers that actually schedule the same thing. So the same instruction is executed on these 32 cores in each execution cycle.
To better clarify this, in the first case (8x8) each block is made of two warps (64 threads) so it is finished within only two execution cycles, however, when you are using (16x16) as your block size, each takes 8 warps (256 threads), therefore taking 4 times more execution cycles resulting in a slower compound.
However, filling an SM with more warps is better in some cases, when memory access is high and each warp is likely to go into a memory stall (i.e. getting its operands from memory), then it will be replaced with another warp until the memory operation gets completed. Therefore resulting in more occupancy of the SM.
You should of course throw in the number of blocks per SM and number of SMs total in your calculations, for example, assigning more than 8 blocks to a single SM might reduce its occupancy, but probably in your case, you are not facing these issues, because 256 is generally a better number than 64, since it will balance your blocks among SMs whereas using 64 threads will result in more blocks getting executed in the same SM.
EDIT: This answer is based on my speculations, for a more scientific approach, see Greg Smiths answer.
Register pool is different from shared memory/cache, to the very bottom of their architecture!
Registers are made of Flip-flops and L1 cache are probably SRAM.
Just to get an idea, look at the picture below which represents FERMI architecture, then update your question to further specify the problem you are facing.
As a note, you can see how many registers and shared memory (smem) are taken by your functions by passing the option --ptxas-options = -v to nvcc.

Texture fetch slower than direct global access, chapter 7 from "Cuda by example" book

I am reading and testing the examples in the book "Cuda By example. An introduction to General Purpose GPU Programming".
When testing the examples in chapter 7, relative to texture memory, I realized that access to global memory via texture cache is much slower than direct access (My NVIDIA GPU is GeForceGTX 260, compute capability 1.3 and I am using NVDIA CUDA 4.2):
Time per frame with texture fetch (1D or 2D) for a 256*256 image: 93 ms
Time per frame not using texture (just direct global access) for 256*256: 8.5 ms
I have double checked the code several times and I have also been reading the "CUDA C Programming guide" and "CUDA C Best practices Guide" which come along with the SDK, and I do not really understand the problem.
As far as I understand, texture memory is just global memory with a specific access mechanism implementation to make it look like a cache (?). I am wondering whether coalesced access to global memory will make texture fetch slower, but I cannot be sure.
Does anybody have a similar problem?
(I found some links in NVIDIA forums for a similar problem, but the link is no longer available.)
The testing code looks this way, only including the relevant parts:
//#define TEXTURE
//#define TEXTURE2
#ifdef TEXTURE
// According to C programming guide, it should be static (3.2.10.1.1)
static texture<float> texConstSrc;
static texture<float> texIn;
static texture<float> texOut;
#endif
__global__ void copy_const_kernel( float *iptr
#ifdef TEXTURE2
){
#else
,const float *cptr ) {
#endif
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
#ifdef TEXTURE2
float c = tex1Dfetch(texConstSrc,offset);
#else
float c = cptr[offset];
#endif
if ( c != 0) iptr[offset] = c;
}
__global__ void blend_kernel( float *outSrc,
#ifdef TEXTURE
bool dstOut ) {
#else
const float *inSrc ) {
#endif
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
int left = offset - 1;
int right = offset + 1;
if (x == 0) left++;
if (x == SXRES-1) right--;
int top = offset - SYRES;
int bottom = offset + SYRES;
if (y == 0) top += SYRES;
if (y == SYRES-1) bottom -= SYRES;
#ifdef TEXTURE
float t, l, c, r, b;
if (dstOut) {
t = tex1Dfetch(texIn,top);
l = tex1Dfetch(texIn,left);
c = tex1Dfetch(texIn,offset);
r = tex1Dfetch(texIn,right);
b = tex1Dfetch(texIn,bottom);
} else {
t = tex1Dfetch(texOut,top);
l = tex1Dfetch(texOut,left);
c = tex1Dfetch(texOut,offset);
r = tex1Dfetch(texOut,right);
b = tex1Dfetch(texOut,bottom);
}
outSrc[offset] = c + SPEED * (t + b + r + l - 4 * c);
#else
outSrc[offset] = inSrc[offset] + SPEED * ( inSrc[top] +
inSrc[bottom] + inSrc[left] + inSrc[right] -
inSrc[offset]*4);
#endif
}
// globals needed by the update routine
struct DataBlock {
unsigned char *output_bitmap;
float *dev_inSrc;
float *dev_outSrc;
float *dev_constSrc;
cudaEvent_t start, stop;
float totalTime;
float frames;
unsigned size;
unsigned char *output_host;
};
void anim_gpu( DataBlock *d, int ticks ) {
checkCudaErrors( cudaEventRecord( d->start, 0 ) );
dim3 blocks(SXRES/16,SYRES/16);
dim3 threads(16,16);
#ifdef TEXTURE
volatile bool dstOut = true;
#endif
for (int i=0; i<90; i++) {
#ifdef TEXTURE
float *in, *out;
if (dstOut) {
in = d->dev_inSrc;
out = d->dev_outSrc;
} else {
out = d->dev_inSrc;
in = d->dev_outSrc;
}
#ifdef TEXTURE2
copy_const_kernel<<<blocks,threads>>>( in );
#else
copy_const_kernel<<<blocks,threads>>>( in,
d->dev_constSrc );
#endif
blend_kernel<<<blocks,threads>>>( out, dstOut );
dstOut = !dstOut;
#else
copy_const_kernel<<<blocks,threads>>>( d->dev_inSrc,
d->dev_constSrc );
blend_kernel<<<blocks,threads>>>( d->dev_outSrc,
d->dev_inSrc );
swap( d->dev_inSrc, d->dev_outSrc );
#endif
}
// Some stuff for the events
// ...
}
I have been testing the results with the nvvp (NVIDIA profiler)
The result are quite curious as they show that there are a lot of texture cache misses (which are probably the cause for the bad performance).
The result from the profiler show also information that is difficult to understand even using the guide "CUPTI_User_GUide):
text_cache_hit: Number of texture cache hits (they are accounted only for one SM according to 1.3 capability).
text_cache_miss: Number of texture cache miss (they are accounted only for one SM according to 1.3 capability).
The following are the results for an example of 256*256 without using texture cache (only relevant info is shown):
Name Duration(ns) Grid_Size Block_Size
"copy_const_kernel(...) 22688 16,16,1 16,16,1
"blend_kernel(...)" 51360 16,16,1 16,16,1
Following are the results using 1D texture cache:
Name Duration(ns) Grid_Size Block_Size tex_cache_hit tex_cache_miss
"copy_const_kernel(...)" 147392 16,16,1 16,16,1 0 1024
"blend_kernel(...)" 841728 16,16,1 16,16,1 79 5041
Following are the results using 2D texture cache:
Name Duration(ns) Grid_Size Block_Size tex_cache_hit tex_cache_miss
"copy_const_kernel(...)" 150880 16,16,1 16,16,1 0 1024
"blend_kernel(...)" 872832 16,16,1 16,16,1 2971 2149
These result show several interesting info:
There are no cache hits at all for the "copy const" function (although ideally the memory is "spatially located", in the sense that each thread accesses memory which is near to the memory acceded by other near threads). I guess that this is because the threads within this function do not access memory from other threads, which seems to be the way for the texture cache to be usable (being the "spatially located" concept quite confusing)
There are some cache hits in the 1D and a lot more in the 2D case for the function "blend_kernel". I guess that it is due to the fact that within that function, any thread access memory from their neighbours threads. I cannot understand why there are more in 2D than 1d.
The duration time is greater in the texture cases than in the no texture case (nearly about one order of magnitude). Perhaps related with the so many texture cache misses.
For the "copy_const" function there are 1024 total accesses for the SM and 5120 for the "blend kernel". The relation 5:1 is correct due to the fact that there are 5 fetches in "blend" and only 1 in "copy_const". Anyway, I cannot understand where all this 1024 come from: ideally, this event "text cache miss/hot" only accounts for one SM (I have 24 in my GeForceGTX 260) and it only accounts for warps ( 32 thread size). Therefore, I have 256 threads/32=8 warps per SM and 256 blocks/24 = 10 or 11 "iterations" per SM, so I would be expecting something like 80 or 88 fetches (more over, some other event like sm_cta_launched, which is the number of thread blocks per SM, which is supposed to be supported in my 1.3 device, is always 0...)

CUDA - Memory Limit - Vector Summation

I'm trying to learn CUDA and the following code works OK for the values N<= 16384, but fails for the greater values(Summation check at the end of the code fails, c values are always 0 for the index value of i>=16384).
#include<iostream>
#include"cuda_runtime.h"
#include"../cuda_be/book.h"
#define N (16384)
__global__ void add(int *a,int *b,int *c)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if(tid<N)
{
c[tid] = a[tid] + b[tid];
tid += blockDim.x * gridDim.x;
}
}
int main()
{
int a[N],b[N],c[N];
int *dev_a,*dev_b,*dev_c;
//allocate mem on gpu
HANDLE_ERROR(cudaMalloc((void**)&dev_a,N*sizeof(int)));
HANDLE_ERROR(cudaMalloc((void**)&dev_b,N*sizeof(int)));
HANDLE_ERROR(cudaMalloc((void**)&dev_c,N*sizeof(int)));
for(int i=0;i<N;i++)
{
a[i] = -i;
b[i] = i*i;
}
HANDLE_ERROR(cudaMemcpy(dev_a,a,N*sizeof(int),cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_b,b,N*sizeof(int),cudaMemcpyHostToDevice));
system("PAUSE");
add<<<128,128>>>(dev_a,dev_b,dev_c);
//copy the array 'c' back from the gpu to the cpu
HANDLE_ERROR( cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost));
system("PAUSE");
bool success = true;
for(int i=0;i<N;i++)
{
if((a[i] + b[i]) != c[i])
{
printf("Error in %d: %d + %d != %d\n",i,a[i],b[i],c[i]);
system("PAUSE");
success = false;
}
}
if(success) printf("We did it!\n");
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
return 0;
}
I think it's a shared memory related problem, but I can't come up with a good explanation(Possible lack of knowledge). Could you provide me an explanation and a workaround to run for the values of N greater than 16384. Here is the specs for my GPU:
General Info for device 0
Name: GeForce 9600M GT
Compute capability: 1.1
Clock rate: 1250000
Device copy overlap : Enabled
Kernel Execution timeout : Enabled
Mem info for device 0
Total global mem: 536870912
Total const mem: 65536
Max mem pitch: 2147483647
Texture Alignment 256
MP info about device 0
Multiproccessor count: 4
Shared mem per mp: 16384
Registers per mp: 8192
Threads in warp: 32
Max threads per block: 512
Max thread dimensions: (512,512,64)
Max grid dimensions: (65535,65535,1)
You probably intended to write
while(tid<N)
not
if(tid<N)
You aren't running out of shared memory, your vector arrays are being copied into your device's global memory. As you can see this has far more space available than the 196608 bytes (16384*4*3) you need.
The reason for your problem is that you are only performing one addition operation per thread so hence with this structure, the maximum dimension that your vectors can be is the block*thread parameters in your kernel launch as tera has pointed out. By correcting
if(tid<N)
to
while(tid<N)
in your code, each thread will perform its addition on multiple indexes and the whole array will be considered.
For more information about the memory hierarchy and the various different places memory can sit, you should read sections 2.3 and 5.3 of the CUDA_C_Programming_Guide.pdf provided with the CUDA toolkit.
Hope that helps.
If N is:
#define N (33 * 1024) //value defined in Cuda by Examples
The same code I found in Cuda by Example, but the value of N was different. I think that o value of N cant be 33 * 1024. I must change the parameters number of block and number of threads per blocks. Because:
add<<<128,128>>>(dev_a,dev_b,dev_c); //16384 threads
(128 * 128) < (33 * 1024) so we have a crash.

OpenCl equivalent of finding Consecutive indices in CUDA

In CUDA to cover multiple blocks, and thus incerase the range of indices for arrays we do some thing like this:
Host side Code:
dim3 dimgrid(9,1)// total 9 blocks will be launched
dim3 dimBlock(16,1)// each block is having 16 threads // total no. of threads in
// the grid is thus 16 x9= 144.
Device side code
...
...
idx=blockIdx.x*blockDim.x+threadIdx.x;// idx will range from 0 to 143
a[idx]=a[idx]*a[idx];
...
...
What is the equivalent in OpenCL for acheiving the above case ?
On the host, when you enqueue your kernel using clEnqueueNDRangeKernel, you have to specify the global and local work size. For instance:
size_t global_work_size[1] = { 144 }; // 16 * 9 == 144
size_t local_work_size[1] = { 16 };
clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL,
global_work_size, local_work_size,
0, NULL, NULL);
In your kernel, use:
size_t get_global_size(uint dim);
size_t get_global_id(uint dim);
size_t get_local_size(uint dim);
size_t get_local_id(uint dim);
to retrieve the global and local work sizes and indices respectively, where dim is 0 for x, 1 for y and 2 for z.
The equivalent of your idx will thus be simply size_t idx = get_global_id(0);
See the OpenCL Reference Pages.
Equivalences between CUDA and OpenCL are:
blockIdx.x*blockDim.x+threadIdx.x = get_global_id(0)
LocalSize = blockDim.x
GlobalSize = blockDim.x * gridDim.x