__global__ void add(int *a, int *b, int *c, int n)
{
size_t index = threadIdx. + blockId.x * blockDim.x;
if(index < n)
c[index] = a[index] + b[index] ;
}
Hello, I am trying to remember the reason why the if test is necessary. I remember it is something about block size in this kernel. Is it just about array bounds ?
What will happen for threads whose index are above n ?
I remember it is something about block size in this kernel. Is it just about array bounds ?
Yes. Unless the number of elements in the input and output arrays exactly match the number of threads launched, then out of bounds memory access would occur. In practice that is rarely the case, and it would be normal to round up the grid size to ensure there are more threads than required for the array sizes. The alternative would be to run less threads than inputs, which would leave part of the inputs and outputs unprocessed, and that doesn't make a lot of sense.
What will happen for threads whose index are above n ?
Nothing. They will branch around the memory access portion of the code and exit without touching memory that would otherwise result in a runtime error for out-of-bounds memory access.
Related
I'm having trouble here. I launch two kernels , check if some value is the one expected (memcpy to the host), if it is I stop, if it isn't I launch the two kernels again.
the first kernel:
__global__ void aco_step(const KPDeviceData* data)
{
int obj = threadIdx.x;
int ant = blockIdx.x;
int id = threadIdx.x + blockIdx.x * blockDim.x;
*(data->added) = 1;
while(*(data->added) == 1)
{
*(data->added) = 0;
//check if obj fits
int fits = (data->obj_weights[obj] + data->weight[ant] <= data->max_weight);
fits = fits * !(getElement(data->selections, data->selections_pitch, ant, obj));
if(obj == 0)
printf("ant %d going..\n", ant);
__syncthreads();
...
The code goes on after this. But that printf never gets printed, that syncthreads is there just for debugging purposes.
The "added" variable was shared, but since shared memory is a PITA and usually throws bugs in the code, i just removed it for now. This "added" variable isn't the smartest thing to do but it's faster than the alternative, which is checking if any variable within an array is some value on the host and deciding to keep iterating or not.
The getElement, simply does the matrix memory calculation with the pitch to access the right position and returns the element there:
int* el = (int*) ((char*)mat + row * pitch) + col;
return *el;
The obj_weights array has the right size, n*sizeof(int). So does the weight array, ants*sizeof(float). So they aren't out of bounds.
The kernel after this one has a printf right on the beginning, and it doesn't get printed either and after the printf it sets a variable on the device memory, and this memory is copied to the CPU after the kernel finished, and it isn't the right value when I print it in the CPU code. So I think this kernel is doing something illegal and the second one doesn't even get launched.
I'm testing some instances, when I launch 8 blocks and 512 threads, it runs OK. 32 blocks, 512 threads, OK. But 8 blocks and 1024 threads, and this happens, the kernel doesn't work, neither 32 blocks and 1024 threads.
Am I doing something wrong? Memory access? Am I launching too many threads?
edit: tried removing the "added" variable and the while loop, so it should execute just once. Still doesnt work, nothing gets printed, even if the printf is right after the three initial lines and the next kernel also doesn't print anything.
edit: another thing, I'm using a GTX 570, so the "Maximum number of threads per block" is 1024 according to http://en.wikipedia.org/wiki/CUDA. Maybe I'll just stick with 512 maximum or check on how higher I can put this value.
__syncthreads() inside conditional code is only allowed if the condition evaluates identically on all threads of a block.
In your case the condition suffers a race condition and is nondeterministic, so it most probably evaluates to different results for different threads.
printf() output is only displayed after the kernel finishes successfully. In this case it doesn't due to the problem mentioned above, so the output never shows up. You could have figured out this by testing the return codes all CUDA function calls for errors.
I am novice in GPU parallel computing and I'm trying to learn CUDA by looking at some examples in NVidia "CUDA by examples" book.
And I do not understand properly how thread access and change variables in such a simple example (dot product of two vectors).
The kernel function is defined as follows
__global__ void dot( float *a, float *b, float *c ) {
__shared__ float cache[threadsPerBlock];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;
float temp = 0;
while (tid < N) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}
// set the cache values
cache[cacheIndex] = temp;
I do not understand three things.
What is the sequence of execution of this function? Is there any sequence between threads? For example, the first are the thread from the first block, then threads from the second block come into play and so on (this is connected to the question why this is necessary to divide threads into blocks).
Do all threads have their own copy of the "temp" variable or not (if not, why is there no race condition?)
How is it operated? What exactly goes to the variable temp in the while loop? The array cache stores values of temp for different threads. How does the summation go on? It seems that temp already contains all sums necessary for dot product because variable tid goes from 0 to N-1 in the while loop.
Despite the code you provide is incomplete, here are some clarifications about what you are asking :
The kernel code will be executed by all the threads in all the blocks. The way to "split the jobs" is to make threads work only on one or a few elements.
For instance, if you have to treat 100 integers with a specific algorithm, you probably want 100 threads to treat 1 element each.
In CUDA the amount of blocks and threads is defined at the kernel launch on host side :
myKernel<<<grid, threads>>>(...);
Where grids and threads are dim3, which define the size on three dimensions.
There is no specific order in the execution of threads and blocks. As you can read there :
http://mc.stanford.edu/cgi-bin/images/3/34/Darve_cme343_cuda_3.pdf
On page 6 : "No specific order in which blocks are dispatched and executed".
Since the temp variable is defined in the kernel in no specific way, it is not distributed and each thread will have this value stored in a register.
This is equivalent of what is done on CPU side. So yes, this means each threads has its own "temp" variable.
The temp variable is updated in each iteration of the loop, using access to device arrays.
Again, this is equivalent of what is done on CPU side.
I think you should probably check if you are used enough to C/C++ programming on CPU side before going further into GPU programming. Meaning no offense, it seems you have a lack in several main topics.
Since CUDA allows you to drive your GPU with C code, the difficulty is not in the syntax, but in the specificities of the hardware.
I am trying to figure out how well the global memory write accesses of one of my kernels are coalesced, based on the "global store efficiency" value of NVidia's profiler (I am using CUDA 5 toolkit preview release, on a Fermi GPU).
As far as I understood, this value is the ratio of requested memory transactions to actual nb of transcations performed, therefore reflects whether accesses are all perfectly coalesced (100% efficiency) or not.
Now, for a thread block width of 32, and taking float values as input and output, the following test kernel gives 100% efficiency both for global load and for global store, as expected:
__global__ void dummyKernel(float*output,float* input,size_t pitch)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
int offset = y*pitch+x;
float tmp = input[offset];
output[offset] = tmp;
}
What I don't understand is why when I start adding useful code in between the input read and the output write, the global store efficiency begins to drop, whereas I have not changed the memory write pattern or the thread block geometry ? The global load stays at 100%, as I expect, though.
Could someone please shed a light on why this happens ? I thought, since all 32 threads in a given warp execute the output store instruction simultaneously (by definition) and using a "coalescing-friendly" pattern, I should still get 100% whatever I do before, but obviously I must be misunderstanding something on either the meaning of global store efficiency, or on the conditions for global store coalescing.
Thx,
EDIT :
Here is an example: if I use this code (just adding a "round" operation on input), global store efficiency drops from 100% to 95%
__global__ void dummyKernel(float*output,float* input,size_t pitch)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
int offset = y*pitch+x;
float tmp = round(input[offset]);
output[offset] = tmp;
}
Unsure if this is the case, but round probably converts its argument to a double and if there is a register spilling, then each thread would access 8 bytes of memory, which would then be coerced into 4 bytes of tmp. Accessing 8 bytes would reduce the coalescing to half-warp.
However, I believe register spilling shouldn't happen since the number of local variables in your kernel is small. You could check with nvcc --ptxas-options=-v for the spill.
Ok, shame on me, I found the problem: I was profiling this simple test code in Debug mode, which gives completely wild numbers for most metrics. Re-profiling in Release mode gave me the expected result : 100% store efficiency in both cases.
I have a buffer in global memory that I want to copy in shared memory for each block as to speed up my read-only access. Each thread in each block will use the whole buffer at different positions concurrently.
How does one do that?
I know the size of the buffer only at run time:
__global__ void foo( int *globalMemArray, int N )
{
extern __shared__ int s_array[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if( idx < N )
{
...?
}
}
The first point to make is that shared memory is limited to a maximum of either 16kb or 48kb per streaming multiprocessor (SM), depending on which GPU you are using and how it is configured, so unless your global memory buffer is very small, you will not be able to load all of it into shared memory at the same time.
The second point to make is that the contents of shared memory only has the scope and lifetime of the block it is associated with. Your sample kernel only has a single global memory argument, which makes me think that you are either under the misapprehension that the contents of a shared memory allocation can be preserved beyond the life span of the block that filled it, or that you intend to write the results of the block calculations back into same global memory array from which the input data was read. The first possibility is wrong and the second will result in memory races and inconsistant results. It is probably better to think of shared memory as a small, block scope L1 cache which is fully programmer managed than some sort of faster version of global memory.
With those points out of the way, a kernel which loaded sucessive segments of a large input array, processed them and then wrote some per thread final result back input global memory might look something like this:
template <int blocksize>
__global__ void foo( int *globalMemArray, int *globalMemOutput, int N )
{
__shared__ int s_array[blocksize];
int npasses = (N / blocksize) + (((N % blocksize) > 0) ? 1 : 0);
for(int pos = threadIdx.x; pos < (blocksize*npasses); pos += blocksize) {
if( pos < N ) {
s_array[threadIdx.x] = globalMemArray[pos];
}
__syncthreads();
// Calculations using partial buffer contents
.......
__syncthreads();
}
// write final per thread result to output
globalMemOutput[threadIdx.x + blockIdx.x*blockDim.x] = .....;
}
In this case I have specified the shared memory array size as a template parameter, because it isn't really necessary to dynamically allocate the shared memory array size at runtime, and the compiler has a better chance at performing optimizations when the shared memory array size is known at compile time (perhaps in the worst case there could be selection between different kernel instances done at run time).
The CUDA SDK contains a number of good example codes which demonstrate different ways that shared memory can be used in kernels to improve memory read and write performance. The matrix transpose, reduction and 3D finite difference method examples are all good models of shared memory usage. Each also has a good paper which discusses the optimization strategies behind the shared memory use in the codes. You would be well served by studying them until you understand how and why they work.
I have a short piece of code like this:
typedef struct {
double sX;
double sY;
double vX;
double vY;
int rX;
int rY;
int mass;
int species;
int boxnum;
} particle;
typedef struct {
double mX;
double mY;
double count;
int rotDir;
double cX;
double cY;
int superDir;
} box;
//....
int i;
for(i=0;i<PART_COUNT;i++) {
particles[i].boxnum = ((((int)(particles[i].sX+boxShiftX))/BOX_SIZE)%BWIDTH+BWIDTH*((((int)(particles[i].sY+boxShiftY))/BOX_SIZE)%BHEIGHT));
}
for(i=0;i<PART_COUNT;i++) {
//sum the momenta
boxnum = particles[i].boxnum;
boxes[boxnum].mX += particles[i].vX*particles[i].mass;
boxes[boxnum].mY += particles[i].vY*particles[i].mass;
boxes[boxnum].count++;
}
Now, I want to port this to CUDA. The first step is easy; spreading the calculation across a bunch of threads is no problem. The issue is the second. Since any two particles are equally likely to be in any same box, I'm not sure how I can partition it so as to avoid conflicts.
Number of particles is on the order of 10,000 to 10,000,000, and number of boxes is on the order of 1024 to 1048576.
Ideas?
You can try to use atomicAdd operations to modify your boxes array. Atomic operations on global memory are very slow but at the same time it's quite impossible to do any optimizations involving shared memory for two reasons:
Under the assumption that the properties boxnum of the particles particles[0]..particles[n] aren't ordered and do not lie in any small boundaries (in the range of a block size) you can't predict which boxes to load from global memory into shared memory. You would've to first collect all the boxnumbers..
If you try to collect all boxnumbers you can't use an array with every possible boxnumber as an index since there are way too many boxes to fit into shared memory. So you'd have to collect indices with a queue (realized with an array, a pointer to the next free slot and atomic operations), but then you'd still have conflicts because the same boxnumber could occur multiple times in your queue.
Conclusion: atomicAdd will give you at least correct behavior. Try it out and test the performance. If you aren't satisfied by the performance, think if there's another way to do the same computations that would profit from shared memory.
As an alternative, you could launch a 2D grid of blocks.
blocks.x = numParticles / threadsPerBlock / repeatPerBlock.
blocks.y = numOfBoxes / 1024;
Each block performs atomic additions in shared memory if and only if boxnum lies in between 1024 * blockIdx.y and 1024 * (blockIdx.y + 1);
This is followed by a reduction along blocks.x
This may or may not be faster than atomicAdd on global memory as the data is read blocks.y number of times. This could however be fixed if the "particles" are sorted by boxnum in a sorting pass followed by a partitioning pass.
There may be several other ways to do it, but since the problem size varies by a large amount, you may end up having to write 2-3 different methods that are optimized for a given size range.