How make_int2 is working? - cuda

in the CUDA C Programming Guide Version there is a small paragraph about Built-in Vector Types. It says that this structure has 4 components and they accessible with a specific way, e.i .x .y .z .w. What are the 4 components? Can someone give an example?
Moreover it says that with this line int2 make_int2(int x, int y); it constructs a vector with value x, y. Each of these variables has 4 components?
The reason I am trying to understand these things is because I am studying the following code:
/*1*/ int ny = num_ofElements_y_ofmyMatrix;
/*2*/ int nx = num_ofElements_x_ofmyMatrix;
/*3*/ int2 matrix_index_2d = make_int2( ( blockIdx.x * blockDim.x ) + threadIdx.x, ( blockIdx.y * blockDim.y ) + threadIdx.y );
/*4*/ int matrix_index_1d = ( nx * matrix_index_2d.y ) + matrix_index_2d.x;
/*5*/ if ( matrix_index_2d.x < nx && matrix_index_2d.y < ny )
/*6*/ {
/*7*/ float r = myMatrix[ matrix_index_1d ];
/*8*/ }
How the indexing in lines 3 and 4 is working? Subsequently, who does the access of the matrix myMatrix is working?
UPDATE:
As far as the code snippet is concerned usually when I am accessing an array I am using the following:
col = blockDim.x*blockIdx.x + threahIdx.x;
row = blockDim.x*blockIdx.x + threahIdx.x;
if (col < NUMCOLS && row < NUMROWS){...}
in order to have a row-major access of an array in c++, e.g myMatrix[row*NUMCOLS + col].
What is the connection with the type of indexing used in line 3 and 4?

Not all the CUDA built-in vector types have 4 components. For example, double2 has 2 double components. double2 is indeed defined as
struct __device_builtin__ __builtin_align__(16) double2
{
double x, y;
};
and can be used to deal with complex, double precision numbers. According to the definition above, you can use a declaration like
double2 foo;
and then initialize the two int2 make_int2(int x, int y); and y components as
foo.x = 1.;
foo.y = 3.4;
As another example, float4 has 4 float components and can be used to deal with four-vector in a Minkowski space.
In your example, int2 has 2 integer components and the instruction
int2 foo_int = make_int2(3,1);
constructs a foo_int struct of type int2 and initializes the x and y components to 3 and 1 respectively.

Related

Matrix Multiplication of matrix and its transpose in Cuda

I am relatively new to CUDA programming so there are some unsolved issues for which I hope I can get some hints in the right direction.
So the case is that I want to multiply a 2D array with its transpose and to be precise I want to execute the operation ATA.
I have already used the cublas Dgemm function and now I am trying to do the same operation with a tiled algorithm, very similar to the one from CUDA guide.
The case is that while the initial algorithm runs properly, I want to calculate only the upper triangular matrix of the product hoping that I could achieve a better time for the operation, and I am not sure on how to extract tiles/blocks which will have the respective elements.
So if you could enlighten me on this, or give any hint I would be grateful, cause I have stuck on that for a while.
This is the code of the kernel
__shared__ double Ads1[TILE_WIDTH][TILE_WIDTH];
__shared__ double Ads2[TILE_WIDTH][TILE_WIDTH];
//block row and column
//we save in registers for faster access
int by = blockIdx.y;
int bx = blockIdx.x;
int ty = threadIdx.y;
int tx = threadIdx.x;
int row = by * TILE_WIDTH + ty;
int col = bx * TILE_WIDTH + tx;
double Rvalue = 0;
if(row >= width || col >= width) return;
//Each thread block computes one sub-matrix Rsub of result R
for (int i=0; i<(int) ceil(((double) height/TILE_WIDTH)); ++i)
{
Ads1[tx][ty] = Ad[(i * TILE_WIDTH + ty)*width + col];
Ads2[tx][ty] = Ad[(i * TILE_WIDTH + tx)*width + row];
__syncthreads();
for (int j = 0; j < TILE_WIDTH; ++j)
{
if ((i*TILE_WIDTH + j) > height ) break; //in order not to exceed the matrix's height
Rvalue+=Ads1[j][tx]*Ads2[ty][j];
}
__syncthreads();
}
Rd [row * width + col] = Rvalue;
You may want to use the batch dgemm API function described here recursely dividing your output matrix with block diagonal and corner. You also want to balance smallest block size versus overhead in compute to avoid small invokes. Finally, note that matrix multiply turns memory bound at some stage, which can be on modern GPU somewhat large.

Why does this CUDA example kernel have a for loop?

I have been looking at the following example from the official CUDA website:
http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-cufft
Download here: http://developer.download.nvidia.com/compute/DevZone/C/Projects/x64/simpleCUFFT.zip
It contains the following kernel:
// Complex pointwise multiplication
static __global__ void ComplexPointwiseMulAndScale(Complex *a, const Complex *b, int size, float scale)
{
const int numThreads = blockDim.x * gridDim.x;
const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
for (int i = threadID; i < size; i += numThreads)
{
a[i] = ComplexScale(ComplexMul(a[i], b[i]), scale);
}
}
My question is, why is there a for loop here? Doesn't CUDA simultaneously call an array of thread? I removed the thread, replacing it with the following code and it produced the same output.
// Complex pointwise multiplication
static __global__ void ComplexPointwiseMulAndScale(Complex *a, const Complex *b, int size, float scale)
{
const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
a[threadID] = ComplexScale(ComplexMul(a[threadID], b[threadID]), scale);
}
As this is an official example on the CUDA website, I imagine I must be missing something.
Your version is basically what happens when numThreads is equal to size (but only then).
What the official example does is the following: Suppose numThreads is equal to 4 (for simplicity, usually it will be much larger), and consider the array positions (both for a and b):
a or b x x x x x x x x
thread that works here 0 1 2 3 0 1 2 3
Then the first thread will work on all array position divisible by 4, et cetera.
The problem with your version is that the caller of your function will have to make sure that there are as many threads as size is large. For example, if you call your version with a 1-dim grid and both gridDim.x and blockDim.x being 2, but on vectors of length 8, then half of your vector isn't processed!
The official example works regardless - no matter how many threads the caller assigns to it, the entire vector will be processed.

kernel using AoS is faster than using SoA

I have two versions of a kernel that performs the same task -fill a linked cell list-, the difference between both kernels is the datatype to store particle position, the first one using a float array to store the positions (4 float per particle due to 128bit reads/writes), and the second uses a vec3f structure array to store the positions (a structure which holds 3 floats).
Doing some tests using nvprof, I found that the second kernel (which uses vec3f) ran faster than the first one:
Time(%) Time Calls Avg Min Max Name
42.88 37.26s 2 18.63s 23.97us 37.26s adentu_grid_cuda_filling_kernel(int*, int*, int*, float*, int, _vec3f, _vec3f, _vec3i)
11.00 3.93s 2 1.97s 25.00us 3.93s adentu_grid_cuda_filling_kernel(int*, int*, int*, _vec3f*, int, _vec3f, _vec3f, _vec3i)
The tests are done trying to fill a linked cell list using 256 and 512000 particles.
My question is, what happened here? I supposed that float array should do a better memory access due to coalesced memory, versus the use of vec3f structure array which has unaligned memory. I missunderstood anything?
These are the kernels, the first kernel:
__global__ void adentu_grid_cuda_filling_kernel (int *head,
int *linked,
int *cellnAtoms,
float *pos,
int nAtoms,
vec3f origin,
vec3f h,
vec3i nCell)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx >= nAtoms)
return;
vec3i cell;
vec3f _pos = (vec3f){(float)pos[idx*4+0], (float)pos[idx*4+1], (float)pos[idx*4+2]};
cell.x = floor ((_pos.x - origin.x)/h.x);
cell.y = floor ((_pos.y - origin.y)/h.y);
cell.z = floor ((_pos.z - origin.z)/h.z);
int c = nCell.x * nCell.y * cell.z + nCell.x * cell.y + cell.x;
int i;
if (atomicCAS (&head[c], -1, idx) != -1){
i = head[c];
while (atomicCAS (&linked[i], -1, idx) != -1)
i = linked[i];
}
atomicAdd (&cellnAtoms[c], 1);
}
And this is the second kernel:
__global__ void adentu_grid_cuda_filling_kernel (int *head,
int *linked,
int *cellNAtoms,
vec3f *pos,
int nAtoms,
vec3f origin,
vec3f h,
vec3i nCell)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx >= nAtoms)
return;
vec3i cell;
vec3f _pos = pos[idx];
cell.x = floor ((_pos.x - origin.x)/h.x);
cell.y = floor ((_pos.y - origin.y)/h.y);
cell.z = floor ((_pos.z - origin.z)/h.z);
int c = nCell.x * nCell.y * cell.z + nCell.x * cell.y + cell.x;
int i;
if (atomicCAS (&head[c], -1, idx) != -1){
i = head[c];
while (atomicCAS (&linked[i], -1, idx) != -1)
i = linked[i];
}
atomicAdd (&cellNAtoms[c], 1);
}
This is the vec3f structure:
typedef struct _vec3f {float x, y, z} vec3f;
This is not an example of AoS vs. SoA. Let's look at the important lines of code and the data structures implicit in them.
Your first, "SoA" or "slow" case:
vec3f _pos = (vec3f){(float)pos[idx*4+0], (float)pos[idx*4+1], (float)pos[idx*4+2]};
^ ^ ^
| | |
These values are stored in *adjacent* memory locations
So an individual thread is accessing successively pos[idx*4] plus the 2 locations right after it. This is how a structure gets stored! What you're calling a structure of Arrays is in fact an array of structures, the way it is stored in memory. To have a valid "SoA" case, your code would need to look something like this:
vec3f _pos = (vec3f){(float)pos1[idx], (float)pos2[idx], (float)pos3[idx]};
^
|
Adjacent threads will read adjacent values for pos1, pos2, and pos3
leading to *coalesced* access.
Your "AoS" or "fast" doesn't really have a different storage format.
To my mind both of your approaches are actually AoS, the only difference is that the first approach is AoS with a structure of four elements while the second approach only uses three elements. This is why your second solution is preferable.
If you really want to have SoA in your first solution you would have to organize the pos array as follows:
vec3f _pos = (vec3f){(float)pos[idx], (float)pos[N + idx], (float)pos[2 * N + idx]};

CUDA and Monte Carlo with local behavior defined

I have a question about a strange behavior In CUDA.
I am currently developing a Monte Carlo simulation on particles trajectories and I am doing the following the thing.
The position p(n) of my particle at a given date t(n) depends on the position t(n-1) of my particle at the previous date t(n-1). Indeed, let’s say the value v(n) is computed from the value p(n-1). Here is a simplified example of my code:
__device__ inline double calculateStep( double drift, double vol, double dt, double randomWalk, double S_t){
return exp((drift - vol*vol*0.5)*dt + randomWalk*vol*sqrt(dt))*S_t;
}
__device__ double doSomethingWhith(double v_n, ….) {
...
Return v_n*exp(t)*S
}
__global__ myMCsimulation( double* matrice, double * randomWalk, int nbreSimulation, int nPaths, double drift, ……) {
double dt = T/nPaths;
unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int stride = blockDim.x*gridDim.x;
unsigned int index = tid;
double mydt = (index - nbreSimulation)/nbreSimulation*dt + dt;
for ( index = tid; index < nbreSimulation*nPaths; index += stride) {
if (index >= nbreSimulation)
{
double v_n = DoSomethingWith(drift,dt, matrice[index – nbreSimulation]);
matrice[index] = matrice[index - nbreSimulation ] * calculateStep(drift,v_n,dt,randomWalk[index]); //
}
...}
The last code line :
matrice[index] = matrice[index - nbreSimulation ] * calculateStep(drift,v_n,dt,randomWalk[index]);
enables me to fill in only the second row of the matrix matrice. I don’t know why.
When I change the code line by :
matrice[index] = DoSomethingWith(drift,dt, matrice[index – nbreSimulation]);
My matrix is well filled in and I have all my values changed, then I am able to get back the matrice[index – nbreSimulation].
I think this is a concurrent access but I am not sure, I tried __syncthreads() but it did not work.
Could someone please help on this point?
Many thanks
I have changed my code by the following thing and now it works perfectly.
if (index < nbreSimulation) {
matrice[index] = S0;
for (workingCol=1; workingCol< nPaths; workingCol++) {
previousMove = index;
index = index + nbreSimulation;
................
matrice[index] = calculateStep(drift,vol_int[index],dt,randomWalk[index], matrice[previousMove]); }
}
}
I have tried the following thing :
I have declared a shared variable (an array of doubles) which contains the value computed at each iteration :
__shared__ double mat[];
......
for ( index = tid; index < nbreSimulation*nPaths; index += stride) {
.....
mat[index] = computedValue;
......
}
Without success. Does anyone see the issue?

Cuda kernel seems to block, launch timed out and was terminated

i wrote a lib in CUDA that loads JPEG files and a viewer that displays them.
Both parts make heavy use of CUDA, the sources are on SourceForge:
cuview & cujpeg
I store an image as RGB data in GPU memory and i have a function bitblt that copies a rectangular array of RGB data from one image into another one.
The code worked fine on my last PC with a GTX580 with CUD3.x (can't restore any more).
Now i have a GTX680 and use CUDA 4.x.
The kernel looks like this, it worked fine on GTX580 / CUDA 3.x:
__global__ void cujpeg_k_bitblt(CUJPEG* dd, CUJPEG* src, int sx, int sy, int tx, int ty, int w, int h)
{
unsigned char* sb;
unsigned char* s;
unsigned char* db;
unsigned char* d;
int tid;
int x, y;
int xs, ys, xt, yt;
int ws, wt;
sb = src->dev_rgb;
db = dd->dev_rgb;
ws = src->stride;
wt = dd->stride;
for(tid = threadIdx.x + blockIdx.x * blockDim.x; tid < w * h; tid += blockDim.x * gridDim.x) {
y = tid / w;
x = tid - y * w;
xs = x + sx;
ys = y + sy;
xt = x + tx;
yt = y + ty;
s = sb + (ys * ws + xs) * 3;
d = db + (yt * wt + xt) * 3;
d[0] = s[0];
d[1] = s[1];
d[2] = s[2];
}
}
I wonder what this could be related to, maybe the higher numbers for several properties on the GTX680 generate an overflow somewhere?
threads in warp 32
max threads per block 1024
max thread dim 1024 1024 64
max grid dim 2147483647 65535 65535
Any hints would be really appreciated.
I develop on Linux, use OpenSuSE 12.1.
Best regards,
Torsten.
Edit, 2012-08-22:
I use:
devdriver_4.0_linux_64_270.40.run
cudatools_4.0.13_linux_64.run
cudatoolkit_4.0.13_linux_64_suse11.2.run
Regarding the timing of that function bitblt:
On my last PC with Cuda 3.x and GTX580 that function took a few milliseconds.
Now it times out after several seconds.
There are other kernels running, if i comment out the call to bitblt everything runs fine.
Also using printf() i can see that all calls before bitblt were fine and after bitblt nothing is executed.
I can't really think that that kernel itself is the problem but i don't know what can influence the behaviour i see.
Best regards,
Torsten.
Ok, i found the problem. As the JPEG decoder is a library i give the user some flexibility in decoding, so when calling CUDA kernels i don't have fixed paramters for grids / threads but use pre-initialised values that i set at initialisation and that the user can overwrite. These default values i get from the CUDA properties of the GPU used but i use not the correct values. The grids are 2147483647, but 65535 is the maximum value allowed.