CUDA: Thread and Array Allocation - cuda

I have read many times about CUDA Thread/Blocks and Array, but still don't understand point: how and when CUDA starts to run multithread for kernel function. when host calling kernel function, or inside kernel function.
For example I have this example, It just simple transpose an array. (so, it just copy value from this array to another array).
__global__
void transpose(float* in, float* out, uint width) {
uint tx = blockIdx.x * blockDim.x + threadIdx.x;
uint ty = blockIdx.y * blockDim.y + threadIdx.y;
out[tx * width + ty] = in[ty * width + tx];
}
int main(int args, char** vargs) {
/*const int HEIGHT = 1024;
const int WIDTH = 1024;
const int SIZE = WIDTH * HEIGHT * sizeof(float);
dim3 bDim(16, 16);
dim3 gDim(WIDTH / bDim.x, HEIGHT / bDim.y);
float* M = (float*)malloc(SIZE);
for (int i = 0; i < HEIGHT * WIDTH; i++) { M[i] = i; }
float* Md = NULL;
cudaMalloc((void**)&Md, SIZE);
cudaMemcpy(Md,M, SIZE, cudaMemcpyHostToDevice);
float* Bd = NULL;
cudaMalloc((void**)&Bd, SIZE); */
transpose<<<gDim, bDim>>>(Md, Bd, WIDTH); // CALLING FUNCTION TRANSPOSE
cudaMemcpy(M,Bd, SIZE, cudaMemcpyDeviceToHost);
return 0;
}
(I have commented all lines that not important, just have the line calling function transpose)
I have understand all lines in function main except the line calling function tranpose. Does it true when I say: when we call function transpose<<<gDim, bDim>>>(Md, Bd, WIDTH), CUDA will automatically assign each elements of array into one thread (and block), and when we calling "one time" tranpose, CUDA will running gDim * bDim times tranpose on gDim * bDim threads.
This point makes me feel frustrated so much, because it doesn't like multithread in java, when I use :( Please tell me.
Thanks :)

Your understanding is in essence correct.
transpose is not a function, but a CUDA kernel. When you call a regular function, it only runs once. But when you launch a kernel a single time, CUDA will automatically run the code in the kernel many times. CUDA does this by starting many threads. Each thread runs the code in your kernel one time. The numbers inside the tripple brackets (<<< >>>) is called the kernel execution configuration. It determines how many threads will be launched by CUDA and specifies some relationships between the threads.
The number of threads that will be started is calculated by multiplying up all the values in the grid and block dimensions inside the triple brackets. For instance, the number of threads will be 1,048,576 (16 * 16 * 64 * 64) in your example.
Each thread can read some variables to find out which thread it is. Those are the blockIdx and threadIdx structures at the top of the kernel. The values reflect the ones in the kernel execution configuration. So, if you run your kernel with a grid configuration of 16 x 16 (the first dim3 in the triple brackets, you will get threads that, when they each read the x and y values in the blockIdx structure, will get all possible combinations of x and y between 0 and 15.
So, as you see, CUDA does not know anything about array elements or any other data structures that are specific to your kernel. It just deals with threads, thread indexes and block indexes. You then use those indexes to to determine what a given thread should do (in particular, which values in your application specific data it should work on).

Related

What dimension should the cuRAND initialization kernel have

I am working on a program in which there are two main kernels.
Due to the impact on performances, each kernel has its own dimensions. Thus I have 2 different block and grid sizes (whose values cannot be known at compile time).
Both kernels need to use the cuRAND library, so before a third kernel is launched to initialize the cuRAND state on the device.
My question comes when I need to choose the dimensions of this kernel.
Let's say I have for kernel 1 and 2:
block_size_1 = 256
grid_size_1 = 10
block_size_2 = 512
grid_size_2 = 2
For the cuRAND initialization kernel, should I use the largest sizes (10*512), or the highest number of threads (10*256)?
Pick the biggest kernel size, because that is the maximum number of cuRand generators that you'll use. You can easyly evaluate the size you need using something like
__host__ void fun(){
curandState * randState;
int myCurandSize = ((block_size1 * grid_size1) > (block_size2 * grid_size2))? Block_size1 * Grid_size1 : Block_size2 * Grid_size2);
error = cudaMalloc((void **)&randState, myCurandSize * sizeof(curandState));
if (error == cudaErrorMemoryAllocation){
cudaDeviceReset();
return 1;
}
setup_cuRand <<<1, myCurandSize>>> (randState, unsigned(time(NULL)));
//Don't forget to free the space
cudaFree(randState);
}
__global__ void setup_cuRand(curandState * state, unsigned long seed)
{
int id = threadIdx.x;
curand_init(seed, id, 0, &state[id]);
}
Edit: I was asumming that block_size * grid_size will not go over the maximum thread limit, otherwise, you can do the same but keeping aswell the grid and block dimension and launching that number of threads setup_curand<<<x, y>>>(...);

What is the difference between __ldg() intrinsic and a normal execution?

I am trying to explore '__ldg intrinsic'. I have gone through NVIDIA's documentation for this but didn't get any satisfactory answer over its use and implementations. Moreover with reference to THIS I tried implementing __ldg in a simple 1024*1024 matrix multiplication example.
#include<stdio.h>
#include<stdlib.h>
__global__ void matrix_mul(float * ad,float * bd,float * cd,int N)
{
float pvalue=0;
//find Row and Column corresponding to a data element for each thread
int Row = blockIdx.y * blockDim.y + threadIdx.y;
int Col = blockIdx.x * blockDim.x + threadIdx.x;
//calculate dot product of Row of First Matrix and Column of Second Matrix
for(int i=0;i< N;++i)
{
// I tried with executing this first:
float m=__ldg(&ad[Row * N+i]);
float n=__ldg(&bd[i * N + Col]);
//Then I executed this as a normal execution:
// float m = ad[Row * N+i];
// float n = bd[i * N + Col];
pvalue += m * n;
}
//store dot product at corresponding position in resultant Matrix
cd[Row * N + Col] = pvalue;
}
int main()
{
int N = 1024,i,j; //N == size of square matrix
float *a,*b;
float *ad,*bd,*cd,*c;
//open a file for outputting the result
FILE *f;
f=fopen("Parallel Multiply_ldg.txt","w");
size_t size=sizeof(float)* N * N;
//allocate host side memory
a=(float*)malloc(size);
b=(float*)malloc(size);
c=(float*)malloc(size);
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
a[i*N+j]=2.0; //(float)(i*N+j); //initializing each value with its own index
b[i*N+j]=1.0; //(float)(i*N+j); //random functions can be used alternatively
}
}
//allocate device memory
cudaMalloc(&ad,size);
//printf("\nAfter cudaMalloc for ad\n%s\n",cudaGetErrorString(cudaGetLastError()));
cudaMalloc(&bd,size);
//printf("\nAfter cudaMalloc bd\n%s\n",cudaGetErrorString(cudaGetLastError()));
cudaMalloc(&cd,size);
//printf("\nAfter cudaMalloc cd\n%s\n",cudaGetErrorString(cudaGetLastError()));
//copy value from host to device
cudaMemcpy(ad,a,size,cudaMemcpyHostToDevice);
cudaMemcpy(bd,b,size,cudaMemcpyHostToDevice);
printf("\nAfter HostToDevice Memcpy\n%s\n",cudaGetErrorString(cudaGetLastError()));
//calculate execution configuration
dim3 blocksize(16,16); //each block contains 16 * 16 (=256) threads
dim3 gridsize(N/16,N/16); //creating just sufficient no of blocks
//GPU timer code
float time;
cudaEvent_t start,stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
matrix_mul <<< gridsize, blocksize >>> (ad,bd,cd, N);
cudaDeviceSynchronize();
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time,start,stop); //time taken in kernel call calculated
cudaEventDestroy(start);
cudaEventDestroy(stop);
//copy back results
cudaMemcpy(c,cd,sizeof(float)* N*N,cudaMemcpyDeviceToHost);
printf("\nAfter DeviceToHost Memcpy\n%s\n",cudaGetErrorString(cudaGetLastError()));
//output results in output_file
fprintf(f,"Array A was---\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
fprintf(f,"%f ",a[i*N+j]);
fprintf(f,"\n");
}
fprintf(f,"\nArray B was---\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
fprintf(f,"%f ",b[i*N+j]);
fprintf(f,"\n");
}
fprintf(f,"\nMultiplication of A and B gives C----\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
fprintf(f,"%f ",c[i*N+j]); //if correctly computed, then all values must be N
fprintf(f,"\n");
}
printf("\nYou can see output in Parallel Mutiply.txt file in project directory");
printf("\n\nTime taken is %f (ms)\n",time);
fprintf(f,"\n\nTime taken is %f (ms)\n",time);
fclose(f);
cudaThreadExit();
//cudaFree(ad); cudaFree(bd); cudaFree (cd);
free(a);free(b);free(c);
//_getch();
return 1;
}
I commented that __ldg part in my kernel and executed by normal execution, and vice versa.
In both cases it gives me correct multiplication result. I am confused with the time difference I am getting between these executions, because its huge almost more than 100X!
In case of __ldg it gives me: Time taken is 0.014432 (ms)
And in case of normal execution without __ldg it gives me : Time taken is 36.858398 (ms)
Is this the exact way of using __ldg intrisic? What is the significance of __ldg intrinsic and what is the proper way of using it? Apparently what I did above in my code is wrong and naive. I am looking for explanation and example. Thanks in advance.
From the CUDA C Programming Guide
Global memory accesses for devices of compute capability 3.x are cached in L2 and for devices of compute capability 3.5, may also be cached in the read-only data cache described in the previous section; they are not cached in L1.
...
Data that is read-only for the entire lifetime of the kernel can also be cached in the read-only data cache described in the previous section by reading it using the __ldg() function (see Read-Only Data Cache Load Function). When the compiler detects that the read-only condition is satisfied for some data, it will use __ldg() to read it. The compiler might not always be able to detect that the read-only condition is satisfied for some data. Marking pointers used for loading such data with both the const and __restrict__ qualifiers increases the likelihood that the compiler will detect the read-only condition.
The read only cache accesses have a much lower latency than the global memory accesses. Because matrix multiplication accesses the same values from memory many times, caching in the read only cache gives a huge speedup (in memory bound applications).
In NVIDIA GPU there is a texture - images with special and not hard logic to work with images.
This texture memory is another type of memory available in GPU. In particularly constant, global and register file memory has not any relation to this texture memory.
Kepler GPUs and later add the ability to use this memory from "GPU texture pipeline".
But let's specify the difference between constant cache and read-only cache.
Constant Cache
Data loaded through the constant cache must be relatively small and must be accessed in such way that all threads of a warp should access the same location at any given time.
Read-only Cache or Texture Memory Cache
Cache can be much larger and can be accessed in a non-uniform pattern.
Read Only cache has granularity 32 bytes.
You can use this as "read-only cache" for your CUDA kernel.
1. Data stored in global memory can be cached in that place GPU Texture Memory
2. With doing that you give promise to the compiler that data is read-only for the
duration of a kernel execution in GPU.
There are two ways to achieve this.
A. Using an intrinsic function __ldg
Example: output[i] += __ldg(&input[j]);
B. Qualifying pointers to global memory
const float* __restrict__ input
output[idx] += input[idx];
Comparision:
The intrinsic __ldg is a better choice for deep compiler reasons.

Tips for optimizing X_transpose*X CUDA kernel

I am writing my first CUDA application and am writing all the kernels my self for practice.
In one portion I am simply calculating X_transpose * X.
I have been using cudaMallocPitch and cudaMemcpy2D, I first allocate enough space on the device for X and X_transpose*X. I copy X to the device, my kernel takes two inputs, the X matrix, then the space to write the X_transpose * X result.
Using the profiler the kernel originally took 104 seconds to execute on a matrix of size 5000x6000. I pad the matrix with zeros on the host so that it is a multiple of the block size to avoid checking the bounds of the matrix in the kernel. I use a block size of 32 by 32.
I made some changes to try to maximize coalesced reads/writes to global memory, this seemed to help significantly. Using the visual profiler to profile the release build of my code, the kernel now takes 4.27 seconds to execute.
I haven't done an accurate timing of my matlab execution(just the operation X'*X;), but it appears to be about 3 seconds. I was hoping I could get much better speedups than matlab using CUDA.
The nvidia visual profiler is unable to find any issues with my kernel, I was hoping the community here might have some suggestions as to how I can make it go faster.
The kernel code:
__global__ void XTXKernel(Matrix X, Matrix XTX) {
//find location in output matrix
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
int row = threadIdx.y;
int col = threadIdx.x;
Matrix XTXsub = GetSubMatrix(XTX, blockRow, blockCol);
float Cvalue = 0;
for(int m = 0; m < (X.paddedHeight / BLOCK_SIZE); ++m) {
//Get sub-matrix
Matrix Xsub = GetSubMatrix(X, m, blockCol);
Matrix XTsub = GetSubMatrix(X, m, blockRow);
__shared__ float Xs[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float XTs[BLOCK_SIZE][BLOCK_SIZE];
//Xs[row][col] = GetElement(Xsub, row, col);
//XTs[row][col] = GetElement(XTsub, col, row);
Xs[row][col] = *(float*)((char*)Xsub.data + row*Xsub.pitch) + col;
XTs[col][row] = *(float*)((char*)XTsub.data + row*XTsub.pitch) + col;
__syncthreads();
for(int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += Xs[e][row] * XTs[col][e];
__syncthreads();
}
//write the result to the XTX matrix
//SetElement(XTXsub, row, col, Cvalue);
((float *)((char*)XTXsub.data + row*XTX.pitch) + col)[0] = Cvalue;
}
The definition of my Matrix structure:
struct Matrix {
matrixLocation location;
unsigned int width; //width of matrix(# cols)
unsigned int height; //height of matrix(# rows)
unsigned int paddedWidth; //zero padded width
unsigned int paddedHeight; //zero padded height
float* data; //pointer to linear array of data elements
size_t pitch; //pitch in bytes, the paddedHeight*sizeof(float) for host, device determines own pitch
size_t size; //total number of elements in the matrix
size_t paddedSize; //total number of elements counting zero padding
};
Thanks in advance for your suggestions.
EDIT: I forgot to mention, I am running the on a Kepler card, GTX 670 4GB.
Smaller block size like 16x16 or 8x8 may be faster. This slides also demos larger non-square size of block/shared mem may be faster for particular matrix size.
For shared mem allocation, add a dumy element on the leading dimension by using [BLOCK_SIZE][BLOCK_SIZE+1] to avoid the bank conflict.
Try to unroll the inner for loop by using #pragma unroll
On the other hand, You probably won't be much faster than matlab GPU code for large enough A'*A. Since the performance bottleneck of matlab is the invoking overhead rather than the kernel performance.
The cuBLAS routine culas_gemm() may have highest performance for matrix multiplication. You could compare yours with it.
MAGMA routine magma_gemm() has higher performance than cuBLAS in some cases. It's a open source project. You may also get some ideas from their code.

CUDA efficient division?

I would like to know if there is, by any chance an efficient way of dividing elements of an array. I am running with matrix values 10000x10000 and it a considerable amount of time in comparison with other kernels. Division are expensive operations, and I can't see how to improve it.
__global__ void division(int N, float* A, int* B){
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if((row < N) && (col <= row) ){
if( B[row*N+col] >0 )
A[row*N+col] /= (float)B[row*N+col];
}
}
kernel launched with
int N = 10000;
int threads = 32
int blocks = (N+threads-1)/threads
dim3 t(threads,threads);
dim3 b(blocks, blocks);
division<<< b, t >>>(N, A, B);
cudaThreadSynchronize();
Option B:
__global__ void division(int N, float* A, int* B){
int k = blockIdx.x * blockDim.x + threadIdx.x;
int kmax = N*(N+1)/2
int i,j;
if(k< kmax){
row = (int)(sqrt(0.25+2.0*k)-0.5);
col = k - (row*(row+1))>>1;
if( B[row*N+col] >0 )
A[row*N+col] /= (float)B[row*N+col];
}
}
launched with
int threads =192;
int totalThreadsNeeded = (N*(N+1)/2;
int blocks = ( threads + (totalThreadsNeeded)-1 )/threads;
division<<<blocks, threads >>>(N, A, B);
Why is option B giving a wrong result even if the threadIds are the correct one? what is missing here?
Your basic problem is that you are launching an improbably huge grid (over 100 million threads for your 10000x10000 array example), and then because of the triangular nature of the access pattern in the kernel, fully half of those threads never do anything productive. So a enormous amount of GPU cycles are being wasted for no particularly good reason. Further, the access pattern you are using isn't allowing coalesced memory access, which is going to further reduce the performance of the threads which are actually doing useful work.
If I understand your problem correctly, the kernel is only performing element-wise division on a lower-triangle of a square array. If this is the case, it could be equally done using something like this:
__global__
void division(int N, float* A, int* B)
{
for(int row=blockIdx.x; row<N; row+=gridDim.x) {
for(int col=threadIdx.x; col<=row; col+=blockDim.x) {
int val = max(1,B[row*N+col]);
A[row*N+col] /= (float)val;
}
}
}
[disclaimer: written in browser, never compiled, never tested, use at own risk]
Here, a one dimension grid is used, with each block computing a row at a time. Threads in a block move along the row, so memory access is coalesced. In comments you mention your GPU is a Tesla C2050. That device only requires 112 blocks of 192 threads each to completely "fill" each of the 14 SM with a full complement of 8 blocks each and the maximum number of concurrent threads per SM. So the launch parameters could be something like:
int N = 10000;
int threads = 192;
int blocks = min(8*14, N);
division<<<blocks, threads>>>(N, A, B);
I would expect this to run considerably faster than your current approach. If numerical accuracy isn't that important, you can probably achieve further speed-up by replacing the division with an approximate reciprocal intrinsic and a floating point multiply.
Because threads are executed in groups of 32, called warps, you are paying for the division for all 32 threads in a warp if both if conditions are true for just one of the threads. If the condition is false for many threads, see if you can filter out the values for which the division is not needed in a separate kernel.
The int to float conversion may itself be slow. If so, you might be able to generate floats directly in your earlier step, and pass B in as an array of floats.
You may be able to generate inverted numbers in the earlier step, where you generate the B array. If so, you can use multiplication instead of division in this kernel. (a / b == a * 1 / b).
Depending on your algorithm, maybe you can get away with a lower precision division. There's an intrinsic, __fdividef(x, y), that you can try. There is also a compiler flag, -prec-div=false.
The very first thing to look at should be coalesced memory access. There is no reason for the non-coalesced pattern here, just exchange rows and columns for to avoid wasting a lot of memory bandwidth:
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
...
A[row*N+col] ...
Even if this is run on compute capability 2.0 or higher, the caches are not large enough to remedy this suboptimal pattern.

CUDA Dot Product

I'm trying to implement the classic dot-product kernel for double precision arrays with atomic computation of the final sum across the various blocks. I used the atomicAdd for double precision as stated in page 116 of the programming guide.Probably i'm doing something wrong.The partial sums across the threads in every block are computed correctly but afterwords the atomic operation doesn't seem to be working properly since every time i run my kernel with the same data,i receive different results. I'll be grateful if somebody could spot the mistake or provide an alternative solution!
Here is my kernel:
__global__ void cuda_dot_kernel(int *n,double *a, double *b, double *dot_res)
{
__shared__ double cache[threadsPerBlock]; //thread shared memory
int global_tid=threadIdx.x + blockIdx.x * blockDim.x;
int i=0,cacheIndex=0;
double temp = 0;
cacheIndex = threadIdx.x;
while (global_tid < (*n)) {
temp += a[global_tid] * b[global_tid];
global_tid += blockDim.x * gridDim.x;
}
cache[cacheIndex] = temp;
__syncthreads();
for (i=blockDim.x/2; i>0; i>>=1) {
if (threadIdx.x < i) {
cache[threadIdx.x] += cache[threadIdx.x + i];
}
__syncthreads();
}
__syncthreads();
if (cacheIndex==0) {
*dot_res=cuda_atomicAdd(dot_res,cache[0]);
}
}
And here is my device function atomicAdd:
__device__ double cuda_atomicAdd(double *address, double val)
{
double assumed,old=*address;
do {
assumed=old;
old= __longlong_as_double(atomicCAS((unsigned long long int*)address,
__double_as_longlong(assumed),
__double_as_longlong(val+assumed)));
}while (assumed!=old);
return old;
}
Getting a reduction right using ad hoc CUDA code can be tricky, so here's an alternative solution using a Thrust algorithm, which is included with the CUDA Toolkit:
#include <thrust/inner_product.h>
#include <thrust/device_ptr.h>
double do_dot_product(int n, double *a, double *b)
{
// wrap raw pointers to device memory with device_ptr
thrust::device_ptr<double> d_a(a), d_b(b);
// inner_product implements a mathematical dot product
return thrust::inner_product(d_a, d_a + n, d_b, 0.0);
}
You are using the cuda_atomicAdd function incorrectly. This section of your kernel:
if (cacheIndex==0) {
*dot_res=cuda_atomicAdd(dot_res,cache[0]);
}
is the culprit. Here, you atomically add to dot_res. then non atomically set dot_res with the result it returns. The return result from this function is the previous value of the location being atomically updated, and it supplied for "information" or local use of the caller only. You don't assign it to what you are atomically updated, that completely defeats the purpose of using atomic memory access in the first place. Do something like this instead:
if (cacheIndex==0) {
double result=cuda_atomicAdd(dot_res,cache[0]);
}
Did not checked your code that depth but here are some advices.
I would only advice using Thrust if you only use your GPU for such generic tasks, since if a complex problem will arise people have no idea to efficiently program parallel on the gpu.
Start a new parallel reduction kernel to summarize the dot product.
Since the data is already on the device you won't see a decrease in performance starting a new kernel.
Your kernel seems not to scale across the maximum number of possible blocks on the newest GPU. If it would and your kernel would be able to calculate the dot product of millions of values the performance would decrease dramatically because of the serialized atomic operation.
Beginner mistake: Is your input data and shared memory access range checked? Or are you sure the input data is always multiple of your block size? Else you will read garbage. Most of my wrong results were due to this fault.
optimise your parallel reduction. My Thesis or Optimisations Mark Harris
Untested, i just wrote it down in notepad:
/*
* #param inCount_s unsigned long long int Length of both input arrays
* #param inValues1_g double* First value array
* #param inValues2_g double* Second value array
* #param outDots_g double* Output dots of each block, length equals the number of blocks
*/
__global__ void dotProduct(const unsigned long long int inCount_s,
const double* inValuesA_g,
const double* inValuesB_g,
double* outDots_g)
{
//get unique block index in a possible 3D Grid
const unsigned long long int blockId = blockIdx.x //1D
+ blockIdx.y * gridDim.x //2D
+ gridDim.x * gridDim.y * blockIdx.z; //3D
//block dimension uses only x-coordinate
const unsigned long long int tId = blockId * blockDim.x + threadIdx.x;
/*
* shared value pair products array, where BLOCK_SIZE power of 2
*
* To improve performance increase its size by multiple of BLOCK_SIZE, so that each threads loads more then 1 element!
* (outDots_g length decreases by same factor, and you need to range check and initialize memory)
* -> see harris gpu optimisations / parallel reduction slides for more informations.
*/
__shared__ double dots_s[BLOCK_SIZE];
/*
* initialize shared memory array and calculate dot product of two values,
* shared memory always needs to be initialized, its never 0 by default, else garbage is read later!
*/
if(tId < inCount_s)
dots_s[threadIdx.x] = inValuesA_g[tId] * inValuesB_g[tId];
else
dots_s[threadIdx.x] = 0;
__syncthreads();
//do parallel reduction on shared memory array to sum up values
reductionAdd(dots_s, dots_s[0]) //see my thesis link
//output value
if(threadIdx.x == 0)
outDots_g[0] = dots_s[0];
//start new parallel reduction kernel to sum up outDots_g!
}
Edit: removed unnecessary points.