What dimension should the cuRAND initialization kernel have - cuda

I am working on a program in which there are two main kernels.
Due to the impact on performances, each kernel has its own dimensions. Thus I have 2 different block and grid sizes (whose values cannot be known at compile time).
Both kernels need to use the cuRAND library, so before a third kernel is launched to initialize the cuRAND state on the device.
My question comes when I need to choose the dimensions of this kernel.
Let's say I have for kernel 1 and 2:
block_size_1 = 256
grid_size_1 = 10
block_size_2 = 512
grid_size_2 = 2
For the cuRAND initialization kernel, should I use the largest sizes (10*512), or the highest number of threads (10*256)?

Pick the biggest kernel size, because that is the maximum number of cuRand generators that you'll use. You can easyly evaluate the size you need using something like
__host__ void fun(){
curandState * randState;
int myCurandSize = ((block_size1 * grid_size1) > (block_size2 * grid_size2))? Block_size1 * Grid_size1 : Block_size2 * Grid_size2);
error = cudaMalloc((void **)&randState, myCurandSize * sizeof(curandState));
if (error == cudaErrorMemoryAllocation){
cudaDeviceReset();
return 1;
}
setup_cuRand <<<1, myCurandSize>>> (randState, unsigned(time(NULL)));
//Don't forget to free the space
cudaFree(randState);
}
__global__ void setup_cuRand(curandState * state, unsigned long seed)
{
int id = threadIdx.x;
curand_init(seed, id, 0, &state[id]);
}
Edit: I was asumming that block_size * grid_size will not go over the maximum thread limit, otherwise, you can do the same but keeping aswell the grid and block dimension and launching that number of threads setup_curand<<<x, y>>>(...);

Related

What is the exact behaviour of the CUDA execution?

Let's suppose we want to call a global function with the code that follows. Every single thread will have a curandState generator and an array of ints (both properly initialized) that we'll use in order to execute the following code sample:
#define NUMTHREADS 200
int main(){
int * result;
curandState * randState;
if (cudaMalloc(&randState, NUMTHREADS * sizeof(curandState)) == cudaErrorMemoryAllocation ||
cudaMalloc(&result, NUMTHREADS * sizeof(int)) == cudaErrorMemoryAllocation){
cudaDeviceReset();
exit(-1);
}
setup_cuRand <<<1, NUMTHREADS>>> (randState, unsigned(time(NULL)));
method <<<1, NUMTHREADS>>> (state,result);
return 1;
}
__global__ void setup_cuRand(curandState * state, unsigned long seed)
{
int id = threadIdx.x;
curand_init(seed, id, 0, &state[id]);
}
__global__ void generic method(curandState* state, int * result){
curandState localState = state[threadIdx.x];
int num = curand(&localState) % 100;
if(num > 50)
result[threadIdx.x] = threadIdx.x;
else
result[threadIdx.x] = -1;
}
What would be our execution? I mean, do the threads split into both codes magically and re-join later or how it works? are all 1024 threads in execution at once? this last question is because when i'm debugging on Visual Studio 2013, using Cuda Debugger, when i'm going forward, threadIdx.x allways has a value like n*32 and until now i tought that 1024 threads could be executed at the same time and now i'm doubtfull
The test is likely to be transformed into a predicate that will mean conditional assignment of some value in your region of memory. Should your if be more complex, the threads of a warp would magically join after the second part of the if clause. Depending on predicate for each thread of a warp, a branch might not even get visited.
When entering a breakpoint, the data will be shown for a specific thread/block id. Which thread/block is followed is given by the CUDA Debug Focus setting in NSIGHT for Visual Studio (While debugging with CUDA, enter the NSIGHT menu entry, and select Windows, then CUDA Debug Focus...) By default, thread 0,0,0 will be focused.
Threads are logically executed at the same time, but in practice, you have less than 1024 CUDA-cores per SM. The threads are organized into warps of 32, and warps are scheduled on different execution units by the instruction scheduler.
For 1024 threads, that is 32 warps, the first and last warp are not necessarily executed at the same time precisely.
See Memory Fence function in cuda documentation for more details, as well as Synchronization Functions.

How do i calculate the number of CUDA threads being launched?

I have a CUDA card with : Cuda Compute capability (3.5) If i have a call such as <<<2000,512>>> , what are the number of iterations that occur within the kernel? I thought it was (2000*512), but testing isn't proving this? I also want to confirm that the way I'm calculating the the variable is correct.
The situation is, within the kernel I am incrementing a passed global memory number based on the thread number :
int thr = blockDim.x * blockIdx.x + threadIdx.x;
worknumber = globalnumber + thr;
So, when I return back to the CPU, I want to know exactly how many increments there were so I can keep track so I don't repeat or skip numbers when I recall the kernel GPU to process my next set of numbers.
Edit :
__global__ void allin(uint64_t *lkey, const unsigned char *d_patfile)
{
uint64_t kkey;
int tmp;
int thr = blockDim.x * blockIdx.x + threadIdx.x;
kkey = *lkey + thr;
if (thr > tmp) {
tmp = thr;
printf("%u \n", thr);
}
}
If you launch a kernel with the configuration <<<X,Y>>>, and you have not violated any rules of CUDA usage, then the number of threads launched will, in fact, be X*Y (or a suitable modification of that if we are talking about 2 or 3 dimensional threadblocks and/or grids, i.e. X.x*X.y*X.z*Y.x*Y.y*Y.z ).
printf from a CUDA kernel has various limitations. Therefore, generating a large amount of printf output from a CUDA kernel is generally unwise and probably not useful for validating the number of threads launched in a large grid.
If you want to keep track of the number of threads that actually get launched, you could use a global variable and have each thread atomically update it. Something like this:
$ cat t848.cu
#include <stdio.h>
__device__ unsigned long long totThr = 0;
__global__ void mykernel(){
atomicAdd(&totThr, 1);
}
int main(){
mykernel<<<2000,512>>>();
unsigned long long total;
cudaMemcpyFromSymbol(&total, totThr, sizeof(unsigned long long));
printf("Total threads counted: %lu\n", total);
}
$ nvcc -o t848 t848.cu
$ cuda-memcheck ./t848
========= CUDA-MEMCHECK
Total threads counted: 1024000
========= ERROR SUMMARY: 0 errors
$
Note that atomic operations may be relatively slow. I wouldn't recommend making regular use of such a code for performance reasons. But if you want to convince yourself of the number of threads launched, it should give the correct answer.

What is the difference between __ldg() intrinsic and a normal execution?

I am trying to explore '__ldg intrinsic'. I have gone through NVIDIA's documentation for this but didn't get any satisfactory answer over its use and implementations. Moreover with reference to THIS I tried implementing __ldg in a simple 1024*1024 matrix multiplication example.
#include<stdio.h>
#include<stdlib.h>
__global__ void matrix_mul(float * ad,float * bd,float * cd,int N)
{
float pvalue=0;
//find Row and Column corresponding to a data element for each thread
int Row = blockIdx.y * blockDim.y + threadIdx.y;
int Col = blockIdx.x * blockDim.x + threadIdx.x;
//calculate dot product of Row of First Matrix and Column of Second Matrix
for(int i=0;i< N;++i)
{
// I tried with executing this first:
float m=__ldg(&ad[Row * N+i]);
float n=__ldg(&bd[i * N + Col]);
//Then I executed this as a normal execution:
// float m = ad[Row * N+i];
// float n = bd[i * N + Col];
pvalue += m * n;
}
//store dot product at corresponding position in resultant Matrix
cd[Row * N + Col] = pvalue;
}
int main()
{
int N = 1024,i,j; //N == size of square matrix
float *a,*b;
float *ad,*bd,*cd,*c;
//open a file for outputting the result
FILE *f;
f=fopen("Parallel Multiply_ldg.txt","w");
size_t size=sizeof(float)* N * N;
//allocate host side memory
a=(float*)malloc(size);
b=(float*)malloc(size);
c=(float*)malloc(size);
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
a[i*N+j]=2.0; //(float)(i*N+j); //initializing each value with its own index
b[i*N+j]=1.0; //(float)(i*N+j); //random functions can be used alternatively
}
}
//allocate device memory
cudaMalloc(&ad,size);
//printf("\nAfter cudaMalloc for ad\n%s\n",cudaGetErrorString(cudaGetLastError()));
cudaMalloc(&bd,size);
//printf("\nAfter cudaMalloc bd\n%s\n",cudaGetErrorString(cudaGetLastError()));
cudaMalloc(&cd,size);
//printf("\nAfter cudaMalloc cd\n%s\n",cudaGetErrorString(cudaGetLastError()));
//copy value from host to device
cudaMemcpy(ad,a,size,cudaMemcpyHostToDevice);
cudaMemcpy(bd,b,size,cudaMemcpyHostToDevice);
printf("\nAfter HostToDevice Memcpy\n%s\n",cudaGetErrorString(cudaGetLastError()));
//calculate execution configuration
dim3 blocksize(16,16); //each block contains 16 * 16 (=256) threads
dim3 gridsize(N/16,N/16); //creating just sufficient no of blocks
//GPU timer code
float time;
cudaEvent_t start,stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
matrix_mul <<< gridsize, blocksize >>> (ad,bd,cd, N);
cudaDeviceSynchronize();
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time,start,stop); //time taken in kernel call calculated
cudaEventDestroy(start);
cudaEventDestroy(stop);
//copy back results
cudaMemcpy(c,cd,sizeof(float)* N*N,cudaMemcpyDeviceToHost);
printf("\nAfter DeviceToHost Memcpy\n%s\n",cudaGetErrorString(cudaGetLastError()));
//output results in output_file
fprintf(f,"Array A was---\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
fprintf(f,"%f ",a[i*N+j]);
fprintf(f,"\n");
}
fprintf(f,"\nArray B was---\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
fprintf(f,"%f ",b[i*N+j]);
fprintf(f,"\n");
}
fprintf(f,"\nMultiplication of A and B gives C----\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
fprintf(f,"%f ",c[i*N+j]); //if correctly computed, then all values must be N
fprintf(f,"\n");
}
printf("\nYou can see output in Parallel Mutiply.txt file in project directory");
printf("\n\nTime taken is %f (ms)\n",time);
fprintf(f,"\n\nTime taken is %f (ms)\n",time);
fclose(f);
cudaThreadExit();
//cudaFree(ad); cudaFree(bd); cudaFree (cd);
free(a);free(b);free(c);
//_getch();
return 1;
}
I commented that __ldg part in my kernel and executed by normal execution, and vice versa.
In both cases it gives me correct multiplication result. I am confused with the time difference I am getting between these executions, because its huge almost more than 100X!
In case of __ldg it gives me: Time taken is 0.014432 (ms)
And in case of normal execution without __ldg it gives me : Time taken is 36.858398 (ms)
Is this the exact way of using __ldg intrisic? What is the significance of __ldg intrinsic and what is the proper way of using it? Apparently what I did above in my code is wrong and naive. I am looking for explanation and example. Thanks in advance.
From the CUDA C Programming Guide
Global memory accesses for devices of compute capability 3.x are cached in L2 and for devices of compute capability 3.5, may also be cached in the read-only data cache described in the previous section; they are not cached in L1.
...
Data that is read-only for the entire lifetime of the kernel can also be cached in the read-only data cache described in the previous section by reading it using the __ldg() function (see Read-Only Data Cache Load Function). When the compiler detects that the read-only condition is satisfied for some data, it will use __ldg() to read it. The compiler might not always be able to detect that the read-only condition is satisfied for some data. Marking pointers used for loading such data with both the const and __restrict__ qualifiers increases the likelihood that the compiler will detect the read-only condition.
The read only cache accesses have a much lower latency than the global memory accesses. Because matrix multiplication accesses the same values from memory many times, caching in the read only cache gives a huge speedup (in memory bound applications).
In NVIDIA GPU there is a texture - images with special and not hard logic to work with images.
This texture memory is another type of memory available in GPU. In particularly constant, global and register file memory has not any relation to this texture memory.
Kepler GPUs and later add the ability to use this memory from "GPU texture pipeline".
But let's specify the difference between constant cache and read-only cache.
Constant Cache
Data loaded through the constant cache must be relatively small and must be accessed in such way that all threads of a warp should access the same location at any given time.
Read-only Cache or Texture Memory Cache
Cache can be much larger and can be accessed in a non-uniform pattern.
Read Only cache has granularity 32 bytes.
You can use this as "read-only cache" for your CUDA kernel.
1. Data stored in global memory can be cached in that place GPU Texture Memory
2. With doing that you give promise to the compiler that data is read-only for the
duration of a kernel execution in GPU.
There are two ways to achieve this.
A. Using an intrinsic function __ldg
Example: output[i] += __ldg(&input[j]);
B. Qualifying pointers to global memory
const float* __restrict__ input
output[idx] += input[idx];
Comparision:
The intrinsic __ldg is a better choice for deep compiler reasons.

Monitoring how thread blocks are allocated to SMs across execution time?

I am a beginner with CUDA profiling. I basically want to generate a timeline that shows each SM and the the thread block that was assigned to it across execution time.
Something similar to this:
Author: Sreepathi Pai
I have read about reading %smid register, but I don't know how to incorporate it with the code that I want to test, or how to relate that to thread blocks or time.
The full code is beyond the scope of this answer so this answer provides the building blocks for you to implement block trace.
Allocate a buffer 16 bytes * number of blocks. This can be done per launch or a larger buffer can be allocated and maintained for multiple launches.
Pass the pointer of the block either through a constant variable or as an additional kernel parameter.
Modify your global functions to accept the parameter and perform the code listed below. I recommend writing new global function wrappers and have the wrapper kernel call the old code. This makes it easier to handle kernels with multiple exit points.
Visualizing Data
On compute capability 2.x devices the timestamp function should be clock64. This clock is not synchronized across SMs. The recommend approach is to sort the times per SM and use the lowest time per SM as the time of the kernel launch. This will only be off by 100s of cycles from the real time so for reasonable size kernels this drift is negligible.
Remove the smid from the lower 4-bits of the first 8 byte value. Clear the lower 4-bits of the end timestamp.
Allocate a device buffer equal to number of blocks * 16 bytes. Each 16 byte records will store the start and end timestamp as well as a 5-bit smid packed into the start time.
static __device__ inline uint32_t __smid()
{
uint32_t smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;
}
// use globaltimer for compute capability >= 3.0 (kepler and maxwell)
// use clock64 for compute capability 2.x (fermi)
static __device__ inline uint64_t __timestamp()
{
uint64_t globaltime;
asm volatile("mov.u64 %0, %%globaltimer;" : "=l"(globaltime) );
return globaltime;
}
__global__ blocktime(uint64_t* pBlockTime)
{
// START TIMESTAMP
uint64_t startTime = __timestamp();
// flatBlockIdx should be adjusted to 1D, 2D, and 3D launches to minimize
// overhead. Reduce to uint32_t if launch index does not exceed 32-bit.
uint64_t flatBlockIdx = (blockIdx.z * gridDim.x * gridDim.y)
+ (blockIdx.y * gridDim.x)
+ blockIdx.x;
// reduce this based upon dimensions of block to minimize overhead
if (threadIdx.x == 0 && theradIdx.y == 0 && threadIdx.z == 0)
{
// Put the smid in the 4 lower bits. If the MultiprocessCounter exceeds
// 16 then increase to 5-bits. The lower 5-bits of globaltimer are
// junk. If using clock64 and you want the improve precision then use
// the most significant 4-5 bits.
uint64_t smid = __smid();
uint64_t data = (startTime & 0xF) | smid;
pBlockTime[flatBlockIdx * 2 + 0] = data;
}
// do work
// I would recommend changing your current __global__ function to be
// a __global__ __device__ function and call it here. This will result
// in easier handling of kernels that have multiple exit points.
// END TIMESTAMP
// All threads in block will write out. This is not very efficient.
// Depending on the kernel this can be reduced to 1 thread or 1 thread per warp.
uint64_t endTime = __timestamp();
pBlockTime[flatBlockIdx * 2 + 1] = endTime;
}
__noinline__ __device__ uint get_smid(void)
{
uint ret;
asm("mov.u32 %0, %smid;" : "=r"(ret) );
return ret;
}
Source here.

CUDA: Thread and Array Allocation

I have read many times about CUDA Thread/Blocks and Array, but still don't understand point: how and when CUDA starts to run multithread for kernel function. when host calling kernel function, or inside kernel function.
For example I have this example, It just simple transpose an array. (so, it just copy value from this array to another array).
__global__
void transpose(float* in, float* out, uint width) {
uint tx = blockIdx.x * blockDim.x + threadIdx.x;
uint ty = blockIdx.y * blockDim.y + threadIdx.y;
out[tx * width + ty] = in[ty * width + tx];
}
int main(int args, char** vargs) {
/*const int HEIGHT = 1024;
const int WIDTH = 1024;
const int SIZE = WIDTH * HEIGHT * sizeof(float);
dim3 bDim(16, 16);
dim3 gDim(WIDTH / bDim.x, HEIGHT / bDim.y);
float* M = (float*)malloc(SIZE);
for (int i = 0; i < HEIGHT * WIDTH; i++) { M[i] = i; }
float* Md = NULL;
cudaMalloc((void**)&Md, SIZE);
cudaMemcpy(Md,M, SIZE, cudaMemcpyHostToDevice);
float* Bd = NULL;
cudaMalloc((void**)&Bd, SIZE); */
transpose<<<gDim, bDim>>>(Md, Bd, WIDTH); // CALLING FUNCTION TRANSPOSE
cudaMemcpy(M,Bd, SIZE, cudaMemcpyDeviceToHost);
return 0;
}
(I have commented all lines that not important, just have the line calling function transpose)
I have understand all lines in function main except the line calling function tranpose. Does it true when I say: when we call function transpose<<<gDim, bDim>>>(Md, Bd, WIDTH), CUDA will automatically assign each elements of array into one thread (and block), and when we calling "one time" tranpose, CUDA will running gDim * bDim times tranpose on gDim * bDim threads.
This point makes me feel frustrated so much, because it doesn't like multithread in java, when I use :( Please tell me.
Thanks :)
Your understanding is in essence correct.
transpose is not a function, but a CUDA kernel. When you call a regular function, it only runs once. But when you launch a kernel a single time, CUDA will automatically run the code in the kernel many times. CUDA does this by starting many threads. Each thread runs the code in your kernel one time. The numbers inside the tripple brackets (<<< >>>) is called the kernel execution configuration. It determines how many threads will be launched by CUDA and specifies some relationships between the threads.
The number of threads that will be started is calculated by multiplying up all the values in the grid and block dimensions inside the triple brackets. For instance, the number of threads will be 1,048,576 (16 * 16 * 64 * 64) in your example.
Each thread can read some variables to find out which thread it is. Those are the blockIdx and threadIdx structures at the top of the kernel. The values reflect the ones in the kernel execution configuration. So, if you run your kernel with a grid configuration of 16 x 16 (the first dim3 in the triple brackets, you will get threads that, when they each read the x and y values in the blockIdx structure, will get all possible combinations of x and y between 0 and 15.
So, as you see, CUDA does not know anything about array elements or any other data structures that are specific to your kernel. It just deals with threads, thread indexes and block indexes. You then use those indexes to to determine what a given thread should do (in particular, which values in your application specific data it should work on).