Different running time for cublasSetMatrix on similar matrices - cuda

In the following code I'm using the function cublasSetMatrix for 3 random matrices of size 200x200. I measured the the time of this function in the code:
clock_t t1,t2,t3,t4;
int m =200,n = 200;
float * bold1 = new float [m*n];
float * bold2 = new float [m*n];
float * bold3 = new float [m*n];
for (int i = 0; i< m; i++)
for(int j = 0; j <n;j++)
{
bold1[i*n+j]=rand()%10;
bold2[i*n+j]=rand()%10;
bold3[i*n+j]=rand()%10;
}
float * dev_bold1, * dev_bold2,*dev_bold3;
cudaMalloc ((void**)&dev_bold1,sizeof(float)*m*n);
cudaMalloc ((void**)&dev_bold2,sizeof(float)*m*n);
cudaMalloc ((void**)&dev_bold3,sizeof(float)*m*n);
t1=clock();
cublasSetMatrix(m,n,sizeof(float),bold1,m,dev_bold1,m);
t2 = clock();
cublasSetMatrix(m,n,sizeof(float),bold2,m,dev_bold2,m);
t3 = clock();
cublasSetMatrix(m,n,sizeof(float),bold3,m,dev_bold2,m);
t4 = clock();
cout<<double(t2-t1)/CLOCKS_PER_SEC<<" - "<<double(t3-t2)/CLOCKS_PER_SEC<<" - "<<double(t4-t3)/CLOCKS_PER_SEC;
delete []bold1;
delete []bold2;
delete []bold3;
cudaFree(dev_bold1);
cudaFree(dev_bold2);
cudaFree(dev_bold3);
The output of this code is something like this:
0.121849 - 0.000131 - 0.000141
Actually, every time I run the code the time of applying cublasSetMatrix on the first matrix is more than other two matrices, although the size of all matrices are the same and they are filled with random numbers.
Can anyone please help me to find out what is the reason of this result?

Usually the first CUDA API call in any CUDA program will incur some start-up overhead - the CUDA runtime requires time to initialize everything.
Whenever CUDA libraries are used, there will be some additional one-time start up overhead associated with initialization of the library. This overhead will often be observed to impact the timing of the first library call.
That seems to be what is happening here. By placing another cuBLAS API call before the first one you are measuring, you have moved the start-up overhead cost to a previous call, and so you don't measure it on the cublasSetMatrix() call anymore.

Related

How do I run microbenchmarks in CUDA without the optimizer messing with me?

I've been trying to benchmark the modulus operation in CUDA against some custom modulus operations, currently I use the following function.
__inline__ __device__ uint64_t modop(uint64_t& a, uint32_t& q) {
uint64_t c;
for (int j = 0; j < REPEAT; j++) {
c = a % q;
}
return c;
}
My problem is that I believe the compiler is being clever and optimizing the loop away as it's unreasonably fast. I've trying clobbering my variables with asm volatile("" : "=l"(c)::"memory"); but for a it does nothing and for c it breaks everything.
What can I do to benchmark simple operations like this on CUDA?
There is in-kernel measurement with clock_t returning function clock(). You can use its return value to stop optimizations too (like repeating until clock increments at least 1000 cycles and then measuring number of repeats to be divided by total cycles). Compiler can not predict time.
clock_t a =clock();
clock_t q =clock();
int repeats=0;
while(q-a<1000)
{
c=a%q;
repeats++;
q=clock();
}
perf = repeats / (float)(q - a - (latencyClock + latencyIncrement)*repeats);
Then in performance calculations, subtract clock()(and ++) latency times repeats from total latency.
This enforces each thread to run a different value of modulo as opposed to a single-kernel code for all threads. So the kernel can not be optimized globally.

Getting pointers to specific elements of a 1D contiguous array on the device

I am trying to use CUBLAS in C++ to rewrite a python/tensorflow script which is operating on batches of input samples (of shape BxD, B: BatchSize, D: Depth of the flattened 2D matrix)
For the first step, I decided to use CUBLAS cublasSgemmBatched to compute MatMul for batches of matrices.
I've found couple working sample codes as the one in link to the question,
but what I want is to allocate one big contiguous device array to store batches of flattened identical shaped matrices. I DO NOT want to store batches separated from each other on device memory(as they are in the provided sample code in the given link to StackOverflow question)
From what I can imagine, somehow I have to get a list of pointers to starting elements of each batch on device memory. something like this:
float **device_batch_ptr;
cudaMalloc((void**)&device_batch_ptr, batch_size*sizeof(float *));
for(int i = 0 ; i < batch_size; i++ ) {
// set device_batch_ptr[i] to starting point of i'th batch on device memory array.
}
Note that cublasSgemmBatched needs a float** that each float* in it, points to starting element of each batch in a given input matrix.
Any advice and suggestions will be greatly appreciated.
If your arrays are in contiguous linear memory (device_array) then all you need to do is calculate the offsets using standard pointer arithmetic and store the device addresses in a host array which you then copy to the device. Something like:
float** device_batch_ptr;
float** h_device_batch_ptr = new float*[batch_size];
cudaMalloc((void**)&device_batch_ptr, batch_size*sizeof(float *));
size_t nelementsperrarray = N * N;
for(int i = 0 ; i < batch_size; i++ ) {
// set h_device_batch_ptr[i] to starting point of i'th batch on device memory array.
h_device_batch_ptr[i] = device_array + i * nelementsperarray;
}
cudaMemcpy(device_batch_ptr, h_device_batch_ptr, batch_size*sizeof(float *)),
cudaMemcpyHostToDevice);
[Obviously never compiled or tested, use at own risk]

prefix sum using CUDA

I am having trouble understanding a cuda code for naive prefix sum.
This is code is from https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html
In example 39-1 (naive scan), we have a code like this:
__global__ void scan(float *g_odata, float *g_idata, int n)
{
extern __shared__ float temp[]; // allocated on invocation
int thid = threadIdx.x;
int pout = 0, pin = 1;
// Load input into shared memory.
// This is exclusive scan, so shift right by one
// and set first element to 0
temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0;
__syncthreads();
for (int offset = 1; offset < n; offset *= 2)
{
pout = 1 - pout; // swap double buffer indices
pin = 1 - pout;
if (thid >= offset)
temp[pout*n+thid] += temp[pin*n+thid - offset];
else
temp[pout*n+thid] = temp[pin*n+thid];
__syncthreads();
}
g_odata[thid] = temp[pout*n+thid1]; // write output
}
My questions are
Why do we need to create a shared-memory temp?
Why do we need "pout" and "pin" variables? What do they do? Since we only use one block and 1024 threads at maximum here, can we only use threadId.x to specify the element in the block?
In CUDA, do we use one thread to do one add operation? Is it like, one thread does what could be done in one iteration if I use a for loop (loop the threads or processors in OpenMP given one thread for one element in an array)?
My previous two questions may seem to be naive... I think the key is I don't understand the relation between the above implementation and the pseudocode as following:
for d = 1 to log2 n do
for all k in parallel do
if k >= 2^d then
x[k] = x[k – 2^(d-1)] + x[k]
This is my first time using CUDA, so I'll appreciate it if anyone can answer my questions...
1- It's faster to put stuff in Shared Memory (SM) and do calculations there rather than using the Global Memory. It's important to sync threads after loading the SM hence the __syncthreads.
2- These variables are probably there for the clarification of reversing the order in the algorithm. It's simply there for toggling certain parts:
temp[pout*n+thid] += temp[pin*n+thid - offset];
First iteration ; pout = 1 and pin = 0. Second iteration; pout = 0 and pin = 1.
It offsets the output for N amount at odd iterations and offsets the input at even iterations. To come back to your question, you can't achieve the same thing with threadId.x because the it wouldn't change within the loop.
3 & 4 - CUDA executes threads to run the kernel. Meaning that each thread runs that code separately. If you look at the pseudo code and compare with the CUDA code you already parallelized the outer loop with CUDA. So each thread would run the loop in the kernel until the end of loop and would wait each thread to finish before writing to the Global Memory.
Hope it helps.

What is the difference between __ldg() intrinsic and a normal execution?

I am trying to explore '__ldg intrinsic'. I have gone through NVIDIA's documentation for this but didn't get any satisfactory answer over its use and implementations. Moreover with reference to THIS I tried implementing __ldg in a simple 1024*1024 matrix multiplication example.
#include<stdio.h>
#include<stdlib.h>
__global__ void matrix_mul(float * ad,float * bd,float * cd,int N)
{
float pvalue=0;
//find Row and Column corresponding to a data element for each thread
int Row = blockIdx.y * blockDim.y + threadIdx.y;
int Col = blockIdx.x * blockDim.x + threadIdx.x;
//calculate dot product of Row of First Matrix and Column of Second Matrix
for(int i=0;i< N;++i)
{
// I tried with executing this first:
float m=__ldg(&ad[Row * N+i]);
float n=__ldg(&bd[i * N + Col]);
//Then I executed this as a normal execution:
// float m = ad[Row * N+i];
// float n = bd[i * N + Col];
pvalue += m * n;
}
//store dot product at corresponding position in resultant Matrix
cd[Row * N + Col] = pvalue;
}
int main()
{
int N = 1024,i,j; //N == size of square matrix
float *a,*b;
float *ad,*bd,*cd,*c;
//open a file for outputting the result
FILE *f;
f=fopen("Parallel Multiply_ldg.txt","w");
size_t size=sizeof(float)* N * N;
//allocate host side memory
a=(float*)malloc(size);
b=(float*)malloc(size);
c=(float*)malloc(size);
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
a[i*N+j]=2.0; //(float)(i*N+j); //initializing each value with its own index
b[i*N+j]=1.0; //(float)(i*N+j); //random functions can be used alternatively
}
}
//allocate device memory
cudaMalloc(&ad,size);
//printf("\nAfter cudaMalloc for ad\n%s\n",cudaGetErrorString(cudaGetLastError()));
cudaMalloc(&bd,size);
//printf("\nAfter cudaMalloc bd\n%s\n",cudaGetErrorString(cudaGetLastError()));
cudaMalloc(&cd,size);
//printf("\nAfter cudaMalloc cd\n%s\n",cudaGetErrorString(cudaGetLastError()));
//copy value from host to device
cudaMemcpy(ad,a,size,cudaMemcpyHostToDevice);
cudaMemcpy(bd,b,size,cudaMemcpyHostToDevice);
printf("\nAfter HostToDevice Memcpy\n%s\n",cudaGetErrorString(cudaGetLastError()));
//calculate execution configuration
dim3 blocksize(16,16); //each block contains 16 * 16 (=256) threads
dim3 gridsize(N/16,N/16); //creating just sufficient no of blocks
//GPU timer code
float time;
cudaEvent_t start,stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
matrix_mul <<< gridsize, blocksize >>> (ad,bd,cd, N);
cudaDeviceSynchronize();
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time,start,stop); //time taken in kernel call calculated
cudaEventDestroy(start);
cudaEventDestroy(stop);
//copy back results
cudaMemcpy(c,cd,sizeof(float)* N*N,cudaMemcpyDeviceToHost);
printf("\nAfter DeviceToHost Memcpy\n%s\n",cudaGetErrorString(cudaGetLastError()));
//output results in output_file
fprintf(f,"Array A was---\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
fprintf(f,"%f ",a[i*N+j]);
fprintf(f,"\n");
}
fprintf(f,"\nArray B was---\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
fprintf(f,"%f ",b[i*N+j]);
fprintf(f,"\n");
}
fprintf(f,"\nMultiplication of A and B gives C----\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
fprintf(f,"%f ",c[i*N+j]); //if correctly computed, then all values must be N
fprintf(f,"\n");
}
printf("\nYou can see output in Parallel Mutiply.txt file in project directory");
printf("\n\nTime taken is %f (ms)\n",time);
fprintf(f,"\n\nTime taken is %f (ms)\n",time);
fclose(f);
cudaThreadExit();
//cudaFree(ad); cudaFree(bd); cudaFree (cd);
free(a);free(b);free(c);
//_getch();
return 1;
}
I commented that __ldg part in my kernel and executed by normal execution, and vice versa.
In both cases it gives me correct multiplication result. I am confused with the time difference I am getting between these executions, because its huge almost more than 100X!
In case of __ldg it gives me: Time taken is 0.014432 (ms)
And in case of normal execution without __ldg it gives me : Time taken is 36.858398 (ms)
Is this the exact way of using __ldg intrisic? What is the significance of __ldg intrinsic and what is the proper way of using it? Apparently what I did above in my code is wrong and naive. I am looking for explanation and example. Thanks in advance.
From the CUDA C Programming Guide
Global memory accesses for devices of compute capability 3.x are cached in L2 and for devices of compute capability 3.5, may also be cached in the read-only data cache described in the previous section; they are not cached in L1.
...
Data that is read-only for the entire lifetime of the kernel can also be cached in the read-only data cache described in the previous section by reading it using the __ldg() function (see Read-Only Data Cache Load Function). When the compiler detects that the read-only condition is satisfied for some data, it will use __ldg() to read it. The compiler might not always be able to detect that the read-only condition is satisfied for some data. Marking pointers used for loading such data with both the const and __restrict__ qualifiers increases the likelihood that the compiler will detect the read-only condition.
The read only cache accesses have a much lower latency than the global memory accesses. Because matrix multiplication accesses the same values from memory many times, caching in the read only cache gives a huge speedup (in memory bound applications).
In NVIDIA GPU there is a texture - images with special and not hard logic to work with images.
This texture memory is another type of memory available in GPU. In particularly constant, global and register file memory has not any relation to this texture memory.
Kepler GPUs and later add the ability to use this memory from "GPU texture pipeline".
But let's specify the difference between constant cache and read-only cache.
Constant Cache
Data loaded through the constant cache must be relatively small and must be accessed in such way that all threads of a warp should access the same location at any given time.
Read-only Cache or Texture Memory Cache
Cache can be much larger and can be accessed in a non-uniform pattern.
Read Only cache has granularity 32 bytes.
You can use this as "read-only cache" for your CUDA kernel.
1. Data stored in global memory can be cached in that place GPU Texture Memory
2. With doing that you give promise to the compiler that data is read-only for the
duration of a kernel execution in GPU.
There are two ways to achieve this.
A. Using an intrinsic function __ldg
Example: output[i] += __ldg(&input[j]);
B. Qualifying pointers to global memory
const float* __restrict__ input
output[idx] += input[idx];
Comparision:
The intrinsic __ldg is a better choice for deep compiler reasons.

Using clock() function in CUDA

I have a simple kernel which I am timing using clock().
I got to know about this function in How to measure the inner kernel time in NVIDIA CUDA?
So I have used
clock_t start = clock(); (and similarly stop) to time it. On compilation, I get the following error:
tex1.cu(14): error: expression preceding parentheses of apparent call must have (pointer-to-) function type`
Am I missing a header file, or a compiler option?
Also, I tried using CUDA timers (cudaEvent_t start, stop;) but the elapsed time I get is 0 ms. I create start and stop, record start, do some CUDA stuff, synchronize, record stop, event synchronize and measure elapsed time. This part compiles fine but gives me elapsed time as zero.
It is a simple kernel that I am using to test my understanding of texture memory.
The Kernel:
__global__ void magic(float *mean, int *clock){
int i, tid = threadIdx.x + blockIdx.x * blockDim.x;
float t, sum=0.0;
clock_t start = clock();
if ( tid < dimy )
{
for(i=0;i<dimx; i++){
t = tex2D( input, i, tid );
sum = sum + t*t;
}
clock_t stop = clock();
clock[tid] = (int)(stop-start);
}
}
In your kernel, don't name your kernel parameter clock as this is confusing the compiler because you have a variable named clock and a function named clock. Instead do this:
__global__ void magic(float *mean, int *myclock){
...
myclock[tid] = (int)(stop-start);
}
If you make that change, the error about the expression preceding parenthesis will go away.
It's odd that you answered the question about whether you have any other variables called clock or start with no, because you have both.
If you would like help with your usage of cuda events, please post the actual code you are using for timing. Are you doing error checking on all cuda calls and kernel calls?