Using clock() function in CUDA - cuda

I have a simple kernel which I am timing using clock().
I got to know about this function in How to measure the inner kernel time in NVIDIA CUDA?
So I have used
clock_t start = clock(); (and similarly stop) to time it. On compilation, I get the following error:
tex1.cu(14): error: expression preceding parentheses of apparent call must have (pointer-to-) function type`
Am I missing a header file, or a compiler option?
Also, I tried using CUDA timers (cudaEvent_t start, stop;) but the elapsed time I get is 0 ms. I create start and stop, record start, do some CUDA stuff, synchronize, record stop, event synchronize and measure elapsed time. This part compiles fine but gives me elapsed time as zero.
It is a simple kernel that I am using to test my understanding of texture memory.
The Kernel:
__global__ void magic(float *mean, int *clock){
int i, tid = threadIdx.x + blockIdx.x * blockDim.x;
float t, sum=0.0;
clock_t start = clock();
if ( tid < dimy )
{
for(i=0;i<dimx; i++){
t = tex2D( input, i, tid );
sum = sum + t*t;
}
clock_t stop = clock();
clock[tid] = (int)(stop-start);
}
}

In your kernel, don't name your kernel parameter clock as this is confusing the compiler because you have a variable named clock and a function named clock. Instead do this:
__global__ void magic(float *mean, int *myclock){
...
myclock[tid] = (int)(stop-start);
}
If you make that change, the error about the expression preceding parenthesis will go away.
It's odd that you answered the question about whether you have any other variables called clock or start with no, because you have both.
If you would like help with your usage of cuda events, please post the actual code you are using for timing. Are you doing error checking on all cuda calls and kernel calls?

Related

Different running time for cublasSetMatrix on similar matrices

In the following code I'm using the function cublasSetMatrix for 3 random matrices of size 200x200. I measured the the time of this function in the code:
clock_t t1,t2,t3,t4;
int m =200,n = 200;
float * bold1 = new float [m*n];
float * bold2 = new float [m*n];
float * bold3 = new float [m*n];
for (int i = 0; i< m; i++)
for(int j = 0; j <n;j++)
{
bold1[i*n+j]=rand()%10;
bold2[i*n+j]=rand()%10;
bold3[i*n+j]=rand()%10;
}
float * dev_bold1, * dev_bold2,*dev_bold3;
cudaMalloc ((void**)&dev_bold1,sizeof(float)*m*n);
cudaMalloc ((void**)&dev_bold2,sizeof(float)*m*n);
cudaMalloc ((void**)&dev_bold3,sizeof(float)*m*n);
t1=clock();
cublasSetMatrix(m,n,sizeof(float),bold1,m,dev_bold1,m);
t2 = clock();
cublasSetMatrix(m,n,sizeof(float),bold2,m,dev_bold2,m);
t3 = clock();
cublasSetMatrix(m,n,sizeof(float),bold3,m,dev_bold2,m);
t4 = clock();
cout<<double(t2-t1)/CLOCKS_PER_SEC<<" - "<<double(t3-t2)/CLOCKS_PER_SEC<<" - "<<double(t4-t3)/CLOCKS_PER_SEC;
delete []bold1;
delete []bold2;
delete []bold3;
cudaFree(dev_bold1);
cudaFree(dev_bold2);
cudaFree(dev_bold3);
The output of this code is something like this:
0.121849 - 0.000131 - 0.000141
Actually, every time I run the code the time of applying cublasSetMatrix on the first matrix is more than other two matrices, although the size of all matrices are the same and they are filled with random numbers.
Can anyone please help me to find out what is the reason of this result?
Usually the first CUDA API call in any CUDA program will incur some start-up overhead - the CUDA runtime requires time to initialize everything.
Whenever CUDA libraries are used, there will be some additional one-time start up overhead associated with initialization of the library. This overhead will often be observed to impact the timing of the first library call.
That seems to be what is happening here. By placing another cuBLAS API call before the first one you are measuring, you have moved the start-up overhead cost to a previous call, and so you don't measure it on the cublasSetMatrix() call anymore.

What are the possibilities to profile particular __device__ function within CUDA kernel? [duplicate]

I want to measure time inner kernel of GPU, how how to measure it in NVIDIA CUDA?
e.g.
__global__ void kernelSample()
{
some code here
get start time
some code here
get stop time
some code here
}
You can do something like this:
__global__ void kernelSample(int *runtime)
{
// ....
clock_t start_time = clock();
//some code here
clock_t stop_time = clock();
// ....
runtime[tidx] = (int)(stop_time - start_time);
}
Which gives the number of clock cycles between the two calls. Be a little careful though, the timer will overflow after a couple of seconds, so you should be sure that the duration of code between successive calls is quite short. You should also be aware that the compiler and assembler do perform instruction re-ordering so you might want to check that the clock calls don't wind up getting put next to each other in the SASS output (use cudaobjdump to check).
Try this, it measures time between 2 events in milliseconds.
cudaEvent_t start, stop;
float elapsedTime;
cudaEventCreate(&start);
cudaEventRecord(start,0);
//Do kernel activity here
cudaEventCreate(&stop);
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start,stop);
printf("Elapsed time : %f ms\n" ,elapsedTime);

cudaThreadSynchronize & performance

Some days ago I was comparing performance of some code of mine where I perform a very simple replace and Thrust implementation of the same algorithm. I discovered a mismatching of one order of magnitude (!) in favor of Thrust, so I started to make my debugger "surf" into their code to discover where the magic happens.
Surprisingly, I discovered that my very straight-forward implementation was actually very similar to theirs, once I got rid of all the functor stuff and got to the nitty-gritty. I saw that Thrust has a clever way to decide both block _size & grid_size (btw: exactly, how it works?!), so I just took their settings and executed my code again, being them so similar. I gained some microseconds, but almost the same situation. Then, in the end, I don't know why but just "to try" I removed a cudaThreadSynchronize() after my kernel and BINGO! I zeroed the gap (and better) and gained a whole order of magnitude of execution time. Accessing my array's value I saw that they had exactly what I expected, so correct execution.
The questions, now, are: when can I get rid of cudaThreadSynchronize (et similia)? Why does it cause such a huge overhead? I see that Thrust itself doesn't synchronize at the end (synchronize_if_enabled(const char* message) that is a NOP if macro __THRUST_SYNCHRONOUS isn't defined and it isn't).
Details & code follow.
// my replace code
template <typename T>
__global__ void replaceSimple(T* dev, const int n, const T oldval, const T newval)
{
const int gridSize = blockDim.x * gridDim.x;
int index = blockIdx.x * blockDim.x + threadIdx.x;
while(index < n)
{
if(dev[index] == oldval)
dev[index] = newval;
index += gridSize;
}
}
// replace invocation - not in main because of cpp - cu separation
template <typename T>
void callReplaceSimple(T* dev, const int n, const T oldval, const T newval)
{
replaceSimple<<<30,768,0>>>(dev,n,oldval,newval);
cudaThreadSynchronize();
}
// thrust replace invocation
template <typename T>
void callReplace(thrust::device_vector<T>& dev, const T oldval, const T newval)
{
thrust::replace(dev.begin(), dev.end(), oldval, newval);
}
Param details: arrays: n=10,000,000 elements set to 2, oldval=2, newval=3
Time to execute thrust callReplace (thrust): 0.057 ms
Time to execute callReplaceSimple with sync: 0.662 ms
Time to execute callReplaceSimple without sync: 0.011 ms
I used CUDA 5.0 with Thrust included, my card is a GeForce GTX 570 and I have a quadcore Q9550 2.83 GHz with 2 GB RAM.
Kernel launches are asynchronous. If you remove the cudaThreadSynchronize() call, you only measure the kernel launch time, not the time until completion of the kernel.

Kernel Launch Failure

I'm operating on a Linux system and a Tesla C2075 machine. I am launching a kernel that is a modified version of the reduction kernel. My aim is to find the mean and a step by step averaged version(time_avg) of a large data set (result). See code below.
Size of "result" and "time_avg" is same and equal to "nsamps". "time_avg" contains successive averaged sets of the array result. So, first half contains averages of every two non-overlapping samples, the quarter after that has averages of every four non-overlapping samples, the next eighth of 8 samples and so on.
__global__ void timeavg_mean(float *result, unsigned int *nsamps, float *time_avg, float *mean) {
__shared__ float temp[1024];
int ltid = threadIdx.x, gtid = blockIdx.x*blockDim.x + threadIdx.x, stride;
int start = 0, index;
unsigned int npts = *nsamps;
printf("here here\n");
// Store chunk of memory=2*blockDim.x (which is to be reduced) into shared memory
if ( (2*gtid) < npts ){
temp[2*ltid] = result[2*gtid];
temp[2*ltid+1] = result[2*gtid + 1];
}
for (stride=1; stride<blockDim.x; stride>>=1) {
__syncthreads();
if (ltid % (stride*2) == 0){
if ( (2*gtid) < npts ){
temp[2*ltid] += temp[2*ltid + stride];
index = (int)(start + gtid/stride);
time_avg[index] = (float)( temp[2*ltid]/(2.0*stride) );
}
}
start += npts/(2*stride);
}
__syncthreads();
if (ltid == 0)
{
atomicAdd(mean, temp[0]);
}
__syncthreads();
printf("%f\n", *mean);
}
Launch configuration is 40 blocks, 512 threads. Data set is ~40k samples.
In my main code, I call cudaGetLastError() after the kernel call and it returns no error. Memory allocations and memory copies return no errors. If I write cudaDeviceSynchronize() (or a cudaMemcpy to check for the value of mean) after the kernel call, the program hangs completely after the kernel call. If I remove it, program runs and exits. In neither case, do I get the outputs "here here" or the mean value printed. I understand that unless the kernel executes successfully, the printf's won't print.
Has this got to do with __syncthreads() in a recursion? All threads will go till the same depth so I think that checks out.
What is the problem here?
Thank you!
A kernel call is asynchronous, if the kernel starts successfully your host code will continue to run and you will see no error. Errors that happen during the kernel run appear only after you do an explicit synchronization or call a function that causes an implicit synchronization.
If your host hangs on synchronization than your kernel probably didn't finished running - it is either running some infinite loop or it is waiting on some __synchthreads() or some other synchronization primitive.
Your code seems to contain an infinite loop: for (stride=1; stride<blockDim.x; stride>>=1). You probably want to shift the stride left not right: stride<<=1.
You mentioned recursion but your code contains only one __global__ function, there are no recursive calls.
Your kernel has an infinite loop. Replace the for loop with
for (stride=1; stride<blockDim.x; stride<<=1) {

How to measure the execution time of every block when using CUDA?

clock() is not accurate enough.
Use CUDA events for measure time of kernels or CUDA operations (memcpy etc):
// Prepare
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
// Start record
cudaEventRecord(start, 0);
// Do something on GPU
MyKernel<<<dimGrid, dimBlock>>>(input_data, output_data);
// Stop event
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop); // that's our time!
// Clean up:
cudaEventDestroy(start);
cudaEventDestroy(stop);
See CUDA Programming Guide, section 3.2.7.6
How about using clock() function in every CUDA thread to calculate start and end times. And store it in a array such a way that you can figure out which thread start/stop at which time based on array indices like following:
__global__ void kclock(unsigned int *ts) {
unsigned int start_time = 0, stop_time = 0;
start_time = clock();
// Code we need to measure should go here.
stop_time = clock();
ts[(blockIdx.x * blockDim.x + threadIdx.x) * 2] = start_time;
ts[(blockIdx.x * blockDim.x + threadIdx.x) * 2 + 1] = stop_time;
}
Then use this array to figure out minimal start time and maximum stop time for block you are considering. For example you can calculate range of indices of time array which corresponds to the (0, 0) block in CUDA and use min/max to calculate the execution time.
I think long long int clock64() is what you are looking for?
See Cuda Programming Guide, C Language Extensions, B. 11.