How to measure the inner kernel time in NVIDIA CUDA? - cuda

I want to measure time inner kernel of GPU, how how to measure it in NVIDIA CUDA?
e.g.
__global__ void kernelSample()
{
some code here
get start time
some code here
get stop time
some code here
}

You can do something like this:
__global__ void kernelSample(int *runtime)
{
// ....
clock_t start_time = clock();
//some code here
clock_t stop_time = clock();
// ....
runtime[tidx] = (int)(stop_time - start_time);
}
Which gives the number of clock cycles between the two calls. Be a little careful though, the timer will overflow after a couple of seconds, so you should be sure that the duration of code between successive calls is quite short. You should also be aware that the compiler and assembler do perform instruction re-ordering so you might want to check that the clock calls don't wind up getting put next to each other in the SASS output (use cudaobjdump to check).

Try this, it measures time between 2 events in milliseconds.
cudaEvent_t start, stop;
float elapsedTime;
cudaEventCreate(&start);
cudaEventRecord(start,0);
//Do kernel activity here
cudaEventCreate(&stop);
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start,stop);
printf("Elapsed time : %f ms\n" ,elapsedTime);

Related

What are the possibilities to profile particular __device__ function within CUDA kernel? [duplicate]

I want to measure time inner kernel of GPU, how how to measure it in NVIDIA CUDA?
e.g.
__global__ void kernelSample()
{
some code here
get start time
some code here
get stop time
some code here
}
You can do something like this:
__global__ void kernelSample(int *runtime)
{
// ....
clock_t start_time = clock();
//some code here
clock_t stop_time = clock();
// ....
runtime[tidx] = (int)(stop_time - start_time);
}
Which gives the number of clock cycles between the two calls. Be a little careful though, the timer will overflow after a couple of seconds, so you should be sure that the duration of code between successive calls is quite short. You should also be aware that the compiler and assembler do perform instruction re-ordering so you might want to check that the clock calls don't wind up getting put next to each other in the SASS output (use cudaobjdump to check).
Try this, it measures time between 2 events in milliseconds.
cudaEvent_t start, stop;
float elapsedTime;
cudaEventCreate(&start);
cudaEventRecord(start,0);
//Do kernel activity here
cudaEventCreate(&stop);
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start,stop);
printf("Elapsed time : %f ms\n" ,elapsedTime);

Using clock() function in CUDA

I have a simple kernel which I am timing using clock().
I got to know about this function in How to measure the inner kernel time in NVIDIA CUDA?
So I have used
clock_t start = clock(); (and similarly stop) to time it. On compilation, I get the following error:
tex1.cu(14): error: expression preceding parentheses of apparent call must have (pointer-to-) function type`
Am I missing a header file, or a compiler option?
Also, I tried using CUDA timers (cudaEvent_t start, stop;) but the elapsed time I get is 0 ms. I create start and stop, record start, do some CUDA stuff, synchronize, record stop, event synchronize and measure elapsed time. This part compiles fine but gives me elapsed time as zero.
It is a simple kernel that I am using to test my understanding of texture memory.
The Kernel:
__global__ void magic(float *mean, int *clock){
int i, tid = threadIdx.x + blockIdx.x * blockDim.x;
float t, sum=0.0;
clock_t start = clock();
if ( tid < dimy )
{
for(i=0;i<dimx; i++){
t = tex2D( input, i, tid );
sum = sum + t*t;
}
clock_t stop = clock();
clock[tid] = (int)(stop-start);
}
}
In your kernel, don't name your kernel parameter clock as this is confusing the compiler because you have a variable named clock and a function named clock. Instead do this:
__global__ void magic(float *mean, int *myclock){
...
myclock[tid] = (int)(stop-start);
}
If you make that change, the error about the expression preceding parenthesis will go away.
It's odd that you answered the question about whether you have any other variables called clock or start with no, because you have both.
If you would like help with your usage of cuda events, please post the actual code you are using for timing. Are you doing error checking on all cuda calls and kernel calls?

Equivalent of usleep() in CUDA kernel?

I'd like to call something like usleep() inside a CUDA kernel. The basic goal is to make all GPU cores sleep or busywait for a number of millesconds--it's part of some sanity checks that I want to do for a CUDA application. My attempt at doing this is below:
#include <unistd.h>
#include <stdio.h>
#include <cuda.h>
#include <sys/time.h>
__global__ void gpu_uSleep(useconds_t wait_time_in_ms)
{
usleep(wait_time_in_ms);
}
int main(void)
{
//input parameters -- arbitrary
// TODO: set these exactly for full occupancy
int m = 16;
int n = 16;
int block1D = 16;
dim3 block(block1D, block1D);
dim3 grid(m/block1D, n/block1D);
useconds_t wait_time_in_ms = 1000;
//execute the kernel
gpu_uSleep<<< grid, block >>>(wait_time_in_ms);
cudaDeviceSynchronize();
return 0;
}
I get the following error when I try to compile this using NVCC:
error: calling a host function("usleep") from a __device__/__global__
function("gpu_uSleep") is not allowed
Clearly, I'm not allowed to use a host function such as usleep() inside a kernel. What would be a good alternative to this?
You can spin on clock() or clock64(). The CUDA SDK concurrentKernels sample does this does the following:
__global__ void clock_block(clock_t *d_o, clock_t clock_count)
{
clock_t start_clock = clock();
clock_t clock_offset = 0;
while (clock_offset < clock_count)
{
clock_offset = clock() - start_clock;
}
d_o[0] = clock_offset;
}
I recommend using clock64(). clock() and clock64() are in cycles so you will have to query the frequency using cudaDeviceProperties(). The frequency can be dynamic so it will be hard to guarantee an accurate spin loop.
You can busy wait with a loop that reads clock().
To wait for at least 10,000 clock cycles:
clock_t start = clock();
clock_t now;
for (;;) {
now = clock();
clock_t cycles = now > start ? now - start : now + (0xffffffff - start);
if (cycles >= 10000) {
break;
}
}
// Stored "now" in global memory here to prevent the
// compiler from optimizing away the entire loop.
*global_now = now;
Note: This is untested. The code that handles overflows was borrowed from this answer by #Pedro. See his answer and section B.10 in the CUDA C Programming Guide 4.2 for details on how clock() works. There is also a clock64() command.
With recent versions of CUDA, and a device with Compute Capability 7.0 or later (Volta, Turing, Ampere etc.), you can use the __nanosleep() primitive:
void __nanosleep(unsigned ns);
which obviates the need for busy-sleeping as suggested in older answers.
The best way to "sleep the cores" is to have the kernel return, to the CPU, and then have the CPU just launch a second kernel (or the same kernel again). This prevents the GPUs from having to spin & overheat.

CUDA GPU slower than CPU

I am having trouble figuring out why my cuda code runs slower than my cpu code
my desktop configuration is i7 2600S, geforce 560ti
and my code is as follows:
int** kernel_shiftSeam(int **MCEnergyMat, int **newE, int *seam, int width, int height, int direction)
{
//time measurement
float elapsed_time_ms = 0;
cudaEvent_t start, stop; //threads per block
dim3 threads(16,16);
//blocks
dim3 blocks((width+threads.x-1)/threads.x, (height+threads.y-1)/threads.y);
int *device_Seam;
int *host_Seam;
int seamSize;
if(direction == 1)
{
seamSize = height*sizeof(int);
host_Seam = (int*)malloc(seamSize);
for(int i=0;i<height;i++)
host_Seam[i] = seam[i];
}
else
{
seamSize = width*sizeof(int);
host_Seam = (int*)malloc(seamSize);
for(int i=0;i<width;i++)
host_Seam[i] = seam[i];
}
cudaMalloc((void**)&device_Seam, seamSize);
cudaMemcpy(device_Seam, host_Seam, seamSize, cudaMemcpyHostToDevice);
global_host_MC = MCEnergyMat;
new_host_MC = newE;
//copy host array to device
cudaMemcpy(global_MC, global_MC2, sizeof(int*)*width, cudaMemcpyHostToDevice);
for(int i=0;i<width;i++)
cudaMemcpy(global_MC2[i], global_host_MC[i], sizeof(int)*height, cudaMemcpyHostToDevice);
cudaMemcpy(new_MC, new_MC2, sizeof(int*)*width, cudaMemcpyHostToDevice);
for(int i=0;i<width;i++)
cudaMemcpy(new_MC2[i], new_host_MC[i], sizeof(int)*height, cudaMemcpyHostToDevice);
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
//do some operations on the 2d matrix
gpu_shiftSeam<<< blocks,threads >>>(global_MC, new_MC, device_Seam, width, height);
//measure end time for cpu calcuations
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms, start, stop );
execTime += elapsed_time_ms;
//copy out the data back to host (RESULT)
for(int i=0;i<width;i++)
{
cudaMemcpy(newE[i], new_MC2[i], sizeof(int)*height, cudaMemcpyDeviceToHost);
}
return newE;
}
I looped it 800 times and I got the follow results:
GPU
Computation Time (the gpu_shiftseam part) : 1176ms
Total program run time: 22s
CPU
Computation Time (same operation as gpu_shiftseam but on host) : 12522ms
Total program run time: 12s
Apparently the GPU computation time is way shorter than the one on CPU, but
for some reason the total program run time for gpu is a lot longer, does
anyone know why? Is it because of the number of threads/blocks I am assigning
is incorrect? Or is the "slowness" coming from allocating memory on device?
Thanks a lot!
Im my experience memory accesses are the #1 reason for slowness.
Profile your array copies to see how much time is being spent. If it is a considerable amount, perhaps try optimizing your code. Instead of copying inside of a for-loop, perhaps see if you can copy sizeof(int *) * height * width directly. Reducing the amount of times you call memcpy should help.
cudaMemcpy(global_MC, global_MC2, sizeof(int*)*width, cudaMemcpyHostToDevice);
cudaMemcpy(global_MC2, global_host_MC, sizeof(int)*height*width,cudaMemcpyHostToDevice);
I had similar experience and found that cudaMalloc was the bottleneck while cudaMemcpy wasn't. In my device, I remember that 16 MB allocation took 160 ms. CUDA memory allocation however can be done before actual computation, for example, by another function call. Thus, the memory allocation time can be removed from overall performance measure, e.g., speedup although I would include cudaMemcpy operation in the speedup calculation.

How to measure the execution time of every block when using CUDA?

clock() is not accurate enough.
Use CUDA events for measure time of kernels or CUDA operations (memcpy etc):
// Prepare
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
// Start record
cudaEventRecord(start, 0);
// Do something on GPU
MyKernel<<<dimGrid, dimBlock>>>(input_data, output_data);
// Stop event
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop); // that's our time!
// Clean up:
cudaEventDestroy(start);
cudaEventDestroy(stop);
See CUDA Programming Guide, section 3.2.7.6
How about using clock() function in every CUDA thread to calculate start and end times. And store it in a array such a way that you can figure out which thread start/stop at which time based on array indices like following:
__global__ void kclock(unsigned int *ts) {
unsigned int start_time = 0, stop_time = 0;
start_time = clock();
// Code we need to measure should go here.
stop_time = clock();
ts[(blockIdx.x * blockDim.x + threadIdx.x) * 2] = start_time;
ts[(blockIdx.x * blockDim.x + threadIdx.x) * 2 + 1] = stop_time;
}
Then use this array to figure out minimal start time and maximum stop time for block you are considering. For example you can calculate range of indices of time array which corresponds to the (0, 0) block in CUDA and use min/max to calculate the execution time.
I think long long int clock64() is what you are looking for?
See Cuda Programming Guide, C Language Extensions, B. 11.