Strategies for timing CUDA Kernels: Pros and Cons? - cuda

When timing CUDA kernels, the following doesn't work because the kernel doesn't block the CPU program execution while it executes:
start timer
kernel<<<g,b>>>();
end timer
I've seen three basic ways of (successfully) timing CUDA kernels:
(1) Two CUDA eventRecords.
float responseTime; //result will be in milliseconds
cudaEvent_t start; cudaEventCreate(&start); cudaEventRecord(start); cudaEventSynchronize(start);
cudaEvent_t stop; cudaEventCreate(&stop);
kernel<<<g,b>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
cudaEventElapsedTime(&responseTime, start, stop); //responseTime = elapsed time
(2) One CUDA eventRecord.
float start = read_timer(); //helper function on CPU, in milliseconds
cudaEvent_t stop; cudaEventCreate(&stop);
kernel<<<g,b>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
float responseTime = read_timer() - start;
(3) deviceSynchronize instead of eventRecord. (Probably only useful when using programming in a single stream.)
float start = read_timer(); //helper function on CPU, in milliseconds
kernel<<<g,b>>>();
cudaDeviceSynchronize();
float responseTime = read_timer() - start;
I experimentally verified that these three strategies produce the same timing result.
Questions:
What are the tradeoffs of these strategies? Any hidden details here?
Aside from timing many kernels in multiple streams, is there any advantages of using two event records and the cudaEventElapsedTime() function?
You can probably use your imagination to figure out what read_timer() does. Nevertheless, it can't hurt to provide an example implementation:
double read_timer(){
struct timeval start;
gettimeofday( &start, NULL ); //you need to include <sys/time.h>
return (double)((start.tv_sec) + 1.0e-6 * (start.tv_usec))*1000; //milliseconds
}

You seem to have ruled out most of the differences by saying they all produce the same result for the relatively simple case you have shown (probably not exactly true, but I understand what you mean), and "Aside from timing (complex sequences) ..." where the first case is clearly better.
One possible difference would be portability between windows and linux. I believe your example read_timer function is linux-oriented. You could probably craft a read_timer function that is "portable" but the cuda event system (method 1) is portable as-is.

Option (1) uses cudaEventRecord to time the CPU. This is highly inefficient and I would discourage using cudaEventRecord for this purpose. cudaEventRecord can be used to time the GPU push buffer time to execute kernel as follows:
float responseTime; //result will be in milliseconds
cudaEvent_t start;
cudaEvent_t stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
kernel<<<g,b>>>();
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&responseTime, start, stop); //responseTime = elapsed time
The code needs to be changed slightly if you submit multiple items of work to multiple streams. I would recommend reading the answer to Difference in time reported by NVVP and counters
Option (2) and (3) are similar for the given example. Option (2) can be more flexible.

Related

Is there some in-code profiling of CUDA program

In OpenCL world there is function clGetEventProfilingInfo which returns all profiling info of event like queued, submitted, start and end times in nanoseconds. It is quite convenient because I'm able to printf that info whenever I want.
For example with PyOpenCL it is possible to write code like this
profile = event.profile
print("%gs + %gs" % (1e-9*(profile.end - profile.start), 1e-9*(profile.start - profile.queued)))
which is quite informative for my task.
Is it possible to get such information in code instead of using external profiling tool like nvprof and company?
For quick, lightweight timing, you may want to have a look at the cudaEvent API.
Excerpt from the link above:
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
cudaEventRecord(start);
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaEventRecord(stop);
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
printf("Elapsed time: %f ms\n", milliseconds);
If you want a more full-featured profiling library, you should look at CUPTI.
There is not a tool other than nvprof than can collect profiling data so far. However, you can harness nvprof in your code. Take a look at this Nvida document.
You can use cuProfilerStart() and cuProfilerStop() to probe just a part of your code.
They are inside cuda_profiler_api.h

How to measure overhead of a kernel launch in CUDA

I want to measure the overhead of a kernel launch in CUDA.
I understand that there are various parameters which affect this overhead. I am interested in the following:
number of threads created
size of data being copied
I am doing this mainly to measure the advantage of using managed memory which has been introduced in CUDA 6.0. I will update this question with the code I develop and from the comments. Thanks!
How to measure the overhead of a kernel launch in CUDA is dealt with in Section 6.1.1 of the "CUDA Handbook" book by N. Wilt. The basic idea is to launch an empty kernel. Here is a sample code snippet
#include <stdio.h>
__global__ void EmptyKernel() { }
int main() {
const int N = 100000;
float time, cumulative_time = 0.f;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
for (int i=0; i<N; i++) {
cudaEventRecord(start, 0);
EmptyKernel<<<1,1>>>();
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
cumulative_time = cumulative_time + time;
}
printf("Kernel launch overhead time: %3.5f ms \n", cumulative_time / N);
return 0;
}
On my laptop GeForce GT540M card, the kernel launch overhead is 0.00245ms.
If you want to check the dependence of this time from the number of threads launched, then just change the kernel launch configuration <<<*,*>>>. It appears that the timing does not significantly change with the number of threads launched, which is consistent with the statement of the book that most of that time is spent in the driver.
Perhaps you should be interested in these test results from the University of Virginia:
Memory transfer overhead: http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_transfer_overhead.html
Kernel launch overhead: http://www.cs.virginia.edu/~mwb7w/cuda_support/kernel_overhead.html
They were measured in a similar way to JackOLantern proposal.

do CUDA events time cudaMalloc and cudaMemcpy executions?

I am using the following code to time calls to cudaMalloc(). I am curious: Do CUDA events only time our kernels, or they also time the "in-built kernels". In other words, is the following method for timing cudaMalloc() valid?
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
for(int t =0 ; t < 100 ; t++){
float* test;
cudaMalloc((void**)&test, 3000000 * sizeof(float));
cudaFree(test);
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime , start, stop);
printf("time elapsed on the GPU: %f ms", elapsedTime/100);
cu(da)EventRecord() does nothing more than submit a command to the GPU that tells the GPU to write a timestamp when the GPU processes the command. The timestamp is just an onboard high-resolution counter. So CUDA events are most useful when used as an asynchronous mechanism for timing on-GPU events, like how long a specific kernel takes to run. CUDA memory management mostly happens on the CPU, so CUDA events are not ideal for timing CUDA allocation and free operations.
In short: You're better off using CPU-based timing like gettimeofday().

Time to be used in calculating bandwidth

I am trying to find the effective bandwidth used by my code against the CUDA GEforce 8800 gtx maximum of 86GB/s .I am not sure what time to use though .Currently I am using the difference between calling the kernel with my instructions against calling the kernel with no instructions.Is this the correct approach?(formula i use is ->effective bw= (bytes read+written)/time)
Also I get a really bad kernel call overhead (close to 1 sec) .Is there a way to get rid of it?
You can time your kernel fairly precisely with cuda events.
//declare the events
cudaEvent_t start;
cudaEvent_t stop;
float kernel_time;
//create events before you use them
cudaEventCreate(&start);
cudaEventCreate(&stop);
//put events and kernel launches in the stream/queue
cudaEventRecord(start,0);
myKernel <<< config >>>( );
cudaEventRecord(stop,0);
//wait until the stop event is recorded
cudaEventSynchronize(stop);
//and get the elapsed time
cudaEventElapsedTime(&kernel_time,start,stop);
//cleanup
cudaEventDestroy(start);
cudaEVentDestroy(stop);
Effective Bandwidth in GBps= ( (Br + Bw)/10^9 ) / Time
Br = number of bytes read by kernel from DRAM
Bw = number of bytes written by kernel in DRAM
Time = time taken by kernel.
For example you test the effective bandwidth of copying a 2048x2048 matrix of floats (4 bytes each) from one locations to another in GPU's DRAM. The formula would be:
Bandwidth in GB/s = ( (2048x2048 x 4 x 2)/10^9 ) / time-taken-by-kernel
here:
2048x2048 (matrix elements)
4 (each element has 4 bytes)
2 (one for read and one for write)
/10^9 to covert B into GB.

Error recording time with CudaEvent

I am using cudaEvent methods to find the time my kernel takes to execute.Here is the code as given in the manual.
cudaEvent_t start,stop;
float time=0;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
subsampler<<<gridSize,blockSize>>>(img_redd,img_greend,img_blued,img_height,img_width,final_device_r,final_device_g,final_device_b);
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time,start,stop);
Now when I run this and try to see the time it comes something like 52428800.0000(values differ but are of this order) .I know it is in milliseconds but still this is a huge number especially when this program execution doesnt take more than a minute.Can someone point out why this is happening .I really need to find how much time the kernel takes to execute.
You should check the return values of each of those CUDA calls. At the very least call cudaGetLastError() at the end to check everything was successful.
If you get an error during the kernel execution then try running your app with cuda-memcheck, especially if you have an Unspecified Launch Failure, to check for illegal memory accesses.