How to measure overhead of a kernel launch in CUDA - cuda

I want to measure the overhead of a kernel launch in CUDA.
I understand that there are various parameters which affect this overhead. I am interested in the following:
number of threads created
size of data being copied
I am doing this mainly to measure the advantage of using managed memory which has been introduced in CUDA 6.0. I will update this question with the code I develop and from the comments. Thanks!

How to measure the overhead of a kernel launch in CUDA is dealt with in Section 6.1.1 of the "CUDA Handbook" book by N. Wilt. The basic idea is to launch an empty kernel. Here is a sample code snippet
#include <stdio.h>
__global__ void EmptyKernel() { }
int main() {
const int N = 100000;
float time, cumulative_time = 0.f;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
for (int i=0; i<N; i++) {
cudaEventRecord(start, 0);
EmptyKernel<<<1,1>>>();
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
cumulative_time = cumulative_time + time;
}
printf("Kernel launch overhead time: %3.5f ms \n", cumulative_time / N);
return 0;
}
On my laptop GeForce GT540M card, the kernel launch overhead is 0.00245ms.
If you want to check the dependence of this time from the number of threads launched, then just change the kernel launch configuration <<<*,*>>>. It appears that the timing does not significantly change with the number of threads launched, which is consistent with the statement of the book that most of that time is spent in the driver.

Perhaps you should be interested in these test results from the University of Virginia:
Memory transfer overhead: http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_transfer_overhead.html
Kernel launch overhead: http://www.cs.virginia.edu/~mwb7w/cuda_support/kernel_overhead.html
They were measured in a similar way to JackOLantern proposal.

Related

Is there some in-code profiling of CUDA program

In OpenCL world there is function clGetEventProfilingInfo which returns all profiling info of event like queued, submitted, start and end times in nanoseconds. It is quite convenient because I'm able to printf that info whenever I want.
For example with PyOpenCL it is possible to write code like this
profile = event.profile
print("%gs + %gs" % (1e-9*(profile.end - profile.start), 1e-9*(profile.start - profile.queued)))
which is quite informative for my task.
Is it possible to get such information in code instead of using external profiling tool like nvprof and company?
For quick, lightweight timing, you may want to have a look at the cudaEvent API.
Excerpt from the link above:
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
cudaEventRecord(start);
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaEventRecord(stop);
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
printf("Elapsed time: %f ms\n", milliseconds);
If you want a more full-featured profiling library, you should look at CUPTI.
There is not a tool other than nvprof than can collect profiling data so far. However, you can harness nvprof in your code. Take a look at this Nvida document.
You can use cuProfilerStart() and cuProfilerStop() to probe just a part of your code.
They are inside cuda_profiler_api.h

How to avoid memcpy if number of blocks depends on device variable?

I am computing a number, X, on the device. Now I need to launch a kernel with X threads. I can set the blockSize to 1024. Is there a way to set the number of blocks to ceil(X / 1024) without performing a memcpy?
I see two possibilities:
Use dynamic parallelism (if feasible). Rather than copying the result back to determine the execution parameters of the next launch, just have the device perform the next launch itself.
Use zero-copy or managed memory. In that case the GPU writes directly to CPU memory over the PCI-e bus, rather than requiring an explicit memory transfer.
Of those options, dynamic parallelism and managed memory require hardware features which are not available on all GPUs. Zero-copy memory is supported by all GPUs with compute capability >= 1.1, which in practice is just about every CUDA compatible device ever made.
An example of using managed memory, as outlined by #talonmies, allowing kernel1 to determine the number of blocks for kernel2 without an explicit memcpy.
#include <stdio.h>
#include <cuda.h>
using namespace std;
__device__ __managed__ int kernel2_blocks;
__global__ void kernel1() {
if (threadIdx.x == 0) {
kernel2_blocks = 42;
}
}
__global__ void kernel2() {
if (threadIdx.x == 0) {
printf("block: %d\n", blockIdx.x);
}
}
int main() {
kernel1<<<1, 1024>>>();
cudaDeviceSynchronize();
kernel2<<<kernel2_blocks, 1024>>>();
cudaDeviceSynchronize();
return 0;
}

can different calls to kernel share memory?

In my code, I need to call a CUDA kernel to parallelize some matrix computation. However, this computation must be done iteratively for ~60,000 times (kernel is called inside a 60,000 iteration for loop).
That means, If I do cudaMalloc/cudaMemcpy across every single call to the kernel, most of the time will be spent doing memory allocation and transfer and I get a significant slowdown.
Is there a way to say, allocate a piece of memory before the for loop, use that memory in each iteration of the kernel, and then after the for loop, copy that memory back from device to host?
Thanks.
Yes, you can do exactly what you describe:
int *h_data, *d_data;
cudaMalloc((void **)&d_data, DSIZE*sizeof(int));
h_data = (int *)malloc(DSIZE*sizeof(int));
// fill up h_data[] with data
cudaMemcpy(d_data, h_data, DSIZE*sizeof(int), cudaMemcpyHostToDevice);
for (int i = 0; i < 60000; i++)
my_kernel<<<grid_dim, block_dim>>>(d_data)
cudaMemcpy(h_data, d_data, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
...

Strategies for timing CUDA Kernels: Pros and Cons?

When timing CUDA kernels, the following doesn't work because the kernel doesn't block the CPU program execution while it executes:
start timer
kernel<<<g,b>>>();
end timer
I've seen three basic ways of (successfully) timing CUDA kernels:
(1) Two CUDA eventRecords.
float responseTime; //result will be in milliseconds
cudaEvent_t start; cudaEventCreate(&start); cudaEventRecord(start); cudaEventSynchronize(start);
cudaEvent_t stop; cudaEventCreate(&stop);
kernel<<<g,b>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
cudaEventElapsedTime(&responseTime, start, stop); //responseTime = elapsed time
(2) One CUDA eventRecord.
float start = read_timer(); //helper function on CPU, in milliseconds
cudaEvent_t stop; cudaEventCreate(&stop);
kernel<<<g,b>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
float responseTime = read_timer() - start;
(3) deviceSynchronize instead of eventRecord. (Probably only useful when using programming in a single stream.)
float start = read_timer(); //helper function on CPU, in milliseconds
kernel<<<g,b>>>();
cudaDeviceSynchronize();
float responseTime = read_timer() - start;
I experimentally verified that these three strategies produce the same timing result.
Questions:
What are the tradeoffs of these strategies? Any hidden details here?
Aside from timing many kernels in multiple streams, is there any advantages of using two event records and the cudaEventElapsedTime() function?
You can probably use your imagination to figure out what read_timer() does. Nevertheless, it can't hurt to provide an example implementation:
double read_timer(){
struct timeval start;
gettimeofday( &start, NULL ); //you need to include <sys/time.h>
return (double)((start.tv_sec) + 1.0e-6 * (start.tv_usec))*1000; //milliseconds
}
You seem to have ruled out most of the differences by saying they all produce the same result for the relatively simple case you have shown (probably not exactly true, but I understand what you mean), and "Aside from timing (complex sequences) ..." where the first case is clearly better.
One possible difference would be portability between windows and linux. I believe your example read_timer function is linux-oriented. You could probably craft a read_timer function that is "portable" but the cuda event system (method 1) is portable as-is.
Option (1) uses cudaEventRecord to time the CPU. This is highly inefficient and I would discourage using cudaEventRecord for this purpose. cudaEventRecord can be used to time the GPU push buffer time to execute kernel as follows:
float responseTime; //result will be in milliseconds
cudaEvent_t start;
cudaEvent_t stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
kernel<<<g,b>>>();
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&responseTime, start, stop); //responseTime = elapsed time
The code needs to be changed slightly if you submit multiple items of work to multiple streams. I would recommend reading the answer to Difference in time reported by NVVP and counters
Option (2) and (3) are similar for the given example. Option (2) can be more flexible.

do CUDA events time cudaMalloc and cudaMemcpy executions?

I am using the following code to time calls to cudaMalloc(). I am curious: Do CUDA events only time our kernels, or they also time the "in-built kernels". In other words, is the following method for timing cudaMalloc() valid?
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
for(int t =0 ; t < 100 ; t++){
float* test;
cudaMalloc((void**)&test, 3000000 * sizeof(float));
cudaFree(test);
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime , start, stop);
printf("time elapsed on the GPU: %f ms", elapsedTime/100);
cu(da)EventRecord() does nothing more than submit a command to the GPU that tells the GPU to write a timestamp when the GPU processes the command. The timestamp is just an onboard high-resolution counter. So CUDA events are most useful when used as an asynchronous mechanism for timing on-GPU events, like how long a specific kernel takes to run. CUDA memory management mostly happens on the CPU, so CUDA events are not ideal for timing CUDA allocation and free operations.
In short: You're better off using CPU-based timing like gettimeofday().