In OpenCL world there is function clGetEventProfilingInfo which returns all profiling info of event like queued, submitted, start and end times in nanoseconds. It is quite convenient because I'm able to printf that info whenever I want.
For example with PyOpenCL it is possible to write code like this
profile = event.profile
print("%gs + %gs" % (1e-9*(profile.end - profile.start), 1e-9*(profile.start - profile.queued)))
which is quite informative for my task.
Is it possible to get such information in code instead of using external profiling tool like nvprof and company?
For quick, lightweight timing, you may want to have a look at the cudaEvent API.
Excerpt from the link above:
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
cudaEventRecord(start);
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaEventRecord(stop);
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
printf("Elapsed time: %f ms\n", milliseconds);
If you want a more full-featured profiling library, you should look at CUPTI.
There is not a tool other than nvprof than can collect profiling data so far. However, you can harness nvprof in your code. Take a look at this Nvida document.
You can use cuProfilerStart() and cuProfilerStop() to probe just a part of your code.
They are inside cuda_profiler_api.h
Related
I want to measure the overhead of a kernel launch in CUDA.
I understand that there are various parameters which affect this overhead. I am interested in the following:
number of threads created
size of data being copied
I am doing this mainly to measure the advantage of using managed memory which has been introduced in CUDA 6.0. I will update this question with the code I develop and from the comments. Thanks!
How to measure the overhead of a kernel launch in CUDA is dealt with in Section 6.1.1 of the "CUDA Handbook" book by N. Wilt. The basic idea is to launch an empty kernel. Here is a sample code snippet
#include <stdio.h>
__global__ void EmptyKernel() { }
int main() {
const int N = 100000;
float time, cumulative_time = 0.f;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
for (int i=0; i<N; i++) {
cudaEventRecord(start, 0);
EmptyKernel<<<1,1>>>();
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
cumulative_time = cumulative_time + time;
}
printf("Kernel launch overhead time: %3.5f ms \n", cumulative_time / N);
return 0;
}
On my laptop GeForce GT540M card, the kernel launch overhead is 0.00245ms.
If you want to check the dependence of this time from the number of threads launched, then just change the kernel launch configuration <<<*,*>>>. It appears that the timing does not significantly change with the number of threads launched, which is consistent with the statement of the book that most of that time is spent in the driver.
Perhaps you should be interested in these test results from the University of Virginia:
Memory transfer overhead: http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_transfer_overhead.html
Kernel launch overhead: http://www.cs.virginia.edu/~mwb7w/cuda_support/kernel_overhead.html
They were measured in a similar way to JackOLantern proposal.
In my code, I need to call a CUDA kernel to parallelize some matrix computation. However, this computation must be done iteratively for ~60,000 times (kernel is called inside a 60,000 iteration for loop).
That means, If I do cudaMalloc/cudaMemcpy across every single call to the kernel, most of the time will be spent doing memory allocation and transfer and I get a significant slowdown.
Is there a way to say, allocate a piece of memory before the for loop, use that memory in each iteration of the kernel, and then after the for loop, copy that memory back from device to host?
Thanks.
Yes, you can do exactly what you describe:
int *h_data, *d_data;
cudaMalloc((void **)&d_data, DSIZE*sizeof(int));
h_data = (int *)malloc(DSIZE*sizeof(int));
// fill up h_data[] with data
cudaMemcpy(d_data, h_data, DSIZE*sizeof(int), cudaMemcpyHostToDevice);
for (int i = 0; i < 60000; i++)
my_kernel<<<grid_dim, block_dim>>>(d_data)
cudaMemcpy(h_data, d_data, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
...
When timing CUDA kernels, the following doesn't work because the kernel doesn't block the CPU program execution while it executes:
start timer
kernel<<<g,b>>>();
end timer
I've seen three basic ways of (successfully) timing CUDA kernels:
(1) Two CUDA eventRecords.
float responseTime; //result will be in milliseconds
cudaEvent_t start; cudaEventCreate(&start); cudaEventRecord(start); cudaEventSynchronize(start);
cudaEvent_t stop; cudaEventCreate(&stop);
kernel<<<g,b>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
cudaEventElapsedTime(&responseTime, start, stop); //responseTime = elapsed time
(2) One CUDA eventRecord.
float start = read_timer(); //helper function on CPU, in milliseconds
cudaEvent_t stop; cudaEventCreate(&stop);
kernel<<<g,b>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
float responseTime = read_timer() - start;
(3) deviceSynchronize instead of eventRecord. (Probably only useful when using programming in a single stream.)
float start = read_timer(); //helper function on CPU, in milliseconds
kernel<<<g,b>>>();
cudaDeviceSynchronize();
float responseTime = read_timer() - start;
I experimentally verified that these three strategies produce the same timing result.
Questions:
What are the tradeoffs of these strategies? Any hidden details here?
Aside from timing many kernels in multiple streams, is there any advantages of using two event records and the cudaEventElapsedTime() function?
You can probably use your imagination to figure out what read_timer() does. Nevertheless, it can't hurt to provide an example implementation:
double read_timer(){
struct timeval start;
gettimeofday( &start, NULL ); //you need to include <sys/time.h>
return (double)((start.tv_sec) + 1.0e-6 * (start.tv_usec))*1000; //milliseconds
}
You seem to have ruled out most of the differences by saying they all produce the same result for the relatively simple case you have shown (probably not exactly true, but I understand what you mean), and "Aside from timing (complex sequences) ..." where the first case is clearly better.
One possible difference would be portability between windows and linux. I believe your example read_timer function is linux-oriented. You could probably craft a read_timer function that is "portable" but the cuda event system (method 1) is portable as-is.
Option (1) uses cudaEventRecord to time the CPU. This is highly inefficient and I would discourage using cudaEventRecord for this purpose. cudaEventRecord can be used to time the GPU push buffer time to execute kernel as follows:
float responseTime; //result will be in milliseconds
cudaEvent_t start;
cudaEvent_t stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
kernel<<<g,b>>>();
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&responseTime, start, stop); //responseTime = elapsed time
The code needs to be changed slightly if you submit multiple items of work to multiple streams. I would recommend reading the answer to Difference in time reported by NVVP and counters
Option (2) and (3) are similar for the given example. Option (2) can be more flexible.
What is the meaining of missing configuration error in cuda ?
This below code is a thread function, when I run this code the error obtained is 1 which implies missing configuration error. what is the mistake in this code ?
void* run(void *args)
{
cudaError_t error;
Matrix *matrix=(Matrix*)args;
int scalar=2;
dim3 dimGrid(1,1,1);
dim3 dimBlock(1024,1,1);
cudaEvent_t start,stop;
cudaSetDevice(0);
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
for(int i=0 ;i< matrix->number ;i++ )
{
syntheticKernel<<<dimGrid,dimBlock>>>();
cudaThreadSynchronize();
}
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&matrix->time,start,stop);
error=cudaGetLastError();
assert(error!=0);
printf("%d\n",error);
}
Can you add more detail about your program please? The CUDA API routines each return a status code, you should check the status of each API call to catch and decode the first reported error.
One point to check is that you have not called any CUDA API routines before you fork the pthreads. Creating a CUDA context (which is automatic for most, but not all, CUDA API routines) before you fork the threads will cause problems. Check this, and if it's not the problem add more details to your question and check the return value of all API calls.
Why are you launching a single block in a Grid? This configuration seems suspicious:
dim3 dimGrid(1,1,1);
dim3 dimBlock(1024,1,1);
Try increasing the grid size and putting less threads in a block. But your main problem is probably about contexts as Tom suggests.
I am using the following code to time calls to cudaMalloc(). I am curious: Do CUDA events only time our kernels, or they also time the "in-built kernels". In other words, is the following method for timing cudaMalloc() valid?
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
for(int t =0 ; t < 100 ; t++){
float* test;
cudaMalloc((void**)&test, 3000000 * sizeof(float));
cudaFree(test);
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime , start, stop);
printf("time elapsed on the GPU: %f ms", elapsedTime/100);
cu(da)EventRecord() does nothing more than submit a command to the GPU that tells the GPU to write a timestamp when the GPU processes the command. The timestamp is just an onboard high-resolution counter. So CUDA events are most useful when used as an asynchronous mechanism for timing on-GPU events, like how long a specific kernel takes to run. CUDA memory management mostly happens on the CPU, so CUDA events are not ideal for timing CUDA allocation and free operations.
In short: You're better off using CPU-based timing like gettimeofday().