I am using cudaEvent methods to find the time my kernel takes to execute.Here is the code as given in the manual.
cudaEvent_t start,stop;
float time=0;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
subsampler<<<gridSize,blockSize>>>(img_redd,img_greend,img_blued,img_height,img_width,final_device_r,final_device_g,final_device_b);
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time,start,stop);
Now when I run this and try to see the time it comes something like 52428800.0000(values differ but are of this order) .I know it is in milliseconds but still this is a huge number especially when this program execution doesnt take more than a minute.Can someone point out why this is happening .I really need to find how much time the kernel takes to execute.
You should check the return values of each of those CUDA calls. At the very least call cudaGetLastError() at the end to check everything was successful.
If you get an error during the kernel execution then try running your app with cuda-memcheck, especially if you have an Unspecified Launch Failure, to check for illegal memory accesses.
Related
I am new to CUDA and got a little confused with cudaEvent. I now have a code sample that goes as follows:
float elapsedTime;
cudaEvent_t start, stop;
CUDA_ERR_CHECK(cudaEventCreate(&start));
CUDA_ERR_CHECK(cudaEventCreate(&stop));
CUDA_ERR_CHECK(cudaEventRecord(start));
// Kernel functions go here ...
CUDA_ERR_CHECK(cudaEventRecord(stop));
CUDA_ERR_CHECK(cudaEventSynchronize(stop));
CUDA_ERR_CHECK(cudaEventElapsedTime(&elapsedTime, start, stop));
CUDA_ERR_CHECK(cudaDeviceSynchronize());
I have two questions regarding this code:
1.Is the last cudaDeviceSynchronize necessary? Because according to the documentation for cudaEventSynchronize, its functionality is Wait until the completion of all device work preceding the most recent call to cudaEventRecord(). So given that we have already called cudaEventSynchronize(stop), do we need to call cudaDeviceSynchronize once again?
2.How different is the above code compared to the following implementation:
#include <chrono>
auto tic = std::chrono::system_clock::now();
// Kernel functions go here ...
CUDA_ERR_CHECK(cudaDeviceSynchronize());
auto toc = std::chrono::system_clock:now();
float elapsedTime = std::chrono::duration_cast < std::chrono::milliseconds > (toc - tic).count() * 1.0;
Just to flesh out comments so that this question has an answer and will fall off the unanswered queue:
No, the cudaDeviceSynchronize() call is not necessary. In fact, in many cases where asynchronous API calls are being used in multiple streams, it is incorrect to use a global scope synchronization call, because you will break the features of event timers which allow accurate timing of operations in streams.
They are completely different. One is using host side
timing, the other is using device driver timing. In the simplest cases, the time measured by both will be comparable. However, in the host side timing version, if you put a host CPU operation which consumes a significant amount of time in the host timing section, your measurement of time will not reflect the GPU time used when the GPU operations take less time than the host operations.
I am struggling with utilizing a simulation loop.
There are 3 kernel launched in every cycle.
The next time step size is computed by the second kernel.
while (time < end)
{
kernel_Flux<<<>>>(...);
kernel_Timestep<<<>>>(d_timestep);
memcpy(&h_timestep, d_timestep, sizeof(float), ...);
kernel_Integrate<<<>>>(d_timestep);
time += h_timestep;
}
I only need copy back a single float. What would be the most efficient way to avoid unnecessary synchronizations?
Thank you in advance. :-)
In CUDA all operations running from the default stream are synchronized. So in the code you've posted kernels will run one after another. From what I can see the kernel kernel_integrate() depends from the result of the kernel kernel_Timestep(), so no way of avoiding synchronization. Anyway if the kernels kernel_Flux() and kernel_Timestep() work on independent data, you can try to execute them in parallel, in two different streams.
If you care about the iteration time a lot, you can probably setup a new stream dedicated to the memcpy of the h_timestep out (you need to use cudaMemcpyAsync in this case). Then use something like speculative execution, where your loop proceeds before you figure out the time. To do so you will have to setup the GPU memory buffers for the next several iterations. You can probably do this by using a circular buffer. You also need to use cudaEventRecord and cudaStreamWaitEvent to synchronize the different streams, such that a next iteration is allowed to proceed only if the time corresponds to the buffer you are about to overwrite, has been calculated (the memcpy stream has done the job), because otherwise you will lose the state at that iteration.
Another potential solution, which I haven't tried but I suspect would work, is to make use of dynamic parallelism. If your cards support that, you can probably put the whole loop in GPU.
EDIT:
Sorry, I just realized that you have the third kernel. Your delay due to synchronization may be because you are not doing cudaMemcpyAsync? It's very likely that the third kernel will run longer than the memcpy. You should be able to proceed without any delay. The only synchronization needed is after each iteration.
The ideal solution would be to move everything to the GPU. However, I cannot do so, because I need to launch CUDPP compact after every few iterations, and it does not support CUDA streams nor dynamics parallelism. I know that the Thrust 1.8 library has copy_if method, which does the same, and it is working with dynamic parallelism. The problem is it does not compile with separate compilation on.
To sum up, now I use the following code:
while (time < end)
{
kernel_Flux<<<gs,bs, 0, stream1>>>();
kernel_Timestep<<<gs,bs, 0, stream1>>>(d_timestep);
cudaEventRecord(event, stream1);
cudaStreamWaitEvent(mStream2, event, 0);
memcpyasync(&h_timestep, d_timestep, sizeof(float), ..., stream2);
kernel_Integrate<<<>>>(d_timestep);
cudaStreamSynchronize(stream2);
time += h_timestep;
}
When timing CUDA kernels, the following doesn't work because the kernel doesn't block the CPU program execution while it executes:
start timer
kernel<<<g,b>>>();
end timer
I've seen three basic ways of (successfully) timing CUDA kernels:
(1) Two CUDA eventRecords.
float responseTime; //result will be in milliseconds
cudaEvent_t start; cudaEventCreate(&start); cudaEventRecord(start); cudaEventSynchronize(start);
cudaEvent_t stop; cudaEventCreate(&stop);
kernel<<<g,b>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
cudaEventElapsedTime(&responseTime, start, stop); //responseTime = elapsed time
(2) One CUDA eventRecord.
float start = read_timer(); //helper function on CPU, in milliseconds
cudaEvent_t stop; cudaEventCreate(&stop);
kernel<<<g,b>>>();
cudaEventRecord(stop); cudaEventSynchronize(stop);
float responseTime = read_timer() - start;
(3) deviceSynchronize instead of eventRecord. (Probably only useful when using programming in a single stream.)
float start = read_timer(); //helper function on CPU, in milliseconds
kernel<<<g,b>>>();
cudaDeviceSynchronize();
float responseTime = read_timer() - start;
I experimentally verified that these three strategies produce the same timing result.
Questions:
What are the tradeoffs of these strategies? Any hidden details here?
Aside from timing many kernels in multiple streams, is there any advantages of using two event records and the cudaEventElapsedTime() function?
You can probably use your imagination to figure out what read_timer() does. Nevertheless, it can't hurt to provide an example implementation:
double read_timer(){
struct timeval start;
gettimeofday( &start, NULL ); //you need to include <sys/time.h>
return (double)((start.tv_sec) + 1.0e-6 * (start.tv_usec))*1000; //milliseconds
}
You seem to have ruled out most of the differences by saying they all produce the same result for the relatively simple case you have shown (probably not exactly true, but I understand what you mean), and "Aside from timing (complex sequences) ..." where the first case is clearly better.
One possible difference would be portability between windows and linux. I believe your example read_timer function is linux-oriented. You could probably craft a read_timer function that is "portable" but the cuda event system (method 1) is portable as-is.
Option (1) uses cudaEventRecord to time the CPU. This is highly inefficient and I would discourage using cudaEventRecord for this purpose. cudaEventRecord can be used to time the GPU push buffer time to execute kernel as follows:
float responseTime; //result will be in milliseconds
cudaEvent_t start;
cudaEvent_t stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
kernel<<<g,b>>>();
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&responseTime, start, stop); //responseTime = elapsed time
The code needs to be changed slightly if you submit multiple items of work to multiple streams. I would recommend reading the answer to Difference in time reported by NVVP and counters
Option (2) and (3) are similar for the given example. Option (2) can be more flexible.
I have two cuda kernel functions like this
a<<<BLK_SIZE,THR_SIZE>>>(params,...);
b<<<BLK_SIZE,THR_SIZE>>>(params,...);
After function a started, I want to wait until a finishes and then start function b.
so I inserted cudaThreadSynchronize() between a and b like this,
a<<<BLK_SIZE,THR_SIZE>>>(params,...);
err=cudaThreadSynchronize();
if( err != cudaSuccess)
printf("cudaThreadSynchronize error: %s\n", cudaGetErrorString(err));
b<<<BLK_SIZE,THR_SIZE>>>(params,...);
but cudaThreadSynchronize() returns error code: the launch timed out and was terminated cuda error
how can I fix it?
A simple code explanation:
mmap(sequence file);
mmap(reference file);
cudaMemcpy(seq_cuda, sequence);
cudaMemcpy(ref_cuda,reference);
kernel<<<>>>(params); //find short sequence in reference
cudaThreadSynchronize();
kernel<<<>>>(params);
cudaMemcpy(result, result_cuda);
report result
and in kernel function, there is a big for loop which contains some if-else for the pattern matching algorithm to reduce number of comparisons.
This launch error means that something went wrong when your first kernel was launched or maybe even something before that. To work your way out of this, try checking the output of all CUDA runtime calls for errors. Also, do a cudaThreadSync followed by error check after all kernel calls. This should help you find the first place where the error occurs.
If it is indeed a launch failure, you will need to investigate the execution configuration and the code of the kernel to find the cause of your error.
Finally, note that it is highly unlikely that your action of adding in a cudaThreadSynchronize caused this error. I say this because the way you have worded the query points to the cudaThreadSynchronize as the culprit. All this call did was to catch your existing error earlier.
say I want to time a memory fetching from device global memory
cudaMemcpy(...cudaMemcpyHostToDevice);
cudaThreadSynchronize();
time1 ...
kernel_call();
cudaThreadSynchronize();
time2 ...
cudaMemcpy(...cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
time3 ...
I don't understand why my time3 and time2 always give same results. My kernel does take a long time to get the result ready for fetching, but shouldn't cudaThreadSynchronize() block all the operation before kernel_call is done? Also fetching from device memory to host memory shall also take a while, at least noticeable. Thanks.
The best way to monitor the execution time is to use the CUDA_PROFILE_LOG=1
environment variable, and set in the CUDA_PROFILE_CONFIG file the values,
timestamp, gpustarttimestamp,gpuendtimestamp. after running your cuda program with those environment variable a local .cuda_log file should be created and listed inside the timing amounts of memcopies and kernel execution to the microsecond level. clean and not invasive
.
I don't know if thats the critical point in here, but i noticed the following:
if you look trough the nvidia code samples (don't know where exactly), you will find something like a "warm-up" function, which is called before some critical kernel is called which should is measured.
Why?
Because the nvidia-driver will dynamically optimize the art of driving the gpu during the first access (in your case before timer1) every time a program is executed. There will be a lot of overhead. That wasn't clear for me for a long time. When i did 10 runs, the first run was sooooo sloooow. Now i know why.
Solution: just use a dummy/warm-up function which is accessing the gpu hardware before the real execution of your code begins.