Why does cuFFT performance suffer with overlapping inputs? - cuda

I'm experimenting with using cuFFT's callback feature to perform input format conversion on the fly (for instance, calculating FFTs of 8-bit integer input data without first doing an explicit conversion of the input buffer to float). In many of my applications, I need to calculate overlapped FFTs on an input buffer, as described in this previous SO question. Typically, adjacent FFTs might overlap by 1/4 to 1/8 of the FFT length.
cuFFT, with its FFTW-like interface, explicitly supports this via the idist parameter of the cufftPlanMany() function. Specifically, if I want to calculate FFTs of size 32768 with an overlap of 4096 samples between consecutive inputs, I would set idist = 32768 - 4096. This does work properly in the sense that it yields the correct output.
However, I'm seeing strange performance degradation when using cuFFT in this way. I have devised a test that implements this format conversion and overlap in two different ways:
Explicitly tell cuFFT about the overlapping nature of the input: set idist = nfft - overlap as I described above. Install a load callback function that just does the conversion from int8_t to float as needed on the buffer index provided to the callback.
Don't tell cuFFT about the overlapping nature of the input; lie to it an dset idist = nfft. Then, let the callback function handle the overlapping by calculating the correct index that should be read for each FFT input.
A test program implementing both of these approaches with timing and equivalence tests is available in this GitHub gist. I didn't reproduce it all here for brevity. The program calculates a batch of 1024 32768-point FFTs that overlap by 4096 samples; the input data type is 8-bit integers. When I run it on my machine (with a Geforce GTX 660 GPU, using CUDA 8.0 RC on Ubuntu 16.04), I get the following result:
executing method 1...done in 32.523 msec
executing method 2...done in 26.3281 msec
Method 2 is noticeably faster, which I would not expect. Look at the implementations of the callback functions:
Method 1:
template <typename T>
__device__ cufftReal convert_callback(void * inbuf, size_t fft_index,
void *, void *)
{
return (cufftReal)(((const T *) inbuf)[fft_index]);
}
Method 2:
template <typename T>
__device__ cufftReal convert_and_overlap_callback(void *inbuf,
size_t fft_index, void *, void *)
{
// fft_index is the index of the sample that we need, not taking
// the overlap into account. Convert it to the appropriate sample
// index, considering the overlap structure. First, grab the FFT
// parameters from constant memory.
int nfft = overlap_params.nfft;
int overlap = overlap_params.overlap;
// Calculate which FFT in the batch that we're reading data for. This
// tells us how much overlap we need to account for. Just use integer
// arithmetic here for speed, knowing that this would cause a problem
// if we did a batch larger than 2Gsamples long.
int fft_index_int = fft_index;
int fft_batch_index = fft_index_int / nfft;
// For each transform past the first one, we need to slide "overlap"
// samples back in the input buffer when fetching the sample.
fft_index_int -= fft_batch_index * overlap;
// Cast the input pointer to the appropriate type and convert to a float.
return (cufftReal) (((const T *) inbuf)[fft_index_int]);
}
Method 2 has a significantly more complex callback function, one that even involves integer division by a non-compile time value! I would expect this to be much slower than method 1, but I'm seeing the opposite. Is there a good explanation for this? Is it possible that cuFFT structures its processing much differently when the input overlaps, thus resulting in the degraded performance?
It seems like I should be able to achieve performance that is quite a bit faster than method 2 if the index calculations could be removed from the callback (but that would require the overlapping to be specified to cuFFT).
Edit: After running my test program under nvvp, I can see that cuFFT definitely seems to be structuring its computations differently. It's hard to make sense of the kernel symbol names, but the kernel invocations break down like this:
Method 1:
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex14packR2C_kernelIjfEEvNS_19spRealComplexR2C_stIT_T0_EE: 3.72 msec
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>: 7.71 msec
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>: 12.75 msec (yes, it gets invoked twice)
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelTexIjfL9fftAxii_t1EEEvP7ComplexIT0_EjT_15coordDivisors_tIS6_E7coord_tIS6_ESA_S6_S3_: 7.49 msec
Method 2:
spRadix0128C::kernel1MemCallback<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, L1, ALL, WRITEBACK>: 5.15 msec
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>: 12.88 msec
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelTexIjfL9fftAxii_t1EEEvP7ComplexIT0_EjT_15coordDivisors_tIS6_E7coord_tIS6_ESA_S6_S3_: 7.51 msec
Interestingly, it looks like cuFFT invokes two kernels to actually compute the FFTs using method 1 (when cuFFT knows about the overlapping), but with method 2 (where it doesn't know that the FFTs are overlapped), it does the job with just one. For the kernels that are used in both cases, it does seem to use the same grid parameters between methods 1 and 2.
I don't see why it should have to use a different implementation here, especially since the input stride istride == 1. It should just use a different base address when fetching data at the transform input; the rest of the algorithm should be exactly the same, I think.
Edit 2: I'm seeing some even more bizarre behavior. I realized by accident that if I fail to destroy the cuFFT handles appropriately, I see differences in measured performance. For example, I modified the test program to skip destruction of the cuFFT handles and then executed the tests in a different sequence: method 1, method 2, then method 2 and method 1 again. I got the following results:
executing method 1...done in 31.5662 msec
executing method 2...done in 17.6484 msec
executing method 2...done in 17.7506 msec
executing method 1...done in 20.2447 msec
So the performance seems to change depending upon whether there are other cuFFT plans in existence when creating a plan for the test case! Using the profiler, I see that the structure of the kernel launches doesn't change between the two cases; the kernels just all seem to execute faster. I have no reasonable explanation for this effect either.

If you specify non-standard strides (doesn't matter if batch/transform) cuFFT uses different path internally.
ad edit 2:
This is likely GPU Boost adjusting clocks on GPU. cuFFT plan do not have impact one on another
Ways to get more stable results:
run warmup kernel (anything that would make full GPU work is good) and then your problem
increase batch size
run test several times and take average
lock clocks of the GPU (not really possible on GeForce - Tesla can do it)

At the suggestion of #llukas, I filed a bug report with NVIDIA regarding the issue (https://partners.nvidia.com/bug/viewbug/1821802 if you're registered as a developer). They acknowledged the poorer performance with overlapped plans. They actually indicated that the kernel configuration used in both cases is suboptimal and they plan to improve that eventually. No ETA was given, but it will likely not be in the next release (8.0 was just released last week). Finally, they said that as of CUDA 8.0, there is no workaround to make cuFFT use a more efficient method with strided inputs.

Related

What is the fastest way to update a single float value to the GPU to access it in a CUDA kernel?

I have a opengl particle simulation, where the position of each particle is calculated in a CUDA kernel. Most memory resides within the GPU memory, but there is a single float value, I have to update from the CPU each frame.
At the moment I use cudaMemcpyAsync() to copy the float value to the GPU, but (at least from what I can tell), this slows down the performance quite a bit. I tried to use nvproof to see, which calls take the longest, with these results:
Calls Avg Min Max Name
477 2.9740us 2.8160us 4.5440us simulation(float3*, float*, float3*, float*)
477 89.033us 18.600us 283.00us cudaLaunchKernel
477 47.819us 10.200us 120.70us cudaMemcpyAsync
I think I can't really do much about the kernel launch itself, but from the calls, that happen every frame cudaMemcpyAsync() seems to be taking the longest.
I have also tried to use pinned memory and cudaHostGetDevicePointer() as described here, however for some reason this increases the kernel launch times even more, making more than up for the time saved for not needing the memcopy function.
I guess there has to be a better/faster way to update my single float variable to the GPU?
Easiest way is, that you can add an extra parameter to the simulation kernel function as a value of simple float but not as a pointer to float so that the data goes directly by the kernel launch parameters structure that CUDA sends to GPU when you launch the kernel. Then you evade that data copy command altogether. (I'm assuming CUDA packs whole function parameter descriptor data of kernel into a single copy command because kernel parameter descriptor space is limited by a few kBs or less).
simulation(fooPointer,
barPointer,
fooBarPointer,
floatVariable
);
Or, try double buffering between data update and rendering or between data update and compute so that simulation image follows the simulation calculation by 1-2 frames behind (and per-frame time gets worse) but "frames per second" increases.
If its not an interactive simulation, hiding compute/render/data latencies by double or triple buffering should work.
If you are after minimizing per-frame timing (quicker response to a user-input into simulation?) then you should embed the float variable to the end of an array that you already send/use in simulation or whatever structure you are using. If you already have a 1MB+ float buffer to send to GPU, then appending 4B(float) to end of it should not make much difference then you can access it from there. 1 copy operation should be faster than 2 copy operations with same total size.
If you are literally sending just 4B to GPU at each frame (with a simple function to generate that data), then (as 3Dave said in comments) you can try adding an extra kernel function to update the value in the GPU and just have the overhead of kernel launch command instead of both copy command overhead and data copy overhead. On a positive side, that extra kernel overhead might be hidden if there is a "graph" of kernels running for each frame automatically without enqueueing all of them again and again.
Here,
https://devblogs.nvidia.com/cuda-graphs/
The part
We are going to create a simple code which mimics this pattern. We will then use this to demonstrate the overheads involved with the standard launch mechanism and show how to introduce a CUDA Graph comprising the multiple kernels, which can be launched from the application in a single operation.
cudaGraphLaunch(instance, stream);
They say per-kernel launch overhead in this "graph" feature is only 3-4 microseconds when there are many(20) kernels in the algorithm.
Since graph supports other commands too, you can try both copy and compute parts in parallel cuda-streams within a graph and switch their inputs with double buffering so all CUDA things can stay within CUDA's context before sending output to rendering.
(Maybe)You don't even have to change the data mechanism at all. Just try sending data of float as binary representation into the pointer value and only read the pointer value (not data value) from kernel and convert it back to float. I don't know if CUDA returns an error for this if you don't try reaching the (wrong) pointer address that the float data represents, in the kernel.
simulation(fooPointer,
barPointer,
fooBarPointer,
toPtr(floatData) // <----- float to 64/32 bit pointer value
);
and in kernel
float val = fromPtrToFloat(parameter4); // converts pointer itself, not the data
But this may not be a preferred practice if you can simply use "value" type parameters.

How to perform basic operations (+ - * /) on GPU and store the result on it

I have the following code line, gamma is a CPU variable, that after i will need to copy to GPU. gamma_x and delta are also stored on CPU. Is there any way that i can execute the following line and store its result directly on GPU? So basically, host gamma, gamma_x and delta on GPU and get the output of the following line on GPU. It would speed up my code a lot for the lines after.
I tried with magma_dcopy but so far i couldn't find a way to make it working because the output of magma_ddot is CPU double.
gamma = -(gamma_x[i+1] + magma_ddot(i,&d_gamma_x[1],1,&(d_l2)[1],1, queue))/delta;
The very short answer is no, you can't do this, or least not if you use magma_ddot.
However, magma_ddot is itself a only very thin wrapper around cublasDdot, and the cublas function fully supports having the result of the operation stored in GPU memory rather than returned to the host.
In theory you could do something like this:
// before the apparent loop you have not shown us:
double* dotresult;
cudaMalloc(&dotresult, sizeof(double));
for (int i=....) {
// ...
// magma_ddot(i,&d_gamma_x[1],1,&(d_l2)[1],1, queue);
cublasSetPointerMode( queue->cublas_handle(), CUBLAS_POINTER_MODE_DEVICE);
cublasDdot(queue->cublas_handle(), i, &d_gamma_x[1], 1, &(d_l2)[1], 1, &dotresult);
cudaDeviceSynchronize();
cublasSetPointerMode( queue->cublas_handle(), CUBLAS_POINTER_MODE_HOST);
// Now dotresult holds the magma_ddot result in device memory
// ...
}
Note that might make Magma blow up depending on how you are using it, because Magma uses CUBLAS internally and how CUBLAS state and asynchronous operations are handled inside Magma are completely undocumented. Having said that, if you are careful, it should be OK.
To then execute your calculation, either write a very simple kernel and launch it with one thread, or perhaps use a simple thrust call with a lambda expression, depending on your preference. I leave that as an exercise to the reader.

Code running perfectly on host, put in a kernel, fails for mysterious reasons

I have to port a pre-existing “host-only” backpropagation implementation to CUDA. I think the nature of the algorithm doesn’t matter here, so I won’t give much explanation about the way it works. What I think matter though, is that it uses 3-dimensional arrays, whose all three dimensions are dynamically allocated.
I use VS2010, with CUDA 5.0. And my device is a 2.1. The original host-only code can be downloaded here
→ http://files.getwebb.org/view-cre62u4d.html
Main points of the code:
patterns from adult.data are loaded into memory, using the Data structure, present in “pattern.h”.
several multi-dimensional arrays are allocated
the algorithm is ran over the patterns, using the arrays allocated just before.
If you want to try to run the code don’t forget to modify the PATH constant at the beginning of kernel.cu. I also advise you to use “2” layers, “5” neurons, and a learning rate of “0.00001”. As you can see, this work perfectly. The “MSE” is improving. For those who have no clue about what does this algorithms, let’s simply say that it learns how to predict a target value, based on 14 variables present in the patterns. The “MSE” decrease, meaning that the algorithm makes less mistakes after each “epoch”.
I spent a really long time trying to run this code on the device. And I’m still unsuccessful. Last attempt was done by simply copying the code initializing the arrays and running the algorithm into a big kernel. Which failed again. This code can be downloaded there
→ http://files.getwebb.org/view-cre62u4c.html
To be precise, here are the differences with the original host-only code:
f() and fder(), which are used by the algorithm, become device
functions.
parameters are hardcoded: 2 layers, 5 neurons, and a learning rate of
0.00001
the “w” array is initialized using a fixed value (0.5), not rand()
anymore
a Data structure is allocated in device’s memory, and the data are
sent in device’s memory after they have been loaded from adult.data
in host’s memory
I think I did the minimal amount of modifications needed to make the code run in a kernel. The “kernel_check_learningData” kernel, show some informations about the patterns loaded in device’s memory, proving the following code, sending the patterns from the host to the device, did work:
Data data;
Data* dev_data;
int* dev_t;
double* dev_x;
...
input_adult(PathFile, &data);
...
cudaMalloc((void**)&dev_data, sizeof(Data));
cudaMalloc((void**)&dev_t, data.N * sizeof(int));
cudaMalloc((void**)&dev_x, data.N * data.n * sizeof(double));
// Filling the device with t and x's data.
cudaMemcpy(dev_t, data.t, data.N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_x, data.x, data.N * data.n * sizeof(double), cudaMemcpyHostToDevice);
// Updating t and x pointers into devices Data structure.
cudaMemcpy(&dev_data->t, &dev_t, sizeof(int*), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->x, &dev_x, sizeof(double*), cudaMemcpyHostToDevice);
// Copying N and n.
cudaMemcpy(&dev_data->N, &data.N, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->n, &data.n, sizeof(int), cudaMemcpyHostToDevice);
It apparently fails at the beginning of the forward phase, when reading the “w” array. I can’t find any explanation for that.
I see two possibilities:
the code sending the patterns into device's memory is bugged, despite the fact it seems to work properly, and provoke a bug way further, when beginning the forward phase.
the CUDA API is not behaving like it should!
I’m desperately searching for my mistake for a very long time. So I wondered if the community could provide me with some help.
Thanks.
Here's the problem in your code, and why it works in 64 bit machine mode but not 32 bit machine mode.
In your backpropagation kernel, in the forward path, you have a sequence of code like this:
/*
* for layer = 0
*/
for (i = 0; i < N[0]; i++) { // for all neurons i of layer 0
a[0][i] = x[ data->n * pat + i]; // a[0][i] = input i
}
In 32 bit machine mode (Win32 project, --machine 32 is being passed to nvcc), the failure occurs on the iteration i=7 when the write of a[0][7] occurs; this write is out of bounds. At this point, a[0][7] is intended to hold a double value, but for some reason the indexing is placing us out of bounds.
By the way, you can verify this by simply opening a command prompt in the directory where your executable is built, and running the command:
cuda-memcheck test_bp
assuming test_bp.exe is the name of your executable. cuda-memcheck conveniently identifies that there is an out of bounds write occurring, and even identifies the line of source that it is occurring on.
So why is this out of bounds? Let's take a look earlier in the kernel code where a[0][] is allocated:
a[0] = (double *)malloc( N[0] * sizeof(double *) );
^ oops!!
a[0][] is intended to hold double data but you're allocating pointer storage.
As it turns out, in a 64 bit machine the two types of storage are the same size, so it ends up working. But in a 32-bit machine, a double pointer is 4 bytes whereas double data is 8 bytes. So, in a 32-bit machine, when we index through this array taking data strides of 8 bytes, we eventually run off the end of the array.
Elsewhere in the kernel code you are allocating storage for the other "layers" of a like this:
a[layer] = (double *)malloc( N[layer] * sizeof(double) );
which is correct. I see that the original "host-only" code seems to contain this error as well. There may be a latent defect in that code as well.
You will still need to address the kernel running time to avoid the windows TDR event, in some fashion, if you want to run on a windows wddm device. And as I already pointed out, this code makes no attempt to use the parallel capability of the machine.

how can a __global__ function RETURN a value or BREAK out like C/C++ does

Recently I've been doing string comparing jobs on CUDA, and i wonder how can a __global__ function return a value when it finds the exact string that I'm looking for.
I mean, i need the __global__ function which contains a great amount of threads to find a certain string among a big big string-pool simultaneously, and i hope that once the exact string is caught, the __global__ function can stop all the threads and return back to the main function, and tells me "he did it"!
I'm using CUDA C. How can I possibly achieve this?
There is no way in CUDA (or on NVIDIA GPUs) for one thread to interrupt execution of all running threads. You can't have immediate exit of the kernel as soon as a result is found, it's just not possible today.
But you can have all threads exit as soon as possible after one thread finds a result. Here's a model of how you would do that.
__global___ void kernel(volatile bool *found, ...)
{
while (!(*found) && workLeftToDo()) {
bool iFoundIt = do_some_work(...); // see notes below
if (iFoundIt) *found = true;
}
}
Some notes on this.
Note the use of volatile. This is important.
Make sure you initialize found—which must be a device pointer—to false before launching the kernel!
Threads will not exit instantly when another thread updates found. They will exit only the next time they return to the top of the while loop.
How you implement do_some_work matters. If it is too much work (or too variable), then the delay to exit after a result is found will be long (or variable). If it is too little work, then your threads will be spending most of their time checking found rather than doing useful work.
do_some_work is also responsible for allocating tasks (i.e. computing/incrementing indices), and how you do that is problem specific.
If the number of blocks you launch is much larger than the maximum occupancy of the kernel on the present GPU, and a match is not found in the first running "wave" of thread blocks, then this kernel (and the one below) can deadlock. If a match is found in the first wave, then later blocks will only run after found == true, which means they will launch, then exit immediately. The solution is to launch only as many blocks as can be resident simultaneously (aka "maximal launch"), and update your task allocation accordingly.
If the number of tasks is relatively small, you can replace the while with an if and run just enough threads to cover the number of tasks. Then there is no chance for deadlock (but the first part of the previous point applies).
workLeftToDo() is problem-specific, but it would return false when there is no work left to do, so that we don't deadlock in the case that no match is found.
Now, the above may result in excessive partition camping (all threads banging on the same memory), especially on older architectures without L1 cache. So you might want to write a slightly more complicated version, using a shared status per block.
__global___ void kernel(volatile bool *found, ...)
{
volatile __shared__ bool someoneFoundIt;
// initialize shared status
if (threadIdx.x == 0) someoneFoundIt = *found;
__syncthreads();
while(!someoneFoundIt && workLeftToDo()) {
bool iFoundIt = do_some_work(...);
// if I found it, tell everyone they can exit
if (iFoundIt) { someoneFoundIt = true; *found = true; }
// if someone in another block found it, tell
// everyone in my block they can exit
if (threadIdx.x == 0 && *found) someoneFoundIt = true;
__syncthreads();
}
}
This way, one thread per block polls the global variable, and only threads that find a match ever write to it, so global memory traffic is minimized.
Aside: __global__ functions are void because it's difficult to define how to return values from 1000s of threads into a single CPU thread. It is trivial for the user to contrive a return array in device or zero-copy memory which suits his purpose, but difficult to make a generic mechanism.
Disclaimer: Code written in browser, untested, unverified.
If you feel adventurous, an alternative approach to stopping kernel execution would be to just execute
// (write result to memory here)
__threadfence();
asm("trap;");
if an answer is found.
This doesn't require polling memory, but is inferior to the solution that Mark Harris suggested in that it makes the kernel exit with an error condition. This may mask actual errors (so be sure to write out your results in a way that clearly allows to tell a successful execution from an error), and it may cause other hiccups or decrease overall performance as the driver treats this as an exception.
If you look for a safe and simple solution, go with Mark Harris' suggestion instead.
The global function doesn't really contain a great amount of threads like you think it does. It is simply a kernel, function that runs on device, that is called by passing paramaters that specify the thread model. The model that CUDA employs is a 2D grid model and then a 3D thread model inside of each block on the grid.
With the type of problem you have it is not really necessary to use anything besides a 1D grid with 1D of threads on in each block because the string pool doesn't really make sense to split into 2D like other problems (e.g. matrix multiplication)
I'll walk through a simple example of say 100 strings in the string pool and you want them all to be checked in a parallelized fashion instead of sequentially.
//main
//Should cudamalloc and cudacopy to device up before this code
dim3 dimGrid(10, 1); // 1D grid with 10 blocks
dim3 dimBlocks(10, 1); //1D Blocks with 10 threads
fun<<<dimGrid, dimBlocks>>>(, Height)
//cudaMemCpy answerIdx back to integer on host
//kernel (Not positive on these types as my CUDA is very rusty
__global__ void fun(char *strings[], char *stringToMatch, int *answerIdx)
{
int idx = blockIdx.x * 10 + threadIdx.x;
//Obviously use whatever function you've been using for string comparison
//I'm just using == for example's sake
if(strings[idx] == stringToMatch)
{
*answerIdx = idx
}
}
This is obviously not the most efficient and is most likely not the exact way to pass paramaters and work with memory w/ CUDA, but I hope it gets the point across of splitting the workload and that the 'global' functions get executed on many different cores so you can't really tell them all to stop. There may be a way I'm not familiar with, but the speed up you will get by just dividing the workload onto the device (in a sensible fashion of course) will already give you tremendous performance improvements. To get a sense of the thread model I highly recommend reading up on the documents on Nvidia's site for CUDA. They will help tremendously and teach you the best way to set up the grid and blocks for optimal performance.

Timing CUDA kernels

hi every one im currently working on timing some of my CUDA code. I was able to time them using events. My kernel ran for 19 ms. Somehow I find this doubtful because when I ran a sequential implementation of this, it was at around 5000 ms. I know the code should run faster, but should it be this fast?
I'm using wrapper functions to call cuda kernels in my cpp program. Am I supposed to be calling them there or in the .cu file? Thanks!
The obvious way to check if your program is working would be to compare the output to that of your CPU based implementation. If you get the same output, it is working by definition, right? :)
If your program is experimental in such a way that it doesn't really produce any verifiable output then there is a good chance that the compiler has optimized out some (or all) of your code. The compiler will remove code that does not contribute to output data. This can cause, for instance, that the entire contents of a kernel is removed if the final statement that stores the calculated value is commented out.
As to your speedup. 5000ms / 19ms = 263x, which is an unlikely increase, even for algorithms that map perfectly to the GPU architecture.
Well, if you wrote your CUDA code right, yes, it could be that much faster. Think about it. You moved the code from sequential execution on a single processor to parallel execution on hundreds of processors, depending on your GPU model. My $179 mid range card has 480 cores. Some available now have 1500 cores. It is very possible to get 100x perf jumps with CUDA, particularly if your kernel is much more compute-bound than memory bound.
That said, make sure you are measuring what you think you are measuring. If you are invoking your CUDA kernel without using any explicit streams, then the call is synchronous to the host thread and your timings should be accurate. If you are invoking your kernel using a stream, then you need to call cudaDeviceSynchronise() or have your host code wait on an event signaled by the kernel. Kernel calls invoked on a stream execute asynchronously to the host thread, so time measurements in the host thread will not correctly reflect the kernel time unless you make the host thread wait until the kernel call is complete. You can also use CUDA events to measure elapsed time on the GPU within a given stream. See section 5.1.2 of the CUDA Best Practices Guide in the NVidia GPU Computing SDK 4.2.
In my own code, I use the clock() function to get precise timings. For convenience, I have the macros
enum {
tid_this = 0,
tid_that,
tid_count
};
__device__ float cuda_timers[ tid_count ];
#ifdef USETIMERS
#define TIMER_TIC clock_t tic; if ( threadIdx.x == 0 ) tic = clock();
#define TIMER_TOC(tid) clock_t toc = clock(); if ( threadIdx.x == 0 ) atomicAdd( &cuda_timers[tid] , ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) );
#else
#define TIMER_TIC
#define TIMER_TOC(tid)
#endif
These can then be used to instrument the device code as follows:
__global__ mykernel ( ... ) {
/* Start the timer. */
TIMER_TIC
/* Do stuff. */
...
/* Stop the timer and store the results to the "timer_this" counter. */
TIMER_TOC( tid_this );
}
You can then read the cuda_timers in the host code.
A few notes:
The timers work on a per-block basis, i.e. if you have 100 blocks executing the same kernel, the sum of all their times will be stored.
The timers count the number of clock ticks. To get the number of milliseconds, divide this by the number of GHz on your device and multiply by 1000.
The timers can slow down your code a bit, which is why I wrapped them in the #ifdef USETIMERS so you can switch them off easily.
Although clock() returns integer values of type clock_t, I store the accumulated values as float, otherwise the values will wrap around for kernels that take longer than a few seconds (accumulated over all blocks).
The selection ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) ) is necessary in case the clock counter wraps around.