cuFFT don't see what my kernel did - cuda

I have a very strange issue using cuFFT. I tried to prepare my input data with a kernel that apply a Hanning window. Everything seems fine. here the issue: cuFFT run on the data WITHOUT the hanning window applied. I don't understand why.
I tried the following:
test1:
- I run the kernel to apply window
- get back the data to the host and check the values: all is OK. the window is applied
- copy back the values to the device
- run the fft: no luck, it eat non windowed data!
test2:
- I don't use a kernel, I apply the window with CPU
- I run the fft: it works. it eat the windowed data
Is there any rational explanation to this?
Is there some kind of cache involved here ?
NOTE: I use the same device memory pointer in my kernel and in cuFFT

It appears that I ran my kernel applyWindow only on the first 512 samples of my data. This explain why it did nothing. Running the kernel with the right settings did the job on the entire buffer.
int nbBlock = inputSizeInSamples / deviceProp.maxThreadsPerBlock;
applyWindow <<<nbBlock, deviceProp.maxThreadsPerBlock >>>( ... )

Related

What is the fastest way to update a single float value to the GPU to access it in a CUDA kernel?

I have a opengl particle simulation, where the position of each particle is calculated in a CUDA kernel. Most memory resides within the GPU memory, but there is a single float value, I have to update from the CPU each frame.
At the moment I use cudaMemcpyAsync() to copy the float value to the GPU, but (at least from what I can tell), this slows down the performance quite a bit. I tried to use nvproof to see, which calls take the longest, with these results:
Calls Avg Min Max Name
477 2.9740us 2.8160us 4.5440us simulation(float3*, float*, float3*, float*)
477 89.033us 18.600us 283.00us cudaLaunchKernel
477 47.819us 10.200us 120.70us cudaMemcpyAsync
I think I can't really do much about the kernel launch itself, but from the calls, that happen every frame cudaMemcpyAsync() seems to be taking the longest.
I have also tried to use pinned memory and cudaHostGetDevicePointer() as described here, however for some reason this increases the kernel launch times even more, making more than up for the time saved for not needing the memcopy function.
I guess there has to be a better/faster way to update my single float variable to the GPU?
Easiest way is, that you can add an extra parameter to the simulation kernel function as a value of simple float but not as a pointer to float so that the data goes directly by the kernel launch parameters structure that CUDA sends to GPU when you launch the kernel. Then you evade that data copy command altogether. (I'm assuming CUDA packs whole function parameter descriptor data of kernel into a single copy command because kernel parameter descriptor space is limited by a few kBs or less).
simulation(fooPointer,
barPointer,
fooBarPointer,
floatVariable
);
Or, try double buffering between data update and rendering or between data update and compute so that simulation image follows the simulation calculation by 1-2 frames behind (and per-frame time gets worse) but "frames per second" increases.
If its not an interactive simulation, hiding compute/render/data latencies by double or triple buffering should work.
If you are after minimizing per-frame timing (quicker response to a user-input into simulation?) then you should embed the float variable to the end of an array that you already send/use in simulation or whatever structure you are using. If you already have a 1MB+ float buffer to send to GPU, then appending 4B(float) to end of it should not make much difference then you can access it from there. 1 copy operation should be faster than 2 copy operations with same total size.
If you are literally sending just 4B to GPU at each frame (with a simple function to generate that data), then (as 3Dave said in comments) you can try adding an extra kernel function to update the value in the GPU and just have the overhead of kernel launch command instead of both copy command overhead and data copy overhead. On a positive side, that extra kernel overhead might be hidden if there is a "graph" of kernels running for each frame automatically without enqueueing all of them again and again.
Here,
https://devblogs.nvidia.com/cuda-graphs/
The part
We are going to create a simple code which mimics this pattern. We will then use this to demonstrate the overheads involved with the standard launch mechanism and show how to introduce a CUDA Graph comprising the multiple kernels, which can be launched from the application in a single operation.
cudaGraphLaunch(instance, stream);
They say per-kernel launch overhead in this "graph" feature is only 3-4 microseconds when there are many(20) kernels in the algorithm.
Since graph supports other commands too, you can try both copy and compute parts in parallel cuda-streams within a graph and switch their inputs with double buffering so all CUDA things can stay within CUDA's context before sending output to rendering.
(Maybe)You don't even have to change the data mechanism at all. Just try sending data of float as binary representation into the pointer value and only read the pointer value (not data value) from kernel and convert it back to float. I don't know if CUDA returns an error for this if you don't try reaching the (wrong) pointer address that the float data represents, in the kernel.
simulation(fooPointer,
barPointer,
fooBarPointer,
toPtr(floatData) // <----- float to 64/32 bit pointer value
);
and in kernel
float val = fromPtrToFloat(parameter4); // converts pointer itself, not the data
But this may not be a preferred practice if you can simply use "value" type parameters.

How to perform basic operations (+ - * /) on GPU and store the result on it

I have the following code line, gamma is a CPU variable, that after i will need to copy to GPU. gamma_x and delta are also stored on CPU. Is there any way that i can execute the following line and store its result directly on GPU? So basically, host gamma, gamma_x and delta on GPU and get the output of the following line on GPU. It would speed up my code a lot for the lines after.
I tried with magma_dcopy but so far i couldn't find a way to make it working because the output of magma_ddot is CPU double.
gamma = -(gamma_x[i+1] + magma_ddot(i,&d_gamma_x[1],1,&(d_l2)[1],1, queue))/delta;
The very short answer is no, you can't do this, or least not if you use magma_ddot.
However, magma_ddot is itself a only very thin wrapper around cublasDdot, and the cublas function fully supports having the result of the operation stored in GPU memory rather than returned to the host.
In theory you could do something like this:
// before the apparent loop you have not shown us:
double* dotresult;
cudaMalloc(&dotresult, sizeof(double));
for (int i=....) {
// ...
// magma_ddot(i,&d_gamma_x[1],1,&(d_l2)[1],1, queue);
cublasSetPointerMode( queue->cublas_handle(), CUBLAS_POINTER_MODE_DEVICE);
cublasDdot(queue->cublas_handle(), i, &d_gamma_x[1], 1, &(d_l2)[1], 1, &dotresult);
cudaDeviceSynchronize();
cublasSetPointerMode( queue->cublas_handle(), CUBLAS_POINTER_MODE_HOST);
// Now dotresult holds the magma_ddot result in device memory
// ...
}
Note that might make Magma blow up depending on how you are using it, because Magma uses CUBLAS internally and how CUBLAS state and asynchronous operations are handled inside Magma are completely undocumented. Having said that, if you are careful, it should be OK.
To then execute your calculation, either write a very simple kernel and launch it with one thread, or perhaps use a simple thrust call with a lambda expression, depending on your preference. I leave that as an exercise to the reader.

Simulation loop GPU utilization

I am struggling with utilizing a simulation loop.
There are 3 kernel launched in every cycle.
The next time step size is computed by the second kernel.
while (time < end)
{
kernel_Flux<<<>>>(...);
kernel_Timestep<<<>>>(d_timestep);
memcpy(&h_timestep, d_timestep, sizeof(float), ...);
kernel_Integrate<<<>>>(d_timestep);
time += h_timestep;
}
I only need copy back a single float. What would be the most efficient way to avoid unnecessary synchronizations?
Thank you in advance. :-)
In CUDA all operations running from the default stream are synchronized. So in the code you've posted kernels will run one after another. From what I can see the kernel kernel_integrate() depends from the result of the kernel kernel_Timestep(), so no way of avoiding synchronization. Anyway if the kernels kernel_Flux() and kernel_Timestep() work on independent data, you can try to execute them in parallel, in two different streams.
If you care about the iteration time a lot, you can probably setup a new stream dedicated to the memcpy of the h_timestep out (you need to use cudaMemcpyAsync in this case). Then use something like speculative execution, where your loop proceeds before you figure out the time. To do so you will have to setup the GPU memory buffers for the next several iterations. You can probably do this by using a circular buffer. You also need to use cudaEventRecord and cudaStreamWaitEvent to synchronize the different streams, such that a next iteration is allowed to proceed only if the time corresponds to the buffer you are about to overwrite, has been calculated (the memcpy stream has done the job), because otherwise you will lose the state at that iteration.
Another potential solution, which I haven't tried but I suspect would work, is to make use of dynamic parallelism. If your cards support that, you can probably put the whole loop in GPU.
EDIT:
Sorry, I just realized that you have the third kernel. Your delay due to synchronization may be because you are not doing cudaMemcpyAsync? It's very likely that the third kernel will run longer than the memcpy. You should be able to proceed without any delay. The only synchronization needed is after each iteration.
The ideal solution would be to move everything to the GPU. However, I cannot do so, because I need to launch CUDPP compact after every few iterations, and it does not support CUDA streams nor dynamics parallelism. I know that the Thrust 1.8 library has copy_if method, which does the same, and it is working with dynamic parallelism. The problem is it does not compile with separate compilation on.
To sum up, now I use the following code:
while (time < end)
{
kernel_Flux<<<gs,bs, 0, stream1>>>();
kernel_Timestep<<<gs,bs, 0, stream1>>>(d_timestep);
cudaEventRecord(event, stream1);
cudaStreamWaitEvent(mStream2, event, 0);
memcpyasync(&h_timestep, d_timestep, sizeof(float), ..., stream2);
kernel_Integrate<<<>>>(d_timestep);
cudaStreamSynchronize(stream2);
time += h_timestep;
}

Code running perfectly on host, put in a kernel, fails for mysterious reasons

I have to port a pre-existing “host-only” backpropagation implementation to CUDA. I think the nature of the algorithm doesn’t matter here, so I won’t give much explanation about the way it works. What I think matter though, is that it uses 3-dimensional arrays, whose all three dimensions are dynamically allocated.
I use VS2010, with CUDA 5.0. And my device is a 2.1. The original host-only code can be downloaded here
→ http://files.getwebb.org/view-cre62u4d.html
Main points of the code:
patterns from adult.data are loaded into memory, using the Data structure, present in “pattern.h”.
several multi-dimensional arrays are allocated
the algorithm is ran over the patterns, using the arrays allocated just before.
If you want to try to run the code don’t forget to modify the PATH constant at the beginning of kernel.cu. I also advise you to use “2” layers, “5” neurons, and a learning rate of “0.00001”. As you can see, this work perfectly. The “MSE” is improving. For those who have no clue about what does this algorithms, let’s simply say that it learns how to predict a target value, based on 14 variables present in the patterns. The “MSE” decrease, meaning that the algorithm makes less mistakes after each “epoch”.
I spent a really long time trying to run this code on the device. And I’m still unsuccessful. Last attempt was done by simply copying the code initializing the arrays and running the algorithm into a big kernel. Which failed again. This code can be downloaded there
→ http://files.getwebb.org/view-cre62u4c.html
To be precise, here are the differences with the original host-only code:
f() and fder(), which are used by the algorithm, become device
functions.
parameters are hardcoded: 2 layers, 5 neurons, and a learning rate of
0.00001
the “w” array is initialized using a fixed value (0.5), not rand()
anymore
a Data structure is allocated in device’s memory, and the data are
sent in device’s memory after they have been loaded from adult.data
in host’s memory
I think I did the minimal amount of modifications needed to make the code run in a kernel. The “kernel_check_learningData” kernel, show some informations about the patterns loaded in device’s memory, proving the following code, sending the patterns from the host to the device, did work:
Data data;
Data* dev_data;
int* dev_t;
double* dev_x;
...
input_adult(PathFile, &data);
...
cudaMalloc((void**)&dev_data, sizeof(Data));
cudaMalloc((void**)&dev_t, data.N * sizeof(int));
cudaMalloc((void**)&dev_x, data.N * data.n * sizeof(double));
// Filling the device with t and x's data.
cudaMemcpy(dev_t, data.t, data.N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_x, data.x, data.N * data.n * sizeof(double), cudaMemcpyHostToDevice);
// Updating t and x pointers into devices Data structure.
cudaMemcpy(&dev_data->t, &dev_t, sizeof(int*), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->x, &dev_x, sizeof(double*), cudaMemcpyHostToDevice);
// Copying N and n.
cudaMemcpy(&dev_data->N, &data.N, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->n, &data.n, sizeof(int), cudaMemcpyHostToDevice);
It apparently fails at the beginning of the forward phase, when reading the “w” array. I can’t find any explanation for that.
I see two possibilities:
the code sending the patterns into device's memory is bugged, despite the fact it seems to work properly, and provoke a bug way further, when beginning the forward phase.
the CUDA API is not behaving like it should!
I’m desperately searching for my mistake for a very long time. So I wondered if the community could provide me with some help.
Thanks.
Here's the problem in your code, and why it works in 64 bit machine mode but not 32 bit machine mode.
In your backpropagation kernel, in the forward path, you have a sequence of code like this:
/*
* for layer = 0
*/
for (i = 0; i < N[0]; i++) { // for all neurons i of layer 0
a[0][i] = x[ data->n * pat + i]; // a[0][i] = input i
}
In 32 bit machine mode (Win32 project, --machine 32 is being passed to nvcc), the failure occurs on the iteration i=7 when the write of a[0][7] occurs; this write is out of bounds. At this point, a[0][7] is intended to hold a double value, but for some reason the indexing is placing us out of bounds.
By the way, you can verify this by simply opening a command prompt in the directory where your executable is built, and running the command:
cuda-memcheck test_bp
assuming test_bp.exe is the name of your executable. cuda-memcheck conveniently identifies that there is an out of bounds write occurring, and even identifies the line of source that it is occurring on.
So why is this out of bounds? Let's take a look earlier in the kernel code where a[0][] is allocated:
a[0] = (double *)malloc( N[0] * sizeof(double *) );
^ oops!!
a[0][] is intended to hold double data but you're allocating pointer storage.
As it turns out, in a 64 bit machine the two types of storage are the same size, so it ends up working. But in a 32-bit machine, a double pointer is 4 bytes whereas double data is 8 bytes. So, in a 32-bit machine, when we index through this array taking data strides of 8 bytes, we eventually run off the end of the array.
Elsewhere in the kernel code you are allocating storage for the other "layers" of a like this:
a[layer] = (double *)malloc( N[layer] * sizeof(double) );
which is correct. I see that the original "host-only" code seems to contain this error as well. There may be a latent defect in that code as well.
You will still need to address the kernel running time to avoid the windows TDR event, in some fashion, if you want to run on a windows wddm device. And as I already pointed out, this code makes no attempt to use the parallel capability of the machine.

CUDA samples matrixMul error

I am very new to cuda and started reading about parallel programming and cuda just a few weeks ago. After I installed the cuda toolkit, I was browsing the sdk samples (which come with the installation of the toolkit) and wanted to try some of them out. I started with matrixMul from 0_Simple folder. This program executes fine (I am using Visual Studio 2010).
Now I want to change the size of the matrices and try with a bigger one (for example 960X960 or 1024x1024). In this case, something crashes (I get black screen, and then the message: display driver stopped responding and has recovered).
I am changing this two lines in the code (from main function):
dim3 dimsA(8*4*block_size, 8*4*block_size, 1);
dim3 dimsB(8*4*block_size, 8*4*block_size, 1);
before they were:
dim3 dimsA(5*2*block_size, 5*2*block_size, 1);
dim3 dimsB(5*2*block_size, 5*2*block_size, 1);
Can someone point to me what I am doing wrong. and should I alter something else in this example for it to work properly. Thx!
Edit: like some of you suggested, i changed the timeout value (0 somehow did not work for me, I set the timeout to 60), so my driver does not crash, but I get huge list of errors, like:
... ... ...
Error! Matrix[409598]=6.40005159, ref=6.39999986 error term is > 1e-5
Error! Matrix[409599]=6.40005159, ref=6.39999986 error term is > 1e-5
Does this got something to do with the allocation of the memory. Should I make changes there and what could they be?
Your new problem is actually just the strict tolerances provided in the NVidia example. Your kernel is running correctly. It's just complaining that accumluated error is greater than the limit that they had set for this example. This is just because you're doing a lot more math operations which are all accumulating error. If you look at the numbers it's giving you, you're only off of the reference answer by about 0.00005, which is not unusual after a lot of single-precision floating-point math. The reason you're getting these errors now and not with the default matrix sizes is that the original matricies were smaller and thus required a lot less operations to multiply. Matrix multiplication of N x N matricies requires on the order of N^3 operations, so the number of operations required increases much faster than the size of the matrix and the accumulated error would increase in proportion with the number of operations.
If you look near the end of the runTest() function, there's a call to computeGold() which computes the reference answer on your CPU. There should then be a call to something like shrCompareL2fe that compares the results. The last parameter to this is a tolerance. If you increase the size of this tolerance (say, to 1e-3 or 1e-4 instead of 1e-5,) you should eliminate these error messages. Note that there may be a couple of these calls. The version of the SDK examples that I have has an optional CUBLAS implementation, so it has a comparison for that against the gold, too. The one right after the print statement that says "Comparing CUDA matrixMul & Host results" is the one you'd want to change.
I'd advise looking at the indexing used in the kernel (matrixMulCUDA) a bit closer - it sounds like you're writing to unallocated memory.
More specifically, is the only thing that you changed the dimsA and dimsB variables? Inside the kernel they use the thread and block index to access the data - did you also increase the data size accordingly? There is no bounds checking going on in the kernel, so if you just change the kernel launch configuration, but not the data, then odds are you're writing past your data into some other memory
Have you disabled Timeout Detection and Recovery (TDR) in Windows? It is entirely possible that your code is running fine but that the larger matricies caused the kernel execution to exceed Windows' timeout, which causes Windows to assume the card is locked up, so it resets the card and gives you a message identical to the one you describe. Even if that is not your problem here, you definitely want to disable that before doing any serious CUDA work in Windows. The timeout is quite short by default, since normal graphics rendering should take small fractions of a second per frame.
See this post on the NVidia forums that describes TDR and how to turn it off:
WDDM TDR - NVidia devtalk forum
In particular, you probably want to set the key HKLM\System\CurrentControlSet\Control\GraphicsDrivers\TdrLevel to 0 (Detection Disabled).
Alternatively, you can increase the timeout period by setting
HKLM\System\CurrentControlSet\Control\GraphicsDrivers\TdrDelay. It defaults to 2 and is specified in seconds. Personally, I have found that TDR is always annoying when doing work in CUDA, so I just turn it off entirely. IIRC, you need to restart your system for any TDR-related changes to take effect.