I am very new to cuda and started reading about parallel programming and cuda just a few weeks ago. After I installed the cuda toolkit, I was browsing the sdk samples (which come with the installation of the toolkit) and wanted to try some of them out. I started with matrixMul from 0_Simple folder. This program executes fine (I am using Visual Studio 2010).
Now I want to change the size of the matrices and try with a bigger one (for example 960X960 or 1024x1024). In this case, something crashes (I get black screen, and then the message: display driver stopped responding and has recovered).
I am changing this two lines in the code (from main function):
dim3 dimsA(8*4*block_size, 8*4*block_size, 1);
dim3 dimsB(8*4*block_size, 8*4*block_size, 1);
before they were:
dim3 dimsA(5*2*block_size, 5*2*block_size, 1);
dim3 dimsB(5*2*block_size, 5*2*block_size, 1);
Can someone point to me what I am doing wrong. and should I alter something else in this example for it to work properly. Thx!
Edit: like some of you suggested, i changed the timeout value (0 somehow did not work for me, I set the timeout to 60), so my driver does not crash, but I get huge list of errors, like:
... ... ...
Error! Matrix[409598]=6.40005159, ref=6.39999986 error term is > 1e-5
Error! Matrix[409599]=6.40005159, ref=6.39999986 error term is > 1e-5
Does this got something to do with the allocation of the memory. Should I make changes there and what could they be?
Your new problem is actually just the strict tolerances provided in the NVidia example. Your kernel is running correctly. It's just complaining that accumluated error is greater than the limit that they had set for this example. This is just because you're doing a lot more math operations which are all accumulating error. If you look at the numbers it's giving you, you're only off of the reference answer by about 0.00005, which is not unusual after a lot of single-precision floating-point math. The reason you're getting these errors now and not with the default matrix sizes is that the original matricies were smaller and thus required a lot less operations to multiply. Matrix multiplication of N x N matricies requires on the order of N^3 operations, so the number of operations required increases much faster than the size of the matrix and the accumulated error would increase in proportion with the number of operations.
If you look near the end of the runTest() function, there's a call to computeGold() which computes the reference answer on your CPU. There should then be a call to something like shrCompareL2fe that compares the results. The last parameter to this is a tolerance. If you increase the size of this tolerance (say, to 1e-3 or 1e-4 instead of 1e-5,) you should eliminate these error messages. Note that there may be a couple of these calls. The version of the SDK examples that I have has an optional CUBLAS implementation, so it has a comparison for that against the gold, too. The one right after the print statement that says "Comparing CUDA matrixMul & Host results" is the one you'd want to change.
I'd advise looking at the indexing used in the kernel (matrixMulCUDA) a bit closer - it sounds like you're writing to unallocated memory.
More specifically, is the only thing that you changed the dimsA and dimsB variables? Inside the kernel they use the thread and block index to access the data - did you also increase the data size accordingly? There is no bounds checking going on in the kernel, so if you just change the kernel launch configuration, but not the data, then odds are you're writing past your data into some other memory
Have you disabled Timeout Detection and Recovery (TDR) in Windows? It is entirely possible that your code is running fine but that the larger matricies caused the kernel execution to exceed Windows' timeout, which causes Windows to assume the card is locked up, so it resets the card and gives you a message identical to the one you describe. Even if that is not your problem here, you definitely want to disable that before doing any serious CUDA work in Windows. The timeout is quite short by default, since normal graphics rendering should take small fractions of a second per frame.
See this post on the NVidia forums that describes TDR and how to turn it off:
WDDM TDR - NVidia devtalk forum
In particular, you probably want to set the key HKLM\System\CurrentControlSet\Control\GraphicsDrivers\TdrLevel to 0 (Detection Disabled).
Alternatively, you can increase the timeout period by setting
HKLM\System\CurrentControlSet\Control\GraphicsDrivers\TdrDelay. It defaults to 2 and is specified in seconds. Personally, I have found that TDR is always annoying when doing work in CUDA, so I just turn it off entirely. IIRC, you need to restart your system for any TDR-related changes to take effect.
Related
Do __shfl_xx_sync() instructions, where only some lanes participate, need an additional __syncwarp() instruction, or is setting a mask enough?
I cannot provide a working minimal example, as it is very long and confidential code and the error appeared only in certain run/build configurations.
The code looks basically like the following:
if (threadIdx.x >= 30) {
temp.x = __shfl_up_sync(0xC0000000, x, 1);
temp.y = __shfl_up_sync(0xC0000000, y, 1);
}
// __syncwarp();
__shfl_up_sync(0xffffffff, w, 1, 32);
Release builds worked fine; with debug builds lanes 30 and 31 waited (according to debugger and SASS) at a different sync instruction than the other lanes.
When I introduced __syncwarp() also debug builds run through. And I hope this problem is definitely fixed!?!
I am using a mask in the first two shuffle instructions indicating that only lanes 30 and 31 participate. What happens, if the scheduler decides that lanes 0 to 29 are executed first and goes to the second shuffle instruction (with all participating lanes)? Then those shuffle instructions wait for the lanes 30 and 31. Those threads then get to the upper shuffle instructions. Can the shuffles be distinguished?
If the __syncwarp() is needed: Why would it react differently then the shuffle instruction with mask 0xffffffff itself?
Because it is of other type? (shuffle sync instead of normal sync?) Or was it by accident that the program worked in this way?
(The __syncwarp() intrinsic is probably useful here anyway (for performance reasons), as the threads converge at that point.)
If __syncwarp() is not enough: How to make sure the kernel does not hang? Is there generally another recommended way than __syncwarp()?
I am running this on Turing RTX 2060 Mobile (and debugged is with Visual Studio).
No, you should not need a __syncwarp() here. CUDA went from e.g. __shfl_up() to __shfl_up_sync() to avoid this. I think the problem is that you are trying to shuffle up data from a thread that is not participating in the call, i.e. thread 30 is trying to get data from thread 29, so thread 29 has to participate.
Threads may only read data from another thread which is actively participating in the __shfl_sync() command. If the target thread is inactive, the retrieved value is undefined.
from the docs. Although this explanation is still unsatisfactory, as you seem to get a deadlock instead of an undefined value. But maybe this is wanted behavior for a debug build?
That being said, I'm not quite sure how to do this elegantly, because just including thread 29 in the conditional and mask will only shift the problem to 29 trying to get data from 28. In the examples given in the documentation, they always do the intrinsic with all threads and then conditionally use the results.
My best guess is that you want thread 29 to participate, but with a delta of 0. I have not found anything saying that delta needs to be the same across threads.
You might also want to use __ballot_sync() to retrieve a mask as can be seen in Listing 3 of this blogpost to avoid bugs from manually specifying the mask, which needs to be changed whenever the conditional is changed.
I have a opengl particle simulation, where the position of each particle is calculated in a CUDA kernel. Most memory resides within the GPU memory, but there is a single float value, I have to update from the CPU each frame.
At the moment I use cudaMemcpyAsync() to copy the float value to the GPU, but (at least from what I can tell), this slows down the performance quite a bit. I tried to use nvproof to see, which calls take the longest, with these results:
Calls Avg Min Max Name
477 2.9740us 2.8160us 4.5440us simulation(float3*, float*, float3*, float*)
477 89.033us 18.600us 283.00us cudaLaunchKernel
477 47.819us 10.200us 120.70us cudaMemcpyAsync
I think I can't really do much about the kernel launch itself, but from the calls, that happen every frame cudaMemcpyAsync() seems to be taking the longest.
I have also tried to use pinned memory and cudaHostGetDevicePointer() as described here, however for some reason this increases the kernel launch times even more, making more than up for the time saved for not needing the memcopy function.
I guess there has to be a better/faster way to update my single float variable to the GPU?
Easiest way is, that you can add an extra parameter to the simulation kernel function as a value of simple float but not as a pointer to float so that the data goes directly by the kernel launch parameters structure that CUDA sends to GPU when you launch the kernel. Then you evade that data copy command altogether. (I'm assuming CUDA packs whole function parameter descriptor data of kernel into a single copy command because kernel parameter descriptor space is limited by a few kBs or less).
simulation(fooPointer,
barPointer,
fooBarPointer,
floatVariable
);
Or, try double buffering between data update and rendering or between data update and compute so that simulation image follows the simulation calculation by 1-2 frames behind (and per-frame time gets worse) but "frames per second" increases.
If its not an interactive simulation, hiding compute/render/data latencies by double or triple buffering should work.
If you are after minimizing per-frame timing (quicker response to a user-input into simulation?) then you should embed the float variable to the end of an array that you already send/use in simulation or whatever structure you are using. If you already have a 1MB+ float buffer to send to GPU, then appending 4B(float) to end of it should not make much difference then you can access it from there. 1 copy operation should be faster than 2 copy operations with same total size.
If you are literally sending just 4B to GPU at each frame (with a simple function to generate that data), then (as 3Dave said in comments) you can try adding an extra kernel function to update the value in the GPU and just have the overhead of kernel launch command instead of both copy command overhead and data copy overhead. On a positive side, that extra kernel overhead might be hidden if there is a "graph" of kernels running for each frame automatically without enqueueing all of them again and again.
Here,
https://devblogs.nvidia.com/cuda-graphs/
The part
We are going to create a simple code which mimics this pattern. We will then use this to demonstrate the overheads involved with the standard launch mechanism and show how to introduce a CUDA Graph comprising the multiple kernels, which can be launched from the application in a single operation.
cudaGraphLaunch(instance, stream);
They say per-kernel launch overhead in this "graph" feature is only 3-4 microseconds when there are many(20) kernels in the algorithm.
Since graph supports other commands too, you can try both copy and compute parts in parallel cuda-streams within a graph and switch their inputs with double buffering so all CUDA things can stay within CUDA's context before sending output to rendering.
(Maybe)You don't even have to change the data mechanism at all. Just try sending data of float as binary representation into the pointer value and only read the pointer value (not data value) from kernel and convert it back to float. I don't know if CUDA returns an error for this if you don't try reaching the (wrong) pointer address that the float data represents, in the kernel.
simulation(fooPointer,
barPointer,
fooBarPointer,
toPtr(floatData) // <----- float to 64/32 bit pointer value
);
and in kernel
float val = fromPtrToFloat(parameter4); // converts pointer itself, not the data
But this may not be a preferred practice if you can simply use "value" type parameters.
I'm using the vDSP framework for a real-time audio application based on FFT computation.
After having lots of problems trying to figure out why the algorithm was producing incorrect results, I found out the following comment on the official vDSP FFT help code (DemonstrateFFT.c, lines 242, 416, 548)
/* Zero the signal before timing because repeated FFTs on non-zero
data can cause abnormalities such as infinities, NaNs, and
subnormal numbers.
*/
In order to reproduce the error, just comment line 247 (no zero the signal) and add something similar to the following line at line 273 (just after the vDSP_fft_zrip method)
if (isnan(Observed.realp[0])) printf("Iteration %lu: NaN\n",i); // it would work with any of the components of Observed
It is interesting to observe that reducing N (i.e. increasing the amount of FFTs per time unit) makes the zrip algorithm to fail before, which kinds of makes sense since the comment advices about performing repeated FFTs.
The behavior is also observed with the vDSP_fft_zrop algorithm.
I'm really wondering what's the point about performing FFTs of "zero data" as advised on the comment. Either I'm missing something important, or definitely the vDSP framework is not suited at all for real-time audio processing.
Normal 16 and 24-bit "real time" audio samples will not see this issue.
But benchmarks can create bigger and smaller numbers that can exceed the range of double precision floats when iterated enough times, and when using many functions, not just FFTs. Try iterating exp() fed back to itself, that will blow up even faster. It's a problem one encounters using any finite precision computer arithmetic (not just the ARM and x86 CPUs that vDSP uses).
I have a very strange issue using cuFFT. I tried to prepare my input data with a kernel that apply a Hanning window. Everything seems fine. here the issue: cuFFT run on the data WITHOUT the hanning window applied. I don't understand why.
I tried the following:
test1:
- I run the kernel to apply window
- get back the data to the host and check the values: all is OK. the window is applied
- copy back the values to the device
- run the fft: no luck, it eat non windowed data!
test2:
- I don't use a kernel, I apply the window with CPU
- I run the fft: it works. it eat the windowed data
Is there any rational explanation to this?
Is there some kind of cache involved here ?
NOTE: I use the same device memory pointer in my kernel and in cuFFT
It appears that I ran my kernel applyWindow only on the first 512 samples of my data. This explain why it did nothing. Running the kernel with the right settings did the job on the entire buffer.
int nbBlock = inputSizeInSamples / deviceProp.maxThreadsPerBlock;
applyWindow <<<nbBlock, deviceProp.maxThreadsPerBlock >>>( ... )
I replace
if((nMark >> tempOffset) & 1){nDuplicate++;}
else{nMark = (nMark | (1 << tempOffset));}
with
nDuplicate += ((nMark >> tempOffset) & 1);
nMark = (nMark | (1 << tempOffset));
this replacement turns out to be 5ms slower on GT 520 graphics card.
Could you tell me why? or do you have any idea to help me improve it?
The native instruction set for the GPU deals with small conditions very efficiently via predication. Additionally, the ISET instruction converts a condition code register into an integer with the value 0 or 1, which naturally fits with your conditional increment.
My guess is that the key difference between the first and second formulations is that you've effectively hidden the fact that it's an if/else.
To tell for sure, you can use cuobjdump to look at the microcode generated for the two cases: specify --keep to nvcc and use cuobjdump on the .cubin file to see the disassembled microcode.
Shot in the dark, but you're always incrementing/re-assigning to the nDuplicate variable now in the latter implementation where as you weren't incrementing/assigning to it if the test in the if statement was false previously. Guessing the overhead comes from that, but you don't describe your test data set so I don't know if that was already the case.
Does your program exhibit significant branch divergence? If you're running e.g. 100 warps and only 5 have divergent behavior, and they run in 5 SMs, you would only see 21 time cycles (expecting 20)... a 5% increase that could easily be defeated by doing 2x the work in each thread to avoid rare divergence.
Barring that, the 520 is a fairly modern graphics card, and might incorporate modern SIMT scheduling techniques, e.g. Dynamic Warp Formation and Thread Block Compaction, to hide SIMT stalls. Maybe look into architectural features (specs) or write a simple benchmark to generate n-way branch divergence and measure slowdown?
Barring that, check where your variables live. Does making them shared affect performance/results? Since you always access all variables in the second and the first can avoid accessing nDimension, slow (uncoalesced global?) memory accesses could explain it.
Just some things to think about.
For low-level optimization, it is often helpful to look at the low-level assembly (SASS) of the kernel directly. You can do this with the cuobjdump tool distributed as part of the CUDA Toolkit. Basic usage is to compile with -keep in nvcc then do:
cuobjdump -sass mykernel.cubin
Then you can see the exact sequence of instructions and compare them. I'm not sure why version 1 would be faster than version 2 of the code, but the SASS listings might give you a clue.