I'm trying to implement a separable convolution filter using CUDA as part of a bigger application I'm working on. My code has multiple CUDA kernels which are called one after the other (each performing one stage of the process). The problem is I keep getting this weird error and I'm not sure what exactly does it mean or what is causing it. I also can't find anything about it on the Internet except a couple of Stack Overflow questions related to OpenGL and CUDA interoperability (which I'm not doing, i.e. I'm not using OpenGL at all).
Can someone please explain to me why such an error may occur?
Thanks.
Related
I am somewhat familiar with the CUDA visual profiler and the occupancy spreadsheet, although I am probably not leveraging them as well as I could. Profiling & optimizing CUDA code is not like profiling & optimizing code that runs on a CPU. So I am hoping to learn from your experiences about how to get the most out of my code.
There was a post recently looking for the fastest possible code to identify self numbers, and I provided a CUDA implementation. I'm not satisfied that this code is as fast as it can be, but I'm at a loss as to figure out both what the right questions are and what tool I can get the answers from.
How do you identify ways to make your CUDA kernels perform faster?
If you're developing on Linux then the CUDA Visual Profiler gives you a whole load of information, knowing what to do with it can be a little tricky. On Windows you can also use the CUDA Visual Profiler, or (on Vista/7/2008) you can use Nexus which integrates nicely with Visual Studio and gives you combined host and GPU profile information.
Once you've got the data, you need to know how to interpret it. The Advanced CUDA C presentation from GTC has some useful tips. The main things to look out for are:
Optimal memory accesses: you need to know what you expect your code to do and then look for exceptions. So if you are always loading floats, and each thread loads a different float from an array, then you would expect to see only 64-byte loads (on current h/w). Any other loads are inefficient. The profiling information will probably improve in future h/w.
Minimise serialization: the "warp serialize" counter indicates that you have shared memory bank conflicts or constant serialization, the presentation goes into more detail and what to do about this as does the SDK (e.g. the reduction sample)
Overlap I/O and compute: this is where Nexus really shines (you can get the same info manually using cudaEvents), if you have a large amount of data transfer you want to overlap the compute and the I/O
Execution configuration: the occupancy calculator can help with this, but simple methods like commenting the compute to measure expected vs. measured bandwidth is really useful (and vice versa for compute throughput)
This is just a start, check out the GTC presentation and the other webinars on the NVIDIA website.
If you are using Windows... Check Nexus:
http://developer.nvidia.com/object/nexus.html
The CUDA profiler is rather crude and doesn't provide a lot of useful information. The only way to seriously micro-optimize your code (assuming you have already chosen the best possible algorithm) is to have a deep understanding of the GPU architecture, particularly with regard to using shared memory, external memory access patterns, register usage, thread occupancy, warps, etc.
Maybe you could post your kernel code here and get some feedback ?
The nVidia CUDA developer forum forum is also a good place to go for help with this kind of problem.
I hung back because I'm no CUDA expert, and the other answers are pretty good IF the code is already pretty near optimal. In my experience, that's a big IF, and there's no harm in verifying it.
To verify it, you need to find out if the code is for sure not doing anything it doesn't really have to do. Here are ways I can see to verify that:
Run the same code on the vanilla processor, and either take stackshots of it, or use a profiler such as Oprofile or RotateRight/Zoom that can give you equivalent information.
Running it on a CUDA processor, and doing the same thing, if possible.
What you're looking for are lines of code that have high occupancy on the call stack, as shown by the fraction of stack samples containing them. Those are your "bottlenecks". It does not take a very large number of samples to locate them.
CUDA 10 added runtime API calls for putting streams (= queues) in "capture mode", so that instead of executing, they are returned in a "graph". These graphs can then be made to actually execute, or they can be cloned.
But what is the rationale behind this feature? Isn't it unlikely to execute the same "graph" twice? After all, even if you do run the "same code", at least the data is different, i.e. the parameters the kernels take likely change. Or - am I missing something?
PS - I skimmed this slide deck, but still didn't get it.
My experience with graphs is indeed that they are not so mutable. You can change the parameters with 'cudaGraphHostNodeSetParams', but in order for the change of parameters to take effect, I had to rebuild the graph executable with 'cudaGraphInstantiate'. This call takes so long that any gain of using graphs is lost (in my case). Setting the parameters only worked for me when I build the graph manually. When getting the graph through stream capture, I was not able to set the parameters of the nodes as you do not have the node pointers. You would think the call 'cudaGraphGetNodes' on a stream captured graph would return you the nodes. But the node pointer returned was NULL for me even though the 'numNodes' variable had the correct number. The documentation explicitly mentions this as a possibility but fails to explain why.
Task graphs are quite mutable.
There are API calls for changing/setting the parameters of task graph nodes of various kinds, so one can use a task graph as a template, so that instead of enqueueing the individual nodes before every execution, one changes the parameters of every node before every execution (and perhaps not all nodes actually need their parameters changed).
For example, See the documentation for cudaGraphHostNodeGetParams and cudaGraphHostNodeSetParams.
Another useful feature is the concurrent kernel executions. Under manual mode, one can add nodes in the graph with dependencies. It will explore the concurrency automatically using multiple streams. The feature itself is not new but make it automatic becomes useful for certain applications.
When training a deep learning model it happens often to re-run the same set of kernels in the same order but with updated data. Also, I would expect Cuda to do optimizations by knowing statically what will be the next kernels. We can imagine that Cuda can fetch more instructions or adapt its scheduling strategy when knowing the whole graph.
CUDA Graphs is trying to solve the problem that in the presence of too many small kernel invocations, you see quite some time spent on the CPU dispatching work for the GPU (overhead).
It allows you to trade resources (time, memory, etc.) to construct a graph of kernels that you can use a single invocation from the CPU instead of doing multiple invocations. If you don't have enough invocations, or your algorithm is different each time, then it won't worth it to build a graph.
This works really well for anything iterative that uses the same computation underneath (e.g., algorithms that need to converge to something) and it's pretty prominent in a lot of applications that are great for GPUs (e.g., think of the Jacobi method).
You are not going to see great results if you have an algorithm that you invoke once or if your kernels are big; in that case the CPU invocation overhead is not your bottleneck. A succinct explanation of when you need it exists in the Getting Started with CUDA Graphs.
Where task graph based paradigms shine though is when you define your program as tasks with dependencies between them. You give a lot of flexibility to the driver / scheduler / hardware to do scheduling itself without much fine-tuning from the developer's part. There's a reason why we have been spending years exploring the ideas of dataflow programming in HPC.
I cannot even achieve overlapping memcpy and kernel execution with the simpleStreams example in the CUDA SDK, let alone in my own programs. These threads argue it is a problem with the WDDM driver in windows:
Why it is not possible to overlap memHtoD with GPU kernel with GTX 590,
CUDA kernels not launching before CudaDeviceSynchronize
Time between Kernel Launch and Kernel Execution
and suggest to:
flush the WDDM queue with cudaEventQuery() or cudaEventQuery(). (Does not work).
submit streams in breadth first manner. (Does not work).
This thread argues it is a bug in fermi:
How can I overlap memory transfers and kernel execution in a CUDA application?
This thread:
http://blog.icare3d.org/2010/04/tesla-compute-drivers.html
proposes a solution to mitigate the problems with WDDM on windows. However, it only works for a Tesla card and it requires an additional video card to steer the display, since the proposed drivers are compute-only drivers.
However, none of these threads provide a real solution. I would appreciate it, if NVIDIA could comment on this problem and come up with a solution, since apparently a lot of people are experiencing this problem.
TL;DR: The issue is caused by the WDDM TDR delay option in Nsight Monitor! When set to false, the issue appears. Instead, if you set the
TDR delay value to a very high number, and the "enabled" option to
true, the issue goes away.
Read below for other (older) steps followed until i came to the solution above, and some other possible causes.
I just recently were able to mostly solve this problem! It is specific to windows and aero i think. Please try these steps and post your results to help others! I have tried it on GTX 650 and GT 640.
Before you do anything, consider using both onboard gpu(as display) and the discrete gpu (for computations), because there are verified issues with the nvidia driver for windows! When you use onboard gpu, said drivers don't get fully loaded, so many bugs are evaded. Also, system responsiveness is maintained while working!
Make sure your concurrency problem is not related to other issues like old drivers (including bios), wrong code, incapable device, etc.
Go to computer>properties
Select advanced system settings on the left side
Go to the Advanced tab
On Performance click settings
In the Visual Effects tab, select the "adjust for best performance" bullet.
This will disable aero and almost all visual effects. If this configuration works, you can try enabling one-by-one the boxes for visual effects until you find the precise one that causes problems!
Alternatively, you can:
Right click on desktop, select personalize
Select a theme from basic themes, that doesn't have aero.
This will also work as the above, but with more visual options enabled. For my two devices, this setting also works, so i kept it.
Please, when you try these solutions, come back here and post your findings!
For me, it solved the problem for most cases (a tiled dgemm i have made),but NOTE THAT i still can't run "simpleStreams" properly and achieve concurrency...
UPDATE: The problem is fully solved with a new windows installation!! The previous steps improved the behavior for some cases, but a fresh install solved all the problems!
I will try to find a less radical way of solving this problem, maybe restoring just the registry will be enough.
I am working on a gpu trace emulation tool in windows as part of my research work in grad school . I am working on cuda runtime trace emulation to be specific.
I use simple DLL injection using MS Detours to enable interception of the cuda runtime APIs. I store the API calls and their parameters in a trace file. I get into some problems while trying to emulate the API from my trace file(I use the word playback to denote this action)
A typical trace file begins by making calls to __cudaRegisterFatBinary and __cudaRegisterFunction. This is followed by a call to cudaMalloc.
What I did?
1) I came across the famous GPUOcelot and I found the cubin structure that Nvidia is using right now. I am using that to save the address parameter of cudaRegisterFatBinary in intercept mode and I am using the pointer in the playback for _cudaRegisterFatBinary by repopulating the structure in the memory.
2)In _cudaRegisterFunction I am not sure what the parameters hostFunction,Device Function and Device Name refer to. I mean I don't understand how I could populate it while playing back from my trace file. I am just saving the pointer from the original execution and using it to imitate the call. But there is no way of knowing whether the function goes through fine since it does not have a return value.
3)cudaMalloc following these two entry point functions return cuda error code 11. It is cuda invalid value according to the Nvidia documentation. I have no idea why this should be the case. I am assuming that something is wrong with the previous two function calls. I also have a feeling that something is wrong with implicit primary context creation by the cuda runtime. Can someone give me some insights about cuda runtime execution and point me to what might I be missing?
I know its a ton of information without any useful code. I dont know which part of the code to post here. I will do it when people start taking interest in my question and ask me specific things about my project. Initially am just hoping that I am missing something big and high level that one of you can spot.
I greatly appreciate your time and interest!
Sounds very interesting overall. Your "Error:Cuda invalid value" is could be related to the params of _cudaRegisterFunction. The param 'DeviceName' sounds like it identifies which GPU (card?) to use. Check the CUDA SDK, there are many demos that enumerate the GPUs on the system, perhaps these values are valid for 'DeviceName'. As for 'hostFunction' and 'deviceFunction', these sound like either function IDs, or perhaps function pointers. Also, you can call 'cudaGetLastError()' to test whether the function call was successful (it returns 'cudaSuccess' if everything is ok... take a look at the error logging macros in the sdk). Good luck!
I have posted my problem in the CUDA forums, but not sure if it's appropriate to post a link here for more ideas in case there are significant number of different audiences between the two forums. The link is here. I apologize for any inconvenience and appreciate any comments on this question, as I haven't heard back yet on some specifics of a particular CUDA memory access and management problems. Thanks in advance.
I'm not sure if this is relevant without seeing more of your code but where is CudaObj's destructor being called from?
you said:
However, if I do it this way, I run into errors exiting the application in the line of CudaObj's destructor where cudaFree() is called. This causes the memory cleanup code after CUDA context's cleanup code not being executed due to the error, leaving behind a mess of memory leaks.
After your description of how you changed the cuda setup stuff to be in the beginning of thread2's main function. If you're calling the destructor on CudaObj from a different thread then doing the cudaFree cleanup will be in error for the same reason that you had to move the cuda initialization into thread 2. It sounds like you know this already, but the cuda context is specific to a single thread in your process. Cleaning up in a different thread is not supported according to the documentation, though I've never tried it myself.
Hope this helps