I am trying to estimate the effect of restricting register usage on achieved occupancy of the application. While running my experiments, when I tried to restrict the number of registers of cdpBezierTessellation application found in Nvidia samples, I got an error.
Flag added to nvcc: -maxrregcount 16
Error: nvlink error : entry function '_Z21computeBezierLinesCDPP10BezierLinei' with max regcount of 16 calls function 'cudaMalloc' with regcount of 18
I don't understand exactly why this is happening. Can anyone help me with this?
As commenters have said, the linker error message is very clear in telling you what is happening. You are trying to compile your kernel (computeBezierLinesCDP()) telling it that it may use a maximum of 16 registers, however then when you come to the link step (which is after compilation) the linker finds that one of the functions you are calling from within the kernel (cudaMalloc()) uses 18 registers. This is a constraint the linker is clearly unable to satisfy!
Since you cannot reduce the number of registers used by cudaMalloc() (since it's a pre-compiled library routine), you need to increase your register limit.
If you really need to constrain your kernel to 16 registers then you would need to avoid calling cudaMalloc() (and any other routine that uses more registers). You may be able to avoid allocating memory from within your kernel by pre-allocating from the host.
Related
Short Version
I have a kernel that launches a lot of blocks and I know that there is are illegal memory reads happening for blockIdx.y = 312. Running it under cuda-gdb results in sequential execution of blocks 16 at a time and it takes very long for the execution to reach this block index, even with a conditional breakpoint.
Is there any way to change the order in thread blocks are scheduled when running under cuda-gdb? If not, is there any other debugging strategy that I might have missed?
Longer Version
I have a baseline convolution CUDA kernel that scales with problem size by launching more blocks. There is a bug for input images with dimensions of the order of 10_000 x 10_000. Running it under cuda-memcheck, I see the following.
...
========= Invalid __global__ read of size 4
========= at 0x00000150 in convolution_kernel_sharedmem(float*, float*, float*)
========= by thread (30,31,0) in block (0,312,0)
...
All illegal accesses appear to be happening for blockDim.y = 312. So, upon running it with cuda-gdb, 16 blocks are being launched at a time starting from (0, 0, 0). I have set a conditional breakpoint at the kernel to stop at the desired block index, but it is taking a very long time to get there.
Is there any way change the order in which thread blocks are scheduled on the device? If not, is there any alternative debugging strategy that I might have missed?
P.S: I know that I can use grid-strided loops instead of launching these many blocks, but I would like to know what is wrong with this particular implementation.
Is there any way to change the order in thread blocks are scheduled when running under cuda-gdb?
There is no way to change the threadblock scheduling order unless you want to rewrite the code, and take control of threadblock scheduling yourself. Note that that linked example is not exactly how to redefine threadblock scheduled order, but it has all the necessary ingredients. In practice I don't see a lot of people wanting to do this level of refactoring, but I mention it for completeness.
If not, is there any other debugging strategy that I might have missed?
The method described here can localize your error to a specific line of kernel code. From there you can use e.g. conditioned printf to identify illegal index calculation, etc. Note that for that method, there is no need to compile your code with debug switches, but you do need to compile with -lineinfo.
This training topic provides a longer treatment of CUDA debugging.
I've got a C++ Cuda toolkit v9.2 application that works fine built with -O, but if I build with -g -G, I get a cuda error 7 at runtime:
too many resources requested for launch
I understand from here that this means:
the number of registers available on the multiprocessor is being exceeded. Reduce the number of threads per block to solve the problem.
I'd rather not reduce the threads per block since it works optimized. What might I do so that for debug builds I use fewer registers, more in line with optimized? How can I track down where the extra register use is coming from in my application?
As also mentioned in the comments above, debug builds typically require more resources due to various reasons.
You can use the --maxrregcount option or __launch_bounds__ qualifier to set a limit for how many registers the compiler is allowed to used. Do note that turning this knob really just means trading one resource for another. Forcing the compiler to use fewer registers will generally mean it has to spill more. More spills will generally mean increased local memory requirements. In extreme cases, you may run into another limit there…
I've been writing some basic CUDA Fortran code. I would like to be able to determine the amount of shared memory my program uses per thread block (for occupancy calculation). I have been compiling with -Mcuda=ptxinfo in the hope of finding this information. The compilation output ends with
ptxas info : Function properties for device_procedures_main_kernel_
432 bytes stack frame, 1128 bytes spill stores, 604 bytes spill loads
ptxas info : Used 63 registers, 96 bytes smem, 320 bytes cmem[0]
which is the only place in the output that smem is mentioned. There is one array in the global subroutine main_kernel with the shared attribute. If I remove the shared attribute then I get
ptxas info : Function properties for device_procedures_main_kernel_
432 bytes stack frame, 1124 bytes spill stores, 532 bytes spill loads
ptxas info : Used 63 registers, 320 bytes cmem[0]
The smem has disappeared. It seems that only shared memory in main_kernel is being counted: device subroutines in my code use variables with the shared attribute but these don't appear to be mentioned in the output e.g the device subroutine evalfuncs includes shared variable declarations but the relevant output is
ptxas info : Function properties for device_procedures_evalfuncs_
504 bytes stack frame, 1140 bytes spill stores, 508 bytes spill loads
Do all variables with the shared attribute need to be declared in a global subroutine?
Do all variables with the shared attribute need to be declared in a global subroutine?
No.
You haven't shown an example code, your compile command, nor have you identified the version of the PGI compiler tools you are using. However, the most likely explanation I can think of for what you are seeing is that as of PGI 14.x, the default CUDA compile option is to generate relocatable device code. This is documented in section 2.2.3 of the current PGI release notes:
2.2.3. Relocatable Device Code
An rdc option is available for the –ta=tesla and –Mcuda flags that specifies to generate
relocatable device code. Starting in PGI 14.1 on Linux and in PGI 14.2 on Windows, the default
code generation and linking mode for Tesla-target OpenACC and CUDA Fortran is rdc,
relocatable device code.
You can disable the default and enable the old behavior and non-relocatable code by specifying
any of the following: –ta=tesla:nordc, –Mcuda=nordc, or by specifying any 1.x compute
capability or any Radeon target.
So a specific option to (disable)enable this is:
–Mcuda=(no)rdc
(note that -Mcuda=rdc is the default, if you don't specify this option)
CUDA Fortran separates Fortran host code from device code. For the device code, the CUDA Fortran compiler does a CUDA Fortran->CUDA C conversion, and passes the auto-generated CUDA C code to the CUDA C compiler. Therefore, the behavior and expectations of switches like rdc and ptxinfo are derived from the behavior of the underlying equivalent CUDA compiler options (-rdc=true and -Xptxas -v, respectively).
When CUDA device code is compiled without the rdc option, the compiler will normally try to inline device (sub)routines that are called from a kernel, into the main kernel code. Therefore, when the compiler is generating the ptxinfo, it can determine all resource requirements (e.g. shared memory, registers, etc.) when it is compiling (ptx assembly) the kernel code.
When the rdc option is specified, however, the compiler may (depending on some other switches and function attributes) leave the device subroutines as separately callable routines with their own entry point (i.e. not inlined). In that scenario, when the device compiler is compiling the kernel code, the call to the device subroutine just looks like a call instruction, and the compiler (at that point) has no visibility into the resource usage requirements of the device subroutine. This does not mean that there is an underlying flaw in the compile sequence. It simply means that the ptxinfo mechanism cannot accurately roll up the resource requirements of the kernel and all of it's called subroutines, at that point in time.
The ptxinfo output also does not declare the total amount of shared memory used by a device subroutine, when it is compiling that subroutine, in rdc mode.
If you turn off the rdc mode:
–Mcuda=nordc
I believe you will see an accurate accounting of the shared memory used by a kernel plus all of its called subroutines, given a few caveats, one of which is that the compiler is able to successfully inline your called subroutines (pretty likely, and the accounting should still work even if it can't) another of which is that you are working with a kernel plus all of its called subroutines in the same file (i.e. translation unit). If you have kernels that are calling device subroutines in different translation units, then the rdc option is the only way to make it work.
Shared memory will still be appropriately allocated for your code at runtime, regardless (assuming you have not violated the total amount of shared memory available). You can also get an accurate reading of the shared memory used by a kernel by profiling your code, using a profiler such as nvvp or nvprof.
If this explanation doesn't describe what you are seeing, I would suggest providing a complete sample code, as well as the exact compile command you are using, plus the version of PGI tools you are using. (I think it's a good suggestion for future questions as well.)
I am using a Quadro K2000M card, CUDA capability 3.0, CUDA Driver 5.5, runtime 5.0, programming with Visual Studio 2010. My GPU algorithm runs many parallel breadth first searches (BFS) of a tree (constant). The threads are independed except reading from a constant array and the tree. In each thread there can be some malloc/free operations, following the BFS algorithm with queues (no recursion). There N threads; the number of tree leaf nodes is also N. I used 256 threads per block and (N+256-1)/256 blocks per grid.
Now the problem is the program works for less N=100000 threads but fails for more than that. It also works in CPU or in GPU thread by thread. When N is large (e.g. >100000), the kernel crashes and then cudaMemcpy from device to host also fails. I tried Nsight, but it is too slow.
Now I set cudaDeviceSetLimit(cudaLimitMallocHeapSize, 268435456); I also tried larger values, up to 1G; cudaDeviceSetLimit succeeded but the problem remains.
Does anyone know some common reason for the above problem? Or any hints for further debugging? I tried to put some printf's, but there are tons of output. Moreover, once a thread crashes, all remaining printf's are discarded. Thus it is hard to identify the problem.
"CUDA Driver 5.5, runtime 5.0" -- that seems odd.
You might be running into a windows TDR event. Based on your description, I would check that first. If, as you increase the threads, the kernel begins to take more than about 2 seconds to execute, you may hit the windows timeout.
You should also add proper cuda error checking to your code, for all kernel calls and CUDA API calls. A windows TDR event will be more easily evident based on the error codes you receive. Or the error codes may steer you in another direction.
Finally, I would run your code with cuda-memcheck in both the passing and failing cases, looking for out-of-bounds accesses in the kernel or other issues.
I'm writing a program that requires the following kernel launch:
dim3 blocks(16,16,16); //grid dimensions
dim3 threads(32,32); //block dimensions
get_gaussian_responses<<<blocks,threads>>>(pDeviceIntegral,itgStepSize,pScaleSpace);
I forgot to free the pScaleSpace array at the end of the program, and then ran the program through the CUDA profiler, which runs it 15 times in succession, using up a lot of memory / causing a lot of fragmentation. Now whenever I run the program, the kernel doesn't even launch. If I look at the list of function calls recorded by the profiler, the kernel is not there. I realize this is a pretty stupid error, but I don't know what I can do at this point to get the program to run again. I have restarted my computer, but that did not help. If I reduce the dimensions of the kernel, it runs fine, but the current dimensions are well within the allowed maximum for my card.
Max threads per block: 1024
Max grid dimensions: 65535,65535,65535
Any suggestions appreciated, thanks in advance!
Try launching with lesser number of threads. If that works, it means that each of your threads is doing a lot of work or using a lot of memory. Thus the maximum possible number of threads cannot possibly be practically launched by CUDA on your hardware.
You may have to make your CUDA code more efficient to be able to launch more threads. You could try slicing your kernel into smaller pieces if it has complex logic inside it. Or get more powerful hardware.
If you compile your code like this:
nvcc -Xptxas="-v" [other compiler options]
the assembler will report the number of local heap memory that the code requires. This can be a useful diagnostic to see what the memory footprint of the kernel is. There is also an API call cudaThreadSetLimit which can be used to control the amount of per thread heap memory which a kernel will try and consume during execution.
Recent toolkits ship with a utility called cuda-memchk, which provides valgrind like analysis of kernel memory access, including buffer overflows and illegal memory usage. It might be that your code is overflowing some memory somewhere and overwriting other parts of GPU memory, leaving the card in a parlous state.
I got it! nVidia NSight 2.0 - which supposedly supports CUDA 4 - changed my CUDA_INC_PATH to use CUDA 3.2. No wonder it wouldn't let me allocate 1024 threads per block. All relief and jubilation aside, that is a really stupid and annoying bug considering I already had CUDA 4.0 RC2 installed.