CUDA code too large to be expanded [duplicate]

CUDA code too large to be expanded [duplicate] - cuda

I have a CUDA class, let's call it A, defined in a header file. I have written a test kernel which creates an instance of class A, which compiles fine and produces the expected result.
In addition, I have my main CUDA kernel, which also compiles fine and produces the expected result. However, when I add code to my main kernel to instantiate an instance of class A, the nvcc compiler fails with a segmentation fault.
Update:
To clarify, the segmentation fault happens during compilation, not when running the kernel. The line I am using to compile is:
`nvcc --cubin -arch compute_20 -code sm_20 -I<My include dir> --keep kernel.cu`
where <My include dir> is the path to my local path containing some utility header files.
My question is, before spending a lot of time isolating a minimal example exhibiting the behaviour (not trivial, due to relatively large code base), has anyone encountered a similar issue? Would it be possible for the nvcc compiler to fail and die if the kernel is either too long or uses too many registers?
If an issue such as register count can affect the compiler this way, then I will need to rethink how to implement my kernel to use fewer resources. This would also mean that trimming things down to a minimal example will likely make the problem disappear. However, if this is not even a possibility, I don't want to waste time on a dead-end, but will rather try to cut things down to a minimal example and will file a bug report to NVIDIA.
Update:
As per the suggestion of #njuffa, I reran the compilation with the -v flag enabled. The output ends with the following:
#$ ptxas -arch=sm_20 -m64 -v "/path/to/kernel_ptx/kernel.ptx" -o "kernel.cubin"
Segmentation fault
# --error 0x8b --
This suggests the problem is due to the ptxas program, which is failing to generate a CUDA binary from the ptx file.

This would appear to have been a genuine bug of some sort in the CUDA 5.0 ptxas assembler. It was reported to NVIDIA and we can assume that it was fixed sometime during the more than three years since the question was asked and this answer added.
[This answer has been assembled from comments and added as a community wiki entry to get this question off the unanswered question list ]

Related

How to detect NVIDIA CUDA Architecture [duplicate]

I've recently gotten my head around how NVCC compiles CUDA device code for different compute architectures.
From my understanding, when using NVCC's -gencode option, "arch" is the minimum compute architecture required by the programmer's application, and also the minimum device compute architecture that NVCC's JIT compiler will compile PTX code for.
I also understand that the "code" parameter of -gencode is the compute architecture which NVCC completely compiles the application for, such that no JIT compilation is necessary.
After inspection of various CUDA project Makefiles, I've noticed the following occur regularly:
-gencode arch=compute_20,code=sm_20
-gencode arch=compute_20,code=sm_21
-gencode arch=compute_21,code=sm_21
and after some reading, I found that multiple device architectures could be compiled for in a single binary file - in this case sm_20, sm_21.
My questions are why are so many arch / code pairs necessary? Are all values of "arch" used in the above?
what is the difference between that and say:
-arch compute_20
-code sm_20
-code sm_21
Is the earliest virtual architecture in the "arch" fields selected automatically, or is there some other obscure behaviour?
Is there any other compilation and runtime behaviour I should be aware of?
I've read the manual, http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-compilation and I'm still not clear regarding what happens at compilation or runtime.

Roughly speaking, the code compilation flow goes like this:
CUDA C/C++ device code source --> PTX --> SASS
The virtual architecture (e.g. compute_20, whatever is specified by -arch compute...) determines what type of PTX code will be generated. The additional switches (e.g. -code sm_21) determine what type of SASS code will be generated. SASS is actually executable object code for a GPU (machine language). An executable can contain multiple versions of SASS and/or PTX, and there is a runtime loader mechanism that will pick appropriate versions based on the GPU actually being used.
As you point out, one of the handy features of GPU operation is JIT-compile. JIT-compile will be done by the GPU driver (does not require the CUDA toolkit to be installed) anytime a suitable PTX code is available but a suitable SASS code is not. The definition of a "suitable PTX" code is one which is numerically equal to or lower than the GPU architecture being targeted for running the code. To pick an example, specifying arch=compute_30,code=compute_30 would tell nvcc to embed cc3.0 PTX code in the executable. This PTX code could be used to generate SASS code for any future architecture that the GPU driver supports. Currently this would include architectures like Pascal, Volta, Turing, etc. assuming the GPU driver supports those architectures.
One advantage of including multiple virtual architectures (i.e. multiple versions of PTX), then, is that you have executable compatibility with a wider variety of target GPU devices (although some devices may trigger a JIT-compile to create the necessary SASS).
One advantage of including multiple "real GPU targets" (i.e. multiple SASS versions) is that you can avoid the JIT-compile step, when one of those target devices is present.
If you specify a bad set of options, it's possible to create an executable that won't run (correctly) on a particular GPU.
One possible disadvantage of specifying a lot of these options is code size bloat. Another possible disadvantage is compile time, which will generally be longer as you specify more options.
It's also possible to create excutables that contain no PTX, which may be of interest to those trying to obscure their IP.
Creating PTX suitable for JIT should be done by specifying a virtual architecture for the code switch.

The purpose of multiple -arch flags is to use the __CUDA_ARCH__ macro for conditional compilation (ie, using #ifdef) of differently-optimized code paths.
See here: http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architecture-identification-macro

How can I read the PTX?

I am working with Capabilities 3.5, CUDA 5 and VS 2010 (and obviously Windows).
I am interested in reading the compiled code to understand better the implication of my C code changes.
What configuration do I need in VS to compile the code for readability (is setting the compilation to PTX enough?)?
What tool do I need to reverse engineer the generated PTX to be able to read it?

In general, to create a ptx version of a particular .cu file, the command is:
nvcc -ptx mycode.cu
which will generate a mycode.ptx file containing the ptx code corresponding to the file you used. It's probably instructive to use the -src-in-ptx option as well:
nvcc -ptx -src-in-ptx mycode.cu
Which will intersperse the lines of source code with the lines of ptx they correspond to.
To comprehend ptx, start with the documentation
Note that the compiler may generate ptx code that doesn't correspond to the source code very well, or is otherwise confusing, due to optimizations. You may wish (perhaps to gain insight) to compile some test cases using the -G switch as well, to see how the non-optimized version compares.
Since the windows environment may vary from machine to machine, I think it's easier if you just look at the path your particular version of msvc++ is using to invoke nvcc (look at the console output from one of your projects when you compile it) and prepend the commands I give above with that path. I'm not sure there's much utility in trying to build this directly into Visual Studio, unless you have a specific need to compile from ptx to an executable. There are also a few sample codes that have to do with ptx in some fashion.
Also note for completeness that ptx is not actually what's executed by the device (but generally pretty close). It is an intermediate code that can be re-targetted to devices within a family by nvcc or a portion of the compiler that also lives in the GPU driver. To see the actual code executed by the device, we use the executable instead of the source code, and the tool to extract the machine assembly code is:
cuobjdump -sass mycode.exe
Similar caveats about prepending an appropriate path, if needed. I would start with the ptx. I think for what you want to do, it's enough.

Computation between two different kernels in Cuda [duplicate]

I'm writing a CUDA program but I'm getting the obnoxious warning:
Warning: Cannot tell what pointer points to, assuming global memory space
this is coming from nvcc and I can't disable it.
Is there any way to filter out warning from third-party tools (like nvcc)?
I'm asking for a way to filter out of the output window log errors/warnings coming from custom build tools.

I had the same annoying warnings, I found help on this thread: link.
You can either remove the -G flag on the nvcc command line,
or
change the compute_10,sm_10 to compute_20,sm_20 in the Cuda C/C++ options of your project if you're using Visual Studio.

cuda with mingw - updated

We have been developing our code in linux, but would like to compile a windows executable. The old non-gpu version compiles just fine with mingw in windows, so I was hoping I'd be able to do the same with the CUDA version.
The strategy is to compile kernel code with nvcc in visual studio, and the rest with gcc in mingw.
So far, we easily compiled the .cu file (with the kernel and kernel launches) in visual studio. However, we still can't compile the c code in mingw. The c code contains cuda api calls such as cudaMalloc and cuda types such as cudaEvent_t, so we must include cuda.h and cuda_runtime.h. However, gcc gives warnings and errors for these headers, for example:
../include/host_defines.h:57:0: warning: "__cdecl" redefined
and
../include/vector_functions.h:127:14: error: 'short_4' has no member named 'x'
Any ideas on how we can include these headers and compile the c portion of the code?

If you are really desperate there might be a way. The nvcc is really just a frontend for a bunch of compilers. It invokes g++ a lot to strip comments, separate device and host code, handle name mangling, link stuff back together, etc. (use --verbose) to get the details.
My idea is as follows: You should be able to compile the host code with mingw while compiling the device code to a fatbin on a linux machine (as I guess the device binary is host-machine independent). Afterwards link both parts of the code back together with mingw or use the driver API to load the fatbin dynamically. Disclaimer: Did not test!

As far as I know, it is impossible to use CUDA without MSVC. So, you need MSVC to make nvcc work, and you can compile CPU code with mingw and link everything together.
According to http://forums.nvidia.com/index.php?showtopic=30743
"There are no current plans to support mingw."

You might want to take a look at how the cycles renderer handles this, look at https://developer.blender.org/diffusion/B/browse/master/extern/cuew/ and
https://developer.blender.org/diffusion/B/browse/master/intern/cycles/device/device_cuda.cpp
I know it's not an automagic trick but it might help you get started.

nvcc -Xptxas –v compiler flag has no effect

I have a CUDA project. It consists of several .cpp files that contain my application logic and one .cu file that contains multiple kernels plus a __host__ function that invokes them.
Now I would like to determine the number of registers used by my kernel(s). My normal compiler call looks like this:
nvcc -arch compute_20 -link src/kernel.cu obj/..obj obj/..obj .. -o bin/..exe -l glew32 ...
Adding the "-Xptxas –v" compiler flag to this call unfortunately has no effect. The compiler still produces the same textual output as before. The compiled .exe also works the same way as before with one exception: My framerate jumps to 1800fps, up from 80fps.

I had the same problem, here is my solution:
Compile *cu files into device only *ptx file, this will discard host code
nvcc -ptx *.cu
Compile *ptx file:
ptxas -v *.ptx
The second step will show you number of used registers by kernel and amount of used shared memory.

Convert the compute_20 to sm_20 in your compiler call. That should fix it.

When using "-Xptxas -v", "-arch" together, we can not get verbose information(register num, etc.). If we want to see the verbose without losing the chance of assigning GPU architecture(-arch, -code) ahead, we can do the following steps: nvcc -arch compute_XX *.cu -keep then ptxas -v *.ptx. But we will obtain many processing files. Certainly, kogut's answer is to the point.

when you compile
nvcc --ptxas-options=-v

You may want to ctrl your compiler verbose option defaults.
For example is VStudio goto :
Tools->Options->ProjectsAndSolutions->BuildAndRun
then set the verbosity output to Normal.

Not exactly what you were looking for, but you can use the CUDA visual profiler shipped with the nvidia gpu computing sdk. Besides many other useful informations, it shows the number of registers used by each kernel in you application.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008