Run CUDA SASS instructions [duplicate] - cuda

With CuObjDump SASS can be generated from Cubin file using
cuobjdump -sass <input file>, But is there any way to convert the SASS back to Cubin.

There are no "assemblers" provided as part of the official NVIDIA CUDA toolchain. The NVIDIA toolchain can take CUDA C/C++, or PTX, and convert it to a cubin or other executable format.
However there are some community-developed assemblers:
Perhaps the most recent one at this time (probably the only one worth considering at this time) is maxas.
There also was an older one asfermi developed in the Fermi generation of CUDA GPUs. I don't think it has been updated or maintained.

I would like to add that depending on the architecture (maxwell/kepler etc), you can use a community developed assembler/dissembler to convert the SASS back to Cubin. Here are some:
Maxas: https://github.com/NervanaSystems/maxas
KeplerAs: https://github.com/PAA-NCIC/PPoPP2017_artifact/tree/master/KeplerAs

Related

GPGPUsim PTX extraction

Just as the title says, I'm learning how to use the GPGPUsim. And when I read the "PTX extraction" section of the manual, I found that it says "In CUDA version 4.0 and later, the fat cubin file used to extract the ptx and sass is not available any more." which makes me confused. How to understand this, what happened in CUDA version 4.0 and later.
Thank you anyway :)
When CUDA 4.0 was released (in 2011!), the device toolchain was switched to a fully ELF based object model. Prior to that, a plain text file with encoded binary sections for emitted SASS code and plain text for PTX was used. As a result, to extract PTX or SASS from an ELF CUDA object requires a utility cuobjdump to access the requisite code.
Thus the pre/post CUDA 4.0 distinction.

Can I use cuda without using nvcc on my host code?

I'm writing a single header library that executes a cuda kernel. I was wondering if there is a way to get around the <<<>>> syntax, or get C source output from nvcc?
You can avoid the host language extensions by using the CUDA driver API instead. It is a little more verbose and you will require a little more boilerplate code to manage the context, but it is not too difficult.
Conventionally, you would compile to PTX or a binary payload to load at runtime, however NVIDIA now also ship an experimental JIT CUDA C compiler library, libNVVM, which you could try if you want JIT from source.

How to detect NVIDIA CUDA Architecture [duplicate]

I've recently gotten my head around how NVCC compiles CUDA device code for different compute architectures.
From my understanding, when using NVCC's -gencode option, "arch" is the minimum compute architecture required by the programmer's application, and also the minimum device compute architecture that NVCC's JIT compiler will compile PTX code for.
I also understand that the "code" parameter of -gencode is the compute architecture which NVCC completely compiles the application for, such that no JIT compilation is necessary.
After inspection of various CUDA project Makefiles, I've noticed the following occur regularly:
-gencode arch=compute_20,code=sm_20
-gencode arch=compute_20,code=sm_21
-gencode arch=compute_21,code=sm_21
and after some reading, I found that multiple device architectures could be compiled for in a single binary file - in this case sm_20, sm_21.
My questions are why are so many arch / code pairs necessary? Are all values of "arch" used in the above?
what is the difference between that and say:
-arch compute_20
-code sm_20
-code sm_21
Is the earliest virtual architecture in the "arch" fields selected automatically, or is there some other obscure behaviour?
Is there any other compilation and runtime behaviour I should be aware of?
I've read the manual, http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-compilation and I'm still not clear regarding what happens at compilation or runtime.
Roughly speaking, the code compilation flow goes like this:
CUDA C/C++ device code source --> PTX --> SASS
The virtual architecture (e.g. compute_20, whatever is specified by -arch compute...) determines what type of PTX code will be generated. The additional switches (e.g. -code sm_21) determine what type of SASS code will be generated. SASS is actually executable object code for a GPU (machine language). An executable can contain multiple versions of SASS and/or PTX, and there is a runtime loader mechanism that will pick appropriate versions based on the GPU actually being used.
As you point out, one of the handy features of GPU operation is JIT-compile. JIT-compile will be done by the GPU driver (does not require the CUDA toolkit to be installed) anytime a suitable PTX code is available but a suitable SASS code is not. The definition of a "suitable PTX" code is one which is numerically equal to or lower than the GPU architecture being targeted for running the code. To pick an example, specifying arch=compute_30,code=compute_30 would tell nvcc to embed cc3.0 PTX code in the executable. This PTX code could be used to generate SASS code for any future architecture that the GPU driver supports. Currently this would include architectures like Pascal, Volta, Turing, etc. assuming the GPU driver supports those architectures.
One advantage of including multiple virtual architectures (i.e. multiple versions of PTX), then, is that you have executable compatibility with a wider variety of target GPU devices (although some devices may trigger a JIT-compile to create the necessary SASS).
One advantage of including multiple "real GPU targets" (i.e. multiple SASS versions) is that you can avoid the JIT-compile step, when one of those target devices is present.
If you specify a bad set of options, it's possible to create an executable that won't run (correctly) on a particular GPU.
One possible disadvantage of specifying a lot of these options is code size bloat. Another possible disadvantage is compile time, which will generally be longer as you specify more options.
It's also possible to create excutables that contain no PTX, which may be of interest to those trying to obscure their IP.
Creating PTX suitable for JIT should be done by specifying a virtual architecture for the code switch.
The purpose of multiple -arch flags is to use the __CUDA_ARCH__ macro for conditional compilation (ie, using #ifdef) of differently-optimized code paths.
See here: http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architecture-identification-macro

cuda to llvm bitcode conversion

I want to convert CUDA code to llvm bitcode so I can instrument it. I have tried gpuocelot, which compile ptx into CPU executable code. Nevertheless, I couldn't get llvm bitcode from it so I can't instrument it. There have been activities trying to get CUDA supported in llvm. Can anyone provide a robust solution to convert CUDA to workable llvm bitcode? Thanks.
NVIDIA's nvcc is actually using LLVM IR as one of its steps. They might have changed it a little bit - I haven't seen the details. They have explained it under:
https://developer.nvidia.com/cuda-llvm-compiler
You should be able to use Clang to compile CUDA (mixed-mode) to LLVM IR now. Check this page out. Note that this support is still experimental. Feel free to report bugs to the LLVM community.

Does CUDA use an interpreter or a compiler?

This is a bit of silly question, but I'm wondering if CUDA uses an interpreter or a compiler?
I'm wondering because I'm not quite sure how CUDA manages to get source code to run on two cards with different compute capabilities.
From Wikipedia:
Programmers use 'C for CUDA' (C with Nvidia extensions and certain restrictions), compiled through a PathScale Open64 C compiler.
So, your answer is: it uses a compiler.
And to touch on the reason it can run on multiple cards (source):
CUDA C/C++ provides an abstraction, it's a means for you to express how you want your program to execute. The compiler generates PTX code which is also not hardware specific. At runtime the PTX is compiled for a specific target GPU - this is the responsibility of the driver which is updated every time a new GPU is released.
These official documents CUDA C Programming Guide and The CUDA Compiler Driver (NVCC) explain all the details about the compilation process.
From the second document:
nvcc mimics the behavior of the GNU compiler gcc: it accepts a range
of conventional compiler options, such as for defining macros and
include/library paths, and for steering the compilation process.
Not just limited to cuda , shaders in directx or opengl are also complied to some kind of byte code and converted to native code by the underlying driver.