Check if SIMD machine is generated for LLVM IR - llvm-clang

I have a C++ program that uses the LLVM libraries to generate an LLVM IR module and it compiles and executes it.
The code uses vector types and I want to check if it translates to SIMD instructions correctly on my architecture.
How do I find this out? Is there a way to see the assembly code that is generated out of this IR?

You're probably looking for some combination of -emit-llvm which outputs IR instead of native assembly, and -S which outputs assembly instead of object files.

Related

CUDA compile multiple .cu files to one file

I am porting some calculation from C# to CUDA.
There many classes in C# which I want to port, for each c# class I create .cu and .cuh file in my CUDA project.
All classes related, and all they used in calculations.
I need to save structure of my C# code, because it will be very easy to made error in other case.
P.S. In case I put all code in one file - everything works as expected but read or fix some issues becomes real pain.
I want to compile CUDA project and use it in my C# via ManagedCuda library.
I can compile test CUDA project with one .cu file to .ptx file, load it in C# via ManagedCuda and call function from it.
But when I want to compile my real projects with multiple cu files, in result I got multiple .ptx files for each .cu file in project, even more I am not able to load this .ptx file via ManagedCuda, I got next error:
ErrorInvalidPtx: This indicates that a PTX JIT compilation failed.
But this error expected, because there cross reference in ptx files, and they have sense only if the loaded together.
My goal is to compile my CUDA project to one file, but in same time I do not want to be limited to only specific video card which I have. For this I need to use PTX(or cubin with ptx included) this PTX file will be compiled for specific device in moment you load it.
I tried to set Generate Relocatable Device Code to Yes (-rdc=true) and compile to PTX and Cubin - result same I get few independent files for each .cu file.
The very short answer is no, you can't do that. The toolchain cannot merged PTX code at the compilation phase.
If you produce multiple PTX files, you will need to use the JIT linker facilities of the CUDA runtime to produce a module which can be loaded into your context. I have no idea whether Managed CUDA supports that or not.
Edit to add that is appears that Managed CUDA does support runtime linking (see here).

Can I use cuda without using nvcc on my host code?

I'm writing a single header library that executes a cuda kernel. I was wondering if there is a way to get around the <<<>>> syntax, or get C source output from nvcc?
You can avoid the host language extensions by using the CUDA driver API instead. It is a little more verbose and you will require a little more boilerplate code to manage the context, but it is not too difficult.
Conventionally, you would compile to PTX or a binary payload to load at runtime, however NVIDIA now also ship an experimental JIT CUDA C compiler library, libNVVM, which you could try if you want JIT from source.

Can the PGI compilers output the generated Cuda code to a file

I would like the generated CUDA code to be saved in a file for examination. Is this possible with OpenAcc and PGI compilers?
You should be able to pass -ta=nvidia,keepgpu,keepptx to any of the PGI GPU compilers, which will retain the intermediate code emitted by the toolchain during the build.
Also refer to the command line help, e.g.:
pgcc -help
Note that PGI compilers have moved to a more integrated toolchain recently, which eliminates the generation of CUDA C intermediate source files, so the above approach works but gives you intermediate files that are not C code (they are llvm and ptx). If you want CUDA C intermediate code, you can also add the nollvm option:
-ta=nvidia,keepgpu,keepptx,nollvm
The "kept" files will generally have .gpu and .h extensions for llvm/CUDA C code, and .ptx extension for PTX.

Does CUDA use an interpreter or a compiler?

This is a bit of silly question, but I'm wondering if CUDA uses an interpreter or a compiler?
I'm wondering because I'm not quite sure how CUDA manages to get source code to run on two cards with different compute capabilities.
From Wikipedia:
Programmers use 'C for CUDA' (C with Nvidia extensions and certain restrictions), compiled through a PathScale Open64 C compiler.
So, your answer is: it uses a compiler.
And to touch on the reason it can run on multiple cards (source):
CUDA C/C++ provides an abstraction, it's a means for you to express how you want your program to execute. The compiler generates PTX code which is also not hardware specific. At runtime the PTX is compiled for a specific target GPU - this is the responsibility of the driver which is updated every time a new GPU is released.
These official documents CUDA C Programming Guide and The CUDA Compiler Driver (NVCC) explain all the details about the compilation process.
From the second document:
nvcc mimics the behavior of the GNU compiler gcc: it accepts a range
of conventional compiler options, such as for defining macros and
include/library paths, and for steering the compilation process.
Not just limited to cuda , shaders in directx or opengl are also complied to some kind of byte code and converted to native code by the underlying driver.

cuda with mingw - updated

We have been developing our code in linux, but would like to compile a windows executable. The old non-gpu version compiles just fine with mingw in windows, so I was hoping I'd be able to do the same with the CUDA version.
The strategy is to compile kernel code with nvcc in visual studio, and the rest with gcc in mingw.
So far, we easily compiled the .cu file (with the kernel and kernel launches) in visual studio. However, we still can't compile the c code in mingw. The c code contains cuda api calls such as cudaMalloc and cuda types such as cudaEvent_t, so we must include cuda.h and cuda_runtime.h. However, gcc gives warnings and errors for these headers, for example:
../include/host_defines.h:57:0: warning: "__cdecl" redefined
and
../include/vector_functions.h:127:14: error: 'short_4' has no member named 'x'
Any ideas on how we can include these headers and compile the c portion of the code?
If you are really desperate there might be a way. The nvcc is really just a frontend for a bunch of compilers. It invokes g++ a lot to strip comments, separate device and host code, handle name mangling, link stuff back together, etc. (use --verbose) to get the details.
My idea is as follows: You should be able to compile the host code with mingw while compiling the device code to a fatbin on a linux machine (as I guess the device binary is host-machine independent). Afterwards link both parts of the code back together with mingw or use the driver API to load the fatbin dynamically. Disclaimer: Did not test!
As far as I know, it is impossible to use CUDA without MSVC. So, you need MSVC to make nvcc work, and you can compile CPU code with mingw and link everything together.
According to http://forums.nvidia.com/index.php?showtopic=30743
"There are no current plans to support mingw."
You might want to take a look at how the cycles renderer handles this, look at https://developer.blender.org/diffusion/B/browse/master/extern/cuew/ and
https://developer.blender.org/diffusion/B/browse/master/intern/cycles/device/device_cuda.cpp
I know it's not an automagic trick but it might help you get started.