cuda with mingw - updated - cuda

We have been developing our code in linux, but would like to compile a windows executable. The old non-gpu version compiles just fine with mingw in windows, so I was hoping I'd be able to do the same with the CUDA version.
The strategy is to compile kernel code with nvcc in visual studio, and the rest with gcc in mingw.
So far, we easily compiled the .cu file (with the kernel and kernel launches) in visual studio. However, we still can't compile the c code in mingw. The c code contains cuda api calls such as cudaMalloc and cuda types such as cudaEvent_t, so we must include cuda.h and cuda_runtime.h. However, gcc gives warnings and errors for these headers, for example:
../include/host_defines.h:57:0: warning: "__cdecl" redefined
and
../include/vector_functions.h:127:14: error: 'short_4' has no member named 'x'
Any ideas on how we can include these headers and compile the c portion of the code?

If you are really desperate there might be a way. The nvcc is really just a frontend for a bunch of compilers. It invokes g++ a lot to strip comments, separate device and host code, handle name mangling, link stuff back together, etc. (use --verbose) to get the details.
My idea is as follows: You should be able to compile the host code with mingw while compiling the device code to a fatbin on a linux machine (as I guess the device binary is host-machine independent). Afterwards link both parts of the code back together with mingw or use the driver API to load the fatbin dynamically. Disclaimer: Did not test!

As far as I know, it is impossible to use CUDA without MSVC. So, you need MSVC to make nvcc work, and you can compile CPU code with mingw and link everything together.
According to http://forums.nvidia.com/index.php?showtopic=30743
"There are no current plans to support mingw."

You might want to take a look at how the cycles renderer handles this, look at https://developer.blender.org/diffusion/B/browse/master/extern/cuew/ and
https://developer.blender.org/diffusion/B/browse/master/intern/cycles/device/device_cuda.cpp
I know it's not an automagic trick but it might help you get started.

Related

CLion code completion for CUDA missing some functions

I'm using CLion with Cuda toolkit on Windows 11 with MSVC compiler. It works and compiles fine, but the code completion is missing a lot of items like cudaMalloc and cudaFree. It does include some items though, like CudaMemAttachGlobal, see screenshot below.
I think it's because I haven't included any headers, but nvcc doesn't require explicit inclusion of headers, and the default CMake settings in CLion compiles and runs my .cu files just fine.
Is there anything extra I'm supposed to do to get CLion code completion to look at the entire available API from nvcc?
EDIT: The above description was with cuda toolkit on windows with MSVC. Now I tried it with cuda toolkit from the Nvidia installations instructions on Fedora 35, and the symptoms are exactly the same. The completion items are only macros, no actual functions. I looked through cuda_runtime_api.h and the signature for cudaFree is
extern __host__ __cudart_builtin__ cudaError_t CUDARTAPI cudaFree(void *devPtr);
Update:
It seems if I press Ctrl+Space, then the code completion menu works perfectly, and is able to complete both CudaMalloc and CudaFree, and anything else. If I don't press Ctrl+Space and just let it show the menu, it still shows the menu but only has macros in it.
Without ctrl+space:
With ctrl+space:
Original:
This seems to be a bug in either CLion or whatever subroutine (maybe CMake) it calls to get the code completions from header files; I tried this on Fedora Linux and observed the exact same behavior.
In contrast, VSCode has an NSight plugin that's developed by Nvidia, and that is able to code-complete functions like cudaMallocManaged and cudaFree with no problems.

CUDA compile multiple .cu files to one file

I am porting some calculation from C# to CUDA.
There many classes in C# which I want to port, for each c# class I create .cu and .cuh file in my CUDA project.
All classes related, and all they used in calculations.
I need to save structure of my C# code, because it will be very easy to made error in other case.
P.S. In case I put all code in one file - everything works as expected but read or fix some issues becomes real pain.
I want to compile CUDA project and use it in my C# via ManagedCuda library.
I can compile test CUDA project with one .cu file to .ptx file, load it in C# via ManagedCuda and call function from it.
But when I want to compile my real projects with multiple cu files, in result I got multiple .ptx files for each .cu file in project, even more I am not able to load this .ptx file via ManagedCuda, I got next error:
ErrorInvalidPtx: This indicates that a PTX JIT compilation failed.
But this error expected, because there cross reference in ptx files, and they have sense only if the loaded together.
My goal is to compile my CUDA project to one file, but in same time I do not want to be limited to only specific video card which I have. For this I need to use PTX(or cubin with ptx included) this PTX file will be compiled for specific device in moment you load it.
I tried to set Generate Relocatable Device Code to Yes (-rdc=true) and compile to PTX and Cubin - result same I get few independent files for each .cu file.
The very short answer is no, you can't do that. The toolchain cannot merged PTX code at the compilation phase.
If you produce multiple PTX files, you will need to use the JIT linker facilities of the CUDA runtime to produce a module which can be loaded into your context. I have no idea whether Managed CUDA supports that or not.
Edit to add that is appears that Managed CUDA does support runtime linking (see here).

Can I use cuda without using nvcc on my host code?

I'm writing a single header library that executes a cuda kernel. I was wondering if there is a way to get around the <<<>>> syntax, or get C source output from nvcc?
You can avoid the host language extensions by using the CUDA driver API instead. It is a little more verbose and you will require a little more boilerplate code to manage the context, but it is not too difficult.
Conventionally, you would compile to PTX or a binary payload to load at runtime, however NVIDIA now also ship an experimental JIT CUDA C compiler library, libNVVM, which you could try if you want JIT from source.

calling cuda c kernel from fortran 90

I am planning to call a typical matrix multiply CUDA C kernel from a fortran program. I am referring the following link http://www-irma.u-strasbg.fr/irmawiki/index.php/Call_CUDA_from_Fortran . I would be glad if any resources is available on this. I intend to avoid PGI Cuda Fortran as I am not possessing the compiler. In the link above I cannot make out what should be the CUDA.F90 file. I assume the last code given in the link is that of main.F90. Kindly help.
Perhaps you need to re-read the very first line of that page you linked to. Those instructions are relying on a set of external ISO C bindings for the CUDA API. That is where the CUDA.F90 file you are asking about comes from. You will need to download and build the FortCUDA bindings to use the instructions on that wiki page.
Edited to add that given your last question was about compilation in Nsight Visual Studio Edition, it would seem that you are running on a Windows platform. You should know that you can't use gcc to build CUDA applications on Windows platforms. The supplied CUDA libraries will only work with either the Microsoft toolchain or (possibly) Intel's compilers in certain cases.

How can I read the PTX?

I am working with Capabilities 3.5, CUDA 5 and VS 2010 (and obviously Windows).
I am interested in reading the compiled code to understand better the implication of my C code changes.
What configuration do I need in VS to compile the code for readability (is setting the compilation to PTX enough?)?
What tool do I need to reverse engineer the generated PTX to be able to read it?
In general, to create a ptx version of a particular .cu file, the command is:
nvcc -ptx mycode.cu
which will generate a mycode.ptx file containing the ptx code corresponding to the file you used. It's probably instructive to use the -src-in-ptx option as well:
nvcc -ptx -src-in-ptx mycode.cu
Which will intersperse the lines of source code with the lines of ptx they correspond to.
To comprehend ptx, start with the documentation
Note that the compiler may generate ptx code that doesn't correspond to the source code very well, or is otherwise confusing, due to optimizations. You may wish (perhaps to gain insight) to compile some test cases using the -G switch as well, to see how the non-optimized version compares.
Since the windows environment may vary from machine to machine, I think it's easier if you just look at the path your particular version of msvc++ is using to invoke nvcc (look at the console output from one of your projects when you compile it) and prepend the commands I give above with that path. I'm not sure there's much utility in trying to build this directly into Visual Studio, unless you have a specific need to compile from ptx to an executable. There are also a few sample codes that have to do with ptx in some fashion.
Also note for completeness that ptx is not actually what's executed by the device (but generally pretty close). It is an intermediate code that can be re-targetted to devices within a family by nvcc or a portion of the compiler that also lives in the GPU driver. To see the actual code executed by the device, we use the executable instead of the source code, and the tool to extract the machine assembly code is:
cuobjdump -sass mycode.exe
Similar caveats about prepending an appropriate path, if needed. I would start with the ptx. I think for what you want to do, it's enough.