Can I compile a cuda program without having a cuda device - cuda

Is it possible to compile a CUDA program without having a CUDA capable device on the same node, using only NVIDIA CUDA Toolkit...?

The answer to your question is YES.
The nvcc compiler driver is not related to the physical presence of a device, so you can compile CUDA codes even without a CUDA capable GPU. Be warned however that, as remarked by Robert Crovella, the CUDA driver library libcuda.so (cuda.lib for Windows) comes with the NVIDIA driver and not with the CUDA toolkit installer. This means that codes requiring driver APIs (whose entry points are prefixed with cu, see Appendix H of the CUDA C Programming Guide) will need a forced installation of a "recent" driver without the presence of an NVIDIA GPU, running the driver installer separately with the --help command line switch.
Following the same rationale, you can compile CUDA codes for an architecture when your node hosts a GPU of a different architecture. For example, you can compile a code for a GeForce GT 540M (compute capability 2.1) on a machine hosting a GT 210 (compute capability 1.2).
Of course, in both the cases (no GPU or GPU with different architecture), you will not be able to successfully run the code.
For the early versions of CUDA, it was possible to compile the code under an emulation modality and run the compiled code on a CPU, but device emulation is since some time deprecated. If you don't have a CUDA capable device, but want to run CUDA codes you can try using gpuocelot (but I don't have any experience with that).

Related

NVIDIA Ampere GPU Architecture Compatibility

Can anyone please help me to understand about NVIDIA devices serias 30 with Ampere architecture and compatible CUDA versions?
From here and from all over the net I understand that in CUDA toolkit v11 support for Ampere was added :
https://forums.developer.nvidia.com/t/can-rtx-3080-support-cuda-10-1/155849
What I don't understand is how it make sense with this :
https://docs.nvidia.com/cuda/ampere-compatibility-guide/index.html
Section
"1.3.1. Applications Built Using CUDA Toolkit 10.2 or Earlier"
So 🤷‍♂️ is it posible or not with CUDA 10.1 ?
Thanks you very very much 🙏
Note the sentence
CUDA applications built using CUDA Toolkit versions 2.1 through 10.2 are compatible with NVIDIA Ampere architecture based GPUs as long as they are built to include PTX versions
(emphasis mine)
Plus the explanation in the section above.
When a CUDA application launches a kernel on a GPU, the CUDA Runtime determines the compute capability of the GPU in the system and uses this information to find the best matching cubin or PTX version of the kernel. If a cubin compatible with that GPU is present in the binary, the cubin is used as-is for execution. Otherwise, the CUDA Runtime first generates compatible cubin by JIT-compiling 1 the PTX and then the cubin is used for the execution. If neither compatible cubin nor PTX is available, kernel launch results in a failure.
In effect: The CUDA toolkit remains ABI-compatible between 2.1 and 11. Therefore an application built for an old version will continue to load at runtime. The CUDA runtime will then detect that your kernels are built for a version that is not compatible with Ampere. So it will take the PTX and compile a new version at runtime.
As note in comments, only a current driver is required on the production system for this to work.

Nsight Compute says: "Profiling is not supported on this device" - why?

I have a machine with an NVIDA GTX 1050 Ti GPU (compute capability 6.1), and am trying to profile a kernel in a program I built with CUDA 11.4. My OS distribution is Devuan GNU/Linux 4 Chimaera (~= Debian 11 Bullseye).
NSight Compute starts my program, and shows me API call after API call, but when I get to the first kernel launch, it gives me an error message in the Details column of the API call listing:
Error: Profiling is not supported on this device
Why? What's wrong with my device? Is it a permissions issue?
tl;dr: Nsight Compute no longer supports Pascal GPUs.
Nsight Compute used to support Pascal-microarchitecture GPUs (Compute Capability 6.x) - up until version 2019.5.1. Beginning with 2020, Nsight Compute dropped support for Pascal.
If you're wondering why that is - no reason or justification was given to my knowledge (see also the quote below). This is especially puzzling, or annoying, given the short period of time between the release of post-Pascal GPUs and this dropping of support (as little as 1.5 years if you look at consumer GTX cards).
On the other hand, you may still use the NVIDIA Visual Profiler tool with Pascal cards, so they did throw you entirely under the bus. And you can also download and use Nsight Computer 2019.5.1.
To quote an NVIDIA moderator's statement on the matter on the NVIDIA developer forums:
Pascal support was deprecated, then dropped from Nsight Compute after Nsight Compute 2019.5.1. The profiling tools that support Pascal in the CUDA Toolkit 11.1 and later are nvprof and visual profiler.

Can a Cuda application built and running on Jetson TX2 run on Jetson Xavier?

I have a Cuda application that was built with Cuda Toolkit 9.0 and running fine on Jetson TX2 board.
I now have a Jetson Xavier board, flashed with Jetpack 4 that installs Cuda Toolkit 10.0 (only 10.0 is available).
What do I need to do if I want to run the same application on Xavier?
Nvidia documentation suggests that as long as I specify the correct target hardware when running nvcc, I should be able to run on future hardwares thanks to JIT compilation. But does this hold for different versions of Cuda toolkit (9 vs 10)?
In theory (and note I don't have access to a Xavier board to test anything), you should be able to run a cross compiled CUDA 9 application (and that might mean both ARM and GPU architecture settings) on a CUDA 10 host.
What you will need to make sure is that you either statically link or copy all the CUDA runtime API library components you require with your application on the Xavier board. Note that there is still an outside chance that those libraries might lack the necessary GPU and ARM features to run correctly on a Xavier system, or more subtle issues like libC incompatibility. That you will have to test for yourself.

cudart_static - when is it necessary?

Since newer drivers ship with the CUDA runtime (I can choose 9.1 or 9.2 in the drivers download page) my question is: should my library (which uses a CUDA kernel internally) be shipped with -lcudart_static?
I had issues launching kernels compiled with 9.2 on systems which used 9.1 CUDA drivers. What's the most 'compatible' way of ensuring my library will run everywhere a recent CUDA driver is installed? (I'm already compiling for a virtual architecture)
Since newer drivers ship with the CUDA runtime (I can choose 9.1 or 9.2 in the drivers download page)
No, that's incorrect. That choice in the drivers download page is related to the fact that each CUDA version has a minimum required driver version associated with it. It does not mean that the driver ships with the CUDA runtime (stated another way, the driver does not install libcudart.so on linux and never has - with some careful experimentation on a clean install, you can prove this to yourself.)
Some additional comments:
-lcudart_static is actually the default for current/recent versions of nvcc. You can discover this by reading the nvcc manual. Therefore, by default, your executable, when compiled/built with nvcc should already be statically linked to the CUDA runtime library corresponding to the version of nvcc that you are using for compilation. The reason you might need to specify this or something like this is if you are building an application with e.g. the gnu toolchain (on linux) rather than nvcc.
The purpose of static linking to the CUDA runtime library is, as you surmise, so that an application can be built in such a way that it does not need an installation of the CUDA toolkit to run properly. It only needs a machine with a proper GPU driver install.
The most compatible way to ensure that an application will run on a range of machines with a range of GPU driver installs is to compile your application using the oldest CUDA toolkit required to meet the needs of the earliest GPU driver in the range you intend to cover. Again, you can refer to the table here.

Loading a PTX programatically returns error 209 when run against device with CUDA capability 5.0

I am trying to use the ptxjit sample from the CUDA SDK as the basis for instrument the interaction with the GPU device.
I've managed to successfully compile the instrumentation code, and control the device to load and execute a PTX module with a Geforce GT440 that has CUDA capability 2.0.
When compiling the same instrumentation code on a (laptop using bumblebee to control the discrete GPU) system with a Geforce 830M that has CUDA capability 5.0 the code compiles but gives me 209 (CUDA_ERROR_NO_BINARY_FOR_GPU).
I've tried to compile the kernel to be compatible with CUDA capability 5.0 but had no success, still the same error.
Any ideas?
In the end the problem was with the driver. It seams that it affects only the functions that are used for PTX code loading with GPUs that have CUDA Capability 5.0.
I removed all the nvidia driver packages that were updated recently and installed the driver and OpenGL libraries that comes with the CUDA SDK. The driver version for SDK 7.5 is 352.39, with this driver both the original ptxjit sample as well as the modified one executed perfectly as on the other systems.
I don't have any GPU with CUDA capability 3.0 to test if the same problem would appear, also, I didn't updated my desktop to the 367.44 driver to see if it would break the ptxjit sample.
For now, the solution is to keep the driver that comes with the CUDA SDK and turn off updates from the nvidia repository.