I would like to know how the allocation of a memory space in CUDA is implemented under Ubuntu Linux. In other words, how cudaMalloc() works internally under Ubuntu Linux? What are the system calls used for this function?
CUDA is proprietary. It's likely that CUDA driver implementation is the same or similar to OpenCL.
But while OpenCL specification is open the implementation is not necessary and NVIDIA OpenCL driver isn’t open .
It's possible that the implementation is as simple as the driver submitting a malloc command completely handled on the hardware side with the kernel driver communicating with the system to achieve unified virtual addressing and to determine what memory resides in VRAM. Probably the technical part at the software side is to avoid the allocation or defer it.
Looking into pocl can give you some idea how things can look like. NVIDIA implementation can be very different though.
Related
Is it possible to run a CUDA program on a virtual machine without having a physical NVidia GPU card on the host machine?
PCIe passthrough is only viable if the host machine has an NVidia card and that's not available.
One possible option to run CUDA programs without a GPU installed is to use an emulator/simulator (ex: http://gpgpu-sim.org/ ) but these simulators are usually limited.
I would appreciate a clear answer on that matter.
Thanks!
You can't run any modern version of CUDA (e.g. 6.0 or newer) unless you have actual GPU hardware available on the machine or virtual machine.
The various simulators and other methods all depend on very old versions of CUDA.
Is it possible to compile a CUDA program without having a CUDA capable device on the same node, using only NVIDIA CUDA Toolkit...?
The answer to your question is YES.
The nvcc compiler driver is not related to the physical presence of a device, so you can compile CUDA codes even without a CUDA capable GPU. Be warned however that, as remarked by Robert Crovella, the CUDA driver library libcuda.so (cuda.lib for Windows) comes with the NVIDIA driver and not with the CUDA toolkit installer. This means that codes requiring driver APIs (whose entry points are prefixed with cu, see Appendix H of the CUDA C Programming Guide) will need a forced installation of a "recent" driver without the presence of an NVIDIA GPU, running the driver installer separately with the --help command line switch.
Following the same rationale, you can compile CUDA codes for an architecture when your node hosts a GPU of a different architecture. For example, you can compile a code for a GeForce GT 540M (compute capability 2.1) on a machine hosting a GT 210 (compute capability 1.2).
Of course, in both the cases (no GPU or GPU with different architecture), you will not be able to successfully run the code.
For the early versions of CUDA, it was possible to compile the code under an emulation modality and run the compiled code on a CPU, but device emulation is since some time deprecated. If you don't have a CUDA capable device, but want to run CUDA codes you can try using gpuocelot (but I don't have any experience with that).
First of all, sorry if I make spelling mistakes, I'm not english.
I'm trying to use IBM Platform MPI v9.1.2 with CUDA 5.5 in Windows 7 to pass messages through GPUs, using CUDA-Aware MPI like this post says: http://devblogs.nvidia.com/parallelforall/introduction-cuda-aware-mpi/
My GPUs (Nvidia Tesla C2075) are compatible with UVA (Unified Virtual Addressing) technology, so it should work properly when I use MPI_Ssend and MPI_Recv passing the device pointers, but it doesn't and the program crashes.
I've installed only the IBM package but I couldn't find anything about any configuration needed.
Anyone know something about this and could help me?
Thank you so much.
You'll need to switch to linux.
IBM indicates PMPI GPU-Direct support is on linux only here and here
Hardware requirements
...
In addition, Platform MPI supports GPU-Direct 2.0 on Linux .
CUDA-Aware MPI depends on GPUDirect, as indicated in the blog you linked.
What does Nvidia CUDA driver do exactly? from the perspective of using CUDA.
The driver passes the kernel code, with the execution configuration (#threads, #blocks)...
and what else?
I saw some post that the driver should be aware of the number of available SMs.
But isn't that unnecessary ? Once the kernel is passed to GPU, the GPU scheduler just needs to spread the work to available SMs...
The GPU isn't a fully autonomous device, it requires a lot of help from the host driver to do even
the simplest things. As I understand it the driver contains at least:
JIT compiler/optimizer (PTX assembly code can be compiled by the driver at runtime, the driver will also recompile code to match the execution architecture of the device if required and possible)
Device memory management
Host memory management (DMA transfer buffers, pinned and mapped host memory, unified addressing model)
Context and runtime support (so code/heap/stack/printf buffer memory management), dynamic symbol management, streams, etc
Kernel "grid level" scheduler (includes managing multiple simultaneous kernels on architectures that support it)
Compute mode management
Display driver interop (for DirectX and OpenGL resource sharing)
That probably represents the bare minimum that is required to get some userland device code onto a GPU and running via the host side APIs.
is there any way I can test the CUDA samples and codes from a computer with no NVIDIA graphic card?
I am using Windows and the latest version of CUDA.
There are several possibilities:
Use older version of CUDA, which has built-in emulator (2.3 has it for sure). Emulator is far from good, and you won't have features from latest CUDA releases.
Use OpenCL, it can run on CPUs (though not with nVidia SDK, you will have to install either AMD or Intel OpenCL implementation (AMD works fine on Intel CPUs, btw)). In my experience, OpenCL is usually slightly slower than CUDA.
There is windows branch of Ocelot emulator: http://code.google.com/p/gpuocelot/. I haven't tried it, though.
However, I would recommend buying some CUDA-capable card. 8xxx or 9xxx series is ok and really cheap. Emulation would allow you to get some basic skills of GPGPU programming, but is useless when you write some real-world application since it doesn't allow you to debug and tune performance.