Does cudamalloc incur any kernel calls?

Does cudamalloc incur any kernel calls? - cuda

I'm reading the HSA spec and it says the user mode application can submit their jobs into GPU queues directly without any OS interaction. I think this must because the application can talk with the GPU driver directly, therefore doesn't need to incur any OS kernel calls.
So my questions is, for a very simple example, in CUDA application, when we make a cudaMalloc(), does it incur any OS kernel calls?

The entire premise of this question is flawed. "Submitting a job" and allocating memory are not the same thing. Even a user space process running on the host CPU which calls malloc will (most of the time) result in a kernel call as the standard library gathers or releases physical memory to its memory heap, normally via sbrk or mmap.
So yes,cudaMalloc results in an OS kernel call - if you run strace you will see the GPU driver invoking ioctl to issue commands to the GPU MMU/TLB. But so does running malloc in host code, and so, undoubtedly does running malloc on a theoretical HSA platform as well.

Related

How to print api calls per thread with nvprof

I am profiling a CUDA application and dumping the logs to a file say target.prof
My application uses multiple threads to dispatch kernels and I want to observe the api calls from just one of those threads.
I tried using nvprof -i target.prof --print-api-trace but this does not print the thread_id.
When I open this file with the visual profiler, I can see which API calls were launched from which thread. How can I access the same information using the command line profiler?
Edit: View in the visual profiler

Are GPU threads launching those kernels or CPU threads? if cpu threads then use the option --cpu-thread-tracing on.

How to turn host memory into page-locked using CUDA driver API

How can I turn an already allocated host memory buffer into a page-locked memory using the CUDA driver API? Is there any equivalent procedure to achieve the same behaviour of the CUDA runtime cudaHostRegister?

cuMemHostRegister is what you are looking for. It is what cudaHostRegister calls under the hood to perform the same operation.

Calling a function in a separate CUDA module from an already running kernel without cudaLaunch?

On the host, I have a wrapper library which JIT compiles a new module. On the device, I have a daemon kernel with its threads waiting for custom commands (like, run this function) from the wrapper library. Are the functions inside the other module callable from the daemon kernel's threads?

Not directly, without host intervention, no. This may change in the future.
Since you already have a synchronization mechanism between the host and the device for your "daemon" kernel (presumably, to pass your custom commands to it), it seems like the kernel could easily send messages to the host as well. The host could be polling for those messages and dispatch whatever separate functions you want.

How to prevent two CUDA programs from interfering

I've noticed that if two users try to run CUDA programs at the same time, it tends to lock up either the card or the driver (or both?). We need to either reset the card or reboot the machine to restore normal behavior.
Is there a way to get a lock on the GPU so other programs can't interfere while it's running?
Edit
OS is Ubuntu 11.10 running on a server. While there is no X Windows running, the card is used to display the text system console. There are multiple users.

If you are running on either Linux or Windows with the TCC driver, you can put the GPU into compute exclusive mode using the nvidia-smi utility.
Compute exclusive mode makes the driver refuse a context establishment request if another process already holds a context on that GPU. Any process trying to run on a busy compute exclusive GPU will receive a no device available error and fail.

You can use something like Task Spooler to queue the programs and run one at the time.
We use TORQUE Resource Manager but it's harder to configure than ts. With TORQUE you can have multiple queues (ie one for cuda jobs, two for cpu jobs) and assign a different job to each gpu.

GPUDirect RDMA transfer from GPU to remote host

Scenario:
I have two machines, a client and a server, connected with Infiniband. The server machine has an NVIDIA Fermi GPU, but the client machine has no GPU. I have an application running on the GPU machine that uses the GPU for some calculations. The result data on the GPU is never used by the server machine, but is instead sent directly to the client machine without any processing. Right now I'm doing a cudaMemcpy to get the data from the GPU to the server's system memory, then sending it off to the client over a socket. I'm using SDP to enable RDMA for this communication.
Question:
Is it possible for me to take advantage of NVIDIA's GPUDirect technology to get rid of the cudaMemcpy call in this situation? I believe I have the GPUDirect drivers correctly installed, but I don't know how to initiate the data transfer without first copying it to the host.
My guess is that it isn't possible to use SDP in conjunction with GPUDirect, but is there some other way to initiate an RDMA data transfer from the server machine's GPU to the client machine?
Bonus: If somone has a simple way to test if I have the GPUDirect dependencies correctly installed that would be helpful as well!

Yes, it is possible with supporting networking hardware. See the GPUDirect RDMA documentation.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Does cudamalloc incur any kernel calls? - cuda

Related

How to print api calls per thread with nvprof

How to turn host memory into page-locked using CUDA driver API

Calling a function in a separate CUDA module from an already running kernel without cudaLaunch?

How to prevent two CUDA programs from interfering

GPUDirect RDMA transfer from GPU to remote host

Categories

Resources

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Does cudamalloc incur any kernel calls? - cuda

Related

How to print api calls per thread with nvprof

How to turn host memory into page-locked using CUDA driver API

Calling a function in a separate CUDA module from an already running kernel *without* cudaLaunch?

How to prevent two CUDA programs from interfering

GPUDirect RDMA transfer from GPU to remote host

Categories

Resources

Calling a function in a separate CUDA module from an already running kernel without cudaLaunch?