I'm reading the HSA spec and it says the user mode application can submit their jobs into GPU queues directly without any OS interaction. I think this must because the application can talk with the GPU driver directly, therefore doesn't need to incur any OS kernel calls.
So my questions is, for a very simple example, in CUDA application, when we make a cudaMalloc(), does it incur any OS kernel calls?
The entire premise of this question is flawed. "Submitting a job" and allocating memory are not the same thing. Even a user space process running on the host CPU which calls malloc will (most of the time) result in a kernel call as the standard library gathers or releases physical memory to its memory heap, normally via sbrk or mmap.
So yes,cudaMalloc results in an OS kernel call - if you run strace you will see the GPU driver invoking ioctl to issue commands to the GPU MMU/TLB. But so does running malloc in host code, and so, undoubtedly does running malloc on a theoretical HSA platform as well.
Related
I am profiling a CUDA application and dumping the logs to a file say target.prof
My application uses multiple threads to dispatch kernels and I want to observe the api calls from just one of those threads.
I tried using nvprof -i target.prof --print-api-trace but this does not print the thread_id.
When I open this file with the visual profiler, I can see which API calls were launched from which thread. How can I access the same information using the command line profiler?
Edit: View in the visual profiler
Are GPU threads launching those kernels or CPU threads? if cpu threads then use the option --cpu-thread-tracing on.
How can I turn an already allocated host memory buffer into a page-locked memory using the CUDA driver API? Is there any equivalent procedure to achieve the same behaviour of the CUDA runtime cudaHostRegister?
cuMemHostRegister is what you are looking for. It is what cudaHostRegister calls under the hood to perform the same operation.
On the host, I have a wrapper library which JIT compiles a new module. On the device, I have a daemon kernel with its threads waiting for custom commands (like, run this function) from the wrapper library. Are the functions inside the other module callable from the daemon kernel's threads?
Not directly, without host intervention, no. This may change in the future.
Since you already have a synchronization mechanism between the host and the device for your "daemon" kernel (presumably, to pass your custom commands to it), it seems like the kernel could easily send messages to the host as well. The host could be polling for those messages and dispatch whatever separate functions you want.
I've noticed that if two users try to run CUDA programs at the same time, it tends to lock up either the card or the driver (or both?). We need to either reset the card or reboot the machine to restore normal behavior.
Is there a way to get a lock on the GPU so other programs can't interfere while it's running?
Edit
OS is Ubuntu 11.10 running on a server. While there is no X Windows running, the card is used to display the text system console. There are multiple users.
If you are running on either Linux or Windows with the TCC driver, you can put the GPU into compute exclusive mode using the nvidia-smi utility.
Compute exclusive mode makes the driver refuse a context establishment request if another process already holds a context on that GPU. Any process trying to run on a busy compute exclusive GPU will receive a no device available error and fail.
You can use something like Task Spooler to queue the programs and run one at the time.
We use TORQUE Resource Manager but it's harder to configure than ts. With TORQUE you can have multiple queues (ie one for cuda jobs, two for cpu jobs) and assign a different job to each gpu.
Scenario:
I have two machines, a client and a server, connected with Infiniband. The server machine has an NVIDIA Fermi GPU, but the client machine has no GPU. I have an application running on the GPU machine that uses the GPU for some calculations. The result data on the GPU is never used by the server machine, but is instead sent directly to the client machine without any processing. Right now I'm doing a cudaMemcpy to get the data from the GPU to the server's system memory, then sending it off to the client over a socket. I'm using SDP to enable RDMA for this communication.
Question:
Is it possible for me to take advantage of NVIDIA's GPUDirect technology to get rid of the cudaMemcpy call in this situation? I believe I have the GPUDirect drivers correctly installed, but I don't know how to initiate the data transfer without first copying it to the host.
My guess is that it isn't possible to use SDP in conjunction with GPUDirect, but is there some other way to initiate an RDMA data transfer from the server machine's GPU to the client machine?
Bonus: If somone has a simple way to test if I have the GPUDirect dependencies correctly installed that would be helpful as well!
Yes, it is possible with supporting networking hardware. See the GPUDirect RDMA documentation.