I am profiling a CUDA application and dumping the logs to a file say target.prof
My application uses multiple threads to dispatch kernels and I want to observe the api calls from just one of those threads.
I tried using nvprof -i target.prof --print-api-trace but this does not print the thread_id.
When I open this file with the visual profiler, I can see which API calls were launched from which thread. How can I access the same information using the command line profiler?
Edit: View in the visual profiler
Are GPU threads launching those kernels or CPU threads? if cpu threads then use the option --cpu-thread-tracing on.
Related
I am executing an ARM machine code on a certain board using qemu. And I have to get list of executed by qemu target machine instructions (opcode name, register addresses, immediates), pcs and values of registers after each control flow instruction.
How to achieve that?
I using PM2 to run my nodejs application.
When starting it in cluster mode "pm2 start server -i 0": PM2 will automatically spawn as many workers as you have CPU cores.
What is the ideal number of workers to run and why?
Beware of the context switch
When running multiple processes on your machine, try to make sure each CPU core will be kepy busy by a single application thread at a time. As a general rule, you should look to spawn N-1 application processes, where N is the number of available CPU cores. That way, each process is guaranteed to get a good slice of one core, and there’s one spare for the kernel scheduler to run other server tasks on. Additionally, try to make sure the server will be running little or no work other than your Node.JS application, so processes don’t fight for CPU.
We made a mistake where we deployed two busy node.js applications to our servers, both apps spawning N-1 processes each. The applications’ processes started vehemently competing for CPU, resulting in CPU load and usage increasing dramatically. Even though we were running these on beefy 8-core servers, we were paying a noticeable penalty due to context switching. Context switching is the behaviour whereby the CPU suspends one task in order to work on another. When context switching, the kernel must suspend all state for one process while it loads and executes state for another. After simply reducing the number of processes the applications spawned such that they each shared an equal number of cores, load dropped significantly:
https://engineering.gosquared.com/optimising-nginx-node-js-and-networking-for-heavy-workloads
On the host, I have a wrapper library which JIT compiles a new module. On the device, I have a daemon kernel with its threads waiting for custom commands (like, run this function) from the wrapper library. Are the functions inside the other module callable from the daemon kernel's threads?
Not directly, without host intervention, no. This may change in the future.
Since you already have a synchronization mechanism between the host and the device for your "daemon" kernel (presumably, to pass your custom commands to it), it seems like the kernel could easily send messages to the host as well. The host could be polling for those messages and dispatch whatever separate functions you want.
I'm reading the HSA spec and it says the user mode application can submit their jobs into GPU queues directly without any OS interaction. I think this must because the application can talk with the GPU driver directly, therefore doesn't need to incur any OS kernel calls.
So my questions is, for a very simple example, in CUDA application, when we make a cudaMalloc(), does it incur any OS kernel calls?
The entire premise of this question is flawed. "Submitting a job" and allocating memory are not the same thing. Even a user space process running on the host CPU which calls malloc will (most of the time) result in a kernel call as the standard library gathers or releases physical memory to its memory heap, normally via sbrk or mmap.
So yes,cudaMalloc results in an OS kernel call - if you run strace you will see the GPU driver invoking ioctl to issue commands to the GPU MMU/TLB. But so does running malloc in host code, and so, undoubtedly does running malloc on a theoretical HSA platform as well.
I've noticed that if two users try to run CUDA programs at the same time, it tends to lock up either the card or the driver (or both?). We need to either reset the card or reboot the machine to restore normal behavior.
Is there a way to get a lock on the GPU so other programs can't interfere while it's running?
Edit
OS is Ubuntu 11.10 running on a server. While there is no X Windows running, the card is used to display the text system console. There are multiple users.
If you are running on either Linux or Windows with the TCC driver, you can put the GPU into compute exclusive mode using the nvidia-smi utility.
Compute exclusive mode makes the driver refuse a context establishment request if another process already holds a context on that GPU. Any process trying to run on a busy compute exclusive GPU will receive a no device available error and fail.
You can use something like Task Spooler to queue the programs and run one at the time.
We use TORQUE Resource Manager but it's harder to configure than ts. With TORQUE you can have multiple queues (ie one for cuda jobs, two for cpu jobs) and assign a different job to each gpu.