Is there a way to obtain the program counter value for a thread in CUDA? By modifying device driver or Compiler? - cuda

I would like to implement a runtime that allows a thread in GPU to obtain the value of the program counter. I have searched in Cuda Programming Guide and PTX guide as well.
With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity.
May I know where the value of the program counter is stored? In what register assigned to that thread? Is there a way that allows the thread to obtain the value of the program counter?
If a thread cannot do it, then can we modify the driver, or compiler, or using inline assembly to achieve that?

May I know where the value of the program counter is stored? In what
register assigned to that thread?
The PTX virtual machine instruction only exposes a limited number of special registers.
At the time of writing, that doesn't include the program counter or the stack frame, as far as I am aware. This appears to be true whether the GPU architecture is using a block wide PC or an independent per-thread PC.
Is there a way that allows the thread to obtain the value of the
program counter?
Not as far as I am aware.

Related

How is stack frame managed within a thread in Cuda?

Suppose we have a kernel that invokes some functions, for instance:
__device__ int fib(int n) {
if (n == 0 || n == 1) {
return n;
} else {
int x = fib(n-1);
int y = fib(n-2);
return x + y;
}
return -1;
}
__global__ void fib_kernel(int* n, int *ret) {
*ret = fib(*n);
}
The kernel fib_kernel will invoke the function fib(), which internally will invoke two fib() functions. Suppose the GPU has 80 SMs, we launch exactly 80 threads to do the computation, and pass in n as 10. I am aware that there will be a ton of duplicated computations which violates the idea of data parallelism, but I would like to better understand the stack management of the thread.
According to the Documentation of Cuda PTX, it states the following:
the GPU maintains execution state per thread, including a program counter and call stack
The stack locates in local memory. As the threads executing the kernel, do they behave just like the calling convention in CPU? In other words, is it true that for each thread, the corresponding stack will grow and shrink dynamically?
The stack of each thread is private, which is not accessible by other threads. Is there a way that I can manually instrument the compiler/driver, so that the stack is allocated in global memory, no longer in local memory?
Is there a way that allows threads to obtain the current program counter, frame pointer values? I think they are stored in some specific registers, but PTX documentation does not provide a way to access those. May I know what I have to modify (e.g. the driver or the compiler) to be able to obtain those registers?
If we increase the input to fib(n) to be 10000, it is likely to cause stack overflow, is there a way to deal with it? The answer to question 2 might be able to address this. Any other thoughts would be appreciated.
You'll get a somewhat better idea of how these things work if you study the generated SASS code from a few examples.
As the threads executing the kernel, do they behave just like the calling convention in CPU? In other words, is it true that for each thread, the corresponding stack will grow and shrink dynamically?
The CUDA compiler will aggressively inline functions when it can. When it can't, it builds a stack-like structure in local memory. However the GPU instructions I'm aware of don't include explicit stack management (e.g. push and pop, for example) so the "stack" is "built by the compiler" with the use of registers that hold a (local) address and LD/ST instructions to move data to/from the "stack" space. In that sense, the actual stack does/can dynamically change in size, however the maximum allowable stack space is limited. Each thread has its own stack, using the definition of "stack" given here.
Is there a way that I can manually instrument the compiler/driver, so that the stack is allocated in global memory, no longer in local memory?
Practically, no. The NVIDIA compiler that generates instructions has a front-end and a back-end that is closed source. If you want to modify an open-source compiler for the GPUs it might be possible, but at the moment there are no widely recognized tool chains that I am aware of that don't use the closed-source back end (ptxas or its driver equivalent). The GPU driver is also largley closed source. There aren't any exposed controls that would affect the location of the stack, either.
May I know what I have to modify (e.g. the driver or the compiler) to be able to obtain those registers?
There is no published register for the instruction pointer/program counter. Therefore its impossible to state what modifications would be needed.
If we increase the input to fib(n) to be 10000, it is likely to cause stack overflow, is there a way to deal with it?
As I mentioned, the maximum stack-space per thread is limited, so your observation is correct, eventually a stack could grow to exceed the available space (and this is a possible hazard for recursion in CUDA device code). The provided mechanism to address this is to increase the per-thread local memory size (since the stack exists in the logical local space).

how exactly does CUDA handle a memory access?

i would like to know how CUDA hardware/run-time system handles the following case.
If a warp (warp1 in the following) instruction involves access to global memory (load/store); the run-time system schedules the next ready warp for execution.
When the new warp is executed,
Will the "memory access" of warp1 be conducted in parallel, i.e. while the new warp is running ?
Will the run time system put warp1 into a memory access waiting queue; once the memory request is completed, the warp is then moved into the runnable queue?
Will the instruction pointer related to warp1 execution be incremented automatically and in parallel to the new warp execution, to annotate that the memory request is completed?
For instance, consider this pseudo code output=input+array[i]; where output and input are both scalar variables mapped into registers, whereas array is saved in the global memory.
To run the above instruction, we need to load the value of array[i] into a (temporary) register before updating output; i.e the above instruction can be translated into 2 macro assembly instructions load reg, reg=&array[i], output_register=input_register+reg.
I would like to know how the hardware and runtime system handle the execution of the above 2 macro assembly instructions, given that load can't return immediately
I am not sure I understand your questions correctly, so I'll just try to answer them as I read them:
Yes, while a memory transaction is in flight further independent instructions will continue to be issued. There isn't necessarily a switch to a different warp though - while instructions from other warps will always be independent, the following instructions from the same warp might be independent as well and the same warp may keep running (i.e. further instructions may be issued from the same warp).
No. As explained under 1. the warp can and will continue executing instructions until either the result of the load is needed by dependent instruction, or a memory fence / barrier instruction requires it to wait for the effect of the store being visible to other threads.
This can go as far as issuing further (independent) load or store instructions, so that multiple memory transactions can be in flight for the same warp at the same time. So the status of a warp after issuing a load/store doesn't change fundamentally and it is not halted until necessary.
The instruction pointer will always be incremented automatically (there is no situation where you ever do this manually, nor are there instructions allowing to do so). However, as 2. implies, this doesn't necessarily indicate that the memory access has been performed - there is separate hardware to track progress of memory accesses.
Please note that the hardware implementation is completely undocumented by Nvidia. You might find some indications of possible implementations if you search through Nvidia's patent applications.
GPUs up to the Fermi generation (compute capability 2.x) tracked outstanding memory transaction completely in hardware. While undocumented by Nvidia, the common mechanism to track (memory) transactions in flight is scoreboarding.
GPUs from newer generations starting with Kepler (compute capability 3.x) use some assistance in the form of control words embedded in the shader assembly code. While again undocumented, Scott Gray has reversed engineered these for his Maxas Maxwell assembler. He found that (amongst other things) the control words contain barrier instructions for tracking memory transactions and was kind enough to document his findings on his Control-Codes wiki page.

Is it possible to set a limit on the number of cores to be used in Cuda programming for a given code?

Assume I have Nvidia K40, and for some reason, I want my code only uses portion of the Cuda cores(i.e instead of using all 2880 only use 400 cores for examples), is it possible?is it logical to do this either?
In addition, is there any way to see how many cores are being using by GPU when I run my code? In other words, can we check during execution, how many cores are being used by the code, report likes "task manger" in Windows or top in Linux?
It is possible, but the concept in a way goes against fundamental best practices for cuda. Not to say it couldn't be useful for something. For example if you want to run multiple kernels on the same GPU and for some reason want to allocate some number of Streaming Multiprocessors to each kernel. Maybe this could be beneficial for L1 caching of a kernel that does not have perfect memory access patterns (I still think for 99% of cases manual shared memory methods would be better).
How you could do this, would be to access the ptx identifiers %nsmid and %smid and put a conditional on the original launching of the kernels. You would have to only have 1 block per Streaming Multiprocessor (SM) and then return each kernel based on which kernel you want on which SM's.
I would warn that this method should be reserved for very experienced cuda programmers, and only done as a last resort for performance. Also, as mentioned in my comment, I remember reading that a threadblock could migrate from one SM to another, so behavior would have to be measured before implementation and could be hardware and cuda version dependent. However, since you asked and since I do believe it is possible (though not recommended), here are some resources to accomplish what you mention.
PTS register for SM index and number of SMs...
http://docs.nvidia.com/cuda/parallel-thread-execution/#identifiers
and how to use it in a cuda kernel without writing ptx directly...
https://gist.github.com/allanmac/4751080
Not sure, whether it works with the K40, but for newer Ampere GPUs there is the MIG Multi-Instance-GPU feature to partition GPUs.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
I don't know such methods, but would like to get to know.
As to question 2, I suppose sometimes this can be useful. When you have complicated execution graphs, many kernels, some of which can be executed in parallel, you want to load GPU fully, most effectively. But it seems on its own GPU can occupy all SMs with single blocks of one kernel. I.e. if you have a kernel with 30-blocks grid and 30 SMs, this kernel can occupy entire GPU. I believe I saw such effect. Really this kernel will be faster (maybe 1.5x against 4 256-threads blocks per SM), but this will not be effective when you have another work.
GPU can't know whether we are going to run another kernel after this one with 30 blocks or not - whether it will be more effective to spread it onto all SMs or not. So some manual way to say this should exist
As to question 3, I suppose GPU profiling tools should show this, Visual Profiler and newer Parallel Nsight and Nsight Compute. But I didn't try. This will not be Task manager, but a statistics for kernels that were executed by your program instead.
As to possibility to move thread blocks between SMs when necessary,
#ChristianSarofeen, I can't find mentions that this is possible. Quite the countrary,
Each CUDA block is executed by one streaming multiprocessor (SM) and
cannot be migrated to other SMs in GPU (except during preemption,
debugging, or CUDA dynamic parallelism).
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Although starting from some architecture there is such thing as preemption. As I remember NVidia advertised it in the following way. Let's say you made a game that run some heavy kernels (say for graphics rendering). And then something unusual happened. You need to execute some not so heavy kernel as fast as possible. With preemption you can unload somehow running kernels and execute this high priority one. This increases execution time (of this high pr. kernel) a lot.
I also found such thing:
CUDA Graphs present a new model for work submission in CUDA. A graph
is a series of operations, such as kernel launches, connected by
dependencies, which is defined separately from its execution. This
allows a graph to be defined once and then launched repeatedly.
Separating out the definition of a graph from its execution enables a
number of optimizations: first, CPU launch costs are reduced compared
to streams, because much of the setup is done in advance; second,
presenting the whole workflow to CUDA enables optimizations which
might not be possible with the piecewise work submission mechanism of
streams.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
I do not believe kernels invocation take a lot of time (of course in case of a stream of kernels and if you don't await for results in between). If you call several kernels, it seems possible to send all necessary data for all kernels while the first kernel is executing on GPU. So I believe NVidia means that it runs several kernels in parallel and perform some smart load-balancing between SMs.

CUDA: it is possible for a kernel to return a break to CPU?

I'm writing a C program using CUDA parallelization, and I was wondering if it is possible for a kernel to return a break to CPU.
My program essentially do a for loop and inside that loop I take several parallel actions; at the start of each iteration I have to take a control over a variable (measuring the improvement of the just done iteration) which resides on the GPU.
My desire is that the control over that variable returns a break to CPU in order to exit the for loop(I take the control using a trivial kernel <<<1,1>>>).
I've tried copying back that variable on the CPU and take the control on the CPU but, as I feared, it sloows down the overall execution.
Any advice?
There is presently no way for any running code on a CUDA capable GPU to preempt running code on the host CPU. So although it isn't at all obvious what you are asking about, I'm fairly certain the answer is no just because there is no host side preempt or interrupt mechanism available available in device code.
There is no connection betweeen CPU code and GPU code.
All what you can do while working with CUDA is:
From CPU side allocate memory in GPU
Copy data to GPU
Launch executioning of prewritten instructions for GPU (GPU is black box for CPU)
Read data back.
So thinking about these steps in a loop, all what is left to you is to check result and break the loop if need to.

How to uninitialise CUDA?

CUDA implicitly initialises when the first CUDA runtime function is called.
I'm timing the runtime of my code and repeating 100 times via a loop (for([100 times]) {[Time CUDA code and log]}), which also needs to take into account the initialisation time for CUDA at each iteration. Thus I need to uninitialise CUDA after every iteration - how to do this?
I've tried using cudaDeviceReset(), but seems not to have uninitialised CUDA.
Many thanks.
cudaDeviceReset is the canonical way to destroy a context in the runtime API (and calling cudaFree(0) is the canonical way to create a context). Those are the only levels of "re-initialization" available to a running process. There are other per-process events which happen when a process loads the CUDA driver and runtime libraries and connects to the kernel driver, but there is no way I am aware of to make those happen programatically short of forking a new process.
But I really doubt you want or should be needing to account for this sort of setup time when calculating performance metrics anyway.