CUDA Shared Memory not exclusive to block while debugging - cuda

Basically, I am having a difficult time understand exactly what is going wrong here.
Shared memory does not appear to be behaving in a block exclusive manner while debugging. When running the code normally, nothing is printed. However, if I attempt to debug it, shared memory is shared between blocks and the print statement is reached.
This is an example, obviously this isn't terribly useful code, but it reproduces the issue on my system. Am I doing something wrong? Is this a bug or expected behavior from the debugger?
__global__
void test()
{
__shared__ int result[1];
if (blockIdx.x == 0 && blockIdx.y == 0 && blockIdx.z == 0)
result[0] = 4444;
else
{
if (result[0] == 4444)
printf("This should never print if shared memory is unique\n");
}
}
And to launch it:
test<<<dim3(8,8,1), dim3(8,8,1)>>>();
It is also entirely possible that I have completely misunderstood shared memory.
Thanks for the help.
Other Information:
I am using a GTX 460. Compute_20 and sm_20 are set for the project. I am writing the code in Visual Studio 2010 using nsight 3.0 preview.

There is a subtle but important difference between
shared memory is shared between blocks and the print statement is
reached
and
shared memory is re-used by successive blocks and the print statement is
reached
You are assuming the former, but the latter is what is really happening.
Your code, with the exception of the first block, is reading from uninitialised memory. That, in itself, is undefined behaviour. C++ (and CUDA) don't guarantee that statically declared memory is set to any value when it either comes into, or goes out of scope. You can't expect that result wouldn't have a value of 4444, especially when it is probably stored in the same shared scratch space as a previous block which may have set it to a value of 4444.
The entire premise of the code and this question are flawed and you should draw no conclusions from the result you see other that undefined behaviour is undefined.

Related

undefined behavior std::vector

#include <iostream>
#include <string>
#include <vector>
int main()
{
std::string name;
std::vector<double> v(5, 1);
std::cout<<v.capacity()<<std::endl;
v[1000000]= 10.;
std::cout<<v[1000000]<<std::endl;
std::cout<<v.capacity()<<std::endl;
return 0;
}
Is this code undefined behavior ? It seems that no allocation is made on the fly so I am wondering how the program is able to handle the item assignment. I am using OSX Monterrey and this prints "10" as "expected".
std::vector<double> v(5, 1);
std::cout<<v.capacity()<<std::endl;
v[1000000]= 10.;
From your question, I'm pretty sure you know this is undefined behavior. Your question is really "why doesn't this crash?"
Undefined behavior means anything can happen. Usually what happens is the app seems to work. The vector will ask std::allocate for ~20 bytes of memory, and std::allocate will ask the operating system for a large chunk of memory, and then std::allocate will give 20 bytes of memory to the vector, and then it will save the rest of the large chunk for whatever next asks for more memory. Then your code assigns 10 to the place in (virtual) memory that's ~4MB past the memory allocated to the vector.
One possibility from there, is that that address in memory is not currently allocated in your process. The OS will detect this and usually it will crash your app. Or it might just give that memory to your app so that it keeps running. So you never really know what will happen.
Another option is that If that address in memory happens to already be allocated in your process, either coincidentally by std::allocate, or coincidentally by something else, then the write succeeds in writing 10 to that place in memory. Hopefully that wasn't important, like networking code, in which case the next network call could send the contents of your memory, including your passwords, to whatever server it happened to be talking to at the time. Or maybe that memory isn't currently used for anything at the moment, and nothing breaks. Or maybe it works a 998 times in a row, and then on the 999th time it erases your files. Unlikely, but possible.
So yes, most of the time "undefined behavior works anyway". But don't do it, you will regret it.

How is stack frame managed within a thread in Cuda?

Suppose we have a kernel that invokes some functions, for instance:
__device__ int fib(int n) {
if (n == 0 || n == 1) {
return n;
} else {
int x = fib(n-1);
int y = fib(n-2);
return x + y;
}
return -1;
}
__global__ void fib_kernel(int* n, int *ret) {
*ret = fib(*n);
}
The kernel fib_kernel will invoke the function fib(), which internally will invoke two fib() functions. Suppose the GPU has 80 SMs, we launch exactly 80 threads to do the computation, and pass in n as 10. I am aware that there will be a ton of duplicated computations which violates the idea of data parallelism, but I would like to better understand the stack management of the thread.
According to the Documentation of Cuda PTX, it states the following:
the GPU maintains execution state per thread, including a program counter and call stack
The stack locates in local memory. As the threads executing the kernel, do they behave just like the calling convention in CPU? In other words, is it true that for each thread, the corresponding stack will grow and shrink dynamically?
The stack of each thread is private, which is not accessible by other threads. Is there a way that I can manually instrument the compiler/driver, so that the stack is allocated in global memory, no longer in local memory?
Is there a way that allows threads to obtain the current program counter, frame pointer values? I think they are stored in some specific registers, but PTX documentation does not provide a way to access those. May I know what I have to modify (e.g. the driver or the compiler) to be able to obtain those registers?
If we increase the input to fib(n) to be 10000, it is likely to cause stack overflow, is there a way to deal with it? The answer to question 2 might be able to address this. Any other thoughts would be appreciated.
You'll get a somewhat better idea of how these things work if you study the generated SASS code from a few examples.
As the threads executing the kernel, do they behave just like the calling convention in CPU? In other words, is it true that for each thread, the corresponding stack will grow and shrink dynamically?
The CUDA compiler will aggressively inline functions when it can. When it can't, it builds a stack-like structure in local memory. However the GPU instructions I'm aware of don't include explicit stack management (e.g. push and pop, for example) so the "stack" is "built by the compiler" with the use of registers that hold a (local) address and LD/ST instructions to move data to/from the "stack" space. In that sense, the actual stack does/can dynamically change in size, however the maximum allowable stack space is limited. Each thread has its own stack, using the definition of "stack" given here.
Is there a way that I can manually instrument the compiler/driver, so that the stack is allocated in global memory, no longer in local memory?
Practically, no. The NVIDIA compiler that generates instructions has a front-end and a back-end that is closed source. If you want to modify an open-source compiler for the GPUs it might be possible, but at the moment there are no widely recognized tool chains that I am aware of that don't use the closed-source back end (ptxas or its driver equivalent). The GPU driver is also largley closed source. There aren't any exposed controls that would affect the location of the stack, either.
May I know what I have to modify (e.g. the driver or the compiler) to be able to obtain those registers?
There is no published register for the instruction pointer/program counter. Therefore its impossible to state what modifications would be needed.
If we increase the input to fib(n) to be 10000, it is likely to cause stack overflow, is there a way to deal with it?
As I mentioned, the maximum stack-space per thread is limited, so your observation is correct, eventually a stack could grow to exceed the available space (and this is a possible hazard for recursion in CUDA device code). The provided mechanism to address this is to increase the per-thread local memory size (since the stack exists in the logical local space).

Memory space of kernel arguments in CUDA __global__ function

In a CUDA function like the following:
__global__ void Kernel(int value) {
value += 1;
...
}
void Host() {
Kernel<<<10, 10>>>(123);
}
Is the memory space of value inside Kernel device (global), shared, or local?
If one thread modifies it, will that modification become visible to other threads? Or is the variable located on the stack of each thread, as with variables defined inside the function?
Is the memory space of value inside Kernel device (global), shared, or local?
It is in the logical local space. Kernel parameters start out in a particular bank of __constant__ memory as part of the kernel launch process. However for most actual usage, the parameter will first be copied to a thread-local register, which is part of the logical local space. Even for SASS instructions that are not LD but can refer to the __constant__ memory, the usage is effectively local, per-thread, just like registers are local, per-thread.
If one thread modifies it, will that modification become visible to other threads?
Modifications in one thread will not be visible to other threads. If you modify it, the modification will be performed (first) on its value in a thread-local register.
Or is the variable located on the stack of each thread, as with variables defined inside the function?
The stack is in the logical local space for a thread, so I'm not sure what is the purpose of that question. A stack from one thread is not shared with another thread. The only way such a variable would show up on the stack in my experience is if it were used as part of a function call process (i.e. not the thread itself as it is initially spawned by the kernel launch process, but a function call originating from that thread).
Also, variables defined inside a function (e.g. local variables) do not necessarily show up on the stack either. This will mostly be a function of compiler decisions. They could be in registers, they could appear (e.g. due to a spill) in actual device memory (but still in the logical local space) or they could be in the stack, at some point, perhaps as part of a function call.
This should be mostly verifiable using the CUDA binary utilities.

Why does printf() work within a kernel, but using std::cout doesn't?

I have been exploring the field of parallel programming and have written basic kernels in Cuda and SYCL. I have encountered a situation where I had to print inside the kernel and I noticed that std::cout inside the kernel does not work whereas printf works. For example, consider the following SYCL Codes -
This works -
void print(float*A, size_t N){
buffer<float, 1> Buffer{A, {N}};
queue Queue((intel_selector()));
Queue.submit([&Buffer, N](handler& Handler){
auto accessor = Buffer.get_access<access::mode::read>(Handler);
Handler.parallel_for<dummyClass>(range<1>{N}, [accessor](id<1>idx){
printf("%f", accessor[idx[0]]);
});
});
}
whereas if I replace the printf with std::cout<<accessor[idx[0]] it raises a compile time error saying - Accessing non-const global variable is not allowed within SYCL device code.
A similar thing happens with CUDA kernels.
This got me thinking that what may be the difference between printf and std::coout which causes such behavior.
Also suppose If I wanted to implement a custom print function to be called from the GPU, how should I do it?
TIA
This got me thinking that what may be the difference between printf and std::cout which causes such behavior.
Yes, there is a difference. The printf() which runs in your kernel is not the standard C library printf(). A different call is made, to an on-device function (the code of of which is closed, if it at all exists in CUDA C). That function uses a hardware mechanism on NVIDIA GPUs - a buffer for kernel threads to print into, which gets sent back over to the host side, and the CUDA driver then forwards it to the standard output file descriptor of the process which launched the kernel.
std::cout does not get this sort of a compiler-assisted replacement/hijacking - and its code is simply irrelevant on the GPU.
A while ago, I implemented an std::cout-like mechanism for use in GPU kernels; see this answer of mine here on SO for more information and links. But - I decided I don't really like it, and it compilation is rather expensive, so instead, I adapted a printf()-family implementation for the GPU, which is now part of the cuda-kat library (development branch).
That means I've had to answer your second question for myself:
If I wanted to implement a custom print function to be called from the GPU, how should I do it?
Unless you have access to undisclosed NVIDIA internals - the only way to do this is to use printf() calls instead of C standard library or system calls on the host side. You essentially need to modularize your entire stream over the low-level primitive I/O facilities. It is far from trivial.
In SYCL you cannot use std::cout for output on code not running on the host for similar reasons to those listed in the answer for CUDA code.
This means if you are running kernel code on the "device" (e.g. a GPU) then you need to use the stream class. There is more information about this in the SYCL developer guide section called Logging.
There is no __device__ version of std::cout, so only printf can be used in device code.

Can my kernel code tell how much shared memory it has available?

Is it possible for running device-side CUDA code to know how much (static and/or dynamic) shared memory is allocated to each block of the running kernel's grid?
On the host side, you know how much shared memory a launched kernel had (or will have), since you set that value yourself; but what about the device side? It's easy to compile in the upper limit to that size, but that information is not available (unless passed explicitly) to the device. Is there an on-GPU mechanism for obtaining it? The CUDA C Programming Guide doesn't seem to discuss this issue (in or outside of the section on shared memory).
TL;DR: Yes. Use the function below.
It is possible: That information is available to the kernel code in special registers: %dynamic_smem_size and %total_smem_size.
Typically, when we write kernel code, we don't need to be aware of specific registers (special or otherwise) - we write C/C++ code. Even when we do use these registers, the CUDA compiler hides this from us through functions or structures which hold their values. For example, when we use the value threadIdx.x, we are actually accessing the special register %tid.x, which is set differently for every thread in the block. You can see these registers "in action" when you look at compiled PTX code. ArrayFire have written a nice blog post with some worked examples: Demystifying PTX code.
But if the CUDA compiler "hides" register use from us, how can we go behind that curtain and actually insist on using them, accessing them with those %-prefixed names? Well, here's how:
__forceinline__ __device__ unsigned dynamic_smem_size()
{
unsigned ret;
asm volatile ("mov.u32 %0, %dynamic_smem_size;" : "=r"(ret));
return ret;
}
and a similar function for %total_smem_size. This function makes the compiler add an explicit PTX instruction, just like asm can be used for host code to emit CPU assembly instructions directly. This function should always be inlined, so when you assign
x = dynamic_smem_size();
you actually just assign the value of the special register to x.