While I've been writing CUDA kernels for a while now, I've not used dynamic parallelism (DP) yet. I've come up against a task for which I think it might fit; however, the way I would like to be able to use DP is:
If block figures out it needs more threads to complete its work, it spawns them; it imparts to its spawned threads "what it knows" - essentially, the contents of its shared memory, which each block of spawned threads get a copy of in its own shared memory ; the threads use what their parent thread "knew" to figure out what they need to continue doing, and do it.
AFAICT, though, this "inheritance" of shared memory does not happen. Is global memory (and constant memory via kernel arguments) the only way the "parent" DP kernel block can impart information to its "child" blocks?
There is no mechanism of the type you are envisaging. A parent thread cannot share either its local memory or its block's shared memory with a child kernel launched via dynamic parallelism. The only options are to use global or dynamically allocated heap memory when it is necessary for a parent thread to pass transient data to a child kernel.
Related
I have a warp which writes some data to shared memory - with no overwrites, and soon after reads from shared memory. While there may be other warps in my block, they're not going to touch any part of that shared memory or write to anywhere my warp of interest reads from.
Now, I recall that despite warps executing in lockstep, we are not guaranteed that the shared memory reads following the shared memory writes will return the respective values supposedly written earlier by the warp. (this could theoretically be due to instruction reordering or - as #RobertCrovella points out - the compiler optimizing a shared memory access away)
So, we need to resort to some explicit synchronization. Obviously, the block-level __syncthreads() work. This is what does:
__syncthreads() is used to coordinate communication between the threads of the same block. When some threads within a block access the same addresses in shared or global memory, there are potential read-after-write, write-after-read, or write-after-write hazards for some of these memory accesses. These data hazards can be avoided by synchronizing threads in-between these accesses.
That's too powerful for my needs :
It applies to global memory also, not just shared memory.
It performs inter-warp synchronization; I only need intra-warp.
It prevents all types of hazards R-after-W, W-after-R, W-after-W; I only need R-after-W.
It works also for cases of multiple threads performing writes to the same location in shared memory; in my case all shared memory writes are disjoint.
On the other hand, something like __threadfence_block() does not seem to suffice. Is there anything "in-between" those two levels of strength?
Notes:
Related question: CUDA __syncthreads() usage within a warp.
If you're going to suggest I use shuffling instead, then, yes, that's sometimes possible - but not if you want to have array access to the data, i.e. dynamically decide which element of the shared data you're going to read. That would probably spill into local memory, which seems scary to me.
I was thinking maybe volatile could be useful to me, but I'm not sure if using it would do what I want.
If you have an answer that assumes the computer capability is at least XX.YY, that's useful enough.
If I understand #RobertCrovella correctly, this fragment of code should be safe from the hazard:
/* ... */
volatile MyType* ptr = get_some_shared_mem();
ptr[lane::index()] = foo();
auto other_lane_index = bar(); // returns a value within 0..31
auto other_lane_value = ptr[other_lane_index];
/* ... */
because of the use of volatile. (And assuming bar() doesn't mess introduce hazards of its own.)
Theoretical question about CUDA and GPU parallel calculations.
As I know, kernel is a code, function, which is execute by GPU.
Each kernel has a(is executed by) grid which consists blocks and blocks have threads.
So each kernel(code) is executed by even thousands of threads.
I have question about shared memory and kernel codes synchronization.
Could you justify the necessity of synchronization in kernel codes which are using shared memory?
How the synchronization affects the processing efficiency?
CW answer to get this off the unanswered list:
Could you justify the necessity of synchronization in kernel codes which are using shared memory?
__syncthreads() is frequently found in kernels that use shared memory, after the shared memory load, to prevent race conditions. Since the shared memory is usually loaded cooperatively (by all threads in the block), it's necessary to make sure that all threads have completed the loading operation, before any thread begins to use the loaded data for further processing
__syncthreads() is documented here.
Note that it only synchronizes threads within a given block, not grid-wide.
Is it possible to share a cudaMalloc'ed GPU buffer between different contexts (CPU threads) which use the same GPU? Each context allocates an input buffer which need to be filled up by a pre-processing kernel which will use the entire GPU and then distribute the output to them.
This scenario is ideal to avoid multiple data transfer to and from the GPUs. The application is a beamformer, which will combine multiple antenna signals and generate multiple beams, where each beam will be processed by a different GPU context. The entire processing pipeline for the beams is already in place, I just need to add the beamforming part. Having each thread generate it's own beam would duplicate the input data so I'd like to avoid that (also, the it's much more efficient to generate multiple beams at one go).
Each CUDA context has it's own virtual memory space, therefore you cannot use a pointer from one context inside another context.
That being said, as of CUDA 4.0 by default there is one context created per process and not per thread. If you have multiple threads running with the same CUDA context, sharing device pointers between threads should work without problems.
I don't think multiple threads can run with the same CUDA context. I have done the experiments, parent cpu thread create a context and then fork a child thread. The child thread will launch a kernel using the context(cuCtxPushCurrent(ctx) ) created by the parent thread. The program just hang there.
I read some CUDA documentation that refers to local memory. (It is mostly the early documentation.) The device-properties reports a local-mem size (per thread). What does 'local' memory mean? What is 'local' memory? Where is 'local' memory? How do I access 'local' mem? It is __device__ memory, no?
The device-properties also reports: global, shared, & constant mem size.
Are these statements correct:
Global memory is __device__ memory. It has grid scope, and a lifetime of the grid (kernel).
Constant memory is __device__ __constant__ memory. It has grid scope & a lifetime of the grid (kernel).
Shared memory is __device__ __shared__ memory. It has single block scope & a lifetime of that block (of threads).
I'm thinking shared mem is SM memory. i.e. Memory that only that single SM had direct access to. A resource that is rather limited. Isn't an SM assigned a bunch of blocks at a time? Does this mean an SM can interleave the execution of different blocks (or not)? i.e. Run block*A* threads until they stall. Then run block*B* threads until they stall. Then swap back to block*A* threads again. OR Does the SM run a set of threads for block*A* until they stall. Then another set of block*A* threads are swapped in. This swap continues until block*A* is exhausted. Then and only then does work begin on block*B*.
I ask because of shared memory. If a single SM is swapping code in from 2 different blocks, then how does the SM quickly swap in/out the shared memory chunks?
(I'm thinking the later senerio is true, and there is no swapping in/out of shared memory space. Block*A* runs until completion, then block*B* starts execution. Note: block*A* could be a different kernel than block*B*.)
From the CUDA C Programming Guide section 5.3.2.2, we see that local memory is used in several circumstances:
When each thread has some arrays but their size is not known at compile time (so they might not fit in the registers)
When the size of the arrays are known at compile time, and this size is too big for register memory (this can also happen with big structs)
When the kernel has already used up all the register memory (so if we have filled the registers with n ints, the n+1th int will go into local memory) - this last case is register spilling, and it should be avoided, because:
"Local" memory actually lives in the global memory space, which means reads and writes to it are comparatively slow compared to register and shared memory. You'll access local memory every time you use some variable, array, etc in the kernel that doesn't fit in the registers, isn't shared memory, and wasn't passed as global memory. You don't have to do anything explicit to use it - in fact you should try to minimize its use, since registers and shared memory are much faster.
Edit:
Re: shared memory, you cannot have two blocks exchanging shared memory or looking at each others' shared memory. Since the order of execution of blocks is not guaranteed, if you tried to do this you might tie up a SMP for hours waiting for another block to get executed. Similarly, two kernels running on the device at the same time can't see each others' memory UNLESS it is global memory, and even then you're playing with fire (of race conditions). As far as I am aware, blocks/kernels can't really send "messages" to each other. Your scenario doesn't really make sense since order of execution for the blocks will be different every time and it's bad practice to stall a block waiting for another.
I am new to CUDA programming, and I am mostly working with shared memory per block because of performance reasons. The way my program is structured right now, I use one kernel to load the shared memory and another kernel to read the pre-loaded shared memory. But, as I understand it, shared memory cannot persist between two different kernels.
I have two solutions in mind; I am not sure about the first one, and second might be slow.
First Solution: Instead of using two kernels, I use one kernel. After loading the shared memory, the kernel may wait for an input from the host, perform the operation and then return the value to host. I am not sure whether a kernel can wait for a signal from the host.
Second solution: After loading the shared memory, copy the shared memory value in the global memory. When the next kernel is launched, copy the value from global memory back into the shared memory and then perform the operation.
Please comment on the feasibility of the two solutions.
I would use a variation of your proposed first solution: As you already suspected, you can't wait for host input in a kernel - but you can syncronise your kernels at a point. Just call "__syncthreads();" in your kernel after loading your data into shared memory.
I don't really understand your second solution: why would you copy data to shared memory just to copy it back to global memory in the first kernel? Or would this first kernel also compute something? In this case I guess it will not help to store the preliminary results in the shared memory first, I would rather store them directly in global memory (however, this might depend on the algorithm).