CUDA - Which stream will child grid be in? - cuda

If I use dynamic parallelism, which stream will child grid be running in?
For example, I have one kernel called A, and another kernel called B. B is launched by A.
If kernel A is running in stream_A, and also if I does not specify the stream ID for kernel B, which stream will the kernel be running in? Is it the default stream, or it will inherit the stream A is running in?

Based on what I see in the documentation, as well as my own experiments, what I observe is that in general there is no (ordering) relationship between streams on the host and streams on the device.
We can consider two cases:
First case, for created streams, I would say this is covered explicitly in the documentation.
In the second case, although the documentation mentions the device NULL stream, that is perhaps a bit unclear. We can write a simple test case to sort this out for us:
$ cat cdp.cu
#include <cstdio>
#include <cassert>
#include <unistd.h>
const unsigned long long dt = 10000000000ULL;
__device__ void delay(){
unsigned long long start = clock64();
while (start+dt > clock64());
}
__global__ void child(){
delay();
}
__global__ void parent(){
child<<<1,1>>>();
cudaError_t err = cudaGetLastError();
assert(err == cudaSuccess);
err = cudaDeviceSynchronize();
assert(err == cudaSuccess);
}
__global__ void pk(){
printf("hello\n");
}
int main(){
cudaStream_t s1, s2;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
parent<<<1,1, 0, s1>>>();
sleep(1);
pk<<<1,1,0,s2>>>();
cudaDeviceSynchronize();
}
$ nvcc -o cdp cdp.cu -rdc=true -lcudadevrt -lineinfo
$ time ./cdp
hello
real 0m7.276s
user 0m4.193s
sys 0m2.061s
$
(Ubuntu 18.04, CUDA 11.4, V100)
When I run the above code, I observe that after launching the cdp executable, the console remains idle for approximately 1 second. After that the console prints "hello", and then the console remains idle for 6-7 seconds, and then the application exits. (Please don't expect the same behavior in a WDDM GPU.)
If the NULL stream on the device were the "same as" or "inherited from" the NULL stream on the host, then we would expect that the "hello" printout would not appear until immediately before application exit. The pk kernel would not be allowed to run until the child kernel had completed, if host NULL stream semantics were involved. Therefore we must conclude that the device NULL stream is not the same as the host NULL stream.
We can use a little bit of logic to convince ourselves that this is also not exactly the same as inheriting the host (created) stream used for the parent kernel launch. We can read in the documentation that the use of the NULL stream for child kernel launches within the same threadblock will have the NULL stream behavior. However this could not be true if the device NULL stream were simply the same as/inherited from the host created stream. According to the documentation, if I have two child kernel launches from the same threadblock, one into the NULL stream, and another into a device created stream, then we would not expect these to be able to run concurrently - that is the behavior of the NULL stream. But if the NULL stream were simply the inherited host stream, a created stream, there is no reason to expect that they could not run concurrently.
So we are left with the conclusion that the device NULL stream is not the host NULL stream, and the device NULL stream is also not a host created stream. These statements seem consistent with the documentation to me.
If you would like a clarification in the documentation, the usual suggestion is to file a bug.
Rather than worry about peculiarities of NULL stream behavior, the advice I give when teaching CUDA is that if you are concerned about complex concurrency scenarios, do all your work with created streams. Leave the NULL stream behind. Anything you wish to do can be done purely with created streams. Do as you wish, of course.

Related

Failed to test nestedReduce2.cu from book Professional CUDA C Programming

I am reading the book Professional CUDA C Programming. I've downloaded the source codes from Wiley, the file has been tested was chapter03/nestedReduce2.cu. Or the file could be found at github.
I've made the .cu file by its Makefile as well as simple command:
nvcc -o nestedReduce2 ./nestedReduce2.cu -rdc=true
The output was like:
./nestedReduce2 starting reduction at device 0: Quadro RTX 4000 array 1048576 grid 2048 block 512
cpu reduce elapsed 0.000858 sec cpu_sum: 1048576
gpu Neighbored elapsed 0.000404 sec gpu_sum: 1048576 <<<grid 2048 block 512>>>
gpu nested elapsed 0.044057 sec gpu_sum: 1048576 <<<grid 2048 block 512>>>
gpu nestedNosyn elapsed 0.019464 sec gpu_sum: 1048576 <<<grid 2048 block 512>>>
gpu nested2 elapsed 0.001051 sec gpu_sum: 946688 <<<grid 2048 block 512>>>
Test failed!
How to solve this problem? Is there some update for CUDA recursive programming since the last update of the book?
I don't have that book and have never read it. I don't really know what is in the book, so my response is directed to the code posted on the github site and nothing else. I'm unable to make any statements about a book I don't have and have never read.
Concerning the kernel in question:
__global__ void gpuRecursiveReduce2(int *g_idata, int *g_odata, int iStride,
int const iDim)
{
// convert global data pointer to the local pointer of this block
int *idata = g_idata + blockIdx.x * iDim;
// stop condition
if (iStride == 1 && threadIdx.x == 0)
{
g_odata[blockIdx.x] = idata[0] + idata[1];
return;
}
// in place reduction
idata[threadIdx.x] += idata[threadIdx.x + iStride];
// nested invocation to generate child grids
if(threadIdx.x == 0 && blockIdx.x == 0)
{
gpuRecursiveReduce2<<<gridDim.x, iStride / 2>>>(g_idata, g_odata,
iStride / 2, iDim);
}
}
I believe it should be fairly evident for correctness, that the child kernel launch:
gpuRecursiveReduce2<<<gridDim.x, iStride / 2>>>(g_idata, g_odata,
iStride / 2, iDim);
should not be allowed to execute until the preceding parent reduction:
// in place reduction
idata[threadIdx.x] += idata[threadIdx.x + iStride];
is complete. Both items potentially span up to half the entire dataset, and therefore depend on results from multiple blocks (to be complete, for correctness).
On my V100 GPU (CUDA 11.4), the code gives the expected result. However as OP has demonstrated, it may not give the expected result in all scenarios.
In order to be confident of correct results, we would need something like a grid-wide sync, in between the parent reduction step, and the child kernel execution, for each sweep phase (except the last, since there is only 1 thread per block in that case, and so all blocks terminate before reaching the child kernel launch.)
Unfortunately, the cooperative groups grid-wide sync is not supported with CUDA dynamic parallelism (CDP).
The other grid-wide sync formally provided by CUDA is the kernel launch boundary. Therefore:
How to solve this problem?
my suggestion would be to dispense with CDP launches, and use a set of (non-recursive) kernel launches driven by a for-loop in host code. For someone at the level of study indicated here, this should be a trivial refactoring, so I will not present it here.
Additional discussion:
In particular, we could surmise that a case where the GPU is "smaller" (i.e. fewer SMs) and the grid size is "larger" might be a problem. This might give rise to a situation where child kernel blocks are executing prior to the completion of some parent kernel blocks.
Coupled with this, a question might be asked "is there any characteristic of null stream behavior (e.g. synchronization) between the parent kernel null stream and the child kernel null stream that would (or should have) created the desired ordering?" The answer is no. You can refer to the documentation, where null stream behavior of CDP kernels is discussed.
In my view it is clear that the child kernel NULL stream does not synchronize with the parent kernel null stream. As an additional thought experiment, we should keep in mind that the documentation states that a parent kernel is not considered complete until all child kernels are complete. Coupled with that, if we assumed null stream synchronizing behavior between parent and child, it would immediately give rise to deadlock. So we reject that hypothesis.
For additional inspection, we can derive a test case to convince ourselves that a parent kernel null stream and child kernel null stream do not interact:
$ cat t2099.cu
#include <iostream>
__global__ void child(int *d, int val){
*d = val;
}
__global__ void parent(int *d, int val){
*d = val;
if (blockIdx.x == 1048577) child<<<1,1>>>(d, 1);
}
int main(){
int *d;
cudaMallocManaged(&d, sizeof(d[0]));
parent<<<2*1048576, 1>>>(d, 0);
cudaDeviceSynchronize();
std::cout << d[0] << std::endl;
}
$ nvcc -o t2099 t2099.cu -rdc=true
$ ./t2099
0
$
In the above simplified test case, we are launching a parent kernel of ~2M blocks, where all parent kernel blocks set a variable to zero, and the child kernel launched from a single block picked arbitrarily sets the variable to 1.
If there were parent/child synchronization, we would expect the variable to be 1 at conclusion. Since it is 0, we conclude that there is no synchronization between parent and child kernel. The child kernel (block) somehow "intermixed" with the execution of the parent kernel blocks. (the "intermixing" is not in any way guaranteed by CUDA, but we could surmise that one reason the block scheduler might choose to intermix is because the parent kernel block is not complete until its child kernel block is complete. Therefore, from a throughput perspective, it might be advantageous to make forward progress on the child kernel, in the midst of the parent kernel.)
This discussion and experiment help to reinforce the idea that the presented code needs/requires a grid-wide sync for correctness, and neither the code itself nor the CDP mechanism provide any guarantee of that.
(for completeness, the test case I presented is not guaranteed to produce 0 and it may not produce 0 if you run it in your machine. The fact that it does produce 0 in at least one test setup - mine - is sufficient for the argument. In my test case, if I change the number of blocks launched to 1048578, then the output changes from 0 to 1.)

Writing from Device to Host and notifying the host

Using CUDA 5 with VS 2012 and capability 3.5 (Titan and K20).
At particular stages of my kernel execution, I want to send a generated data chunk to the host memory and notify the host that the data is ready, so the host will operate on it.
I cannot wait until the end of the kernel execution to read the data back from the device, because:
The data is no longer relevant to the device once it is calculated, so there is no point keeping it to the end.
The data size is too large to fit on the device memory and wait until the end.
The host should not have to wait until the end of the kernel execution to start processing the data.
Could you point me to the path I have to take and the possible cuda concepts and functions I have to use to achieve my requirements? Put simply, how can I write to the host and notify the host that a chunk data is ready for host processing?
N.B. Each thread does not share any generated data with any other thread, they run independently. So, as far as I know (and please correct me if I am wrong), the concept of blocks, threads and warps do not affect the question. Or in other words, if they aid the answer, I am free to alter their combination.
Below is a sample code that shows that I am trying to do:
#pragma once
#include <conio.h>
#include <cstdio>
#include <cuda_runtime_api.h>
__global__ void Kernel(size_t length, float* hResult)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
// Processing multiple data chunks
for(int i = 0;i < length;i++)
{
// Once this is assigned, I don't need it on the device anymore.
hResult[i + (tid * length)] = i * 100;
}
}
void main()
{
size_t length = 10;
size_t threads = 2;
float* hResult;
// An array that will hold all data from all threads
cudaMallocHost((void**)&hResult, threads * length * sizeof(float));
Kernel<<<threads,1>>>(length, hResult);
// I DO NOT want to wait to the end and block to get the data
cudaError_t error = cudaDeviceSynchronize();
if (error != cudaSuccess) { throw error; }
for(int i = 0;i < threads * length;i++)
{
printf("%f\n", hResult[i]);;
}
cudaFreeHost(hResult);
system("pause");
}
Here is one possible approach. At a high level, on the device:
You'll need to write the data to either device global memory (allocated previously with cudaMalloc) or else directly to host memory (allocated previously with cudaHostAlloc). This memory should be accessed via a volatile pointer.
You may wish to do all the data writing to this region from a single threadblock, to be sure that all the data is written prior to the following steps
You'll then want to issue a threadfence() (if you're using device global memory) or threadfence_system() call (if using host memory) prior to the following steps
Next you'll write to a special location in device global memory or host memory, let's call it the mailbox location, with a specific value indicating the data is ready. This location should also be accessed with a volatile pointer.
Optionally issue another threadfence or threadfence_system call
for device memory usage on the receiving end, again both regions (payload and "mailbox") should be accessed using a volatile pointer.
On the host:
Before launching the kernel, the host will need to set the mailbox location to a default value.
After launching the kernel, the host thread will need to "poll" the mailbox location, looking for the specific value indicating data is ready
Once the specific value is seen, indicating that the data is ready, the host can consume the data
Optionally, if you want to repeat this process, the host can reset the mailbox location to the default value. The device can check for this default value before updating the data block with new data.
Both the mailbox location and the payload region should be accessed by the host thread using a volatile pointer.
Note that even with the above process, there is still an implied device-wide synchronization needed, if the data is being generated/created from multiple threadblocks. The only straightforward device-wide synchronization available is the kernel launch (or completion of the kernel, specifically). Copying the data from a single threadblock simply moves the requirement for device-wide sync out of this particular sequence (to somewhere before this sequence).
The reasons you give don't really suggest to me that the code could not be refactored to create the data on a kernel-launch by kernel-launch basis, which would neatly solve these issues and eliminate the need for the above process as well.
EDIT: responding to a question in the comments.
It's difficult to be more specific about how to refactor the code to deliver one data chunk per kernel call, without a specific example.
Let's take an image processing case, where I have a video sequence of 30 frames stored in global memory. The kernel will process each frame according to some algorithm, then make the processed data available to the host.
In your proposal, after the kernel is done processing a frame, it can signal to the host that the data is ready, and go on to process the next frame. The problem is, if the frame is processed by multiple threadblocks, there's no easy way to know when all threadblocks are done processing that frame. A device-wide synchronization barrier might be what is needed, but it doesn't exist conveniently, except via the kernel call mechanism. However, presumably inside such a kernel we might have a sequence like this:
while (more_frames)
process a frame
signal host
increment frame pointer
In a refactored approach, we would move the loop outside the kernel, to host code:
while (more_frames)
call kernel to process frame
consume frame
increment frame pointer
By doing this, the kernel marks the explicit synchronization needed to know when the frame processing is complete, and the data can be consumed.

Is this way of allocating a device object "correct"?

SO I asked a question before about how to allocate an object on the device directly instead of the "normal":
Allocate on host
Copy to device
Copy dynamically allocated fields to device one by one
The main reason I want it to be allocated directly on the device is that I don't want to copy each dynamically allocated field inside one by one manually.
Anyway, so I think I have actually found a way to do this, and I would like to see some input from more experienced CUDA programmers (like Robert Crovella).
Let's see the code first:
class Particle
{
public:
int *data;
__device__ Particle()
{
data = new int[10];
for (int i=0; i<10; i++)
{
data[i] = i*2;
}
}
};
__global__ void test(Particle **result)
{
Particle *p = new Particle();
result[0] = p; // store memory location
}
__global__ void test2(Particle *p)
{
for (int i=0; i<10; i++)
printf("%d\n", p->data[i]);
}
int main() {
// initialise and allocate an object on device
Particle **d_p_addr;
cudaMalloc((void**)&d_p_addr, sizeof(Particle*));
test<<<1,1>>>(d_p_addr);
// copy pointer to host memory
Particle **p_addr = new Particle*[1];
cudaMemcpy(p_addr, d_p_addr, sizeof(Particle*), cudaMemcpyDeviceToHost);
// test:
test2<<<1,1>>>(p_addr[0]);
cudaDeviceSynchronize();
printf("Done!\n");
}
As you can see, what I do is:
Call a kernel that initialises an object on the device and stores its pointer an output parameter
Copy the pointer to the allocated object from device memory to host memory
Now you can pass that pointer to another kernel just fine !
This code actually works, but I'm not sure if there are drawbacks.
Cheers
EDIT: as pointed out by Robert, there was no point of creating a pointer on host first, so I removed that part from the code.
Yes, you can do that.
You are allocating an object on the device, and passing a pointer to it from one kernel to the next. Since a characteristic of device malloc/new is that allocations persist for the lifetime of the context (not just the kernel), the allocations do not disappear at the end of the kernel. This is basically standard C++ behavior, but I thought it might be worth repeating. The pointer(s) that you are passing from one kernel to the next are therefore valid in any subsequent device code in the context of your program.
There is a wrinkle you might want to be aware of, however. Pointers returned by dynamic allocations done on the device (such as via new or malloc in device code) are not usable for transferring data from device to host, at least in the present incarnation of cuda (cuda 5.0 and earlier). The reasons for this are somewhat arcane (translation: I can't adequately explain it) but it's instructive to think about the fact that dynamic allocations come out of the device heap, a region that is logically separate from the region of global memory that runtime API functions like cudaMalloc and cudaMemcpy use. An oblique indication of this is given here:
Memory reserved for the device heap is in addition to memory allocated through host-side CUDA API calls such as cudaMalloc().
If you want to prove this wrinkle to yourself, try adding the following seemingly innocuous code after your second kernel call:
Particle *q;
q = (Particle *)malloc(sizeof(Particle));
cudaMemcpy(q, p_addr[0], sizeof(Particle), cudaMemcpyDeviceToHost);
If you then check the API error value returned from that cudaMemcpy operation, you will observe the error.
As an unrelated comment, your use of the pointer *p is a little freaky, in my book, and the compiler warning given about it is an indication of the wierdness. It's not technically illegal, since you're not actually doing anything meaningful with that pointer (you immediately replace it in your kernel 1) but nevertheless it's wierd because you're passing a pointer to a kernel that you haven't properly cudaMalloc'ed. In the context of what you're demonstrating, it's completely unnecessary, and your first parameter to kernel 1 could be eliminated and replaced with a local variable, eliminating the wierdness and compiler warning.

JCuda Pinned Memory Example

JCuda + GEForce Gt640 Question:
I'm trying to reduce the latency associated with copying memory from Device to Host after the result has been computed by the GPU. Doing the simple Vector Add program I found that the bulk of the latency is indeed copying the result buffer back to the Host side. The transfer latency of the source buffers to the Device side is negligible ~.30ms while copying the result back is on the order of 20ms.
I did the research an found that a better alternative to copying out the results is to use pinned memory. From what I learned, this memory is allocated on the host side but the kernel would have direct access to it over the pci-e and in turn yielding a higher speed than copying the result after the computation in bulk. I'm using the following example but the results aren't yielding what I expect.
Kernel: {Simple Example to illustrate point, Launching 1 block 1 thread only}
extern "C"
__global__ void add(int* test)
{
test[0]=1; test[1]=2; test[2]=3; test[3]=4; test[4]=5;
}
Java:
import java.io.*;
import jcuda.*;
import jcuda.runtime.*;
import jcuda.driver.*;
import static jcuda.runtime.cudaMemcpyKind.*;
import static jcuda.driver.JCudaDriver.*;
public class JCudaTest
{
public static void main(String args[])
{
// Initialize the driver and create a context for the first device.
cuInit(0);
CUdevice device = new CUdevice();
cuDeviceGet(device, 0);
CUcontext context = new CUcontext();
cuCtxCreate(context, 0, device);
// Load the ptx file.
CUmodule module = new CUmodule();
JCudaDriver.cuModuleLoad(module, "JCudaKernel.ptx");
// Obtain a function pointer to the kernel function.
CUfunction function = new CUfunction();
JCudaDriver.cuModuleGetFunction(function, module, "add");
Pointer P = new Pointer();
JCudaDriver.cuMemAllocHost(P, 5*Sizeof.INT);
Pointer kernelParameters = Pointer.to(P);
// Call the kernel function with 1 block, 1 thread:
JCudaDriver.cuLaunchKernel(function, 1, 1, 1, 1, 1, 1, 0, null, kernelParameters, null);
int [] T = new int[5];
JCuda.cudaMemcpy(Pointer.to(T), P, 5*Sizeof.INT, cudaMemcpyHostToHost);
// Print the results:
for(int i=0; i<5; i++)
System.out.println(T[i]);
}
}
1.) Build the Kernel:
root#NVS295-CUDA:~/JCUDA/MySamples# nvcc -ptx JCudaKernel.cu
root#NVS295-CUDA:~/JCUDA/MySamples# ls -lrt | grep ptx
-rw-r--r-- 1 root root 3295 Mar 27 17:46 JCudaKernel.ptx
2.) Build the Java:
root#NVS295-CUDA:~/JCUDA/MySamples# javac -cp "../JCuda-All-0.5.0-bin-linux-x86/*:." JCudaTest.java
3.) Run the code:
root#NVS295-CUDA:~/JCUDA/MySamples# java -cp "../JCuda-All-0.5.0-bin-linux-x86/*:." JCudaTest
0
0
0
0
0
Expecting:
1
2
3
4
5
Note: I'm using JCuda0.5.0 for x86 if that matters.
Please let me know what I'm doing wrong and thanks in advance:
Ilir
The problem here is that the device may not access host memory directly.
Admittedly, the documentation sounds misleading here:
cuMemAllocHost
Allocates bytesize bytes of host memory that is page-locked and accessible to the device...
This sounds like a clear statement. However, "accessible" here does not mean that the memory may be used directly as a kernel parameter in all cases. This is only possible on devices that support Unified Addressing. For all other devices, it is necessary to obtain a device pointer that corresponds to the allocated host pointer, with cuMemHostGetDevicePointer.
The key point of page-locked host memory is that the data transfer between the host and device is faster. An example of how this memory may be used in JCuda can be seen in the JCudaBandwidthTest sample (this is for the runtime API, but for the driver API, it works analogously).
EDIT:
Note that the new Unified Memory feature of CUDA 6 actually supports what you originally intended to do: With cudaMallocManaged you can allocate memory that is directly accessible to the host and the device (in the sense that it can, for example, be passed to a kernel, written by the device, and afterwards read by the host without additional effort). Unfortunately, this concept does not map very well to Java, because the memory is still managed by CUDA - and this memory can not replace the memory that is, for example, used by the Java VM for a float[] array or so. But at least it should be possible to create a ByteBuffer from the memory that was allocated with cudaMallocManaged, so that you may access this memory, for example, as a FloatBuffer.

CUDA Kernels Randomly Fail, but only when I use certain transcendental functions

I've been working on a CUDA program, that randomly crashes with a unspecified launch failure, fairly frequently. Through careful debugging, I localized which kernel was failing, and furthermore that the failure occurred only if certain transcendental functions were called from within the CUDA kernel, (e.g. sinf() or atanhf()).
This led me to write a much simpler program (see below), to confirm that these transcendental functions really were causing an issue, and it looks like that is indeed the case. When I compile and run the code below, which just has repeated calls to a kernel that uses tanh and atanh, repeatedly, sometimes the program works, and sometimes it prints Error with Kernel along with a message from the driver that says:
NVRM: XiD (0000:01:00): 13, 0002 000000 000050c0 00000368 00000000 0000080
With regards to frequency, it probably crashes 50% of the time that I run the executable.
From what I've read online, it sounds like XiD 13 is analogous to a host-based seg fault. However, given the array indexing, I can't see how that could be the case. Furthermore the program doesn't crash if I replace the transcendental functions in the kernel with other functions (e.g. repeated floating point subtraction and addition). That is, I don't get the XiD error message, and the program ultimately returns the correct value of atanh(0.7).
I'm running cuda-5.0 on Ubuntu 11.10 x64 Desktop. Driver version is 304.54, and I'm using a GeForce 9800 GTX.
I'm inclined to say that this is a hardware issue or a driver bug. What's strange is that the example applications from nvidia work fine, perhaps because they do not use the affected transcendental functions.
The final bit of potentially important information is that if I run either my main project, or this test program under cuda-memcheck, it reports no errors, and never crashes. Honestly, I'd just run my project under cuda-memcheck, but the performance hit makes it impractical.
Thanks in advance for any help/insight here. If any one has a 9800 GTX and would be willing to run this code to see if it works, it would be greatly appreciated.
#include <iostream>
#include <stdlib.h>
using namespace std;
__global__ void test_trans (float *a, int length) {
if ((threadIdx.x + blockDim.x*blockIdx.x) < length) {
float temp=0.7;
for (int i=0;i<100;i++) {
temp=atanh(temp);
temp=tanh(temp);
}
a[threadIdx.x+ blockDim.x*blockIdx.x] = atanh(temp);
}
}
int main () {
float *array_dev;
float *array_host;
unsigned int size=10000000;
if (cudaSuccess != cudaMalloc ((void**)&array_dev, size*sizeof(float)) ) {
cerr << "Error with memory Allocation\n"; exit (-1);}
array_host = new float [size];
for (int i=0;i<10;i++) {
test_trans <<< size/512+1, 512 >>> (array_dev, size);
if (cudaSuccess != cudaDeviceSynchronize()) {
cerr << "Error with kernel\n"; exit (-1);}
}
cudaMemcpy (array_host, array_dev, sizeof(float)*size, cudaMemcpyDeviceToHost);
cout << array_host[size-1] << "\n";
}
Edit: I dropped this project for a few months, but yesterday upon updating to driver version 319.23, I'm no longer having this problem. I think the issue I described must have been a bug that was fixed. Hope this helps.
The asker determined that this was a temporary issue fixed by a newer CUDA release. See the edit to the original question.