When a CUDA stream is created on a specific device via:
int device = 1; // Example number
cudaSetDevice(device);
cudaStream_t cudaStream;
cudaStreamCreate(&cudaStream); // Similar with cudaStreamCreateWithFlags
Do I have to set the device again before deleting it, or can I just call?
// cudaSetDevice(device); // Is this needed ???
cudaStreamDestroy(cudaStream);
No, you don't have to. As for CUDA documentation:
"A kernel launch will fail if it is issued to a stream that is not associated to the current device
A memory copy will succeed even if it is issued to a stream that is not associated to the current device.
cudaEventRecord() will fail if the input event and input stream are associated to different devices.
cudaEventElapsedTime() will fail if the two input events are associated to different devices.
cudaEventSynchronize() and cudaEventQuery() will succeed even if the input event is associated to a device that is different from the current device.
cudaStreamWaitEvent() will succeed even if the input stream and input event are associated to different devices. cudaStreamWaitEvent() can therefore be used to synchronize multiple devices with each other."
So there is no mention of problems regarding cudaStreamDestroy(), and I tested it using two GPU devices, setting the first device, creating a stream on it, then setting the second device and creating another stream on it. After that, I ran a kernel on the second device, then I destroyed the second stream and first stream with no error, and without the need to set the first device again.
cudaStatus = cudaSetDevice(0);
cudaStream_t s0;
cudaStreamCreate(&s0);
cudaStatus = cudaSetDevice(1);
cudaStream_t s1;
cudaStreamCreate(&s1);
addKernel<<<1, size, 1, s1>>>(dev_c, dev_a, dev_b);
cudaDeviceSynchronize();
cudaStatus = cudaStreamDestroy(s1);
fprintf(stderr, "%s\n", cudaGetErrorString(cudaStatus));
cudaStatus = cudaStreamDestroy(s0);
fprintf(stderr, "%s\n", cudaGetErrorString(cudaStatus));
This is the result of cudaStatus.
Related
If I use dynamic parallelism, which stream will child grid be running in?
For example, I have one kernel called A, and another kernel called B. B is launched by A.
If kernel A is running in stream_A, and also if I does not specify the stream ID for kernel B, which stream will the kernel be running in? Is it the default stream, or it will inherit the stream A is running in?
Based on what I see in the documentation, as well as my own experiments, what I observe is that in general there is no (ordering) relationship between streams on the host and streams on the device.
We can consider two cases:
First case, for created streams, I would say this is covered explicitly in the documentation.
In the second case, although the documentation mentions the device NULL stream, that is perhaps a bit unclear. We can write a simple test case to sort this out for us:
$ cat cdp.cu
#include <cstdio>
#include <cassert>
#include <unistd.h>
const unsigned long long dt = 10000000000ULL;
__device__ void delay(){
unsigned long long start = clock64();
while (start+dt > clock64());
}
__global__ void child(){
delay();
}
__global__ void parent(){
child<<<1,1>>>();
cudaError_t err = cudaGetLastError();
assert(err == cudaSuccess);
err = cudaDeviceSynchronize();
assert(err == cudaSuccess);
}
__global__ void pk(){
printf("hello\n");
}
int main(){
cudaStream_t s1, s2;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
parent<<<1,1, 0, s1>>>();
sleep(1);
pk<<<1,1,0,s2>>>();
cudaDeviceSynchronize();
}
$ nvcc -o cdp cdp.cu -rdc=true -lcudadevrt -lineinfo
$ time ./cdp
hello
real 0m7.276s
user 0m4.193s
sys 0m2.061s
$
(Ubuntu 18.04, CUDA 11.4, V100)
When I run the above code, I observe that after launching the cdp executable, the console remains idle for approximately 1 second. After that the console prints "hello", and then the console remains idle for 6-7 seconds, and then the application exits. (Please don't expect the same behavior in a WDDM GPU.)
If the NULL stream on the device were the "same as" or "inherited from" the NULL stream on the host, then we would expect that the "hello" printout would not appear until immediately before application exit. The pk kernel would not be allowed to run until the child kernel had completed, if host NULL stream semantics were involved. Therefore we must conclude that the device NULL stream is not the same as the host NULL stream.
We can use a little bit of logic to convince ourselves that this is also not exactly the same as inheriting the host (created) stream used for the parent kernel launch. We can read in the documentation that the use of the NULL stream for child kernel launches within the same threadblock will have the NULL stream behavior. However this could not be true if the device NULL stream were simply the same as/inherited from the host created stream. According to the documentation, if I have two child kernel launches from the same threadblock, one into the NULL stream, and another into a device created stream, then we would not expect these to be able to run concurrently - that is the behavior of the NULL stream. But if the NULL stream were simply the inherited host stream, a created stream, there is no reason to expect that they could not run concurrently.
So we are left with the conclusion that the device NULL stream is not the host NULL stream, and the device NULL stream is also not a host created stream. These statements seem consistent with the documentation to me.
If you would like a clarification in the documentation, the usual suggestion is to file a bug.
Rather than worry about peculiarities of NULL stream behavior, the advice I give when teaching CUDA is that if you are concerned about complex concurrency scenarios, do all your work with created streams. Leave the NULL stream behind. Anything you wish to do can be done purely with created streams. Do as you wish, of course.
I'm trying to do out-of-core between GPU memory and CPU memory. For example, I have blocks of data each is 1GB, and I need to process 1000 of such blocks in order, each is done by a kernel launch. Assume the processing must be done one by one, because the n'th kernel launch needs to use the result produced by the (n-1)'th kernel, which is stored in the (n-1)'th block, except the first kernel launch. So I'm thinking of using a circular buffer on GPU to store the most recent 5 blocks, and use events to synchronize between the data stream and the task stream. The data stream prepares the data and the task stream launches the kernels. The code is illustrated as the following.
const int N_CBUF = 5, N_TASK = 1000;
// Each pointer points to a data block of 1GB
float* d_cir_buf[N_CBUF];
float* h_data_blocks[N_TASK];
// The data stream for transfering data from host to device.
// The task stream for launching kernels to process the data.
cudaStream_t s_data, s_task;
// The data events for the completion of each data transfer.
// The task events for the completion of each kernel execution.
cudaEvent_t e_data[N_TASK], e_task[N_TASK];
// ... code for creating the streams and events.
for (int i = 0; i < N_TASK; i++) {
// Data transfer should not overwritten the data needed by the kernels.
if (i >= N_CBUF) {
cudaStreamWaitEvent(s_data, e_task[i-N_CBUF+1]);
}
cudaMemcpyAsync(d_cir_buf[i % N_CBUF], h_data_blocks[i], ..., cudaMemcpyHostToDevice, s_data);
cudaEventRecord(e_data[i], s_data);
cudaStreamWaitEvent(s_task, e_data[i]);
// Pass the current and the last data block to the kernel.
my_kernel<<<..., s_task>>>(d_cir_buf[i % N_CBUF],
i == 0 ? 0 : d_cir_buf[(i+N_CBUF-1)%N_CBUF]);
cudaEventRecord(e_task[i], s_task);
}
I'm wondering if this is even a valid idea, or is there anything completely wrong? Also, the CUDA programming guide mentioned that if there is memcpy from two different host memory address to the same device address, then there will be no concurrent execution, does this matter in my case? In particular, if the memory for d_cir_buf is allocated as a whole big block and then split into 5 pieces, would that count as "the same memory address in device", causing concurrency to fail? Also, in my case the (n+5)'th data transfer will go to the same address as the n'th data transfer, however, given the synchronization required, there won't be two such transfers to execute at the same time. So is this OK?
I have the feeling that your problem is best suited to double buffering:
two streams
upload data1 in stream1
run kernel on data1 in stream1
upload data2 in stream2
run kernel on data2 in stream2
... And so on
Kernel in stream2 can overlap with data transfers in strezm 1 and vice versa
Using CUDA 5 with VS 2012 and capability 3.5 (Titan and K20).
At particular stages of my kernel execution, I want to send a generated data chunk to the host memory and notify the host that the data is ready, so the host will operate on it.
I cannot wait until the end of the kernel execution to read the data back from the device, because:
The data is no longer relevant to the device once it is calculated, so there is no point keeping it to the end.
The data size is too large to fit on the device memory and wait until the end.
The host should not have to wait until the end of the kernel execution to start processing the data.
Could you point me to the path I have to take and the possible cuda concepts and functions I have to use to achieve my requirements? Put simply, how can I write to the host and notify the host that a chunk data is ready for host processing?
N.B. Each thread does not share any generated data with any other thread, they run independently. So, as far as I know (and please correct me if I am wrong), the concept of blocks, threads and warps do not affect the question. Or in other words, if they aid the answer, I am free to alter their combination.
Below is a sample code that shows that I am trying to do:
#pragma once
#include <conio.h>
#include <cstdio>
#include <cuda_runtime_api.h>
__global__ void Kernel(size_t length, float* hResult)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
// Processing multiple data chunks
for(int i = 0;i < length;i++)
{
// Once this is assigned, I don't need it on the device anymore.
hResult[i + (tid * length)] = i * 100;
}
}
void main()
{
size_t length = 10;
size_t threads = 2;
float* hResult;
// An array that will hold all data from all threads
cudaMallocHost((void**)&hResult, threads * length * sizeof(float));
Kernel<<<threads,1>>>(length, hResult);
// I DO NOT want to wait to the end and block to get the data
cudaError_t error = cudaDeviceSynchronize();
if (error != cudaSuccess) { throw error; }
for(int i = 0;i < threads * length;i++)
{
printf("%f\n", hResult[i]);;
}
cudaFreeHost(hResult);
system("pause");
}
Here is one possible approach. At a high level, on the device:
You'll need to write the data to either device global memory (allocated previously with cudaMalloc) or else directly to host memory (allocated previously with cudaHostAlloc). This memory should be accessed via a volatile pointer.
You may wish to do all the data writing to this region from a single threadblock, to be sure that all the data is written prior to the following steps
You'll then want to issue a threadfence() (if you're using device global memory) or threadfence_system() call (if using host memory) prior to the following steps
Next you'll write to a special location in device global memory or host memory, let's call it the mailbox location, with a specific value indicating the data is ready. This location should also be accessed with a volatile pointer.
Optionally issue another threadfence or threadfence_system call
for device memory usage on the receiving end, again both regions (payload and "mailbox") should be accessed using a volatile pointer.
On the host:
Before launching the kernel, the host will need to set the mailbox location to a default value.
After launching the kernel, the host thread will need to "poll" the mailbox location, looking for the specific value indicating data is ready
Once the specific value is seen, indicating that the data is ready, the host can consume the data
Optionally, if you want to repeat this process, the host can reset the mailbox location to the default value. The device can check for this default value before updating the data block with new data.
Both the mailbox location and the payload region should be accessed by the host thread using a volatile pointer.
Note that even with the above process, there is still an implied device-wide synchronization needed, if the data is being generated/created from multiple threadblocks. The only straightforward device-wide synchronization available is the kernel launch (or completion of the kernel, specifically). Copying the data from a single threadblock simply moves the requirement for device-wide sync out of this particular sequence (to somewhere before this sequence).
The reasons you give don't really suggest to me that the code could not be refactored to create the data on a kernel-launch by kernel-launch basis, which would neatly solve these issues and eliminate the need for the above process as well.
EDIT: responding to a question in the comments.
It's difficult to be more specific about how to refactor the code to deliver one data chunk per kernel call, without a specific example.
Let's take an image processing case, where I have a video sequence of 30 frames stored in global memory. The kernel will process each frame according to some algorithm, then make the processed data available to the host.
In your proposal, after the kernel is done processing a frame, it can signal to the host that the data is ready, and go on to process the next frame. The problem is, if the frame is processed by multiple threadblocks, there's no easy way to know when all threadblocks are done processing that frame. A device-wide synchronization barrier might be what is needed, but it doesn't exist conveniently, except via the kernel call mechanism. However, presumably inside such a kernel we might have a sequence like this:
while (more_frames)
process a frame
signal host
increment frame pointer
In a refactored approach, we would move the loop outside the kernel, to host code:
while (more_frames)
call kernel to process frame
consume frame
increment frame pointer
By doing this, the kernel marks the explicit synchronization needed to know when the frame processing is complete, and the data can be consumed.
SO I asked a question before about how to allocate an object on the device directly instead of the "normal":
Allocate on host
Copy to device
Copy dynamically allocated fields to device one by one
The main reason I want it to be allocated directly on the device is that I don't want to copy each dynamically allocated field inside one by one manually.
Anyway, so I think I have actually found a way to do this, and I would like to see some input from more experienced CUDA programmers (like Robert Crovella).
Let's see the code first:
class Particle
{
public:
int *data;
__device__ Particle()
{
data = new int[10];
for (int i=0; i<10; i++)
{
data[i] = i*2;
}
}
};
__global__ void test(Particle **result)
{
Particle *p = new Particle();
result[0] = p; // store memory location
}
__global__ void test2(Particle *p)
{
for (int i=0; i<10; i++)
printf("%d\n", p->data[i]);
}
int main() {
// initialise and allocate an object on device
Particle **d_p_addr;
cudaMalloc((void**)&d_p_addr, sizeof(Particle*));
test<<<1,1>>>(d_p_addr);
// copy pointer to host memory
Particle **p_addr = new Particle*[1];
cudaMemcpy(p_addr, d_p_addr, sizeof(Particle*), cudaMemcpyDeviceToHost);
// test:
test2<<<1,1>>>(p_addr[0]);
cudaDeviceSynchronize();
printf("Done!\n");
}
As you can see, what I do is:
Call a kernel that initialises an object on the device and stores its pointer an output parameter
Copy the pointer to the allocated object from device memory to host memory
Now you can pass that pointer to another kernel just fine !
This code actually works, but I'm not sure if there are drawbacks.
Cheers
EDIT: as pointed out by Robert, there was no point of creating a pointer on host first, so I removed that part from the code.
Yes, you can do that.
You are allocating an object on the device, and passing a pointer to it from one kernel to the next. Since a characteristic of device malloc/new is that allocations persist for the lifetime of the context (not just the kernel), the allocations do not disappear at the end of the kernel. This is basically standard C++ behavior, but I thought it might be worth repeating. The pointer(s) that you are passing from one kernel to the next are therefore valid in any subsequent device code in the context of your program.
There is a wrinkle you might want to be aware of, however. Pointers returned by dynamic allocations done on the device (such as via new or malloc in device code) are not usable for transferring data from device to host, at least in the present incarnation of cuda (cuda 5.0 and earlier). The reasons for this are somewhat arcane (translation: I can't adequately explain it) but it's instructive to think about the fact that dynamic allocations come out of the device heap, a region that is logically separate from the region of global memory that runtime API functions like cudaMalloc and cudaMemcpy use. An oblique indication of this is given here:
Memory reserved for the device heap is in addition to memory allocated through host-side CUDA API calls such as cudaMalloc().
If you want to prove this wrinkle to yourself, try adding the following seemingly innocuous code after your second kernel call:
Particle *q;
q = (Particle *)malloc(sizeof(Particle));
cudaMemcpy(q, p_addr[0], sizeof(Particle), cudaMemcpyDeviceToHost);
If you then check the API error value returned from that cudaMemcpy operation, you will observe the error.
As an unrelated comment, your use of the pointer *p is a little freaky, in my book, and the compiler warning given about it is an indication of the wierdness. It's not technically illegal, since you're not actually doing anything meaningful with that pointer (you immediately replace it in your kernel 1) but nevertheless it's wierd because you're passing a pointer to a kernel that you haven't properly cudaMalloc'ed. In the context of what you're demonstrating, it's completely unnecessary, and your first parameter to kernel 1 could be eliminated and replaced with a local variable, eliminating the wierdness and compiler warning.
My intention is to use n host threads to create n streams concurrently on a NVidia Tesla C2050. The kernel is a simple vector multiplication...I am dividing the data equally amongst n streams, and each stream would have concurrent execution/data transfer going on.
The data is floating point, I am sometimes getting CPU/GPU sums as equal, and sometimes they are wide apart...I guess this could be attributed to loss of synchronization constructs on my code, for my case, but also I don't think any synch constructs between streams is necessary, because I want every CPU to have a unique stream to control, and I do not care about asynchronous data copy and kernel execution within a thread.
Following is the code each thread runs:
//every thread would run this method in conjunction
static CUT_THREADPROC solverThread(TGPUplan *plan)
{
//Allocate memory
cutilSafeCall( cudaMalloc((void**)&plan->d_Data, plan->dataN * sizeof(float)) );
//Copy input data from CPU
cutilSafeCall( cudaMemcpyAsync((void *)plan->d_Data, (void *)plan->h_Data, plan->dataN * sizeof(float), cudaMemcpyHostToDevice, plan->stream) );
//to make cudaMemcpyAsync blocking
cudaStreamSynchronize( plan->stream );
//launch
launch_simpleKernel( plan->d_Data, BLOCK_N, THREAD_N, plan->stream);
cutilCheckMsg("simpleKernel() execution failed.\n");
cudaStreamSynchronize(plan->stream);
//Read back GPU results
cutilSafeCall( cudaMemcpyAsync(plan->h_Data, plan->d_Data, plan->dataN * sizeof(float), cudaMemcpyDeviceToHost, plan->stream) );
//to make the cudaMemcpyAsync blocking...
cudaStreamSynchronize(plan->stream);
cutilSafeCall( cudaFree(plan->d_Data) );
CUT_THREADEND;
}
And creation of multiple threads and calling the above function:
for(i = 0; i < nkernels; i++)
threadID[i] = cutStartThread((CUT_THREADROUTINE)solverThread, &plan[i]);
printf("main(): waiting for GPU results...\n");
cutWaitForThreads(threadID, nkernels);
I took this strategy from one of the CUDA Code SDK samples. As I've said before, this code work sometimes, and other time it gives wayward results. I need help with fixing this code...
first off I am not an expert by any stretch of the imagination, just from my experience.
I don't see why this needs multiple host threads. It seems like you're managing one device and passing it multiple streams. The way I've seen this done (pseudocode)
{
create a handle
allocate an array of streams equal to the number of streams you want
for(int n=0;n<NUM_STREAMS;n++)
{
cudaStreamCreate(&streamArray[n]);
}
}
From there you can just pass the streams in your array to the various asynchronous calls (cudaMemcpyAsync(), kernel streams, etc.) and the device manages the rest. I've had weird scalability issues with multiple streams (don't try to make 10k streams, I run into problems around 4-8 on a GTX460), so don't be surprised if you run into those. Best of luck,
John
My bet is that
BLOCK_N, THREAD_N
, don't cover the exact size of the array you are passing.
Please provide the code for initializing the streams and the size of those buffers.
As a side note, Streams are useful for overlapping computation with memory transfer. Synching the stream after each async call is not useful at all.