producer/comsumer model and concurrent kernels - cuda

I'm writing a cuda program that can be interpreted as producer/consumer model.
There are two kernels,
one produces a data on the device memory,
and the other kernel the produced data.
The number of comsuming threads are set two a multiple of 32 which is the warp size.
and each warp waits utill 32 data have been produced.
I've got some problem here.
If the consumer kernel is loaded later than the producer,
the program doesn't halt.
The program runs indefinately sometimes even though consumer is loaded first.
What I'm asking is that is there a nice implementation model of producer/consumer in CUDA?
Can anybody give me a direction or reference?
here is the skeleton of my code.
**kernel1**:
while LOOP_COUNT
compute something
if SOME CONDITION
atomically increment PRODUCE_COUNT
write data into DATA
atomically increment PRODUCER_DONE
**kernel2**:
while FOREVER
CURRENT=0
if FINISHED CONDITION
return
if PRODUCER_DONE==TOTAL_PRODUCER && CONSUME_COUNT==PRODUCE_COUNT
return
if (MY_WARP+1)*32+(CONSUME_WARPS*32*CURRENT)-1 < PRODUCE_COUNT
process the data
if SOME CONDITION
set FINISHED CONDITION true
increment CURRENT
else if PRODUCUER_DONE==TOTAL_PRODUCER
if currnet*32*CONSUME_WARPS+THREAD_INDEX < PRODUCE_COUNT
process the data
if SOME CONDITION
set FINISHED CONDITION true
increment CURRENT

Since you did not provide an actual code, it is hard to check where is the bug. Usually the sceleton is correct, but the problem lies in details.
One of possible issues that I can think of:
By default, in CUDA there is no guarantee that global memory writes by one kernel will be visible by another kernel, with an exception of atomic operations. It can happen then that your first kernel increments PRODUCER_DONE, but there is still no data in DATA.
Fortunately, you are given the intristic function __threadfence() which halts the execution of the current thread, until the data is visible. You should put it before atomically incrementing PRODUCER_DONE. Check out chapter B.5 in the CUDA Programming Guide.
Another issue that may or may not appear:
From the point of view of kernel2, the compiler may deduct that PRODUCE_COUNT, once read, it never changes. The compiler may optimise the code so that, once loaded in register it reuses its value, instead of querying the global memory every time. Solution? Use volatile, or read the value using another atomic operation.
(Edit)
Third issue:
I forgot about one more problem. On pre-Fermi cards (GeForce before 400-series) you can run only a single kernel at a time. So, if you schedule the producer to run after the consumer, the system will wait for consumer-kernel to end before producer-kernel starts its execution. If you want both to run at the same time, put both into a single kernel and have an if-branch based on some block index.

Related

Trying to understand in CUDA why is zero copy faster if it also travel through PCIe?

It is said that zero copy should be used in situations where “read and/or write exactly once” constraint is met. That's fine.
I have understood this, but my question is why is zero copy fast in first place ? After all whether we use explicit transfer via cudamemcpy or zero copy , in both case data has to travel through pci express bus. Or there exist any other path ( i.e copy happen's directly in GPU register by passing device RAM ?
Considered purely from a data-transfer-rate perspective, I know of no reason why the data transfer rate for moving data between host and device via PCIE should be any different when comparing moving that data using a zero-copy method vs. moving it using cudaMemcpy.
However, both operations have overheads associated with them. The primary overhead I can think of for zero-copy comes with pinning of the host memory. This has a noticeable time overhead (e.g. when compared to allocating the same amount of data using e.g. malloc or new). The primary overhead that comes to mind with cudaMemcpy is a per-transfer overhead of at least a few microseconds that is associated with the setup costs of using the underlying DMA engine that does the transfer.
Another difference is in accessibility to the data. pinned/zero-copy data is simultaneously accessible between host and device, and this can be useful for some kinds of communication patterns that would otherwise be more complicated with cudaMemcpyAsync for example.
Here are two fairly simple design patterns where it may make sense to use zero-copy rather than cudaMemcpy.
When you have a large amount of data and you're not sure what will be needed. Suppose we have a large table of data, say 1GB, and the GPU kernel will need access to it. Suppose, also that the kernel design is such that only one or a few locations in the table are needed for each kernel call, and we don't know a-priori which locations those will be. We could use cudaMemcpy to transfer the entire 1GB to the GPU. This would certainly work, but it would take a possibly non-trivial amount of time (e.g. ~0.1s). Suppose also that we don't know what location was updated, and after the kernel call we need access to the modified data on the host. Another transfer would be needed. Using pinned/zero-copy methods here would mostly eliminate the costs associated with moving the data, and since our kernel is only accessing a few locations, the cost for the kernel to do so using zero-copy is far less than 0.1s.
When you need to check status of a search or convergence algorithm. Suppose that we have an algorithm that consists of a loop that is calling a kernel in each loop iteration. The kernel is doing some kind of search or convergence type algorithm, and so we need a "stopping condition" test. This might be as simple as a boolean value, that we communicate back to the host from the kernel activity, to indicate whether we have reached the stopping point or not. If the stopping point is reached, the loop terminates. Otherwise the loop continues with the next kernel launch. There may even be "two-way" communication here. For example, the host code might be setting the boolean value to false. The kernel might set it to true if iteration needs to continue, but the kernel does not ever set the flag to false. Therefore if continuation is needed, the host code sets the flag to false and calls the kernel again. We could realize this with cudaMemcpy:
bool *d_continue;
cudaMalloc(&d_continue, sizeof(bool));
bool h_continue = true;
while (h_continue){
h_continue = false;
cudaMemcpy(d_continue, &h_continue, sizeof(bool), cudaMemcpyHostToDevice);
my_search_kernel<<<...>>>(..., d_continue);
cudaMemcpy(&h_continue, d_continue, sizeof(bool), cudaMemcpyDeviceToHost);
}
The above pattern should be workable, but even though we are only transferring a small amount of data (1 byte), the cudaMemcpy operations will each take ~5 microseconds. If this were a performance concern, we could almost certainly reduce the time cost with:
bool *z_continue;
cudaHostAlloc(&z_continue, sizeof(bool), ...);
*z_continue = true;
while (*z_continue){
*z_continue = false;
my_search_kernel<<<...>>>(..., z_continue);
cudaDeviceSynchronize();
}
For example, assume that you wrote a cuda-accelerated editor algorithm to fix spelling errors for books. If a 2MB text data has only 5 bytes of error, it would need to edit only 5 bytes of it. So it doesn't need to copy whole array from GPU VRAM to system RAM. Here, zero-copy version would access only the page that owns the 5 byte word. Without zero-copy, it would need to copy whole 2MB text. Copying 2MB would take more time than copying 5 bytes (or just the page that owns those bytes) so it would reduce books/second throughput.
Another example, there could be a sparse path-tracing algorithm to add shiny surfaces for few small objects of a game scene. Result may need just to update 10-100 pixels instead of 1920x1080 pixels. Zero copy would work better again.
Maybe sparse-matrix-multiplication would work better with zero-copy. If 8192x8192 matrices are multiplied but only 3-5 elements are non-zero, then zero-copy could still make difference when writing results.

How to avoid executing both branches of conditional in CUDA program if it is known that the condition is the same for all threads in a warp?

It is my understanding that if I have CUDA code of the form:
if (condition) {
// do x
}
else {
//do y
}
Then due to the SIMT execution of threads in a warp, the execution of the conditional will be serialized and all threads will be required to run both the x and y sections of the code. The exception to this is if the branches are big, in which case the compiler will insert a check using __any to avoid unnecessarily running code.
However, if I already know ahead of time that all threads in a warp will have the same value of condition, then this __any operation is unnecessary, merely serving to slow down my code.
I am wondering if there exists any way to instruct the compiler not to include this voting operation, but instead to assume that the evaluation of the condition is the same for all threads in the warp, and to run only the corresponding block of code?
Then due to the SIMT execution of threads in a warp, the execution of the conditional will be serialized and all threads will be required to run both the x and y sections of the code
That only happens if the conditional doesn't evaluate uniformly within a warp
The exception to this is if the branches are big, in which case the compiler will insert a check using __any to avoid unnecessarily running code.
That is completely incorrect. That compiler doesn't do that, and it is trivial to disassemble literally any code emitted by any version of CUDA compiler NVIDIA has ever released to confirm that. There is predicated execution, but that is significantly different to what you describe.
However, if I already know ahead of time that all threads in a warp will have the same value of condition, then this __any operation is unnecessary, merely serving to slow down my code.
It isn't only unnecessary, it is nonexistent.
I am wondering if there exists any way to instruct the compiler not to include this voting operation, but instead to assume that the evaluation of the condition is the same for all threads in the warp, and to run only the corresponding block of code?
No because what you want is the default behaviour.

Is there a way to avoid redundant computations by warps in a thread block on partially overlapped arrays

I want to design a CUDA kernel that has thread blocks where warps read their own 1-D arrays. Suppose that a thread block with two warps takes two arrays {1,2,3,4} and {2,4,6,8}. Then each of warps would perform some computations by reading their own arrays. The computation is done based on per-array element basis. This means that the thread block would have redundant computations for the elements 2 and 4 in the arrays.
Here is my question: How I can avoid such redundant computations?
Precisely, I want to make a warp skip the computation of an element once the element has been already touched by other warps, otherwise the computation goes normally because any warps never touched the element before.
Using a hash table on the shared memory dedicated into a thread block may be considered. But I worry about performance degradation due to hash table accesses whenever a warp access elements of an array.
Any idea or comments?
In parallel computation on many-core co-processors, it is desired to perform arithmetic operations on an independent set of data, i.e. to eliminate any sort of dependency on the set of vectors which you provide to threads/warps. In this way, the computations can run in parallel. If you want to keep track of elements that you have been previously computed (in this case, 2 and 4 which are common in the two input arrays), you have to serialize and create branches which in turn diminishes the computing performance.
In conclusion, you need to check if it is possible to eliminate redundancy at the input level by reducing the input vectors to those with different components. If not, skipping redundant computations of repeated components may not necessarily improve the performance, since the computations are performed in batch.
Let us try to understand what happens on the hardware level.
First of all, computation in CUDA happens via warps. A warp is a group of 32 threads which are synchronized to each other. An instruction is executed on all the threads of a corresponding warp at the same time instance. So technically it's not threads but the warps that finally executes on the hardware.
Now, let us suppose that somehow you are able to keep track of which elements does not need computation so you put a condition in the kernel like
...
if(computationNeeded){
compute();
}
else{
...
}
...
Now let us assume that there are 5 threads in a particular warp for which "computationNeeded" is false and the don't need computation. But according to our definition, all threads executes the same instruction. So even if you put these conditions all threads have to execute both if and else blocks.
Exception:
It will be faster if all the threads of a particular block execute either if or else condition. But its exceptional case for almost all the real-world algorithms.
Suggestion
Put a pre-processing step either on CPU or GPU, which eliminate the redundant elements from input data. Also, check if this step is worth.
Precisely, I want to make a warp skip the computation of an element once the element has been already touched by other warps, otherwise the computation goes normally because any warps never touched the element before.
Using a hash table on the shared memory dedicated into a thread block
may be considered
If you want inter-warp communication between ALL the warps of the grid then it is only possible via global memory not shared memory.

Is there a way to query the ordering of two cuda events?

Suppose we recorded two cuda events A and B by calling to cudaEventRecord, then before we do any synchronization, is there any way to judge whether A will necessarily happen before or after B? For example, if I have these code:
kernelA<<<1,1>>>(...);
cudaEventRecord(A, 0);
kernelB<<<1,1>>>(...);
cudaEventRecord(B, 0);
Then B should happen after A for sure, but how would I know this given the two handles? In another way, how would I write a function like this:
bool judge_order(cudaEvent_t A, cudaEvent_t B) {...}
Such that it returns true if A would happen before B.
The question arises when I want to make a memory manager in order to effectively reuse the memory that are already used in the previous kernel launches.
Everything in CUDA is scheduled on streams. This includes kernel execution, memory transfer, and events. By default everything operates on stream 0.
Each stream is processed stricly linear. I.e. in your example it is guaranteed that kernelA has been completed before eventA is processed. By querying the status of an event you can tell if it has been processd without waiting for it.
Separate streams however can be processed in any order. If you would use a separate stream for each of your kernels/events then no particular processing order is guaranteed.
All of this is much better explained in the CUDA programming guide.

What is a "tight loop"?

I've heard that phrase a lot. What does it mean?
An example would help.
From Wiktionary:
(computing) In assembly languages, a loop which contains few instructions and iterates many times.
(computing) Such a loop which heavily uses I/O or processing resources, failing to adequately share them with other programs running in the operating system.
For case 1 it is probably like
for (unsigned int i = 0; i < 0xffffffff; ++ i) {}
I think the phrase is generally used to designate a loop which iterates many times, and which can have a serious effect on the program's performance - that is, it can use a lot of CPU cycles. Usually you would hear this phrase in a discussion of optimization.
For examples, I think of gaming, where a loop might need to process every pixel on the screen, or scientific app, where a loop is processing entries in giant arrays of data points.
A tight loop is one which is CPU cache-friendly. It is a loop which fits in the instruction cache, which does no branching, and which effectively hides memory fetch latency for data being processed.
There's a good example of a tight loop (~ infinite loop) in the video Jon Skeet and Tony the Pony.
The example is:
while(text.IndexOf(" ") != -1) text = text.Replace(" ", " ");
which produces a tight loop because IndexOf ignores a Unicode zero-width character (thus finds two adjacent spaces) but Replace does not ignore them (thus not replacing any adjacent spaces).
There are already good definitions in the other answers, so I don't mention them again.
SandeepJ's answer is the correct one in the context of Network appliances (for an example, see Wikipedia entry on middlebox) that deal with packets. I would like to add that the thread/task running the tight loop tries to remain scheduled on a single CPU and not get context switched out.
According to Webster's dictionary, "A loop of code that executes without releasing any resources to other programs or the operating system."
http://www.websters-online-dictionary.org/ti/tight+loop.html
From experience, I've noticed that if ever you're trying to do a loop that runs indefinitely, for instance something like:
while(true)
{
//do some processing
}
Such a loop will most likely always be resource intensive. If you check the CPU and Memory usage by the process with this loop, you will find that it will have shot up. Such is the idea some people call a "tight loop".
Many programming environments nowadays don't expose the programmer to the conditions that need a tight-loop e.g. Java web-services run in a container that calls your code and you need to minimise/eliminate loops within a servlet implementation. Systems like Node.js handle the tight-loop and again you should minimise/eliminate loops in your own code. They're used in cases where you have complete control of program execution e.g. OS or real-time/embedded environments. In the case of an OS you can think of the CPU being in an idle state as the amount of time it spends in the tight-loop because the tight-loop is where it performs checks to see if other processes need to run or if there are queues that need serviced so if there are no processes that need to run and queues are empty then the CPU is just whizzing round the tight-loop and this produces a notional indication of how 'not busy' the CPU is. A tight-loop should be designed as just performing checks so it just becomes like a big list of if .. then statements so in Assembly it will boil down to a COMPARE operand and then a branch so it is very efficient. When all the checks result in not branching, the tight-loop can execute millions of times per second. Operating/embedded Systems usually have some detection for CPU hogging to handle cases where some process has not relinquished control of the CPU - checks for this type of occurrence could be performed in the tight-loop. Ultimately you need to understand that a program needs to have a loop at some point otherwise you couldn't do anything useful with a CPU so if you never see the need for a loop it's because your environment handles all that for you. A CPU keeps executing instructions until there's nothing left to execute or you get a crash so a program like an OS must have a tight-loop to actually function otherwise it would run for just microseconds.