For CUDA, is there a guarantee that Ternary Operator can avoid branch divergence? - cuda

I have read a lot of threads about CUDA branch divergence, telling me that using ternary operator is better than if/else statements, because ternary operator doesn't result in branch divergence.
I wonder, for the following code:
foo = (a > b) ? (bar(a)) : (b);
Where bar is another function or some more complicate statements, is it still true that there is no branch divergence ?

I don't know what sources you consulted, but with the CUDA toolchain there is no noticeable performance difference between the use of the ternary operator and the equivalent if-then-else sequence in most cases. In the case where such differences are noticed, they are due to second order effects in the code generation, and the code based on if-then-else sequence may well be faster in my experience. In essence, ternary operators and tightly localized branching are treated in much the same way. There can be no guarantees that a ternary operator may not be translated into machine code containing a branch.
The GPU hardware offers multiple mechanisms that help avoiding branches and the CUDA compiler makes good use of these mechanisms to minimize branches. One is predication, which can be applied to pretty much any instruction. The other is support for select-type instructions which are essentially the hardware equivalent of the ternary operator. The compiler uses if-conversion to translate short branches into branch-less code sequences. Often, it choses a combination of predicated code and a uniform branch. In cases of non-divergent control flow (all threads in a warp take the same branch) the uniform branch skips over the predicated code section.
Except in cases of extreme performance optimization, CUDA can (and should) be written in natural idioms that are clear and appropriate to the task at hand, using either if-then-else sequences or ternary operators as you see fit. The compiler will take care of the rest.

(I would like to add the comment for #njuffa's answer but my reputation is not enough)
I found the performance different between them with my program.
The if-clause style costs 4.78ms:
// fin is {0-4}, range_limit = 5
if(fin >= range_limit){
res_set = res_set^1;
fx_ref = fx + (fxw*(float)DEPTH_M_H/(float)DEPTH_BLOCK_X);
fin = 0;
}
// then branch for next loop iteration.
// nvvp report these assemblies.
#!P1 LOP32I.XOR R48, R48, 0x1;
#!P1 FMUL.FTZ R39, R7, 14;
#!P1 MOV R0, RZ;
MOV R40, R48;
{ #!P1 FFMA.FTZ R6, R39, c[0x2][0x0], R5;
#!P0 BRA `(.L_35); } // the predicate also use for loop's branching
And the ternary style costs 4.46ms:
res_set = (fin < range_limit) ? res_set: (res_set ^1);
fx_ref = (fin < range_limit) ? fx_ref : fx + (fxw*(float)DEPTH_M_H/(float)DEPTH_BLOCK_X) ;
fin = (fin < range_limit) ? fin:0;
//comments are where nvvp mark the instructions are for the particular code line
ISETP.GE.AND P2, PT, R34.reuse, c[0x0][0x160], PT; //res_set
FADD.FTZ R27, -R25, R4;
ISETP.LT.AND P0, PT, R34, c[0x0][0x160], PT; //fx_ref
IADD32I R10, R10, 0x1;
SHL R0, R9, 0x2;
SEL R4, R4, R27, P1;
ISETP.LT.AND P1, PT, R10, 0x5, PT;
IADD R33, R0, R26;
{ SEL R0, RZ, 0x1, !P2;
STS [R33], R58; }
{ FADD.FTZ R3, R3, 74.75;
STS [R33+0x8], R29; }
{ #!P0 FMUL.FTZ R28, R4, 14; //fx_ref
STS [R33+0x10], R30; }
{ IADD32I R24, R24, 0x1;
STS [R33+0x18], R31; }
{ LOP.XOR R9, R0, R9; //res_set
STS [R33+0x20], R32; }
{ SEL R0, R34, RZ, P0; //fin
STS [R33+0x28], R36; }
{ #!P0 FFMA.FTZ R2, R28, c[0x2][0x0], R3; //fx_ref
The inserted lines are from the next loop iteration calculation.
I think in the case of many instructions shared the same predicate value, the ternary style may provide more opportunity for ILP optimization.

Related

arm cortex-m33 (trustzone, silabs efm32pg22) - assembler hardfaults accessing GPIO or almost any peripherals areas, any hint?

I am just lost here with this code trying to configure on baremetal the silicon labs efm32pg22 in theirs devkit accessed through internal J-Link from segger studio (great fast ide) - I have such example blink hello world in C working from theirs simplicity studio, but was trying to achieve the same thing I did on microchip pic32 mc00 or samd21g17d easily in pure assembler, having only clocks and startup configured through gui in mplab x... well, here I tried to go to segger IDE where is NO startup/clocks config easy way, or I didnt found it yet. On hardware level, registers of such cortex beasts are different by manufacturer, in C/C++ there is some not cheap unification over cmsis - but I want only to know what minimal is needed to just have working raw GPIO after clock/startup ... Segger project is generic cortex-m for specific efm32pg22 so cortex-M33 with trust-zone security - I probably dont know what all is locked or switched off or in which state MCU is, if privileged or nonprivileged - there are 2 sets of registers mapping, but nothing works. As far as I try to "store" or even "load" on GPIO config registers (or SMU regs to query someting too) it is throw hardfault exception. All using segger ide debugger over onboard j-link. Kindly please, what I am doing wrong, whats missing here?
in C, I have only this code:
extern void blink(void);
int main ( void )
{
blink();
}
In blink.s I have this:
;#https://github.com/hubmartin/ARM-cortex-M-bare-metal-assembler-examples/blob/master/02%20-%20Bare%20metal%20blinking%20LED/main.S
;#https://sites.google.com/site/hubmartin/arm/arm-cortex-bare-metal-assembly/02---arm-cortex-bare-metal-assembly-blinking-led
;#https://mecrisp-stellaris-folkdoc.sourceforge.io/projects/blink-f0disco-gdbtui/doc/readme.html
;#https://microcontrollerslab.com/use-gpio-pins-tm4c123g-tiva-launchpad/
;#!!! ENABLE GPIO CLOCK SOURCE ON EFM32 !!!
;#https://community.silabs.com/s/share/a5U1M000000knsWUAQ/hello-world-part-2-create-firmware-to-blink-the-led?language=en_US
;#EFM32 GPIO
;#https://www.silabs.com/documents/public/application-notes/an0012-efm32-gpio.pdf
;# ARM thumb2 ISA
;#https://www.engr.scu.edu/~dlewis/book3/docs/ARM_and_Thumb-2_Instruction_Set.pdf
;#https://sciencezero.4hv.org/index.php?title=ARM:_Cortex-M3_Thumb-2_instruction_set
;#!!! https://stackoverflow.com/questions/48561243/gnu-arm-assembler-changes-orr-into-movw
;#segger assembler
;#https://studio.segger.com/segger/UM20006_Assembler.pdf
;#https://www.segger.com/doc/UM20006_Assembler.html
;#!!! unfortunatelly, we dont know here yet how to include ASM SFR defines, nor for MPLAB ARM (Harmony) !!!
;##include <xc.h>
;##include "definitions.h"
.cpu cortex-m33
.thumb
.text
.section .text.startup.main,"ax",%progbits
.balign 2
.p2align 2,,3
.global blink
//.arch armv8-m.base
.arch armv6-m
.syntax unified
.code 16
.thumb_func
.fpu softvfp
.type blink, %function
//!!! here we have manually entered GPIO PORT defines for PIC32CM
.equ SYSCFG_BASE_ADDRESS, 0x50078000
.equ SMU_BASE_ADDRESS, 0x54008000
//.equ SMU_BASE_ADDRESS, 0x5400C000
.equ CMU_BASE_ADDRESS, 0x50008000
.equ GPIO_BASE_ADDRESS, 0x5003C000 // this differs totally from both "special" infineon and microchip "standard?" cortex devices !!!
.equ DELAY, 40000
// Vector table
.word 0x20001000 // Vector #0 - Stack pointer init value (0x20000000 is RAM address and 0x1000 is 4kB size, stack grows "downwards")
.word blink // Vector #1 - Reset vector - where the code begins
// Vector #3..#n - I don't use Systick and another interrupts right now
// so it is not necessary to define them and code can start here
blink:
LDR r0, =(SYSCFG_BASE_ADDRESS + 0x200) // SYSCFG SYSCFG_CTRL
LDR r1, =0 // 0 diable address faults exceptions
ldr r1, [r0] // Store R0 value to r1
LDR r0, =(CMU_BASE_ADDRESS) // CMU CMU_SYSCLKCTRL PCLKPRESC + CLKSEL
LDR r1, =0b10000000001 // FSRCO 20MHz + PCLK = HCLK/2 = 10MHz
STR r1, [r0, 0x70] // Store R0 value to r1
LDR r0, =(CMU_BASE_ADDRESS) // CMU CMU_CLKEN0
LDR r1, [r0, 0x64]
LDR r2, =(1 << 25) // GPIO CLK EN
orrs r1, r2 // !!! HORROR !!! -- orr is not possible in thumb2 ?? only orrs !! (width suffix)
STR r1, [r0, 0x64] // Store R0 value to r1
LDR r1, [r0, 0x68]
LDR r2, =(1 << 14) // SMU CLK EN
orrs r1, r2 // !!! HORROR !!! -- orr is not possible in thumb2 ?? only orrs !! (width suffix)
STR r1, [r0, 0x68] // Store R0 value to r1
//LDR r0, =(SMU_BASE_ADDRESS) // SMU SMU_LOCK
//LDR r1, =11325013 // SMU UNLOCK CODE
//STR r1, [r0, 0x08] //Store R0 value to r1
ldr r0, =(SMU_BASE_ADDRESS) // SMU reading values, detection - AGAIN, HARD FAULTS !!!!!!!
ldr r1, [r0, 0x04]
ldr r1, [r0, 0x20]
ldr r1, [r0, 0x40]
//LDR r0, =(GPIO_BASE_ADDRESS + 0x300) // GPIO UNLOCK
//LDR r1, =0xA534
//STR r1, [r0] // Store R0 value to r1
//!! THIS BELOW IS OLD FOR SAMD , WE STILL SIMPLY CANT ENABLE GPIO !!!!
// Enable PORTA pin 4 as output
LDR r0, =(GPIO_BASE_ADDRESS) // DIR PORTA
LDR r1, =0b00000000000001000000000000000000
STR r1, [r0, 0x04] // Store R0 value to r1
LDR R2, =1
loop:
// Write high to pin PA04
LDR r0, =GPIO_BASE_ADDRESS // OUT PORTA
LDR r1, =0b10000 // PORT_PA04
STR r1, [r0, 0x10] // Store R1 value to address pointed by R0
// Dummy counter to slow down my loop
LDR R0, =0
LDR R1, =DELAY
loop0:
ADD R0, R2
cmp R0, R1
bne loop0
// Write low to PA04
LDR r0, =GPIO_BASE_ADDRESS // OUT PORTA
LDR r1, =0b00000
STR r1, [r0, 0x10] // Store R1 value to address pointed by R0
// Dummy counter to slow down my loop
LDR R0, =0
LDR R1, =DELAY
loop1:
ADD R0, R2
cmp R0, R1
bne loop1
b loop
UPDATE: well, now I tried it again in SimplicityStudio, placing blink() call after pregenerated system init:
extern void blink(void);
int main(void)
{
// Initialize Silicon Labs device, system, service(s) and protocol stack(s).
// Note that if the kernel is present, processing task(s) will be created by
// this call.
sl_system_init();
blink();
}
having this code in blink.s: - and here it works this way and blinks ...
.cpu cortex-m33
.thumb
.text
.section .text.startup.main,"ax",%progbits
.balign 2
.p2align 2,,3
.global blink
//.arch armv8-m.base
.arch armv6-m
.syntax unified
.code 16
.thumb_func
.fpu softvfp
.type blink, %function
/*
//!!! here we have manually entered GPIO PORT defines for PIC32CM
.equ SYSCFG_BASE_ADDRESS, 0x50078000
.equ SMU_BASE_ADDRESS, 0x54008000
//.equ SMU_BASE_ADDRESS, 0x5400C000
.equ CMU_BASE_ADDRESS, 0x50008000
*/
.equ GPIO_BASE_ADDRESS, 0x5003C000 // this differs totally from both "special" infineon and microchip "standard?" cortex devices !!!
.equ DELAY, 400000
// Vector table
.word 0x20001000 // Vector #0 - Stack pointer init value (0x20000000 is RAM address and 0x1000 is 4kB size, stack grows "downwards")
.word blink // Vector #1 - Reset vector - where the code begins
// Vector #3..#n - I don't use Systick and another interrupts right now
// so it is not necessary to define them and code can start here
blink:
// Enable PORTA pin 4 as output
LDR r0, =(GPIO_BASE_ADDRESS) // DIR PORTA
LDR r1, =0b00000000000001000000000000000000
STR r1, [r0, 0x04]
loop:
// Write high to pin PA04
LDR r0, =GPIO_BASE_ADDRESS // OUT PORTA
LDR r1, =0b10000 // PORT_PA04
STR r1, [r0, 0x10]
// Dummy counter to slow down my loop
LDR R0, =0
LDR R1, =DELAY
loop0:
ADD R0, R2
cmp R0, R1
bne loop0
// Write low to PA04
LDR r0, =GPIO_BASE_ADDRESS // OUT PORTA
LDR r1, =0b00000
STR r1, [r0, 0x10]
// Dummy counter to slow down my loop
LDR R0, =0
LDR R1, =DELAY
loop1:
ADD R0, R2
cmp R0, R1
bne loop1
b loop
... so NOW, I am just curious, what all is missing in pure assembly code to bring that cortex-m33 into some "easy" state, just ignoring trustzone, probably to use it similary as say, plain cortex-m3 ??
can anybody help? I am digging deeply into this datasheet/ref manual, but no luck till now ...
https://www.silabs.com/documents/public/reference-manuals/efm32pg22-rm.pdf
UPDATE AGAIN: umm, will try to figure out ... by traversing system_init C-code its clear whats going on, there are also some chip errata workarounds, but I never touched DCDC while initializing, this may be culprit...
void sl_platform_init(void)
{
CHIP_Init();
sl_device_init_nvic();
sl_board_preinit();
sl_device_init_dcdc();
sl_device_init_hfxo();
sl_device_init_lfxo();
sl_device_init_clocks();
sl_device_init_emu();
sl_board_init();
}
well, okay, manufacturer specific code generation for MCU startup IS really important and useful thing )) ... such MCUs from different manufacturers are really much different at registers level (even that all are "cortex-m" core based), that its worthless to try to configure them manually in assembly if there is enough flash available, and it mostly IS. So, till now, no luck with segger/keil/iar "generic" arm/cortex IDEs to do this properly on specific parts, so using manufacturer specific IDE to (mostly) graphically configure startup clocks and peripherals IS CRUCIAL, or at least, its really easiest way (I know, quite expensive observation after all the assembly tries... )). After then, its easy to make even pure assembly "blink" helloworld test called as extern C-function. You may be asking why I am still considering assembly if there are even CMSIS (on arm) "platform abstraction layer" C-headers at least (no, it doesnt help in abstraction, as the devices are still very different, you only have registers symbols #defines and typedefs and enums to do something in C easily, okay). But I am trying to compare some C-compiled code with handwriten assembly for some specific purpose, which needs forced optimized algorithm from scratch and its often quite easier to think/design it directly in assembly that to rely on very complexly described C-compiler optimisations (each compiler has its own LONG document how his optimisations work and at this level, C is simply still too abstract and moving target, the more, you try to write something for even different MCU architectures (think ARM cortex-m, PIC32/mips, and/or even PIC16/18 + PIC24, AVR , MSP430 ...) - while general algorithm may be described in shared pseudoassenbly to be as near to hardware as possible, withnout knowing all optimization quirks of each architecture C compiler(s) - there are often MORE different C compilers too. So, to compare C-compiler generated code with handwriten assembly you can do it, and I already tried such assembly blink on MANY VERY different architectures, in case I definitelly used mfg specific IDE to genearte startup in C, using all the GUI configurations and code generation down to always compilable empty C project, of course, having very different code size output using such generated startups. Most advanced MCUs are really very complex, mostly in clocks configuration and pins functions config and then different peripheral devices too, sure. Some similarities are possible only at single mfg level, to some extent, so MCU of single manufacturer often share similar approach, obviously. So final solution is to have startup generated and then switch to assembly immediatelly, this is feasible. Sure that in case of small flash, its further possible to optimize even startup code, but its mostly important on smallest 8bit parts, where startup IS quite easy anyway or the generated code is also small, obviously.

How does warp divergence manifest in SASS?

When different threads in a warp execute divergent code, divergent branches are serialized, and inactive warps are "disabled."
If the divergent paths contain a small number of instructions, such that branch predication is used, it's pretty clear what "disabled" means (threads are turned on/off by the predicate), and it's also clearly visible in the sass dump.
If the divergent execution paths contain larger numbers of instructions (exact number dependent on some compiler heuristics) branch instructions are inserted to potentially skip one execution path or the other. This makes sense: if one long branch is seldom taken, or not taken by any threads in a certain warp, it's advantageous to allow the warp to skip those instructions (rather than being forced to execute both paths in all cases as for predication).
My question is: How are inactive threads "disabled" in the case of divergence with branches? The slide on page 2, lower left of this presentation seems to indicate that branches are taken based on a condition and threads that do not participate are switched off via predicates attached to the instructions at the branch targets. However, this is not the behavior I observe in SASS.
Here's a minimal compilable sample:
#include <stdio.h>
__global__ void nonpredicated( int* a, int iter )
{
if( a[threadIdx.x] == 0 )
// Make the number of divergent instructions unknown at
// compile time so the compiler is forced to create branches
for( int i = 0; i < iter; i++ )
{
a[threadIdx.x] += 5;
a[threadIdx.x] *= 5;
}
else
for( int i = 0; i < iter; i++ )
{
a[threadIdx.x] += 2;
a[threadIdx.x] *= 2;
}
}
int main(){}
Here's the SASS dump showing that the branch instructions are predicated, but the code at the branch targets is not predicated. Are the threads that did not take the branch switched off implicitly during execution of those branch targets, in some way that is not directly visible in the SASS? I often see terminology like "active mask" alluded to in various Cuda documents, but I'm wondering how this manifests in SASS, if it is a separate mechanism from predication.
Additionally, for pre-Volta architectures, the program counter is shared per-warp, so the idea of a predicated branch instruction is confusing to me. Why would you attach a per-thread predicate to an instruction that might change something (the program counter) that is shared by all threads in the warp?
code for sm_20
Function : _Z13nonpredicatedPii
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ S2R R0, SR_TID.X; /* 0x2c00000084001c04 */
/*0010*/ MOV32I R3, 0x4; /* 0x180000001000dde2 */
/*0018*/ IMAD.U32.U32 R2.CC, R0, R3, c[0x0][0x20]; /* 0x2007800080009c03 */
/*0020*/ IMAD.U32.U32.HI.X R3, R0, R3, c[0x0][0x24]; /* 0x208680009000dc43 */
/*0028*/ LD.E R0, [R2]; /* 0x8400000000201c85 */
/*0030*/ ISETP.EQ.AND P0, PT, R0, RZ, PT; /* 0x190e0000fc01dc23 */
/*0038*/ #P0 BRA 0xd0; /* 0x40000002400001e7 */
/*0040*/ MOV R4, c[0x0][0x28]; /* 0x28004000a0011de4 */
/*0048*/ ISETP.LT.AND P0, PT, R4, 0x1, PT; /* 0x188ec0000441dc23 */
/*0050*/ MOV R4, RZ; /* 0x28000000fc011de4 */
/*0058*/ #P0 EXIT; /* 0x80000000000001e7 */
/*0060*/ NOP; /* 0x4000000000001de4 */
/*0068*/ NOP; /* 0x4000000000001de4 */
/*0070*/ NOP; /* 0x4000000000001de4 */
/*0078*/ NOP; /* 0x4000000000001de4 */
/*0080*/ IADD R4, R4, 0x1; /* 0x4800c00004411c03 */
/*0088*/ IADD R0, R0, 0x2; /* 0x4800c00008001c03 */
/*0090*/ ISETP.LT.AND P0, PT, R4, c[0x0][0x28], PT; /* 0x188e4000a041dc23 */
/*0098*/ SHL R0, R0, 0x1; /* 0x6000c00004001c03 */
/*00a0*/ #P0 BRA 0x80; /* 0x4003ffff600001e7 */
/*00a8*/ ST.E [R2], R0; /* 0x9400000000201c85 */
/*00b0*/ BRA 0x128; /* 0x40000001c0001de7 */
/*00b8*/ NOP; /* 0x4000000000001de4 */
/*00c0*/ NOP; /* 0x4000000000001de4 */
/*00c8*/ NOP; /* 0x4000000000001de4 */
/*00d0*/ MOV R0, c[0x0][0x28]; /* 0x28004000a0001de4 */
/*00d8*/ MOV R4, RZ; /* 0x28000000fc011de4 */
/*00e0*/ ISETP.LT.AND P0, PT, R0, 0x1, PT; /* 0x188ec0000401dc23 */
/*00e8*/ MOV R0, RZ; /* 0x28000000fc001de4 */
/*00f0*/ #P0 EXIT; /* 0x80000000000001e7 */
/*00f8*/ MOV32I R5, 0x19; /* 0x1800000064015de2 */
/*0100*/ IADD R0, R0, 0x1; /* 0x4800c00004001c03 */
/*0108*/ IMAD R4, R4, 0x5, R5; /* 0x200ac00014411ca3 */
/*0110*/ ISETP.LT.AND P0, PT, R0, c[0x0][0x28], PT; /* 0x188e4000a001dc23 */
/*0118*/ #P0 BRA 0x100; /* 0x4003ffff800001e7 */
/*0120*/ ST.E [R2], R4; /* 0x9400000000211c85 */
/*0128*/ EXIT; /* 0x8000000000001de7 */
.....................................
Are the threads that did not take the branch switched off implicitly during execution of those branch targets, in some way that is not directly visible in the SASS?
Yes.
There is a warp execution or "active" mask which is separate from the formal concept of predication as defined in the PTX ISA manual.
Predicated execution may allow instructions to be executed (or not) for a particular thread on an instruction-by-instruction basis. The compiler may also emit predicated instructions to enact a conditional jump or branch.
However the GPU also maintains a warp active mask. When the machine observes that thread execution within a warp has diverged (for example at the point of a predicated branch, or perhaps any predicated instruction), it will set the active mask accordingly. This process isn't really "visible" at the SASS level. AFAIK the low level execution process for a diverged warp (not via predication) isn't well specified, so questions around how long the warp stays diverged and the exact mechanism for re-synchronization aren't well specified, and AFAIK can be affected by compiler choices, on some architectures. This is one recent discussion (note particularly the remarks by #njuffa).
Why would you attach a per-thread predicate to an instruction that might change something (the program counter) that is shared by all threads in the warp?
This is how you perform a conditional jump or branch. Since all execution is lock-step, if we are going to execute a particular instruction (regardless of mask status or predication status) the PC had better point to that instruction. However, the GPU can perform instruction replay to handle different cases, as needed at execution time.
A few other notes:
a mention of the "active mask" is here:
The scheduler dispatches all 32 lanes of the warp to the execution units with an active mask. Non-active threads execute through the pipe.
some NVIDIA tools allow for inspection of the active mask.

IADD.X GPU instruction

When looking into the SASS output generated for the NVIDIA Fermi architecture, the instruction IADD.X is observed. From NVIDIA documentation, IADD means integer add, but not understanding what it means by IADD.X. Can somebody please help... Is this meaning an integer addition with extended number of bits?
The instruction snippet is:
IADD.X R5, R3, c[0x0][0x24]; /* 0x4800400090315c43 */
Yes, the .X stands for eXtended precision. You will see IADD.X used together with IADD.CC, where the latter adds the less significant bits, and produces a carry flag (thus the .CC), and this carry flag is then incorporated into addition of the more significant bits performed by IADD.X.
Since NVIDIA GPUs are basically 32-bit processors with 64-bit addressing capability, a frequent use of this idiom is in address (pointer) arithmetic. The use of 64-bit integer types, such as long long int or uint64_t will likewise lead to the use of these instructions.
Here is a worked example of a kernel doing 64-bit integer addition. This CUDA code was compiled for compute capability 3.5 with CUDA 7.5, and the machine code dumped with cuobjdump --dump-sass.
__global__ void addint64 (long long int a, long long int b, long long int *res)
{
*res = a + b;
}
MOV R1, c[0x0][0x44];
MOV R2, c[0x0][0x148]; // b[31:0]
MOV R0, c[0x0][0x14c]; // b[63:32]
IADD R4.CC, R2, c[0x0][0x140]; // tmp[31:0] = b[31:0] + a[31:0]; carry-out
MOV R2, c[0x0][0x150]; // res[31:0]
MOV R3, c[0x0][0x154]; // res[63:32]
IADD.X R5, R0, c[0x0][0x144]; // tmp[63:32] = b[63:32] + a[63:32] + carry-in
ST.E.64 [R2], R4; // [res] = tmp[63:0]
EXIT

Cuda Mutex, why deadlock?

I am trying to implement a atomic based mutex.
I succeed it but I have one question about warps / deadlock.
This code works well.
bool blocked = true;
while(blocked) {
if(0 == atomicCAS(&mLock, 0, 1)) {
index = mSize++;
doCriticJob();
atomicExch(&mLock, 0);
blocked = false;
}
}
But this one doesn't...
while(true) {
if(0 == atomicCAS(&mLock, 0, 1)) {
index = mSize++;
doCriticJob();
atomicExch(&mLock, 0);
break;
}
}
I think it's a position of exiting loop. In the first one, exit happens where the condition is, in the second one it happens in the end of if, so the thread wait for other warps finish loop, but other threads wait the first thread as well... But I think I am wrong, so if you can explain me :).
Thanks !
There are other questions here on mutexes. You might want to look at some of them. Search on "cuda critical section", for example.
Assuming that one will work and one won't because it seemed to work for your test case is dangerous. Managing mutexes or critical sections, especially when the negotiation is amongst threads in the same warp is notoriously difficult and fragile. The general advice is to avoid it. As discussed elsewhere, if you must use mutexes or critical sections, have a single thread in the threadblock negotiate for any thread that needs it, then control behavior within the threadblock using intra-threadblock synchronization mechanisms, such as __syncthreads().
This question (IMO) can't really be answered without looking at the way the compiler is ordering the various paths of execution. Therefore we need to look at the SASS code (the machine code). You can use the cuda binary utilities to do this, and will probably want to refer to both the PTX reference as well as the SASS reference. This also means that you need a complete code, not just the snippets you've provided.
Here's my code for analysis:
$ cat t830.cu
#include <stdio.h>
__device__ int mLock = 0;
__device__ void doCriticJob(){
}
__global__ void kernel1(){
int index = 0;
int mSize = 1;
while(true) {
if(0 == atomicCAS(&mLock, 0, 1)) {
index = mSize++;
doCriticJob();
atomicExch(&mLock, 0);
break;
}
}
}
__global__ void kernel2(){
int index = 0;
int mSize = 1;
bool blocked = true;
while(blocked) {
if(0 == atomicCAS(&mLock, 0, 1)) {
index = mSize++;
doCriticJob();
atomicExch(&mLock, 0);
blocked = false;
}
}
}
int main(){
kernel2<<<4,128>>>();
cudaDeviceSynchronize();
}
kernel1 is my representation of your deadlock code, and kernel2 is my representation of your "working" code. When I compile this on linux under CUDA 7 and run on a cc2.0 device (Quadro5000), if I call kernel1 the code will deadlock, and if I call kernel2 (as is shown) it doesn't.
I use cuobjdump -sass to dump the machine code:
$ cuobjdump -sass ./t830
Fatbin elf code:
================
arch = sm_20
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
code for sm_20
Fatbin elf code:
================
arch = sm_20
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
code for sm_20
Function : _Z7kernel1v
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ MOV32I R4, 0x1; /* 0x1800000004011de2 */
/*0010*/ SSY 0x48; /* 0x60000000c0000007 */
/*0018*/ MOV R2, c[0xe][0x0]; /* 0x2800780000009de4 */
/*0020*/ MOV R3, c[0xe][0x4]; /* 0x280078001000dde4 */
/*0028*/ ATOM.E.CAS R0, [R2], RZ, R4; /* 0x54080000002fdd25 */
/*0030*/ ISETP.NE.AND P0, PT, R0, RZ, PT; /* 0x1a8e0000fc01dc23 */
/*0038*/ #P0 BRA 0x18; /* 0x4003ffff600001e7 */
/*0040*/ NOP.S; /* 0x4000000000001df4 */
/*0048*/ ATOM.E.EXCH RZ, [R2], RZ; /* 0x547ff800002fdd05 */
/*0050*/ EXIT; /* 0x8000000000001de7 */
............................
Function : _Z7kernel2v
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ MOV32I R0, 0x1; /* 0x1800000004001de2 */
/*0010*/ MOV32I R3, 0x1; /* 0x180000000400dde2 */
/*0018*/ MOV R4, c[0xe][0x0]; /* 0x2800780000011de4 */
/*0020*/ MOV R5, c[0xe][0x4]; /* 0x2800780010015de4 */
/*0028*/ ATOM.E.CAS R2, [R4], RZ, R3; /* 0x54061000004fdd25 */
/*0030*/ ISETP.NE.AND P1, PT, R2, RZ, PT; /* 0x1a8e0000fc23dc23 */
/*0038*/ #!P1 MOV R0, RZ; /* 0x28000000fc0025e4 */
/*0040*/ #!P1 ATOM.E.EXCH RZ, [R4], RZ; /* 0x547ff800004fe505 */
/*0048*/ LOP.AND R2, R0, 0xff; /* 0x6800c003fc009c03 */
/*0050*/ I2I.S32.S16 R2, R2; /* 0x1c00000008a09e84 */
/*0058*/ ISETP.NE.AND P0, PT, R2, RZ, PT; /* 0x1a8e0000fc21dc23 */
/*0060*/ #P0 BRA 0x18; /* 0x4003fffec00001e7 */
/*0068*/ EXIT; /* 0x8000000000001de7 */
............................
Fatbin ptx code:
================
arch = sm_20
code version = [4,2]
producer = cuda
host = linux
compile_size = 64bit
compressed
$
Considering a single warp, with either code, all threads must acquire the lock (via atomicCAS) once, in order for the code to complete successfully. With either code, only one thread in a warp can acquire the lock at any given time, and in order for other threads in the warp to (later) acquire the lock, that thread must have an opportunity to release it (via atomicExch).
The key difference between these realizations then, lies in how the compiler scheduled the atomicExch instruction with respect to conditional branches.
Let's consider the "deadlock" code (kernel1). In this case, the ATOM.E.EXCH instruction does not occur until after the one (and only) conditional branch (#P0 BRA 0x18;) instruction. A conditional branch in CUDA code represents a possible point of warp divergence, and execution after warp divergence is, to some degree, unspecified and up to the specifics of the machine. But given this uncertainty, it's possible that the thread that acquired the lock will wait for the other threads to complete their branches, before executing the atomicExch instruction, which means that the other threads will not have a chance to acquire the lock, and we have deadlock.
If we then compare that to the "working" code, we see that once the ATOM.E.CAS instruction is issued, there are no conditional branches in between that point and the point at which the ATOM.E.EXCH instruction is issued, thus releasing the lock just acquired. Since each thread that acquires the lock (via ATOM.E.CAS) will release it (via ATOM.E.EXCH) before any conditional branching occurs, there isn't any possibility (given this code realization) for the kind of deadlock witnessed previously (with kernel1) to occur.
(#P0 is a form of predication, and you can read about it in the PTX reference here to understand how it can lead to conditional branching.)
NOTE: I consider both of these codes to be dangerous, and possibly flawed. Even though the current tests don't seem to uncover a problem with the "working" code, I think it's possible that a future CUDA compiler might choose to schedule things differently, and break that code. It's even possible that compiling for a different machine architecture might produce different code here. I consider a mechanism like this to be more robust, which avoids intra-warp contention entirely. Even such a mechanism, however, can lead to inter-threadblock deadlocks. Any mutex must be used under specific programming and usage limitations.

The concept of branch (taken, not taken, diverged) in CUDA

In Nsight Visual Studio, we will have a graph to present the statistics of "taken", "not taken" and "diverged" branches. I am confused about the differece between "not taken" and "diverged".
For example
kernel()
{
if(tid % 32 != 31)
{...}
else
{...}
}
In my opinion, when tid %31 == 31 in a warp, the divergency will happen, but what is "not taken"?
From the Nsight Visual Studio Edition User Guide:
Not Taken / Taken Total: number of executed branch instructions with a uniform control flow decision; that is all active threads of a warp either take or not take the branch.
Diverged: Total number of executed branch instruction for which the conditional resulted in different outcomes across the threads of the warp. All code paths with at least one participating thread get executed sequentially. Lower numbers are better, however, check the Flow Control Efficiency to understand the impact of control flow on the device utilization.
Now, let us consider the following simple code, which perhaps is what you are currently considering in your tests:
#include<thrust\device_vector.h>
__global__ void test_divergence(int* d_output) {
int tid = threadIdx.x;
if(tid % 32 != 31)
d_output[tid] = tid;
else
d_output[tid] = 30000;
}
void main() {
const int N = 32;
thrust::device_vector<int> d_vec(N,0);
test_divergence<<<2,32>>>(thrust::raw_pointer_cast(d_vec.data()));
}
The Branch Statistics graph produced by Nsight is reported below. As you can see, Taken is equal to 100%, since all the threads bump into the if statement. The surprising result is that you have no Diverge. This can be explained by taking a look at the disassembled code of the kernel function (compiled for a compute capability of 2.1):
MOV R1, c[0x1][0x100];
S2R R0, SR_TID.X;
SHR R2, R0, 0x1f;
IMAD.U32.U32.HI R2, R2, 0x20, R0;
LOP.AND R2, R2, -0x20;
ISUB R2, R0, R2;
ISETP.EQ.AND P0, PT, R2, 0x1f, PT;
ISCADD R2, R0, c[0x0][0x20], 0x2;
SEL R0, R0, 0x7530, !P0;
ST [R2], R0;
EXIT;
As you can see, the compiler is able to optimize the diassembled code so that no branching is present, except the uniform one due to the EXIT instruction, as pointed out by Greg Smith in the comment below.
EDIT: A MORE COMPLEX EXAMPLE FOLLOWING GREG SMITH'S COMMENT
I'm now considering the following more complex example
/**************************/
/* TEST DIVERGENCE KERNEL */
/**************************/
__global__ void testDivergence(float *a, float *b)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < 16) a[tid] = tid + 1;
else b[tid] = tid + 2;
}
/********/
/* MAIN */
/********/
void main() {
const int N = 64;
float* d_a; cudaMalloc((void**)&d_a,N*sizeof(float));
float* d_b; cudaMalloc((void**)&d_b,N*sizeof(float));
testDivergence<<<2,32>>>(d_a, d_b);
}
This is the Branch Statistics graph
while this is the disassembled code
MOV R1, c[0x1][0x100];
S2R R0, SR_CTAID.X; R0 = blockIdx.x
S2R R2, SR_TID.X; R0 = threadIdx.x
IMAD R0, R0, c[0x0][0x8], R2; R0 = threadIdx.x + blockIdx.x * blockDim.x
ISETP.LT.AND P0, PT, R0, 0x10, PT; Checks if R0 < 16 and puts the result in predicate register P0
/*0028*/ #P0 BRA.U 0x58; If P0 = true, jumps to line 58
#!P0 IADD R2, R0, 0x2; If P0 = false, R2 = R0 + 2
#!P0 ISCADD R0, R0, c[0x0][0x24], 0x2; If P0 = false, calculates address to store b[tid] in global memory
#!P0 I2F.F32.S32 R2, R2; "
#!P0 ST [R0], R2; "
/*0050*/ #!P0 BRA.U 0x78; If P0 = false, jumps to line 78
/*0058*/ #P0 IADD R2, R0, 0x1; R2 = R0 + 1
#P0 ISCADD R0, R0, c[0x0][0x20], 0x2;
#P0 I2F.F32.S32 R2, R2;
#P0 ST [R0], R2;
/*0078*/ EXIT;
As it can be seen, now we have two BRA instructions in the disassembled code. From the graph above, each warp bumps into 3 branches (one for the EXIT and the two BRAs). Both warps have 1 taken branch, since all the threads uniformly bump into the EXIT instruction. The first warp has 2 not taken branches, since the two BRAs paths are not followed uniformly across the warp threads. The second warp has 1 not taken branch and 1 taken branch since all the warp threads follow uniformly one of the two BRAs. I would say that again diverged* is equal to zero because the instructions in the two branches are exactly the same, although performed on different operands.