I have been writing this program in assimbly language that encrypts or decrypts a string of text. At the end it should be simply outputting the encoded message but instead I am just getting a massive number of random characters. Anyone have any idea whats going on here?
.ORIG x3000
;CLEAR REGISTERS
AGAIN AND R0, R0, 0 ;CLEAR R0
AND R1, R1, 0 ;CLEAR R1
AND R2, R2, 0 ;CLEAR R2
AND R3, R3, 0 ;CLEAR R3
AND R4, R4, 0 ;CLEAR R4
AND R5, R5, 0 ;CLEAR R5
AND R6, R6, 0 ;CLEAR R6
;ENCRYPT/DECRYPT PROMPT
LEA R0, PROMPT_E ;LOADS PROMPT_E INTO R0
PUTS ;PRINTS R0
GETC ;GETS INPUT
OUT ;ECHO TO SCREEN
STI R0, MEMX3100 ;X3100 <- R0
;KEY PROMPT
LEA R0, PROMPT_K ;LOADS PROMPT_E INTO R0
PUTS ;PRINTS R0
GETC ;GETS INPUT
OUT ;ECHO TO SCREEN
STI R0, CYPHERKEY ;X3101 <- R0
;MESSAGE PROMPT
LD R6, MEMX3102 ;R6 <- MEMX3102
LEA R0, PROMPT_M ;LOADS PROMPT_E INTO R0
PUTS ;PRINTS R0
LOOP1 GETC ;GETS INPUT
OUT ;ECHO TO SCREEN
ADD R1, R0, #-10 ;R1 <- R0-10
BRZ NEXT ;BRANCH NEXT IF ENTER
STR R0, R6, #0 ;X3102 <- R0
ADD R6, R6, #1 ;INCRIMENT COUT
LD R2, NUM21 ;R2 <- -12546
ADD R5, R6, R2 ;R5 - R2
STI R5, MEMX4000 ;MEMX4000 <- R5
LD R1, NUM20 ;R1 <- NUM20
ADD R1, R6, R1 ;CHECK FOR 20
BRN LOOP1 ;CREATES WHILE LOOP
;Function choose
NEXT LDI R6, MEMX3100 ;R6 <- X3100
LD R1, NUM68 ;R1 <- -68
ADD R1, R6, R1 ;CHECKS FOR D INPUT
BRZ DECRYPT
;ENCRYPT FUNCTION(DEFAULT)
LD R4, MEMX3102 ;R6 <- X3102
LOOP2 LDR R1, R4, #0 ;R1 <- MEM[R4+0]
LDI R5, ASCII ;R5 <- ASCII
ADD R1, R1, R5 ;STRIPS ASCII
AND R6, R1, #1 ;R6 <- R1 AND #1
BRZ LSBOI ;BRANCH IF LSB = 0
ADD R1, R1, #-1 ;R1 <- R1-1
BRNZP KEYLOAD ;BRANCH TO KEYLOAD
LSBOI ADD R1, R1, #1 ;R1 <- R1+1
KEYLOAD LDI R2, CYPHERKEY ;R2 <- CYPHERKEY
ADD R1, R1, R2 ;R1 <- R1+R2
STR R1, R4, #21 ;MEM[R4+21] <- R1
ADD R4, R4, #1 ;R4 <- R4 + 1
LD R5, MEMX4000 ;R5 <- COUNT
NOT R5, R5 ;NOT R5
ADD R5, R5, R4 ;CHECK FOR NEGATIVE
BRN LOOP2 ;LOOP
BRNZP NEXT2 ;BRANCH WHEN DONE
;DECRYPT FUNCTION DECRYPT LD R4, MEMX3102 ;R4 <- X3102 LOOP3 LDR R1, R4, #0 ;R1 <- MEM[R4+0] LDI R5, ASCII ;R5 <- ASCII ADD R1, R1, R5 ;STRIPS ASCII LDI R2, CYPHERKEY ;R2 <- CYPHERKEY NOT R2, R2 ;R2 <- NOT R2 ADD R1, R1, R2 ;R1 <- R1 - CYPHERKEY AND R6, R1,
#1 ;R6 <- R1 AND #1 BRZ LSBOI2 ;BRANCH IF LSB = 0 ADD R1, R1, #-1 ;R1 <- R1-1 BRNZP NEXTTASK1 ;BRANCH TO KEYLOAD LSBOI2 ADD R1, R1, #1 ;R1 <- R1+1 NEXTTASK1 STR R1, R4, #21 ;MEM[R4+21] <- R1 ADD R4, R4, #1 ;R4 <- R4 + 1 LD R5, MEMX4000 ;R5 <- COUNT NOT R5, R5 ;NOT R5 ADD R5, R5, R4 ;CHECK FOR NEGATIVE BRN LOOP3 ;LOOP
;OUTPUT NEXT2 LD R4, MEMX3102 ;R4 <- X3102 LOOP4 LDR R0, R4,
#21 ;R0 <- [R4+21] OUT ;PRINT R0 ADD R4, R4, #1 ;R4 <- R4+1 LD R5, MEMX4000 ;R5 <- COUNT NOT R5, R5 ;NOT R5 ADD R5, R5, R4 ;CHECK FOR NEGATIVE BRN LOOP4
HALT MEMX4000 .FILL X4000 ASCII .FILL #-30 NUM21 .FILL #-12546 NUM20 .FILL #-12566 MEMX3102 .FILL X3102 CYPHERKEY .FILL X3101 MEMX3100 .FILL X3100 NUM68 .FILL #-68 NUM32 .FILL #-32 PROMPT_E .STRINGZ "\nTYPE E TO ENCRYPT OR TYPE D TO DECRYPT (UPPER CASE): " PROMPT_K .STRINGZ "\nENTER THE ENCRYPTION KEY (A SINGLE DIGIT FROM 1 TO 9) " PROMPT_M .STRINGZ "\nINPUT A MESSAGE OF NO MORE THAN 20 CHARACTERS THEN PRESS <ENTER> "
.END
There are a number of different things that are going on in your program, here are some of the things I've found:
Encoding loop loops more times than the number of characters entered
The encryption key is stored and used in its ASCII form
characters from the user are stored in the middle of the PROMPT_M text
encoding loop cycles for thousands of times
Encoding loop didn't change any of the stored characters at location x3102
Output routine doesn't loop, so it only outputs one char
From what I've seen your program takes a non ascii char from the user adds it to the ascii form of the encryption key and then stores that hundreds of times at every memory offset 21 locations from x3102. When your output routine runs it pulls the value stored at x3117 and outputs that one char, then halts the program.
Related
I'm trying to order a collection by characters and numbers in Laravel, I have tried different methods but nothing seems to work the way I want it to.
The numbers I'm trying to order can always be different, the example in te code below uses the following characters and numbers:
M1,
M2,
M3,
M10,
M11,
R1,
R2,
R10,
R11
The characters + numbers are stored in the "MYSQL" database with a VARCHAR datatype.
Tried this:
$items = Item::where('item_id', $itemId)->orderBy('name', 'ASC')->get();
// M1, M10, M11, M2, M3, R1, R10, R11, R2
$items = Item::where('item_id', $itemId)->orderByRaw('LENGTH(name)', 'asc')->orderBy('name', 'ASC')->get();
// M1, M2, M3, R1, R2, M10, M11, R10, R11</i>
$items = Item::where('item_id', $itemId)->orderByRaw('CAST(name as unsigned)')->orderBy('name', 'ASC')->get();
// M1, M10, M11, M2, M3, R1, R10, R11, R2</i>
What I'm trying to achieve is the following order:
M1,
M2,
M10,
M11,
R1,
R2,
R10,
R11
Is this even possible?
Use sortBy() with flag SORT_NATURAL.
$items = Item::where('item_id', $itemId)->get()->sortBy('name', SORT_NATURAL);
I trying to read an input from user and print it.
In the beginning, I print a request to the user, the user enter a value and I want to print it.
.data
params_sys5: .space 8
params_sys3: .space 8
prompt_msg_LBound: .asciiz "Enter lower bound for x,y\n"
prompt_msg_LBound_val: .asciiz "Lower bound for x,y = %d\n"
xyL: .word64 0
prompt_msg_UBound: .asciiz "Enter upper bound for x,y\n"
prompt_msg_UBound_val: .asciiz "Upper bound for x,y = %d\n"
xyU: .word64 0
prompt_msg_UBoundZ: .asciiz "Enter upper bound for z\n"
prompt_msg_UBoundZ_val: .asciiz "Lower bound for z = %d\n"
zU: .word64 0
prompt_msgAns: .asciiz "x = %d, y = %d, z = %d\n"
.word64 0
.word64 0
.word64 0
xyL_Len: .word64 0
xyU_Len: .word64 0
zU_Len: .word64 0
xyL_text: .space 32
xyU_text: .space 32
zU_text: .space 32
ZeroCode: .word64 0x30 ;Ascii '0'
.text
main: daddi r4, r0, prompt_msg_LBound
jal print_string
daddi r8, r0, xyL_text ;r8 = xyL_text
daddi r14, r0, params_sys3
daddi r9, r0, 32
jal read_keyboard_input
sd r1, xyL_Len(r0) ;save first number length
ld r10, xyL_Len(r0) ;n = r10 = length of xyL_text
daddi r17, r0, xyL_text
jal convert_string_to_integer ;r17 = &source string,r10 = string length,returns computed number in r11
sd r11, xyL(r0)
daddi r4, r0, prompt_msg_LBound_val
jal print_string
end: syscall 0
print_string: sw $a0, params_sys5(r0)
daddi r14, r0, params_sys5
syscall 5
jr r31
read_keyboard_input: sd r0, 0(r14) ;read from keyboard
sd r8, 8(r14) ;destination address
sd r9, 16(r14) ;destination size
syscall 3
jr r31
convert_string_to_integer: daddi r13, r0, 1 ;r13 = constant 1
daddi r20, r0, 10 ;r20 = constant 10
movz r11, r0, r0 ;x1 = r11 = 0
ld r19, ZeroCode(r0)
For1: beq r10, r0, EndFor1
dmultu r11, r20 ;lo = x * 10
mflo r11 ;x = r11 = lo = r11 * 10
movz r16, r0, r0 ;r16 = 0
lbu r16, 0(r17) ;r16 = text[i]
dsub r16, r16, r19 ;r16 = text[i] - '0'
dadd r11, r11, r16 ;x = x + text[i] - '0'
dsub r10, r10, r13 ;n--
dadd r17, r17, r13 ;i++
b For1
EndFor1: jr r31
I'm trying to get the first number, the lower bound of x,y.
For example, I type the number 5, so in the end the xyL representation is 5 but the printed string is:
Enter lower bound for x,y
Lower bound for x,y = 0
How do I print the entered value and after that do same with the next string?
Thanks.
Edit:=======================================================================
I changed the .data by adding another data type .space 8 to save the address and now instead of jumping to print_string to print the value, I call syscall 5, for example:
prompt_msg_LBound: .asciiz "Enter lower bound for x,y\n"
prompt_msg_LBound_val: .asciiz "Lower bound for x,y = %d\n"
LBound_val_addr: .space 8
xyL: .space 8
and in the .code section:
sd r11, xyL(r0)
daddi r5, r0, prompt_msg_LBound_val
sd r5, LBound_val_addr(r0)
daddi r14 ,r0, LBound_val_addr
syscall 5
But I still want to use the print_string to print the string:prompt_msg_LBound_val with the user entered value.
How can I do that?
The print_string sample function in the manual is not meant to be used with placeholders, just with plain strings.
If you add placeholders to the format string, then SYSCALL 5 will keep reading from memory the value of those placeholders. In this case, it just reads and display the value 0, which by accident is what's in memory.
See the printf() example from the manual (slightly updated and annotated) to check how to use placeholders:
.data
format_str: .asciiz "%dth of %s:\n%s version %i.%i.%i is being tested!"
s1: .asciiz "February"
s2: .asciiz "EduMIPS64"
fs_addr: .space 4 ; Will store the address of the format string
.word 10 ; The literal value 10.
s1_addr: .space 4 ; Will store the address of the string "February"
s2_addr: .space 4 ; Will store the address of the string "EduMIPS64"
.word 1 ; The literal value 1.
.word 2 ; The literal value 2.
.word 6 ; The literal value 6.
test:
.code
daddi r5, r0, format_str
sw r5, fs_addr(r0)
daddi r2, r0, s1
daddi r3, r0, s2
sd r2, s1_addr(r0)
sd r3, s2_addr(r0)
daddi r14, r0, fs_addr
syscall 5
syscall 0
Hopefully this is a simple question but I cannot for the life of me figure out how to do a bitshift in binary. This is being done in the LC3 environemnt. I just need to know how to arithmetical divide by two and shift to the right. I know going left is simple by just adding the binary value to itself, but I have tried the opposite for bitshift right(subtracting from itself, NOTing and then subtracting etc etc.) Would be much appreciated.
Or if you have a better way to move x00A0 to x000A that would also be fantastic. Thanks!
This is an older post, but I ran into the same issue so I figured I would post what I've found.
When you have to do a bit-shift to the right you're normally halving the the binary number (divide by 2) but that can be a challenge in the LC-3. This is the code I wrote to preform a bit-shift to the right.
; Bit shift to the right
.ORIG x3000
MAIN
LD R3, VALUE
AND R5, R5, #0 ; Reseting our bit counter
B_RIGHT_LOOP
ADD R3, R3, #-2 ; Subtract 2 from the value stored in R3
BRn BR_END ; Exit the loop as soon as the number in R3 has gone negative
ADD R5, R5, #1 ; Add 1 to the bit counter
BR B_RIGHT_LOOP ; Start the loop over again
BR_END
ST R5, ANSWER ; Store the shifted value into the ANSWER variable
HALT ; Stop the program
; Variables
VALUE .FILL x3BBC ; Value is the number we want to do a bit-shift to the right
ANSWER .FILL x0000
.END
Keep in mind that with this code the left most bit B[0] is lost. Also this code doesn't work if the number we are trying to shift to the right is negative. So if bit [15] is set this code won't work.
Example:
VALUE .FILL x8000 ; binary value = 1000 0000 0000 0000
; and values higher than x8000
; won't work because their 15th
; bit is set
This should at least get you going on the right track.
.ORIG x3000
BR main
;»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»
; UL7AAjr
; shift right register R0
; used rigisters R1, R2, R3, R4, R5
;»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»
shift_right
AND R4, R4, #0 ; R4 - counter = 15 times
ADD R4, R4, #15
AND R1, R1, #0 ; R1 - temp result
LEA R2, _sr_masks ; R2 - masks pointer
_sr_loop
LDR R3, R2, #0 ; load mask into R3
AND R5, R0, R3 ; check bit in R0
BRZ _sr_zero ; go sr_zero if bit is zero
LDR R3, R2, #1 ; R3 next mask index
ADD R1, R1, R3 ; add mask to temp result
_sr_zero
ADD R2, R2, #1 ; next mask address
ADD R4, R4, #-1 ; all bits done?
BRNP _sr_loop
AND R0, R0, #0 ; R0 = R1
ADD R0, R0, R1
RET
_sr_masks
.FILL x8000
.FILL x4000
.FILL x2000
.FILL x1000
.FILL x0800
.FILL x0400
.FILL x0200
.FILL x0100
.FILL x0080
.FILL x0040
.FILL x0020
.FILL x0010
.FILL x0008
.FILL x0004
.FILL x0002
.FILL x0001
;»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»»
main
LD R0, data
JSR shift_right
HALT
data .FILL xFFFF
.END
; right shift R0 1-bit with sign-extention
; Algorithm: build bit form msb one by one
.ORIG x3000
AND R1, R1, #0 ; r1 = 0
ADD R2, R1, #14 ; r2 = 14
ADD R0, R0, #0 ; r0 = r0
BRzp LOOP
ADD R1, R1, #-1 ; r1 = xffff
LOOP ADD R1, R1, R1 ; r1 << 1
ADD R0, R0, R0 ; r0 << 1
BRzp MSB0
ADD R1, R1, #1 ; r1++
MSB0 ADD R2, R2, #-1 ; cnt--
BRp LOOP
ADD R0, R1, #0 ; r0 = r1
HALT
.END
; right shift R0 1-bit with sign-extention
; Algorithm: left-rotate 14 times with proper sign
.ORIG x3000
LD R1, CNT
ADD R2, R0, #0
LOOP ADD R0, R0, R0 ; r0 << 1
BRzp NEXTBIT
ADD R0, R0, #1
NEXTBIT ADD R1, R1, #-1
BRp LOOP
LD R3, MASK
AND R0, R0, R3
ADD R2, R2, #0
BRzp DONE
NOT R3, R3
ADD R0, R0, R3
DONE HALT
MASK .FILL x3FFF
CNT .FILL 14
.END
; right shift R0 1-bit with sign-extention
; Algorithm: look-uo table and auto-stop
.ORIG x3000
AND R1, R1, #0 ; r1 = 0
LEA R2, TABLE ; r2 = table[]
AND R0, R0, #-2
LOOP BRzp MSB0
LDR R3, R2, #0 ; r3 = table[r2]
ADD R1, R1, R3 ; r1 += r3
MSB0 ADD R2, R2, #1 ; r2++
ADD R0, R0, R0 ; r0 << 1
BRnp LOOP
ADD R0, R1, #0 ; r0 = r1
HALT
TABLE
.FILL xC000
.FILL x2000
.FILL x1000
.FILL x0800
.FILL x0400
.FILL x0200
.FILL x0100
.FILL x0080
.FILL x0040
.FILL x0020
.FILL x0010
.FILL x0008
.FILL x0004
.FILL x0002
.FILL x0001
.END
I think I hit a CUDA bug. Can someone confirm/comment the code (see below).
The code (attached) will produce different results depending on the "BUG" define. With BUG=0 the result is 8 (correct), while with BUG=1 it is 4 (and it is wrong). The difference in the code is only here:
#if BUG
unsigned int na=threadIdx.x, nb=threadIdx.y, nc=threadIdx.z;
#else
unsigned int na=0, nb=0, nc=0;
#endif
I submit only ONE thread, so na==nb==nc==0 in both cases and I also check this with statements:
assert( na==0 && nb==0 && nc==0 );
printf("INITIAL VALUES: %u %u %u\n",na,nb,nc);
Here is my compilation & run:
nvcc -arch=sm_21 -DBUG=0 -o bug0 bug.cu
nvcc -arch=sm_21 -DBUG=1 -o bug1 bug.cu
./bug0
./bug1
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2012 NVIDIA
Corporation Built on Fri_Sep_21_17:28:58_PDT_2012 Cuda compilation
tools, release 5.0, V0.2.1221
nvcc runs with g++-4.6
Finally here is the test code:
/* Compilation & run
nvcc -arch=sm_21 -DBUG=0 -o bug0 bug.cu
nvcc -arch=sm_21 -DBUG=1 -o bug1 bug.cu
./bug0
./bug1
*/
#include <stdio.h>
#include <assert.h>
__global__
void b(unsigned int *res)
{
#if BUG
unsigned int na=threadIdx.x, nb=threadIdx.y, nc=threadIdx.z;
#else
unsigned int na=0, nb=0, nc=0;
#endif
assert( na==0 && nb==0 && nc==0 );
printf("INITIAL VALUES: %u %u %u\n",na,nb,nc);
unsigned int &iter=*res, na_max=2, nb_max=2, nc_max=2;
iter=0;
while(true)
{
printf("a-iter=%u %u %u %u\n",iter,na,nb,nc);
if( na>=na_max )
{
na = 0;
nb += blockDim.y;
printf("b-iter=%u %u %u %u\n",iter,na,nb,nc);
if( nb>=nb_max )
{
printf("c-iter=%u %u %u %u\n",iter,na,nb,nc);
nb = 0;
nc += blockDim.z;
if( nc>=nc_max )
break; // end of loop
}
else
printf("c-else\n");
}
else
printf("b-else\n");
printf("result %u %u %u\n",na,nb,nc);
iter++;
na += blockDim.x;
}
}
int main(void)
{
unsigned int res, *d_res;
cudaMalloc(&d_res,sizeof(unsigned int));
b<<<1,1>>>(d_res);
cudaMemcpy(&res, d_res, sizeof(unsigned int), cudaMemcpyDeviceToHost);
cudaFree(d_res);
printf("There are %u combinations (correct is 8)\n",res);
return 0;
}
This appears to be an assembler bug. If I take a simplified version of your example:
template<int bug>
__global__
void b(unsigned int *res)
{
unsigned int na, nb, nc;
switch(bug) {
case 1:
na=threadIdx.x;
nb=threadIdx.y;
nc=threadIdx.z;
break;
default:
na = nb = nc = 0;
break;
}
unsigned int &iter=*res, na_max=2, nb_max=2, nc_max=2;
iter=0;
while(true)
{
if( na>=na_max )
{
na = 0;
nb += blockDim.y;
if( nb>=nb_max )
{
nb = 0;
nc += blockDim.z;
if( nc>=nc_max ) break;
}
}
iter++;
na += blockDim.x;
}
}
and instantiate both versions, the PTX emitted appears to be the same with the exception of the use of tid.{xyz} in the version with bug=1 (on the right):
.visible .entry _Z1bILi0EEvPj( .visible .entry _Z1bILi1EEvPj(
.param .u64 _Z1bILi0EEvPj_param_0 .param .u64 _Z1bILi1EEvPj_param_0
) )
{ {
.reg .pred %p<4>; .reg .pred %p<4>;
.reg .s32 %r<28>; .reg .s32 %r<28>;
.reg .s64 %rd<3>; .reg .s64 %rd<3>;
ld.param.u64 %rd2, [_Z1bILi0EEvPj_param_0]; ld.param.u64 %rd2, [_Z1bILi1EEvPj_param_0];
cvta.to.global.u64 %rd1, %rd2; cvta.to.global.u64 %rd1, %rd2;
mov.u32 %r26, 0; .loc 2 11 1
.loc 2 22 1 mov.u32 %r27, %tid.x;
st.global.u32 [%rd1], %r26; .loc 2 12 1
.loc 2 33 1 mov.u32 %r25, %tid.y;
mov.u32 %r1, %ntid.z; .loc 2 13 1
.loc 2 28 1 mov.u32 %r26, %tid.z;
mov.u32 %r2, %ntid.y; mov.u32 %r24, 0;
.loc 2 39 1 .loc 2 22 1
mov.u32 %r3, %ntid.x; st.global.u32 [%rd1], %r24;
mov.u32 %r27, %r26; .loc 2 33 1
mov.u32 %r25, %r26; mov.u32 %r4, %ntid.z;
mov.u32 %r24, %r26; .loc 2 28 1
mov.u32 %r5, %ntid.y;
BB0_1: .loc 2 39 1
.loc 2 25 1 mov.u32 %r6, %ntid.x;
setp.lt.u32 %p1, %r27, 2;
#%p1 bra BB0_4; BB1_1:
.loc 2 25 1
.loc 2 28 1 setp.lt.u32 %p1, %r27, 2;
add.s32 %r25, %r2, %r25; #%p1 bra BB1_4;
.loc 2 30 1
setp.lt.u32 %p2, %r25, 2; .loc 2 28 1
mov.u32 %r27, 0; add.s32 %r25, %r5, %r25;
.loc 2 30 1 .loc 2 30 1
#%p2 bra BB0_4; setp.lt.u32 %p2, %r25, 2;
mov.u32 %r27, 0;
.loc 2 33 1 .loc 2 30 1
add.s32 %r26, %r1, %r26; #%p2 bra BB1_4;
.loc 2 34 1
setp.gt.u32 %p3, %r26, 1; .loc 2 33 1
mov.u32 %r27, 0; add.s32 %r26, %r4, %r26;
mov.u32 %r25, %r27; .loc 2 34 1
.loc 2 34 1 setp.gt.u32 %p3, %r26, 1;
#%p3 bra BB0_5; mov.u32 %r27, 0;
mov.u32 %r25, %r27;
BB0_4: .loc 2 34 1
.loc 2 38 1 #%p3 bra BB1_5;
add.s32 %r24, %r24, 1;
st.global.u32 [%rd1], %r24; BB1_4:
.loc 2 39 1 .loc 2 38 1
add.s32 %r27, %r3, %r27; add.s32 %r24, %r24, 1;
bra.uni BB0_1; st.global.u32 [%rd1], %r24;
.loc 2 39 1
BB0_5: add.s32 %r27, %r6, %r27;
.loc 2 41 2 bra.uni BB1_1;
ret;
} BB1_5:
.loc 2 41 2
ret;
}
The assembler output is another story however (again bug=0 on the left and bug=1on the right):
/*0008*/ MOV R1, c [0x0] [0x44]; MOV R1, c [0x0] [0x44];
/*0010*/ MOV R6, c [0x0] [0x140]; MOV R6, c [0x0] [0x140];
/*0018*/ MOV R7, c [0x0] [0x144]; MOV R7, c [0x0] [0x144];
/*0020*/ S2R R0, SR_Tid_X; MOV R0, RZ;
/*0028*/ MOV R4, RZ; MOV R2, RZ;
/*0030*/ S2R R3, SR_Tid_Z; MOV R3, RZ;
/*0038*/ ST.E [R6], RZ; MOV R4, RZ;
/*0048*/ S2R R2, SR_Tid_Y; ST.E [R6], RZ;
/*0050*/ ISETP.LT.U32.AND P0, pt, R0, 0x2, pt; ISETP.LT.U32.AND P0, pt, R2, 0x2, pt;
/*0058*/ SSY 0xd0; #P0 BRA 0xb0;
/*0060*/ #P0 BRA 0xc0; IADD R3, R3, c [0x0] [0x2c];
/*0068*/ IADD R2, R2, c [0x0] [0x2c]; MOV R2, RZ;
/*0070*/ MOV R0, RZ; ISETP.LT.U32.AND P0, pt, R3, 0x2, pt;
/*0078*/ ISETP.LT.U32.AND P0, pt, R2, 0x2, pt; #P0 BRA 0xb0;
/*0088*/ SSY 0xa0; IADD R0, R0, c [0x0] [0x30];
/*0090*/ #P0 BRA 0xc0; MOV R2, RZ;
/*0098*/ IADD.S R3, R3, c [0x0] [0x30]; ISETP.GT.U32.AND P0, pt, R0, 0x1, pt;
/*00a0*/ ISETP.GT.U32.AND P0, pt, R3, 0x1, pt; MOV R3, RZ;
/*00a8*/ MOV R0, RZ; #P0 EXIT;
/*00b0*/ MOV R2, RZ; IADD R4, R4, 0x1;
/*00b8*/ #P0 EXIT; IADD R2, R2, c [0x0] [0x28];
/*00c8*/ IADD.S R4, R4, 0x1; ST.E [R6], R4;
/*00d0*/ ST.E [R6], R4; BRA 0x50;
/*00d8*/ IADD R0, R0, c [0x0] [0x28]; BRA 0xd8;
/*00e0*/ BRA 0x50; NOP CC.T;
/*00e8*/ BRA 0xe8; NOP CC.T;
/*00f0*/ NOP CC.T; NOP CC.T;
/*00f8*/ NOP CC.T; NOP CC.T;
The code on the right lacks two SSY instructions, and running it causes the kernel to sit in an infinite loop which would be consistant with some kind of SIMT correctness problem, like undetected branch divergence or divergence around a synchronisation barrier. What is really interesting is that it hangs when running only a single thread in a single block.
I would suggest filing a bug report on the NVIDIA registered developer site if I were you.
I have a shader that looks like this:
void main( in float2 pos : TEXCOORD0,
in uniform sampler2D data : TEXUNIT0,
in uniform sampler2D palette : TEXUNIT1,
in uniform float c,
in uniform float th0,
in uniform float th1,
in uniform float th2,
in uniform float4 BackGroundColor,
out float4 color : COLOR
)
{
const float4 dataValue = tex2D( data, pos );
const float vValue = dataValue.x;
const float tValue = dataValue.y;
color = BackGroundColor;
if ( tValue <= th2 )
{
if ( tValue < th1 )
{
const float vRealValue = abs( vValue - 0.5 );
if ( vRealValue > th0 )
{
// determine value and color
const float power = ( c > 0.0 ) ? vValue : ( 1.0 - vValue );
color = tex2D( palette, float2( power, 0.0 ) );
}
}
else
{
color = float4( 0.0, tValue, 0.0, 1.0 );
}
}
}
and I am compiling it like this:
cgc -profile arbfp1 -strict -O3 -q sh.cg -o sh.asm
Now, different versions of Cg compiler creating different output.
cgc version 2.2.0006 is compiling the shader into an assembler code using 18 instructions:
!!ARBfp1.0
PARAM c[6] = { program.local[0..4],{ 0, 1, 0.5 } };
TEMP R0;
TEMP R1;
TEMP R2;
TEX R0.xy, fragment.texcoord[0], texture[0], 2D;
ADD R0.z, -R0.x, c[5].y;
CMP R0.z, -c[0].x, R0.x, R0;
MOV R0.w, c[5].x;
TEX R1, R0.zwzw, texture[1], 2D;
SLT R0.z, R0.y, c[2].x;
ADD R0.x, R0, -c[5].z;
ABS R0.w, R0.x;
SGE R0.x, c[3], R0.y;
MUL R2.x, R0, R0.z;
SLT R0.w, c[1].x, R0;
ABS R2.y, R0.z;
MUL R0.z, R2.x, R0.w;
CMP R0.w, -R2.y, c[5].x, c[5].y;
CMP R1, -R0.z, R1, c[4];
MUL R2.x, R0, R0.w;
MOV R0.xzw, c[5].xyxy;
CMP result.color, -R2.x, R0, R1;
END
# 18 instructions, 3 R-regs
cgc version 3.0.0016 is compiling the shader into an assembler code using 23 instructions:
!!ARBfp1.0
PARAM c[6] = { program.local[0..4], { 0, 1, 0.5 } };
TEMP R0;
TEMP R1;
TEMP R2;
TEX R0.xy, fragment.texcoord[0], texture[0], 2D;
ADD R1.y, R0.x, -c[5].z;
MOV R1.z, c[0].x;
ABS R1.y, R1;
SLT R1.z, c[5].x, R1;
SLT R1.x, R0.y, c[2];
SGE R0.z, c[3].x, R0.y;
MUL R0.w, R0.z, R1.x;
SLT R1.y, c[1].x, R1;
MUL R0.w, R0, R1.y;
ABS R1.z, R1;
CMP R1.y, -R1.z, c[5].x, c[5];
MUL R1.y, R0.w, R1;
ADD R1.z, -R0.x, c[5].y;
CMP R1.z, -R1.y, R1, R0.x;
ABS R0.x, R1;
CMP R0.x, -R0, c[5], c[5].y;
MOV R1.w, c[5].x;
TEX R1, R1.zwzw, texture[1], 2D;
CMP R1, -R0.w, R1, c[4];
MUL R2.x, R0.z, R0;
MOV R0.xzw, c[5].xyxy;
CMP result.color, -R2.x, R0, R1;
END
# 23 instructions, 3 R-regs
The strange thing is that the optimization level for the cg 3.0 doesn't seems to influence anything.
Can someone explain what is going on? Why is the optimization not working and why is the shader longer when I compiled with cg 3.0?
Take a note that I removed comments from the compiled shaders.
This might not be a real answer to the problem but maybe give some more insight. I inspected the generated assembly code a bit and converted it back to high-level code. I tried to compress it as much as possible and remove all copies and temporaries that follow implicitly from the high-level operations. I used b variables as temporary bools and fs as temporary floats. The first one (with the 2.2 version) is:
power = ( c > 0.0 ) ? vValue : ( 1.0 - vValue );
R1 = tex2D( palette, float2( power, 0.0 ) );
vRealValue = abs( vValue - 0.5 );
b1 = ( tValue < th1 );
b2 = ( tValue <= th2 );
b3 = b1;
b1 = b1 && b2 && ( vRealValue > th0 );
R1 = b1 ? R1 : BackGroundColor;
color = ( b2 && !b3 ) ? float4( 0.0, tValue, 0.0, 1.0 ) : R1;
and the second (with 3.0) is:
vRealValue = abs( vValue - 0.5 );
f0 = c;
b0 = ( 0 < f0 );
b1 = ( tValue < th1 );
b2 = ( tValue <= th2 );
b4 = b1 && b2 && ( vRealValue > th0 );
b0 = b0;
b3 = b1;
power = ( b4 && !b0 ) ? ( 1.0 - vValue ) : vValue;
R1 = tex2D( palette, float2( power, 0.0 ) );
R1 = b4 ? R1 : BackGroundColor;
color = ( b2 && !b3 ) ? float4( 0.0, tValue, 0.0, 1.0 ) : R1;
Most parts are essentially the same. The second program does some unneccessary operations. It copies the c variable into a temporary instead of using it directly. Moreover does it switch vValue and 1-vValue in the power computation, so it needs to negate b0 (resulting in one more CMP), whereas the first one does not use a temporary at all (it uses CMP directly instead of SLT and CMP). It also uses b4 in this computation, which is completely unneccessary, because when b4 is false, the result of the texture access is irrelevant, anyway. This results in one more && (implemented with MUL). There is also the unneccessary copy from b1 to b3 (in the first program it is neccessary, but not in the second). And the extremely useless copy from b0 into itself (which is disguised as an ABS, but as the value comes from an SLT, it can only be 0.0 or 1.0 and the ABS degenerates to a MOV).
So the second program is quite similar to the first one with just some additional, but IMHO completely useless instructions. The optimizer seems to have done a worse job compared to the previous(!) version. As the Cg compiler is an nVidia product (and not from some other not to be named graphics company) this behaviour is really strange.