Function call with more than 4 registers ARM assembly - function

I am trying to pass r0-r5 into the function check. However only the registers r0-r3 are copied by reference. In my main function i have this code.
push {lr}
mov r0, #1
mov r1, #2
mov r2, #3
mov r3, #4
mov r4, #5
mov r5, #6
bl check
pop {lr}
bx lr
Inside my check function i have this code. This is in a separate file also not sure if that matters
m: .asciz "%d, %d ~ (%d, %d, %d)
...
push {lr}
ldr r0, =m
bl printf
pop {lr}
bx lr
The output for this is 2, 3 ~ (4, 33772, 1994545180). I am trying to learn assembly so can you please explain the answer with some googling i know i need to use the stack but, I am not sure how to use it and would like to learn how. Thanks in advance.

you could just try it and see
void check ( unsigned int, unsigned int, unsigned int, unsigned int, unsigned int );
void call_check ( void )
{
check(1,2,3,4,5);
}
arm-linux-gnueabi-gcc -c -O2 check.c -o check.o
arm-linux-gnueabi-objdump -D check.o
00000000 <call_check>:
0: e52de004 push {lr} ; (str lr, [sp, #-4]!)
4: e3a03005 mov r3, #5
8: e24dd00c sub sp, sp, #12
c: e58d3000 str r3, [sp]
10: e3a00001 mov r0, #1
14: e3a01002 mov r1, #2
18: e3a02003 mov r2, #3
1c: e3a03004 mov r3, #4
20: ebfffffe bl 0 <check>
24: e28dd00c add sp, sp, #12
28: e8bd8000 ldmfd sp!, {pc}
now of course this could be hand optimized and still work just fine. Maybe they are keeping the stack aligned on a 16 byte/4 word/64 bit boundary is the reason for the additional 12 byte modification to the stack pointer? dont know. but other than that you can see that you naturally need to save the link register since you are calling another function. r0 - r3 are obvious and then per the eabi the first thing on the stack is the 5th word worth of parameters.
Likewise for your check function you can simply let the compiler get you started. If you look at your code, r0 is coming in as your first parameter and then you trash it by changing it to the first parameter for printf. you need 6 parameters for printf to pass in. you need to move them over one the first parameter to check is the second parameter to printf, the second to check is third to printf and so on. so the code has to do that shift (two of which now are on the stack).

Related

Why does moving a call function 1 byte earlier causes it to malfunction? What to modify to resolve it?

I am trying to create an inline codecave to modify a very old dosgame (ROTK2). However, far calls cannot be repositioned - moving the machine code for the function call 1 byte earlier causes it to malfunction. What parameters I need to readjust to correct the problem?
To be more specific for the task needed, I need to create 3 bytes of space to be able to add an assembly line. Hence I need to reposition code (codecaving may not be a possible option due to lack of resources in script, I had to find the 3 bytes inline).
There are too many lines of code in the whole program, which I do not fully understand, but I am using dosbox debugger with breakpoints to pinpoint to lines relevant:
This position of the code prints out the year, month and season at the current point of time in the game. From line 4EE8, the two pushs of ax moves the cursor to the position (2,4) on the screen so that the year located at [0x44] and the month, number located at bl [0x46] which then needs to be modified to reference to the month in text somewhere else. Then the string is printed out, and the process is repeated.
...
00004EDF 0E push cs
00004EE0 E80500 call 0x4ee8
00004EE3 0E push cs
00004EE4 E85101 call 0x5038
00004EE7 CB retf
00004EE8 B80400 mov ax,0x4
00004EEB 50 push ax
00004EEC B80200 mov ax,0x2
00004EEF 50 push ax
00004EF0 9A3404EF03 call 0x3ef:0x434
00004EF5 83C404 add sp,byte +0x4
00004EF8 FF364400 push word [0x44]
00004EFC 8A1E4600 mov bl,[0x46]
00004F00 2AFF sub bh,bh
00004F02 D1E3 shl bx,1
00004F04 FFB71C36 push word [bx+0x361c]
00004F08 B8B834 mov ax,0x34b8
00004F0B 50 push ax
00004F0C 9AE806EF03 call 0x3ef:0x6e8
00004F11 83C406 add sp,byte +0x6
00004F14 B80C00 mov ax,0xc
00004F17 50 push ax
00004F18 B80500 mov ax,0x5
00004F1B 50 push ax
00004F1C 9A3404EF03 call 0x3ef:0x434
00004F21 83C404 add sp,byte +0x4
00004F24 A04600 mov al,[0x46]
00004F27 B103 mov cl,0x3
00004F29 2AE4 sub ah,ah
00004F2B F6F1 div cl
00004F2D 8AD8 mov bl,al
00004F2F 2AFF sub bh,bh
00004F31 D1E3 shl bx,1
00004F33 FFB7D634 push word [bx+0x34d6]
00004F37 B8CC34 mov ax,0x34cc
00004F3A 50 push ax
00004F3B 9AE806EF03 call 0x3ef:0x6e8
00004F40 83C404 add sp,byte +0x4
00004F43 CB retf
....
I can make the code more "concise" by doing this without breaking the code:
...
00004EDF 0E push cs
00004EE0 E80500 call 0x4ee8
00004EE3 0E push cs
00004EE4 E85101 call 0x5038
00004EE7 CB retf
00004EE8 B80400 mov ax,0x4
00004EEB 50 push ax
00004EEC D1E8 shr ax,1
00004EEE 50 push ax; just nice the previous number was double
00004EEF 90 nop
00004EF0 9A3404EF03 call 0x3ef:0x434
00004EF5 83C404 add sp,byte +0x4
00004EF8 FF364400 push word [0x44]
00004EFC 8A1E4600 mov bl,[0x46]
00004F00 2AFF sub bh,bh
00004F02 D1E3 shl bx,1
00004F04 FFB71C36 push word [bx+0x361c]
00004F08 B8B834 mov ax,0x34b8
00004F0B 50 push ax
00004F0C 9AE806EF03 call 0x3ef:0x6e8
00004F11 83C406 add sp,byte +0x6
00004F14 B80C00 mov ax,0xc
00004F17 50 push ax
00004F18 B80500 mov ax,0x5
00004F1B 50 push ax
00004F1C 9A3404EF03 call 0x3ef:0x434
00004F21 83C404 add sp,byte +0x4
00004F24 A04600 mov al,[0x46]
00004F27 B103 mov cl,0x3
00004F29 2AE4 sub ah,ah
00004F2B F6F1 div cl
00004F2D 8AD8 mov bl,al
00004F2F 2AFF sub bh,bh
00004F31 D1E3 shl bx,1
00004F33 FFB7D634 push word [bx+0x34d6]
00004F37 B8CC34 mov ax,0x34cc
00004F3A 50 push ax
00004F3B 9AE806EF03 call 0x3ef:0x6e8
00004F40 83C404 add sp,byte +0x4
00004F43 CB retf
....
But once I shift the function call one byte earlier (instead of having the nop in front of the call 0x3ef:0x434 but call 0x3ef:0x434 then the nop), everything cease to function.
with nop first: dosbox logger output
07D2:0000039F nop EAX:00000002 EBX:00000054 ECX:00000000 EDX:000003CF ESI:00003431 EDI:00000107 EBP:0000E424 ESP:0000E416 DS:32C3 ES:A000 FS:0000 GS:0000 SS:32C3 CF:0 ZF:0 SF:0 OF:0 AF:1 PF:0 IF:1
07D2:000003A0 call 070C:0434 EAX:00000002 EBX:00000054 ECX:00000000 EDX:000003CF ESI:00003431 EDI:00000107 EBP:0000E424 ESP:0000E416 DS:32C3 ES:A000 FS:0000 GS:0000 SS:32C3 CF:0 ZF:0 SF:0 OF:0 AF:1 PF:0 IF:1
without nop:
07D2:0000039F call 20EF:0434 EAX:00000002 EBX:00000054 ECX:00000000 EDX:000003CF ESI:00003431 EDI:00000107 EBP:0000E424 ESP:0000E416 DS:32C3 ES:A000 FS:0000 GS:0000 SS:32C3 CF:0 ZF:0 SF:0 OF:0 AF:1 PF:0 IF:1
I have been reading assembly, bp, sp, ss and stuff, but I am still stuck. Hence asking the question, which is how to reposition function calls without corrupting code?
DOS doesn't load programs at a constant address so the far calls have relocation entries that also need changing, if one is modifying an executable.
Also the moved call instruction may be the destination of a call or jmp instruction which also needs changing.

Two functions/subroutines in ARM assembly language

I am stuck with an exercise of ARM.
The following program should calculate the result of 2((x-1)^2 + 1) but there is a mistake in the program that leads it into an infinite loop.
I think that I still don't understand completely subroutines and for this reason I am not seeing where the mistake is.
_start:
mov r0, #4
bl g
mov r7, #1
swi #0
f:
mul r1, r0, r0
add r0, r1, #1
mov pc, lr
g:
sub r0, r0, #1
bl f
add r0, r0, r0
mov pc, lr
The infinite loop starts in subroutine g: in the line of mov pc, lr and instead of returning to _start it goes to the previous line add r0, r0, r0 and then again to the last line of subroutine g:.
So I guess that the problem is the last line of subroutine g: but I can't find the way to return to _start without using mov pc, lr. I mean, this should be the command used when we have a branch with link.
Also, in this case r0 = 4, so the result of the program should be 20.
This is because you don't save lr on the stack prior to calling f, and the initial return address was therefore lost: if you only have one level of subroutine calls, using lr without saving it is fine, but if you have more then one, you need to preserve the previous value of lr.
For example, when compiling this C example using Compiler Explorer with ARM gcc 4.56.4 (Linux), and options -mthumb -O0,
void f()
{
}
void g()
{
f();
}
void start()
{
g();
}
The generated code will be:
f():
push {r7, lr}
add r7, sp, #0
mov sp, r7
pop {r7, pc}
g():
push {r7, lr}
add r7, sp, #0
bl f()
mov sp, r7
pop {r7, pc}
start():
push {r7, lr}
add r7, sp, #0
bl g()
mov sp, r7
pop {r7, pc}
If you were running this on bare metal, not under Linux, you'd need your stack pointer to be initialized a correct value.
Assuming you are running from RAM on a bare-metal system/simulator, you could setup a minimal stack of 128 bytes:
.text
.balign 8
_start:
adr r0, . + 128 // set top of stack at _start + 128
mov sp, r0
...
But it looks like you're writing a Linux executable that exits with a swi/r7=1 exit system call. So don't do that, it would make your program crash when it tries to write to the stack.

Using Functions in the LC3 Assembly Language

I need to know how to write a simple function in LC3 and using it in the main program.
It's just a matter of creating a label and then jumping to it. Once you're done with that subroutine then return back to the main code.
.orig x3000
AND R0, R0, #0 ; clear R0
JSR FUNCTION
PUTc
HALT ; TRAP x25
FUNCTION
ADD R0, R0, #10 ; Store the value of 10 into R0
RET ; return back to the main code
.end

LLVM use of carry and zero flags

I'm starting to read LLVM docs and IR documentation.
In common architectures, an asm cmp instruction "result" value is -at least- 3 bits long, let's say the first bit is the SIGN flag, the second bit is the CARRY flag and the third bit is the ZERO flag.
Question 1)
Why the IR icmp instruction result value is only i1? (you can choose only one flag)
Why doesn't IR define, let's call it a icmp2 instruction returning an i3 having SIGN,CARRY and ZERO flags?
This i3 value can be acted upon with a switch instruction, or maybe a specific br2 instruction, like:
%result = cmp2 i32 %a, i32 %b
br2 i3 %result onzero label %EQUAL, onsign label %A_LT_B
#here %a GT %b
Question 2)
Does this make sense? Could this br2 instruction help create new optimizations? i.e. remove all jmps? it is necessary or the performance gains are negligible?
The reason I'm asking this -besides not being an expert in LLVM- is because in my first tests I was expecting some kind of optimization to be made by LLVM in order to avoid making the comparison twice and also avoid all branches by using asm conditional-move instructions.
My Tests:
I've compiled with clang-LLVM this:
#include <stdlib.h>
#include <inttypes.h>
typedef int32_t i32;
i32 compare (i32 a, i32 b){
// return (a - b) & 1;
if (a>b) return 1;
if (a<b) return -1;
return 0;
}
int main(int argc, char** args){
i32 n,i;
i32 a,b,avg;
srand(0); //fixed seed
for (i=0;i<500;i++){
for (n=0;n<1e6;n++){
a=rand();
b=rand();
avg+=compare(a,b);
}
}
return avg;
}
Output asm is:
...
mov r15d, -1
...
.LBB1_2: # Parent Loop BB1_1 Depth=1
# => This Inner Loop Header: Depth=2
call rand
mov r12d, eax
call rand
mov ecx, 1
cmp r12d, eax
jg .LBB1_4
# BB#3: # in Loop: Header=BB1_2 Depth=2
mov ecx, 0
cmovl ecx, r15d
.LBB1_4: # %compare.exit
# in Loop: Header=BB1_2 Depth=2
add ebx, ecx
...
I expected (all jmps removed in the inner loop):
mov r15d, -1
mov r13d, 1 # HAND CODED
call rand
mov r12d, eax
call rand
xor ecx,ecx # HAND CODED
cmp r12d, eax
cmovl ecx, r15d # HAND CODED
cmovg ecx, r13d # HAND CODED
add ebx, ecx
Performance difference (1s) seems to be negligible (on a VM under VirtualBox):
LLVM generated asm: 12.53s
hancoded asm: 11.53s
diff: 1s, in 500 millions iterations
Question 3)
Are my performance measures correct? Here's the makefile and the full hancoded.compare.s
makefile:
CC=clang -mllvm --x86-asm-syntax=intel
all:
$(CC) -S -O3 compare.c
$(CC) compare.s -o compare.test
$(CC) handcoded.compare.s -o handcoded.compare.test
echo `time ./compare.test`
echo `time ./handcoded.compare.test`
echo `time ./compare.test`
echo `time ./handcoded.compare.test`
hand coded (fixed) asm:
.text
.file "handcoded.compare.c"
.globl compare
.align 16, 0x90
.type compare,#function
compare: # #compare
.cfi_startproc
# BB#0:
mov eax, 1
cmp edi, esi
jg .LBB0_2
# BB#1:
xor ecx, ecx
cmp edi, esi
mov eax, -1
cmovge eax, ecx
.LBB0_2:
ret
.Ltmp0:
.size compare, .Ltmp0-compare
.cfi_endproc
.globl main
.align 16, 0x90
.type main,#function
main: # #main
.cfi_startproc
# BB#0:
push rbp
.Ltmp1:
.cfi_def_cfa_offset 16
push r15
.Ltmp2:
.cfi_def_cfa_offset 24
push r14
.Ltmp3:
.cfi_def_cfa_offset 32
push r12
.Ltmp4:
.cfi_def_cfa_offset 40
push rbx
.Ltmp5:
.cfi_def_cfa_offset 48
.Ltmp6:
.cfi_offset rbx, -48
.Ltmp7:
.cfi_offset r12, -40
.Ltmp8:
.cfi_offset r14, -32
.Ltmp9:
.cfi_offset r15, -24
.Ltmp10:
.cfi_offset rbp, -16
xor r14d, r14d
xor edi, edi
call srand
mov r15d, -1
mov r13d, 1 # HAND CODED
# implicit-def: EBX
.align 16, 0x90
.LBB1_1: # %.preheader
# =>This Loop Header: Depth=1
# Child Loop BB1_2 Depth 2
mov ebp, 1000000
.align 16, 0x90
.LBB1_2: # Parent Loop BB1_1 Depth=1
# => This Inner Loop Header: Depth=2
call rand
mov r12d, eax
call rand
xor ecx,ecx #hand coded
cmp r12d, eax
cmovl ecx, r15d #hand coded
cmovg ecx, r13d #hand coded
add ebx, ecx
.LBB1_3:
dec ebp
jne .LBB1_2
# BB#5: # in Loop: Header=BB1_1 Depth=1
inc r14d
cmp r14d, 500
jne .LBB1_1
# BB#6:
mov eax, ebx
pop rbx
pop r12
pop r14
pop r15
pop rbp
ret
.Ltmp11:
.size main, .Ltmp11-main
.cfi_endproc
.ident "Debian clang version 3.5.0-1~exp1 (trunk) (based on LLVM 3.5.0)"
.section ".note.GNU-stack","",#progbits
Question 1: LLVM IR is machine independent. Some machines might not even have a carry flag, or even a zero flag or sign flag. The return value is i1 which suffices to indicate TRUE or FALSE. You can set the comparison condition like 'eq' and then check the result to see if the two operands are equal or not, etc.
Question 2: LLVM IR does not care about optimization initially. The main goal is to generate a Static Single Assignment (SSA) based representation of instructions. Optimization happens in later passes of which some are machine independent and some are machine dependent. Your br2 idea will assume that the machine will support those 3 flags which might be a wrong assumption,
Question 3: I am not sure what you are trying to do here. Can you explain more?

Sum function in x86 assembly - no output

I am trying to write a simple sum function in x86 assembly - to which i am passing 3 and 8 as arguments. However, the code doesn't print the sum. Appreciate any help in spotting the errors. I'm using NASM
section .text
global _start
_sum:
push ebp
mov ebp, esp
push edi
push esi ;prologue ends
mov eax, [ebp+8]
add eax, [ebp+12]
pop esi ;epilogue begins
pop edi
mov esp, ebp
pop ebp
ret 8
_start:
push 8
push 3
call _sum
mov edx, 1
mov ecx, eax
mov ebx, 1 ;stdout
mov eax, 4 ;write
int 0x80
mov ebx, 0
mov eax, 1 ;exit
int 0x80
To me, this looks like Linux assembler. From this page, in the Examples section, subsection int 0x80, it looks like ecx expects the address of the string:
_start:
movl $4, %eax ; use the write syscall
movl $1, %ebx ; write to stdout
movl $msg, %ecx ; use string "Hello World"
movl $12, %edx ; write 12 characters
int $0x80 ; make syscall
So, you'll have to get a spare chunk of memory, convert your result to a string, probably null-terminate that string, and then call the write with the address of the string in ecx.
For an example of how to convert an integer to a string see Printing an Int (or Int to String) You'll have to store each digit in a string instead of printing it, and null-terminate it. Then you can print the string.
Sorry, I have not programmed in assembly in years, so I cannot give you a more detailed answer, but hope that this will be enough to point you in the right direction.