Using Functions in the LC3 Assembly Language - function

I need to know how to write a simple function in LC3 and using it in the main program.

It's just a matter of creating a label and then jumping to it. Once you're done with that subroutine then return back to the main code.
.orig x3000
AND R0, R0, #0 ; clear R0
JSR FUNCTION
PUTc
HALT ; TRAP x25
FUNCTION
ADD R0, R0, #10 ; Store the value of 10 into R0
RET ; return back to the main code
.end

Related

arm cortex-m33 (trustzone, silabs efm32pg22) - assembler hardfaults accessing GPIO or almost any peripherals areas, any hint?

I am just lost here with this code trying to configure on baremetal the silicon labs efm32pg22 in theirs devkit accessed through internal J-Link from segger studio (great fast ide) - I have such example blink hello world in C working from theirs simplicity studio, but was trying to achieve the same thing I did on microchip pic32 mc00 or samd21g17d easily in pure assembler, having only clocks and startup configured through gui in mplab x... well, here I tried to go to segger IDE where is NO startup/clocks config easy way, or I didnt found it yet. On hardware level, registers of such cortex beasts are different by manufacturer, in C/C++ there is some not cheap unification over cmsis - but I want only to know what minimal is needed to just have working raw GPIO after clock/startup ... Segger project is generic cortex-m for specific efm32pg22 so cortex-M33 with trust-zone security - I probably dont know what all is locked or switched off or in which state MCU is, if privileged or nonprivileged - there are 2 sets of registers mapping, but nothing works. As far as I try to "store" or even "load" on GPIO config registers (or SMU regs to query someting too) it is throw hardfault exception. All using segger ide debugger over onboard j-link. Kindly please, what I am doing wrong, whats missing here?
in C, I have only this code:
extern void blink(void);
int main ( void )
{
blink();
}
In blink.s I have this:
;#https://github.com/hubmartin/ARM-cortex-M-bare-metal-assembler-examples/blob/master/02%20-%20Bare%20metal%20blinking%20LED/main.S
;#https://sites.google.com/site/hubmartin/arm/arm-cortex-bare-metal-assembly/02---arm-cortex-bare-metal-assembly-blinking-led
;#https://mecrisp-stellaris-folkdoc.sourceforge.io/projects/blink-f0disco-gdbtui/doc/readme.html
;#https://microcontrollerslab.com/use-gpio-pins-tm4c123g-tiva-launchpad/
;#!!! ENABLE GPIO CLOCK SOURCE ON EFM32 !!!
;#https://community.silabs.com/s/share/a5U1M000000knsWUAQ/hello-world-part-2-create-firmware-to-blink-the-led?language=en_US
;#EFM32 GPIO
;#https://www.silabs.com/documents/public/application-notes/an0012-efm32-gpio.pdf
;# ARM thumb2 ISA
;#https://www.engr.scu.edu/~dlewis/book3/docs/ARM_and_Thumb-2_Instruction_Set.pdf
;#https://sciencezero.4hv.org/index.php?title=ARM:_Cortex-M3_Thumb-2_instruction_set
;#!!! https://stackoverflow.com/questions/48561243/gnu-arm-assembler-changes-orr-into-movw
;#segger assembler
;#https://studio.segger.com/segger/UM20006_Assembler.pdf
;#https://www.segger.com/doc/UM20006_Assembler.html
;#!!! unfortunatelly, we dont know here yet how to include ASM SFR defines, nor for MPLAB ARM (Harmony) !!!
;##include <xc.h>
;##include "definitions.h"
.cpu cortex-m33
.thumb
.text
.section .text.startup.main,"ax",%progbits
.balign 2
.p2align 2,,3
.global blink
//.arch armv8-m.base
.arch armv6-m
.syntax unified
.code 16
.thumb_func
.fpu softvfp
.type blink, %function
//!!! here we have manually entered GPIO PORT defines for PIC32CM
.equ SYSCFG_BASE_ADDRESS, 0x50078000
.equ SMU_BASE_ADDRESS, 0x54008000
//.equ SMU_BASE_ADDRESS, 0x5400C000
.equ CMU_BASE_ADDRESS, 0x50008000
.equ GPIO_BASE_ADDRESS, 0x5003C000 // this differs totally from both "special" infineon and microchip "standard?" cortex devices !!!
.equ DELAY, 40000
// Vector table
.word 0x20001000 // Vector #0 - Stack pointer init value (0x20000000 is RAM address and 0x1000 is 4kB size, stack grows "downwards")
.word blink // Vector #1 - Reset vector - where the code begins
// Vector #3..#n - I don't use Systick and another interrupts right now
// so it is not necessary to define them and code can start here
blink:
LDR r0, =(SYSCFG_BASE_ADDRESS + 0x200) // SYSCFG SYSCFG_CTRL
LDR r1, =0 // 0 diable address faults exceptions
ldr r1, [r0] // Store R0 value to r1
LDR r0, =(CMU_BASE_ADDRESS) // CMU CMU_SYSCLKCTRL PCLKPRESC + CLKSEL
LDR r1, =0b10000000001 // FSRCO 20MHz + PCLK = HCLK/2 = 10MHz
STR r1, [r0, 0x70] // Store R0 value to r1
LDR r0, =(CMU_BASE_ADDRESS) // CMU CMU_CLKEN0
LDR r1, [r0, 0x64]
LDR r2, =(1 << 25) // GPIO CLK EN
orrs r1, r2 // !!! HORROR !!! -- orr is not possible in thumb2 ?? only orrs !! (width suffix)
STR r1, [r0, 0x64] // Store R0 value to r1
LDR r1, [r0, 0x68]
LDR r2, =(1 << 14) // SMU CLK EN
orrs r1, r2 // !!! HORROR !!! -- orr is not possible in thumb2 ?? only orrs !! (width suffix)
STR r1, [r0, 0x68] // Store R0 value to r1
//LDR r0, =(SMU_BASE_ADDRESS) // SMU SMU_LOCK
//LDR r1, =11325013 // SMU UNLOCK CODE
//STR r1, [r0, 0x08] //Store R0 value to r1
ldr r0, =(SMU_BASE_ADDRESS) // SMU reading values, detection - AGAIN, HARD FAULTS !!!!!!!
ldr r1, [r0, 0x04]
ldr r1, [r0, 0x20]
ldr r1, [r0, 0x40]
//LDR r0, =(GPIO_BASE_ADDRESS + 0x300) // GPIO UNLOCK
//LDR r1, =0xA534
//STR r1, [r0] // Store R0 value to r1
//!! THIS BELOW IS OLD FOR SAMD , WE STILL SIMPLY CANT ENABLE GPIO !!!!
// Enable PORTA pin 4 as output
LDR r0, =(GPIO_BASE_ADDRESS) // DIR PORTA
LDR r1, =0b00000000000001000000000000000000
STR r1, [r0, 0x04] // Store R0 value to r1
LDR R2, =1
loop:
// Write high to pin PA04
LDR r0, =GPIO_BASE_ADDRESS // OUT PORTA
LDR r1, =0b10000 // PORT_PA04
STR r1, [r0, 0x10] // Store R1 value to address pointed by R0
// Dummy counter to slow down my loop
LDR R0, =0
LDR R1, =DELAY
loop0:
ADD R0, R2
cmp R0, R1
bne loop0
// Write low to PA04
LDR r0, =GPIO_BASE_ADDRESS // OUT PORTA
LDR r1, =0b00000
STR r1, [r0, 0x10] // Store R1 value to address pointed by R0
// Dummy counter to slow down my loop
LDR R0, =0
LDR R1, =DELAY
loop1:
ADD R0, R2
cmp R0, R1
bne loop1
b loop
UPDATE: well, now I tried it again in SimplicityStudio, placing blink() call after pregenerated system init:
extern void blink(void);
int main(void)
{
// Initialize Silicon Labs device, system, service(s) and protocol stack(s).
// Note that if the kernel is present, processing task(s) will be created by
// this call.
sl_system_init();
blink();
}
having this code in blink.s: - and here it works this way and blinks ...
.cpu cortex-m33
.thumb
.text
.section .text.startup.main,"ax",%progbits
.balign 2
.p2align 2,,3
.global blink
//.arch armv8-m.base
.arch armv6-m
.syntax unified
.code 16
.thumb_func
.fpu softvfp
.type blink, %function
/*
//!!! here we have manually entered GPIO PORT defines for PIC32CM
.equ SYSCFG_BASE_ADDRESS, 0x50078000
.equ SMU_BASE_ADDRESS, 0x54008000
//.equ SMU_BASE_ADDRESS, 0x5400C000
.equ CMU_BASE_ADDRESS, 0x50008000
*/
.equ GPIO_BASE_ADDRESS, 0x5003C000 // this differs totally from both "special" infineon and microchip "standard?" cortex devices !!!
.equ DELAY, 400000
// Vector table
.word 0x20001000 // Vector #0 - Stack pointer init value (0x20000000 is RAM address and 0x1000 is 4kB size, stack grows "downwards")
.word blink // Vector #1 - Reset vector - where the code begins
// Vector #3..#n - I don't use Systick and another interrupts right now
// so it is not necessary to define them and code can start here
blink:
// Enable PORTA pin 4 as output
LDR r0, =(GPIO_BASE_ADDRESS) // DIR PORTA
LDR r1, =0b00000000000001000000000000000000
STR r1, [r0, 0x04]
loop:
// Write high to pin PA04
LDR r0, =GPIO_BASE_ADDRESS // OUT PORTA
LDR r1, =0b10000 // PORT_PA04
STR r1, [r0, 0x10]
// Dummy counter to slow down my loop
LDR R0, =0
LDR R1, =DELAY
loop0:
ADD R0, R2
cmp R0, R1
bne loop0
// Write low to PA04
LDR r0, =GPIO_BASE_ADDRESS // OUT PORTA
LDR r1, =0b00000
STR r1, [r0, 0x10]
// Dummy counter to slow down my loop
LDR R0, =0
LDR R1, =DELAY
loop1:
ADD R0, R2
cmp R0, R1
bne loop1
b loop
... so NOW, I am just curious, what all is missing in pure assembly code to bring that cortex-m33 into some "easy" state, just ignoring trustzone, probably to use it similary as say, plain cortex-m3 ??
can anybody help? I am digging deeply into this datasheet/ref manual, but no luck till now ...
https://www.silabs.com/documents/public/reference-manuals/efm32pg22-rm.pdf
UPDATE AGAIN: umm, will try to figure out ... by traversing system_init C-code its clear whats going on, there are also some chip errata workarounds, but I never touched DCDC while initializing, this may be culprit...
void sl_platform_init(void)
{
CHIP_Init();
sl_device_init_nvic();
sl_board_preinit();
sl_device_init_dcdc();
sl_device_init_hfxo();
sl_device_init_lfxo();
sl_device_init_clocks();
sl_device_init_emu();
sl_board_init();
}
well, okay, manufacturer specific code generation for MCU startup IS really important and useful thing )) ... such MCUs from different manufacturers are really much different at registers level (even that all are "cortex-m" core based), that its worthless to try to configure them manually in assembly if there is enough flash available, and it mostly IS. So, till now, no luck with segger/keil/iar "generic" arm/cortex IDEs to do this properly on specific parts, so using manufacturer specific IDE to (mostly) graphically configure startup clocks and peripherals IS CRUCIAL, or at least, its really easiest way (I know, quite expensive observation after all the assembly tries... )). After then, its easy to make even pure assembly "blink" helloworld test called as extern C-function. You may be asking why I am still considering assembly if there are even CMSIS (on arm) "platform abstraction layer" C-headers at least (no, it doesnt help in abstraction, as the devices are still very different, you only have registers symbols #defines and typedefs and enums to do something in C easily, okay). But I am trying to compare some C-compiled code with handwriten assembly for some specific purpose, which needs forced optimized algorithm from scratch and its often quite easier to think/design it directly in assembly that to rely on very complexly described C-compiler optimisations (each compiler has its own LONG document how his optimisations work and at this level, C is simply still too abstract and moving target, the more, you try to write something for even different MCU architectures (think ARM cortex-m, PIC32/mips, and/or even PIC16/18 + PIC24, AVR , MSP430 ...) - while general algorithm may be described in shared pseudoassenbly to be as near to hardware as possible, withnout knowing all optimization quirks of each architecture C compiler(s) - there are often MORE different C compilers too. So, to compare C-compiler generated code with handwriten assembly you can do it, and I already tried such assembly blink on MANY VERY different architectures, in case I definitelly used mfg specific IDE to genearte startup in C, using all the GUI configurations and code generation down to always compilable empty C project, of course, having very different code size output using such generated startups. Most advanced MCUs are really very complex, mostly in clocks configuration and pins functions config and then different peripheral devices too, sure. Some similarities are possible only at single mfg level, to some extent, so MCU of single manufacturer often share similar approach, obviously. So final solution is to have startup generated and then switch to assembly immediatelly, this is feasible. Sure that in case of small flash, its further possible to optimize even startup code, but its mostly important on smallest 8bit parts, where startup IS quite easy anyway or the generated code is also small, obviously.

Two functions/subroutines in ARM assembly language

I am stuck with an exercise of ARM.
The following program should calculate the result of 2((x-1)^2 + 1) but there is a mistake in the program that leads it into an infinite loop.
I think that I still don't understand completely subroutines and for this reason I am not seeing where the mistake is.
_start:
mov r0, #4
bl g
mov r7, #1
swi #0
f:
mul r1, r0, r0
add r0, r1, #1
mov pc, lr
g:
sub r0, r0, #1
bl f
add r0, r0, r0
mov pc, lr
The infinite loop starts in subroutine g: in the line of mov pc, lr and instead of returning to _start it goes to the previous line add r0, r0, r0 and then again to the last line of subroutine g:.
So I guess that the problem is the last line of subroutine g: but I can't find the way to return to _start without using mov pc, lr. I mean, this should be the command used when we have a branch with link.
Also, in this case r0 = 4, so the result of the program should be 20.
This is because you don't save lr on the stack prior to calling f, and the initial return address was therefore lost: if you only have one level of subroutine calls, using lr without saving it is fine, but if you have more then one, you need to preserve the previous value of lr.
For example, when compiling this C example using Compiler Explorer with ARM gcc 4.56.4 (Linux), and options -mthumb -O0,
void f()
{
}
void g()
{
f();
}
void start()
{
g();
}
The generated code will be:
f():
push {r7, lr}
add r7, sp, #0
mov sp, r7
pop {r7, pc}
g():
push {r7, lr}
add r7, sp, #0
bl f()
mov sp, r7
pop {r7, pc}
start():
push {r7, lr}
add r7, sp, #0
bl g()
mov sp, r7
pop {r7, pc}
If you were running this on bare metal, not under Linux, you'd need your stack pointer to be initialized a correct value.
Assuming you are running from RAM on a bare-metal system/simulator, you could setup a minimal stack of 128 bytes:
.text
.balign 8
_start:
adr r0, . + 128 // set top of stack at _start + 128
mov sp, r0
...
But it looks like you're writing a Linux executable that exits with a swi/r7=1 exit system call. So don't do that, it would make your program crash when it tries to write to the stack.

MASM how to make desired function call

I'd like to know, how to do the following.
I have an array, where i have to summ numbers (easy)
but the twist is, that i have to have a function call for it,
that get's is params through specific registers. How do i implement that?
In this case, the function needs to get the array (offset) through ESI, and the length of it through ECX.
please educate me
EDIT:
in the meantime i've conjured up this. No idea if this works to as my MASM compliling just broken itself for no reason
.data
intarray DWORD 10000h,20000h,30000h,40000h
.code
szummer proc uses esi ecx,
ptrArray:PTR DWORD, ;points to the array
szArray: Dword ;array size
mov esi, ptrArray ;address of the array
mov ecx, szArray ;szize
mov eax, 0 ;set to 0
AS1:
add eax, [esi] ;add each int to sum
add esi, 4 ;point to next int
loop AS1 ;reapet for array size
ret;
szummer endp
main proc
mov ecx, OFFSET intarray
mov esi, LENGHTOF intarray
INVOKE ArraySum,ecx,esi
invoke ExitProcess,0
main endp
end main
The MASM directive INVOKE works only with the calling conventions C (cdecl), STDCALL, BASIC, FORTRAN and PASCAL. All of these conventions pass the arguments on the stack. Thus, you can't use INVOKE for passing the arguments in registers. You can use the Assembly instruction CALL instead. Your program - slightly modified ;-) - with MASM32 library included (because of "ExitProcess"):
INCLUDE \masm32\include\masm32rt.inc
.DATA
intarray DWORD 10000h,20000h,30000h,40000h
.CODE
szummer proc uses esi ecx
mov eax, 0 ;set to 0
AS1:
add eax, [esi] ;add each int to sum
add esi, 4 ;point to next int
loop AS1 ;reapet for array size
ret;
szummer endp
main proc
mov esi, OFFSET intarray
mov ecx, LENGTHOF intarray
call szummer
invoke ExitProcess,0
main ENDP
END main

Function call with more than 4 registers ARM assembly

I am trying to pass r0-r5 into the function check. However only the registers r0-r3 are copied by reference. In my main function i have this code.
push {lr}
mov r0, #1
mov r1, #2
mov r2, #3
mov r3, #4
mov r4, #5
mov r5, #6
bl check
pop {lr}
bx lr
Inside my check function i have this code. This is in a separate file also not sure if that matters
m: .asciz "%d, %d ~ (%d, %d, %d)
...
push {lr}
ldr r0, =m
bl printf
pop {lr}
bx lr
The output for this is 2, 3 ~ (4, 33772, 1994545180). I am trying to learn assembly so can you please explain the answer with some googling i know i need to use the stack but, I am not sure how to use it and would like to learn how. Thanks in advance.
you could just try it and see
void check ( unsigned int, unsigned int, unsigned int, unsigned int, unsigned int );
void call_check ( void )
{
check(1,2,3,4,5);
}
arm-linux-gnueabi-gcc -c -O2 check.c -o check.o
arm-linux-gnueabi-objdump -D check.o
00000000 <call_check>:
0: e52de004 push {lr} ; (str lr, [sp, #-4]!)
4: e3a03005 mov r3, #5
8: e24dd00c sub sp, sp, #12
c: e58d3000 str r3, [sp]
10: e3a00001 mov r0, #1
14: e3a01002 mov r1, #2
18: e3a02003 mov r2, #3
1c: e3a03004 mov r3, #4
20: ebfffffe bl 0 <check>
24: e28dd00c add sp, sp, #12
28: e8bd8000 ldmfd sp!, {pc}
now of course this could be hand optimized and still work just fine. Maybe they are keeping the stack aligned on a 16 byte/4 word/64 bit boundary is the reason for the additional 12 byte modification to the stack pointer? dont know. but other than that you can see that you naturally need to save the link register since you are calling another function. r0 - r3 are obvious and then per the eabi the first thing on the stack is the 5th word worth of parameters.
Likewise for your check function you can simply let the compiler get you started. If you look at your code, r0 is coming in as your first parameter and then you trash it by changing it to the first parameter for printf. you need 6 parameters for printf to pass in. you need to move them over one the first parameter to check is the second parameter to printf, the second to check is third to printf and so on. so the code has to do that shift (two of which now are on the stack).

For CUDA, is there a guarantee that Ternary Operator can avoid branch divergence?

I have read a lot of threads about CUDA branch divergence, telling me that using ternary operator is better than if/else statements, because ternary operator doesn't result in branch divergence.
I wonder, for the following code:
foo = (a > b) ? (bar(a)) : (b);
Where bar is another function or some more complicate statements, is it still true that there is no branch divergence ?
I don't know what sources you consulted, but with the CUDA toolchain there is no noticeable performance difference between the use of the ternary operator and the equivalent if-then-else sequence in most cases. In the case where such differences are noticed, they are due to second order effects in the code generation, and the code based on if-then-else sequence may well be faster in my experience. In essence, ternary operators and tightly localized branching are treated in much the same way. There can be no guarantees that a ternary operator may not be translated into machine code containing a branch.
The GPU hardware offers multiple mechanisms that help avoiding branches and the CUDA compiler makes good use of these mechanisms to minimize branches. One is predication, which can be applied to pretty much any instruction. The other is support for select-type instructions which are essentially the hardware equivalent of the ternary operator. The compiler uses if-conversion to translate short branches into branch-less code sequences. Often, it choses a combination of predicated code and a uniform branch. In cases of non-divergent control flow (all threads in a warp take the same branch) the uniform branch skips over the predicated code section.
Except in cases of extreme performance optimization, CUDA can (and should) be written in natural idioms that are clear and appropriate to the task at hand, using either if-then-else sequences or ternary operators as you see fit. The compiler will take care of the rest.
(I would like to add the comment for #njuffa's answer but my reputation is not enough)
I found the performance different between them with my program.
The if-clause style costs 4.78ms:
// fin is {0-4}, range_limit = 5
if(fin >= range_limit){
res_set = res_set^1;
fx_ref = fx + (fxw*(float)DEPTH_M_H/(float)DEPTH_BLOCK_X);
fin = 0;
}
// then branch for next loop iteration.
// nvvp report these assemblies.
#!P1 LOP32I.XOR R48, R48, 0x1;
#!P1 FMUL.FTZ R39, R7, 14;
#!P1 MOV R0, RZ;
MOV R40, R48;
{ #!P1 FFMA.FTZ R6, R39, c[0x2][0x0], R5;
#!P0 BRA `(.L_35); } // the predicate also use for loop's branching
And the ternary style costs 4.46ms:
res_set = (fin < range_limit) ? res_set: (res_set ^1);
fx_ref = (fin < range_limit) ? fx_ref : fx + (fxw*(float)DEPTH_M_H/(float)DEPTH_BLOCK_X) ;
fin = (fin < range_limit) ? fin:0;
//comments are where nvvp mark the instructions are for the particular code line
ISETP.GE.AND P2, PT, R34.reuse, c[0x0][0x160], PT; //res_set
FADD.FTZ R27, -R25, R4;
ISETP.LT.AND P0, PT, R34, c[0x0][0x160], PT; //fx_ref
IADD32I R10, R10, 0x1;
SHL R0, R9, 0x2;
SEL R4, R4, R27, P1;
ISETP.LT.AND P1, PT, R10, 0x5, PT;
IADD R33, R0, R26;
{ SEL R0, RZ, 0x1, !P2;
STS [R33], R58; }
{ FADD.FTZ R3, R3, 74.75;
STS [R33+0x8], R29; }
{ #!P0 FMUL.FTZ R28, R4, 14; //fx_ref
STS [R33+0x10], R30; }
{ IADD32I R24, R24, 0x1;
STS [R33+0x18], R31; }
{ LOP.XOR R9, R0, R9; //res_set
STS [R33+0x20], R32; }
{ SEL R0, R34, RZ, P0; //fin
STS [R33+0x28], R36; }
{ #!P0 FFMA.FTZ R2, R28, c[0x2][0x0], R3; //fx_ref
The inserted lines are from the next loop iteration calculation.
I think in the case of many instructions shared the same predicate value, the ternary style may provide more opportunity for ILP optimization.