Is there a way to run thread in USER Mode for azure-rtos (threadx)? - threadx

I have been playing around with azure-rtos (THREADX) and trying to port the OS for the cortex R5 based system. After looking at the port files, it seems that OS runs the threads in Supervisor (SVC) mode.
For example, in the function _tx_thread_stack_build, while building the stack for threads, initialization value for the CPSR is such that mode bits correspond to SVC mode. This initialization value is later used to initialize the CPSR before jumping to the thread entry function.
Following is the snippet of the function _tx_thread_stack_build storing initialization value of CPSR on the stack of a thread. For your reference see file tx_thread_stack_build.S.
.global _tx_thread_stack_build
.type _tx_thread_stack_build,function
_tx_thread_stack_build:
# Stack Bottom: (higher memory address) */
#
...
MRS r1, CPSR # Pickup CPSR
BIC r1, r1, #CPSR_MASK # Mask mode bits of CPSR
ORR r3, r1, #SVC_MODE # Build CPSR, SVC mode, interrupts enabled
STR r3, [r2, #4] # Store initial CPSR
...
To give another example, the function tx_thread_context_restore.S switches to SVC mode from IRQ mode to save the context of thread being switched out, which indicates that OS assumes here that thread is running in an SVC mode. For your reference see the file tx_thread_context_restore.s
Following is a snippet of the function saving context of a thread being switched out.
LDMIA sp!, {r3, r10, r12, lr} ; Recover temporarily saved registers
MOV r1, lr ; Save lr (point of interrupt)
MOV r2, #SVC_MODE ; Build SVC mode CPSR
MSR CPSR_c, r2 ; Enter SVC mode
STR r1, [sp, #-4]! ; Save point of interrupt
STMDB sp!, {r4-r12, lr} ; Save upper half of registers
MOV r4, r3 ; Save SPSR in r4
MOV r2, #IRQ_MODE ; Build IRQ mode CPSR
MSR CPSR_c, r2 ; Enter IRQ mode
LDMIA sp!, {r0-r3} ; Recover r0-r3
MOV r5, #SVC_MODE ; Build SVC mode CPSR
MSR CPSR_c, r5 ; Enter SVC mode
STMDB sp!, {r0-r3} ; Save r0-r3 on thread's stack
This leads me to a question, is there a way to run threads in USER mode? It is typically a case in OS that threads run in USER mode while kernel and services provided by it run in an SVC mode, which does not seem to be the case with Azure RTOS.

This is by design, ThreadX is a small monolithic kernel, where application code is tightly integrated with the kernel and lives in the same address space and mode. This allows for greater performance and lower footprint. You can also use ThreadX Modules, where the available MPU or MMU is used to separate kernel and user code into different modes and provide additional protection, but this incurs a small performance and footprint penalty.

Related

Which GPU execution dependencies have fixed latency (causing 'Wait' stalls)?

With recent NVIDIA micro-architectures, there's a new (?) taxonomy of warp stall reasons / warp scheduler states. One of these is:
Wait : Warp was stalled waiting on a fixed latency execution dependency.
As #GregSmith explains, fixed-latency instructions are: "Math, bitwise [and] register movement". But what are fixed-latency "execution dependencies"? Are these just "waiting for somebody else's fixed-latency instruction to conclude before we can issue it ourselves"?
Execution dependencies are dependencies that need to be resolved before the next instruction can be issued. These include register operands and predicates. The WAIT stall reason will be issued between instructions that have fixed latency. The compiler can choose to add additional waits between instructions to the same pipeline if the pipeline issue frequency is not 1 warp per cycle (e.g. FMA and ALU pipe can issue every other cycle on GV100 - GA100).
EXAMPLE 1 - No dependencies - compiler added waits
IADD R0, R1, R2; # R0 = R1 + R2
// stall = wait for 1 additional cycle
IADD R4, R5, R6; # R4 = R5 + R6
// stall = wait for 1 additional cycle
IADD R8, R9, R10; # R8 = R9 + R10
If the compiler did not add wait cycles then the stall reason would be math_throttle. This can also show up if the warp is ready to issue the instruction (all dependencies resolved) and another warp is issuing an instruction to the target pipeline.
EXAMPLE 2 - Wait stalls due to read after write dependency
IADD R0, R1, R2; # R0 = R1 + R2
// stall - wait for fixed number of cycles to clear read after write
IADD R0, R0, R3; # R0 += R3
// stall - wait for fixed number of cycles to clear read after write
IADD R0, R0, R4; # R0 += R4

Combined format of SASS instructions

I haven't seen a cuda document that describes the combined form of SASS instructions. For example, I know what are IADD and IMAD. But
IMAD.IADD R8, R8, 0x1, R7 ;
are not clear. Which operand belongs to which opcode? How that is executed? Moreover, are we dealing with one ADD and one MAD which means two ADD and one MUL? Or that is considered as one one MADD which means one ADD and one MUL?
How about IMAD.MOV.U32 R5, RZ, RZ, 0x0 ;? How that is interpreted?
The Volta and Turing architecture have two primary execution pipes.
FMA pipe is responsible for FFMA, FMUL, FADD, FSWZADD, and IMAD instructions.
ALU pipe is responsible for integer (except IMAD), bit manipulation, logical, and data movement instructions.
The ALU pipe executes MOV and IADD3.
The FMA pipe executes IMAD including variants IMAD.IADD and IMAD.MOV.
Using IMAD to emulate IADD and MOV allows the compiler to explicitly schedule instructions to FMA pipe instead of the ALU pipe.
What's clear from compiler output is that the compiler is emulating binary integer add and raw moves with IMAD, which generalizes both. The suffix is just the disassembler being nice by matching the pattern and telling you the operation is semantically equivalent to a simpler operation. The IMAD.* sequences are clever using RZ (the zero register), 0x0 and 0x1 to accomplish this. When the disassembler sees such a pattern, it adds the .MOV op suffix to say, "Hey, this is just a simple move."
E.g.
IMAD.IADD R8, R8, 0x1, R7
is:
R8 = 1*R8 + R7 = R8 + R7
IADD R8, R8, R7
(If IADD existed.)
Similarly for the MOV case, you see that it's using RZ. It's emulating the following.
MOV R5, 0x0
There is a MOV op in Volta, but I almost never see it.
(There's also a left-shift-by-K version IMAD.SHL I think, which uses a multiplier of 2^K where K is the shift amount.)

When #GP is raised from v8086 mode does the processor push an error code on the ring0 stack?

More broadly the question really is - when an exception is generated in v8086 mode that is propagated to a protected-mode interrupt/trap gate, does an error code get pushed onto the stack after the return address is pushed for those exceptions with an error code?
Say for instance I am running in V8086 mode (CPL=3, VM=1, PE=1) with an IOPL of 0. I would expect that the privileged instruction HLT should raise a #GP exception. NASM code could look something like:
bits 32
xor ebx, ebx ; EBX=0
push ebx ; Real mode GS=0
push ebx ; Real mode FS=0
push ebx ; Real mode DS=0
push ebx ; Real mode ES=0
push V86_STACK_SEG
push V86_STACK_OFS ; v8086 stack SS:SP (grows down from SS:SP)
push dword 1<<EFLAGS_VM_BIT | 1<<EFLAGS_BIT1
; Set VM Bit, IF bit is off, DF=0(forward direction),
; IOPL=0, Reserved bit (bit 1) always 1. Everything
; else 0. These flags will be loaded in the v8086 mode
; during the IRET. We don't want interrupts enabled
; because we don't have a proper v86 monitor
; GPF handler to process them.
push V86_CS_SEG ; Real Mode CS (segment)
push v86_mode_entry ; Entry point (offset)
iret ; Transfer control to v8086 mode and our real mode code
bits 16
v86_mode_entry:
hlt ; This should raise a #GP exception
When the protected-mode #GP exception handler starts running I want to know if an error code is pushed on the stack after CS:EIP.
One may say RTFM but the Intel documentation is the source of confusion.
Reason for the Question
Intel documents the exceptions and error codes in the Intel® 64 and IA-32 Architectures Software Developer’s Manual Vol 3A Table 6-2:
From the table #DF, #TS, #NP, #SS, #GP, #PF, and #AC have error codes. Intel documents that in real-address mode error codes are not pushed on the stack, but it seems to be suggested that in all other legacy modes (16/32-bit protected mode and v8086 mode) and long mode (64-bit and 16/32-bit compatibility modes) that an error code is pushed.
In Volume 2A in the instruction set reference for INT n/INTO/INT3/INT1—Call to Interrupt Procedure it says in the pseudo-code of those instructions the state REAL_ADDRESS_MODE has these items pushed:
Push(CS);
Push(IP);
(* No error codes are pushed in real-address mode*)
CS ← IDT(Descriptor (vector_number « 2), selector));
EIP ← IDT(Descriptor (vector_number « 2), offset)); (* 16 bit offset AND 0000FFFFH *)
Intel has gone out of their way to make it quite clear in real-address mode - error codes don't apply.
The Instruction Set Reference for the INT n/INTO/INT3/INT1—Call to Interrupt Procedure the pseudo-code defines the mechanics of INTER-PRIVILEGE-LEVEL-INTERRUPT or INTRA-PRIVILEGE-LEVEL-INTERRUPT states. Although the gate size (16/32/64-bit) determines the width of the data (including the width of the error code) the error code is pushed (if applicable) and is documented specifically with:
Push(ErrorCode); (* If needed, #-bytes *)
Where # is 2 (16-bit gate), 4 (32-bit gate), or 8 (64-bit gate).
The exception: The one place where an error code isn't documented as being pushed is in the state INTERRUPT-FROM-VIRTUAL-8086-MODE. A snippet of the relevant pseudo-code:
IF IDT gate is 32-bit
THEN
IF new stack does not have room for 40 bytes (error code pushed)
or 36 bytes (no error code pushed)
THEN #SS(error_code(NewSS,0,EXT)); FI;
(* idt operand to error_code is 0 because selector is used *)
ELSE (* IDT gate is 16-bit)
IF new stack does not have room for 20 bytes (error code pushed)
or 18 bytes (no error code pushed)
THEN #SS(error_code(NewSS,0,EXT)); FI;
(* idt operand to error_code is 0 because selector is used *)
FI;
IF instruction pointer from IDT gate is not within new code-segment limits
THEN #GP(EXT); FI; (* Error code contains NULL selector *)
tempEFLAGS ← EFLAGS;
VM ← 0;
TF ← 0;
RF ← 0;
NT ← 0;
IF service through interrupt gate
THEN IF = 0; FI;
TempSS ← SS;
TempESP ← ESP;
SS ← NewSS;
ESP ← NewESP;
(* Following pushes are 16 bits for 16-bit IDT gates and 32 bits for 32-bit IDT gates;
Segment selector pushes in 32-bit mode are padded to two words *)
Push(GS);
Push(FS);
Push(DS);
Push(ES);
Push(TempSS);
Push(TempESP);
Push(TempEFlags);
Push(CS);
Push(EIP);
GS ← 0; (* Segment registers made NULL, invalid for use in protected mode *)
FS ← 0;
DS ← 0;
ES ← 0;
CS ← Gate(CS); (* Segment descriptor information also loaded *)
CS(RPL) ← 0;
CPL ← 0;
IF IDT gate is 32-bit
THEN
EIP ← Gate(instruction pointer);
ELSE (* IDT gate is 16-bit *)
EIP ← Gate(instruction pointer) AND 0000FFFFH;
FI;
(* Start execution of new routine in Protected Mode *)
What is notably absent is any mention of the error code after Push(EIP); and before starting execution in protected mode. Of interest is that a check is done for enough stack space for the case of an error code and no error code. With a 32-bit interrupt/trap gate the size is either 40 with an error code or 36 without. This is the reason for the question1.
Footnotes
1I had never paid close attention to the newer Intel documentation over the years and was unaware what the documentation was saying with regards to v8086 mode. My v8086 monitors and protected mode interrupt handlers have always been written to take into account exceptions with error codes and those without. I didn't notice the problems in the documentation until this past week when someone approached me about a discussion where this was mentioned in passing (but not explained).
TL;DR: The pseudo-code in the Intel instruction set reference is incorrect. If an exception in v8086 mode causes a protected mode call/interrupt gate to execute an exception handler then an error code will be pushed if the exception is one of those with an error code. #GP has an error code and it will be pushed on the ring 0 stack before transferring control to your #GP handler. You must manually remove it prior to doing an IRET.
The answer is that an exception in Virtual 8086 mode (v8086 or v86) that is processed by a protected mode handler (through an interrupt or trap gate) will have the error code pushed for those exceptions that use one (including #GP). The pseudo-code should have been:
Push(CS);
Push(EIP);
Push(ErrorCode); (* If needed *)
In Intel® 64 and IA-32 Architectures Software Developer’s Manual Vol 1 in section 6.4.1 Call and Return Operation for Interrupt or Exception Handling Procedures documents inter (a privilege level change) and intra (privilege level remains the same) transitions as having this rule applied:
Pushes an error code on the new stack (if appropriate).
IMHO it would have probably been worded better as:
Pushes an error code on the new stack (if applicable to the exception).
v8086 mode is a special mode of protected-mode running at Privilege Level 3. These rules still apply since exceptions transition the processor from ring 3 to ring 0 (inter privilege level change) to handle interrupts via an interrupt/trap gate.
Related Real-Address Mode Documentation Inconsistencies
On the original 8086 processors the only exceptions were 0 through 4 (inclusive). That included #DE, #DB, NMI interrupt, #BP, and #OF. The rest were documented as reserved1 by Intel up to and including exception 31. None of the exceptions on the 8086 had error codes so this was never an issue. This changed on the 286 and later processors where exceptions with error codes were introduced.
In Intel® 64 and IA-32 Architectures Software Developer’s Manual Vol 1 section 6.4.3, Intel says this about Real-address mode on later processors (286+)
6.4.3 Interrupt and Exception Handling in Real-Address Mode
When operating in real-address mode, the processor responds to an interrupt
or exception with an implicit far call to an interrupt or exception
handler. The processor uses the interrupt or exception vector as an
index into an interrupt table. The interrupt table contains
instruction pointers to the interrupt and exception handler
procedures.
The processor saves the state of the EFLAGS register, the
EIP register, the CS register, and an optional error code on the stack
before switching to the handler procedure.
A return from the interrupt
or exception handler is carried out with the IRET instruction.
See Chapter 20, “8086 Emulation,” in the Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 3B, for more information on
handling interrupts and exceptions in real-address mode.
I've emphasized the important part where the documentation claims that "an optional error code" is pushed. This is in fact not true. An error code is not pushed in real-address mode for exceptions that would normally have one pushed in other operating modes. This section does say to see Chapter 20, “8086 Emulation” of Volume 3B. In Chapter 20 we find Section 20.1.4 Interrupt and Exception Handling says this:
The processor performs the following actions to make an implicit call
to the selected handler:
Pushes the current values of the CS and EIP registers onto the stack. (Only the 16 least-significant bits of the EIP register are
pushed.)
Pushes the low-order 16 bits of the EFLAGS register onto the stack.
Clears the IF flag in the EFLAGS register to disable interrupts.
Clears the TF, RF, and AC flags, in the EFLAGS register. Vol. 3B 20-5 8086 EMULATION
Transfers program control to the location specified in the interrupt vector table. An IRET instruction at the end of the handler
procedure reverses these steps to return program control to the
interrupted program. Exceptions do not return error codes in
real-address mode.
This part of the documentation is correct. The 5 steps do not include pushing an error code. This is consistent with the pseudo-code in the instruction set reference for INT n/INTO/INT3/INT1—Call to Interrupt Procedure which has this documented for the state REAL_ADDRESS_MODE:
Push(CS);
Push(IP);
(* No error codes are pushed in real-address mode*)
CS ← IDT(Descriptor (vector_number « 2), selector));
EIP ← IDT(Descriptor (vector_number « 2), offset)); (* 16 bit offset AND 0000FFFFH *)
Footnotes
1Although Intel reserved the unused exceptions up to interrupt 32 on the original 8086, IBM made a poor design decision mapping the external interrupt handlers of its PIC (interrupt controller) to interrupt 8 through 15 (inclusive) and placed BIOS calls in the reserved space as well. This caused problems on the IBM systems with a 286+ processor where the master PICs external interrupts overlapped with the exceptions that Intel added. For instance #GP and IRQ5 share the same interrupt number 13 (0x0d) in real-address mode.
16-bit and 32-bit protected mode OSes generally move the master PICs base address from interrupt 8 to a location greater than interrupt 31 outside the reserved interrupts to avoid this problem.
This is a continuation of the other answer as the post limit was exceeded.
Example Demonstrating #GP and #UD generated in v8086 mode
The following code is not meant as a primer on entering v8086 mode or writing a proper v8086 monitor (#GP handler). Information on entering v8086 mode can be found in another of my Stackoverflow answers. That answer discusses the mechanisms of getting into v8086 mode. The following code is based on that answer, but includes a TSS, and an Interrupt descriptor table that only handles #UD (exception 6) and #GP (exception 13). I chose #UD because it is an exception without an error code, and I chose #GP because it is an exception with an error code.
Most of the code is support code to print to the display in real and protected mode. The idea behind this example is simply to execute the instruction UD2 in v8086 mode and issue a privileged HLT instruction. I am entering v8086 mode with an IOPL of 0 so HLT causes a #GP exception that is handled by a protected mode GPF handler. #GP has an error code, and #UD does not. To find out if an error code is pushed an exception handler only need to subtract the current ESP from the address of the bottom of the stack. I use a 32-bit gate so with an error code the exception stack frame should be 40 bytes (0x28), and without it should be 36 (0x24).
With an error code GS, FS, DS, ES, USER_SS, USER_ESP, EFLAGS, CS, EIP, Error code are pushed. Each is 32-bits wide (4 bytes). 10*4=40.
Without an error code GS, FS, DS, ES, USER_SS, USER_ESP, EFLAGS, CS, EIP are pushed. Each is 32-bits wide (4 bytes). 9*4=36.
The code in v8086 mode does this for the test:
; v8086 code entry point
v86_mode_entry:
ud2 ; Cause a #UD exception (no error code pushed)
mov dword [vidmem_ptr], 0xb8000+80*2
; Advance current video ptr to second line
hlt ; Cause a #GP exception (error code pushed)
; End of the test - enter infinite loop sice we didn't provide a way for
; the v8086 process to be terminated. We can't do a HLT at ring 3.
.endloop:
jmp $
There are two protected mode exception handlers reached through a 32-bit interrupt gate. Although long they ultimately do one thing - print out (in hex) the size of the exception stack frame as it appeared right after control reached the exception handler. Because the exception handler uses pusha to save all the general purpose registers, 32 bytes (8*4) is subtracted from the total amount.
; #UD Invalid Opcode v8086 exception handler
exc_invopcode:
pusha ; Save all general purpose registers
mov eax, DATA32_SEL ; Setup the segment registers with kernel data selector
mov ds, eax
mov es, eax
cld ; DF=0 forward string movement
test dword [esp+efrm_noerr.user_flags], 1<<EFLAGS_VM_BIT
; Is the VM (v8086) set in the EFLAGS of the code
; that was interrupted?
jnz .isvm ; If set then proceed with processing the exception
mov esi, exc_not_vm ; Otherwise print msg we weren't interrupting v8086 code
mov ah, ATTR_BWHITE_ON_RED
call print_string_pm ; Print message to console
.endloop:
hlt
jmp .endloop ; Infinite HLT loop
.isvm:
mov esi, exc_msg_ud
mov ah, ATTR_BWHITE_ON_MAGENTA
call print_string_pm ; Print that we are a #UD exception
; The difference between the bottom of the kernel stack and the ESP
; value (accounting for the extra 8 pushes by PUSHA) is the original
; exception stack frame size. Without an error code this should print 0x24.
mov eax, EXC_STACK-8*4
sub eax, esp ; EAX = size of exception stack frame without
; registers pushed by PUSHA
mov edi, tmp_hex_str ; EDI = address of buffer to store converted integer
mov esi, edi ; ESI = copy of address for call to print_string_pm
call dword_to_hex_pm ; Convert EAX to HEX string
mov ah, ATTR_BWHITE_ON_MAGENTA
call print_string_pm ; Print size of frame in HEX
add word [esp+efrm_noerr.user_eip], 2
; A UD2 instruction is encoded as 2 bytes so update
; the real mode instruction pointer to point to
; next instruction so that the test can continue
; rather than repeatedly throwing #UD exceptions
popa ; Restore all general purpose registers
iret
; #GP v8086 General Protection Fault handler
exc_gpf:
pusha ; Save all general purpose registers
mov eax, DATA32_SEL ; Setup the segment registers with kernel data selector
mov ds, eax
mov es, eax
cld ; DF=0 forward string movement
test dword [esp+efrm_err.user_flags], 1<<EFLAGS_VM_BIT
; Is the VM (v8086) set in the EFLAGS of the code
; that was interrupted?
jnz .isvm ; If set then proceed with processing the exception
mov esi, exc_not_vm ; Otherwise print msg we weren't interrupting v8086 code
mov ah, ATTR_BWHITE_ON_RED
call print_string_pm ; Print message to console
.endloop:
hlt
jmp .endloop ; Infinite HLT loop
.isvm:
mov esi, exc_msg_gp
mov ah, ATTR_BWHITE_ON_MAGENTA
call print_string_pm ; Print that we are a #UD exception
; The difference between the bottom of the kernel stack and the ESP
; value (accounting for the extra 8 pushes by PUSHA) is the original
; exception stack frame size. With an error code this should print 0x28.
mov eax, EXC_STACK-8*4
sub eax, esp ; EAX = size of exception stack frame without
; registers pushed by PUSHA
mov edi, tmp_hex_str ; EDI = address of buffer to store converted integer
mov esi, edi ; ESI = copy of address for call to print_string_pm
call dword_to_hex_pm ; Convert EAX to HEX string
mov ah, ATTR_BWHITE_ON_MAGENTA
call print_string_pm ; Print size of frame in HEX
inc word [esp+efrm_err.user_eip]
; A HLT instruction is encoded as 1 bytes so update
; the real mode instruction pointer to point to
; next instruction so that the test can continue
; rather than repeatedly throwing #GP exceptions
popa ; Restore all general purpose registers
add esp, 4 ; Remove the error code
iret
There is some hard coded trickery to adjust the CS:IP when returning back to v8086 mode so that we don't get in an infinite loop faulting on the same exception repeatedly. A UD2 instruction is 2 bytes so we add 2 bytes. In the case of HLT we add 1 to v8086 CS:IP before returning. These exception handlers only do something useful if coming from v8086 mode, otherwise they print an error if the exception occurred from somewhere other than v8086 mode. Do not consider this code a way to create your own exception and interrupt handlers, they have been coded specifically for this test and aren't generic.
The following code can be run in an emulator or booted on real hardware using the bootloader test harness in this Stackoverflow answer:
stage2.asm:
VIDEO_TEXT_ADDR EQU 0xb8000 ; Hard code beginning of text video memory
ATTR_BWHITE_ON_MAGENTA EQU 0x5f ; Bright White on magenta attribute
ATTR_BWHITE_ON_RED EQU 0x4f ; Bright White on red attribute
PM_MODE_STACK EQU 0x80000 ; Protected mode stack below EBDA
EXC_STACK EQU 0x70000 ; Kernel Stack for interrupt/exception handling
V86_STACK_SEG EQU 0x0000 ; v8086 stack SS
V86_STACK_OFS EQU 0x7c00 ; v8086 stack SP
V86_CS_SEG EQU 0x0000 ; v8086 code segment CS
EFLAGS_VM_BIT EQU 17 ; EFLAGS VM bit
EFLAGS_BIT1 EQU 1 ; EFLAGS bit 1 (reserved, always 1)
EFLAGS_IF_BIT EQU 6 ; EFLAGS IF bit
TSS_IO_BITMAP_SIZE EQU 0x400/8 ; IO Bitmap for 0x400 IO ports
; Size 0 disables IO port bitmap (no permission)
ORG_ADDR EQU 0x7e00 ; Origin point of stage2 (test code)
; Macro to build a GDT descriptor entry
%define MAKE_GDT_DESC(base, limit, access, flags) \
(((base & 0x00FFFFFF) << 16) | \
((base & 0xFF000000) << 32) | \
(limit & 0x0000FFFF) | \
((limit & 0x000F0000) << 32) | \
((access & 0xFF) << 40) | \
((flags & 0x0F) << 52))
; Macro to build a IDT descriptor entry
%define MAKE_IDT_DESC(offset, selector, access) \
((offset & 0x0000FFFF) | \
((offset & 0xFFFF0000) << 32) | \
((selector & 0x0000FFFF) << 16) | \
((access & 0xFF) << 40))
; Macro to convert an address to an absolute offset
%define ABS_ADDR(label) \
(ORG_ADDR + (label - $$))
; Structure representing exception frame WITH an error code
; including registers pushed by a PUSHA
struc efrm_err
; General purpose registers pushed by PUSHA
.edi: resd 1
.esi: resd 1
.ebp: resd 1
.esp: resd 1
.ebx: resd 1
.edx: resd 1
.ecx: resd 1
.eax: resd 1
; Items pushed by the CPU when an exception occurred
.errno: resd 1
.user_eip: resd 1
.user_cs: resd 1
.user_flags: resd 1
.user_esp: resd 1
.user_ss: resd 1
.vm_es: resd 1
.vm_ds: resd 1
.vm_fs: resd 1
.vm_gs: resd 1
EFRAME_ERROR_SIZE equ $-$$
endstruc
; Structure representing exception frame WITHOUT an error code
; including registers pushed by a PUSHA
struc efrm_noerr
; General purpose registers pushed by PUSHA
.edi: resd 1
.esi: resd 1
.ebp: resd 1
.esp: resd 1
.ebx: resd 1
.edx: resd 1
.ecx: resd 1
.eax: resd 1
; Items pushed by the CPU when an exception occurred
.user_eip: resd 1
.user_cs: resd 1
.user_flags: resd 1
.user_esp: resd 1
.user_ss: resd 1
.vm_es: resd 1
.vm_ds: resd 1
.vm_fs: resd 1
.vm_gs: resd 1
EFRAME_NOERROR_SIZE equ $-$$
endstruc
bits 16
ORG ORG_ADDR
start:
xor ax, ax ; DS=SS=ES=0
mov ds, ax
mov ss, ax ; Stack at 0x0000:0x7c00
mov sp, 0x7c00
cld ; Set string instructions to use forward movement
; No enabling A20 as we don't require it
lgdt [gdtr] ; Load our GDT
lidt [idtr] ; Install interrupt table
mov eax, cr0
or eax, 1
mov cr0, eax ; Set protected mode flag
jmp CODE32_SEL:start32 ; FAR JMP to set CS
; v8086 code entry point
v86_mode_entry:
ud2 ; Cause a #UD exception (no error code pushed)
mov dword [vidmem_ptr], 0xb8000+80*2
; Advance current video ptr to second line
hlt ; Cause a #GP exception (error code pushed)
; End of the test - enter infinite loop sice we didn't provide a way for
; the v8086 process to be terminated. We can't do a HLT at ring 3.
.endloop:
jmp $
; 32-bit protected mode entry point
bits 32
start32:
mov ax, DATA32_SEL ; Setup the segment registers with data selector
mov ds, ax
mov es, ax
mov ss, ax
mov esp, PM_MODE_STACK ; Set protected mode stack pointer
mov fs, ax ; Not currently using FS and GS
mov gs, ax
mov ecx, BSS_SIZE_D ; Zero out BSS section a DWORD at a time
mov edi, bss_start
xor eax, eax
rep stosd
; Set iomap_base in tss with the offset of the iomap relative to beginning of the tss
mov word [tss_entry.iomap_base], tss_entry.iomap-tss_entry
mov dword [tss_entry.esp0], EXC_STACK
mov dword [tss_entry.ss0], DATA32_SEL
mov eax, TSS32_SEL
ltr ax ; Load default TSS (used for exceptions, interrupts, etc)
xor ebx, ebx ; EBX=0
push ebx ; Real mode GS=0
push ebx ; Real mode FS=0
push ebx ; Real mode DS=0
push ebx ; Real mode ES=0
push V86_STACK_SEG
push V86_STACK_OFS ; v8086 stack SS:SP (grows down from SS:SP)
push dword 1<<EFLAGS_VM_BIT | 1<<EFLAGS_BIT1
; Set VM Bit, IF bit is off, DF=0(forward direction),
; IOPL=0, Reserved bit (bit 1) always 1. Everything
; else 0. These flags will be loaded in the v8086 mode
; during the IRET. We don't want interrupts enabled
; because we don't have a proper v86 monitor
; GPF handler to process them.
push V86_CS_SEG ; Real Mode CS (segment)
push v86_mode_entry ; Entry point (offset)
iret ; Transfer control to v8086 mode and our real mode code
; Function: print_string_pm
; Display a string to the console on display page 0 in protected mode.
; Very basic. Doesn't update hardware cursor, doesn't handle scrolling,
; LF, CR, TAB.
;
; Inputs: ESI = Offset of address to print
; AH = Attribute of string to print
; Clobbers: None
; Returns: None
print_string_pm:
push edi
push esi
push eax
mov edi, [vidmem_ptr] ; Start from video address stored at vidmem_ptr
jmp .getchar
.outchar:
stosw ; Output character to video display
.getchar:
lodsb ; Load next character from string
test al, al ; Is character NUL?
jne .outchar ; If not, go back and output character
mov [vidmem_ptr], edi ; Update global video pointer
pop eax
pop esi
pop edi
ret
; Function: dword_to_hex_pm
; Convert a 32-bit value to its equivalent HEXadecimal string
;
; Inputs: EDI = Offset of buffer for converted string (at least 8 bytes)
; EAX = 32-bit value to convert to HEX
; Clobbers: None
; Returns: None
dword_to_hex_pm:
push edx ; Save all registers we use
push ecx
push edi
mov ecx, 8 ; Process 8 nibbles (4 bits each)
.nibble_loop:
rol eax, 4 ; Rotate the high nibble to the low nibble of EAX
mov edx, eax ; Save copy of rotated value to continue conversion
and edx, 0x0f ; Mask off eveything but the lower nibble
movzx edx, byte [.hex_lookup_tbl+edx]
mov [edi], dl ; Convert nibble to HEX character using lookup table
inc edi ; Continue with the next nibble
dec ecx
jnz .nibble_loop ; Continue with next nibble if we haven't processed all
pop edi ; Retsore all the registers we clobbered
pop ecx
pop edx
ret
.hex_lookup_tbl: db "0123456789abcdef"
; #UD Invalid Opcode v8086 exception handler
exc_invopcode:
pusha ; Save all general purpose registers
mov eax, DATA32_SEL ; Setup the segment registers with kernel data selector
mov ds, eax
mov es, eax
cld ; DF=0 forward string movement
test dword [esp+efrm_noerr.user_flags], 1<<EFLAGS_VM_BIT
; Is the VM (v8086) set in the EFLAGS of the code
; that was interrupted?
jnz .isvm ; If set then proceed with processing the exception
mov esi, exc_not_vm ; Otherwise print msg we weren't interrupting v8086 code
mov ah, ATTR_BWHITE_ON_RED
call print_string_pm ; Print message to console
.endloop:
hlt
jmp .endloop ; Infinite HLT loop
.isvm:
mov esi, exc_msg_ud
mov ah, ATTR_BWHITE_ON_MAGENTA
call print_string_pm ; Print that we are a #UD exception
; The difference between the bottom of the kernel stack and the ESP
; value (accounting for the extra 8 pushes by PUSHA) is the original
; exception stack frame size. Without an error code this should print 0x24.
mov eax, EXC_STACK-8*4
sub eax, esp ; EAX = size of exception stack frame without
; registers pushed by PUSHA
mov edi, tmp_hex_str ; EDI = address of buffer to store converted integer
mov esi, edi ; ESI = copy of address for call to print_string_pm
call dword_to_hex_pm ; Convert EAX to HEX string
mov ah, ATTR_BWHITE_ON_MAGENTA
call print_string_pm ; Print size of frame in HEX
add word [esp+efrm_noerr.user_eip], 2
; A UD2 instruction is encoded as 2 bytes so update
; the real mode instruction pointer to point to
; next instruction so that the test can continue
; rather than repeatedly throwing #UD exceptions
popa ; Restore all general purpose registers
iret
; #GP v8086 General Protection Fault handler
exc_gpf:
pusha ; Save all general purpose registers
mov eax, DATA32_SEL ; Setup the segment registers with kernel data selector
mov ds, eax
mov es, eax
cld ; DF=0 forward string movement
test dword [esp+efrm_err.user_flags], 1<<EFLAGS_VM_BIT
; Is the VM (v8086) set in the EFLAGS of the code
; that was interrupted?
jnz .isvm ; If set then proceed with processing the exception
mov esi, exc_not_vm ; Otherwise print msg we weren't interrupting v8086 code
mov ah, ATTR_BWHITE_ON_RED
call print_string_pm ; Print message to console
.endloop:
hlt
jmp .endloop ; Infinite HLT loop
.isvm:
mov esi, exc_msg_gp
mov ah, ATTR_BWHITE_ON_MAGENTA
call print_string_pm ; Print that we are a #UD exception
; The difference between the bottom of the kernel stack and the ESP
; value (accounting for the extra 8 pushes by PUSHA) is the original
; exception stack frame size. With an error code this should print 0x28.
mov eax, EXC_STACK-8*4
sub eax, esp ; EAX = size of exception stack frame without
; registers pushed by PUSHA
mov edi, tmp_hex_str ; EDI = address of buffer to store converted integer
mov esi, edi ; ESI = copy of address for call to print_string_pm
call dword_to_hex_pm ; Convert EAX to HEX string
mov ah, ATTR_BWHITE_ON_MAGENTA
call print_string_pm ; Print size of frame in HEX
inc word [esp+efrm_err.user_eip]
; A HLT instruction is encoded as 1 bytes so update
; the real mode instruction pointer to point to
; next instruction so that the test can continue
; rather than repeatedly throwing #GP exceptions
popa ; Restore all general purpose registers
add esp, 4 ; Remove the error code
iret
; Data section
align 4
vidmem_ptr: dd VIDEO_TEXT_ADDR ; Start console output in upper left of display
tmp_hex_str: TIMES 9 db 0 ; String to store 32-bit value converted HEX + NUL byte
exc_msg_ud:
db "#UD frame size: 0x", 0
exc_msg_gp:
db "#GP frame size: 0x", 0
exc_not_vm:
db "Not a v8086 exception", 0
align 4
gdt_start:
dq MAKE_GDT_DESC(0, 0, 0, 0) ; null descriptor
gdt32_code:
dq MAKE_GDT_DESC(0, 0x000fffff, 10011010b, 1100b)
; 32-bit code, 4kb gran, limit 0xffffffff bytes, base=0
gdt32_data:
dq MAKE_GDT_DESC(0, 0x000fffff, 10010010b, 1100b)
; 32-bit data, 4kb gran, limit 0xffffffff bytes, base=0
gdt32_tss:
dq MAKE_GDT_DESC(tss_entry, TSS_SIZE-1, 10001001b, 0000b)
; 32-bit TSS, 1b gran, available, IOPL=0
end_of_gdt:
CODE32_SEL equ gdt32_code - gdt_start
DATA32_SEL equ gdt32_data - gdt_start
TSS32_SEL equ gdt32_tss - gdt_start
gdtr:
dw end_of_gdt - gdt_start - 1
; limit (Size of GDT - 1)
dd gdt_start ; base of GDT
align 4
; Create an IDT which handles #UD and #GPF. All other exceptions set to 0
; so that they triple fault. No external interrupts supported.
idt_start:
TIMES 6 dq 0
dq MAKE_IDT_DESC(ABS_ADDR(exc_invopcode), CODE32_SEL, 10001110b) ; 6
TIMES 6 dq 0
dq MAKE_IDT_DESC(ABS_ADDR(exc_gpf), CODE32_SEL, 10001110b) ; D
TIMES 18 dq 0
end_of_idt:
align 4
idtr:
dw end_of_idt - idt_start - 1
; limit (Size of IDT - 1)
dd idt_start ; base of IDT
; Data section above bootloader acts like a BSS section
align 4
ABSOLUTE ABS_ADDR($) ; Convert location counter to absolute address
bss_start:
; Task State Structure (TSS)
tss_entry:
.back_link: resd 1
.esp0: resd 1 ; Kernel stack pointer used on ring transitions
.ss0: resd 1 ; Kernel stack segment used on ring transitions
.esp1: resd 1
.ss1: resd 1
.esp2: resd 1
.ss2: resd 1
.cr3: resd 1
.eip: resd 1
.eflags: resd 1
.eax: resd 1
.ecx: resd 1
.edx: resd 1
.ebx: resd 1
.esp: resd 1
.ebp: resd 1
.esi: resd 1
.edi: resd 1
.es: resd 1
.cs: resd 1
.ss: resd 1
.ds: resd 1
.fs: resd 1
.gs: resd 1
.ldt: resd 1
.trap: resw 1
.iomap_base:resw 1 ; IOPB offset
.iomap: resb TSS_IO_BITMAP_SIZE ; IO bitmap (IOPB) size 8192 (8*8192=65536) representing
; all ports. An IO bitmap size of 0 would fault all IO
; port access if IOPL < CPL (CPL=3 with v8086)
%if TSS_IO_BITMAP_SIZE > 0
.iomap_pad: resb 1 ; Padding byte that has to be filled with 0xff
; To deal with issues on some CPUs when using an IOPB
%endif
TSS_SIZE EQU $-tss_entry
bss_end:
BSS_SIZE_B EQU bss_end-bss_start; BSS size in bytes
BSS_SIZE_D EQU (BSS_SIZE_B+3)/4 ; BSS size in dwords
Acquire the bpb.inc file and boot.asm from the test harness. Assemble to a disk image with:
nasm -f bin stage2.asm -o stage2.bin
nasm -f bin boot.asm -o disk.img
stage2.bin has to be assembled first as it is embedded as binary by boot.asm. The result should be a 1.44MiB floppy disk image called disk.img. If run in QEMU with:
qemu-system-i386 -fda disk.img
The result should be similar to:
UD frame size should be 0x00000024 (36 = size of exception frame without error code)
GP frame size should be 0x00000028 (40 = size of exception frame with error code)

Porting from 32 to 64-bit by just changing all the register names from eXX to rXX makes factorial return 0?

How fortunate it is for all of use learning the art of computer programming to have access to a community such as Stack Overflow! I have made the decision to take up the task of learning how to program computers and I am doing so by the knowledge of an e-book called 'Programming From the Ground Up', which teaches the reader how to create programs in the assembly language within the GNU/Linux environment.
My progress in the book has come to the point of creating a program which computes the factorial of the integer 4 with a function, which I have made and done without any error caused by the assembler of GCC or caused by running the program. However, the function in my program does not return the right answer! The factorial of 4 is 24, but the program returns a value of 0! Rightly speaking, I do not know why this is!
Here is the code for your consideration:
.section .data
.section .text
.globl _start
.globl factorial
_start:
push $4 #this is the function argument
call factorial #the function is called
add $4, %rsp #the stack is restored to its original
#state before the function was called
mov %rax, %rbx #this instruction will move the result
#computed by the function into the rbx
#register and will serve as the return
#value
mov $1, %rax #1 must be placed inside this register for
#the exit system call
int $0x80 #exit interrupt
.type factorial, #function #defines the code below as being a function
factorial: #function label
push %rbp #saves the base-pointer
mov %rsp, %rbp #moves the stack-pointer into the base-
#pointer register so that data in the stack
#can be referenced as indexes of the base-
#pointer
mov $1, %rax #the rax register will contain the product
#of the factorial
mov 8(%rbp), %rcx #moves the function argument into %rcx
start_loop: #the process loop begins
cmp $1, %rcx #this is the exit condition for the loop
je loop_exit #if the value in %rcx reaches 1, exit loop
imul %rcx, %rax #multiply the current integer of the
#factorial by the value stored in %rax
dec %rcx #reduce the factorial integer by 1
jmp start_loop #unconditional jump to the start of loop
loop_exit: #the loop exit begins
mov %rbp, %rsp #restore the stack-pointer
pop %rbp #remove the saved base-pointer from stack
ret #return
TL:DR: the factorial of the return address overflowed %rax, leaving 0, because you ported wrong.
Porting 32-bit code to 64-bit is not as simple as changing all the register names. That might get it to assemble, but as you found even this simple program behaves differently. In x86-64, push %reg and call both push 64-bit values, and modify rsp by 8. You would see this if you single-stepped your code with a debugger. (See the bottom of the x86 tag wiki for info using gdb for asm.)
You're following a book that uses 32-bit examples, so you should probably just build them as 32-bit executables instead of trying to port them to 64-bit before you know how.
Your sys_exit() using the 32-bit int 0x80 ABI still works (What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?), but you will run into trouble with system calls if you try to pass 64-bit pointers. Use the 64-bit ABI.
You will also run into problems if you want to call any library functions, because the standard function-calling convention is different, too. See Why parameters stored in registers and not on the stack in x86-64 Assembly?, and the 64-bit ABI link, and other calling-convention docs in the x86 tag wiki.
But you're not doing any of that, so the problem with your program simply comes down to not accounting for the doubled "stack width" in x86-64. Your factorial function reads the return address as its argument.
Here's your code, commented to explain what it actually does
push $4 # rsp-=8. (rsp) = qword 4
# non-standard calling convention with args on the stack.
call factorial # rsp-=8. (rsp) = return address. RIP=factorial
add $4, %rsp # misalign the stack, so it's pointing to the top half of the 4 you pushed earlier.
# if this was in a function that wanted to return, you'd be screwed.
mov %rax, %rbx # copy return value to first arg of system call
mov $1, %rax #eax = __NR_EXIT from asm/unistd_32.h, wasting 2 bytes vs. mov $1, %eax
int $0x80 # 32-bit ABI system call, eax=call number, ebx=first arg. sys_exit(factorial(4))
So the caller is sort of fine (for the non-standard 64-bit calling convention you've invented that passes all args on the stack). You might as well omit the add to %rsp entirely, since you're about to exit without touching the stack any further.
.type factorial, #function #defines the code below as being a function
factorial: #function label
push %rbp #rsp-=8, (rsp) = rbp
mov %rsp, %rbp # make a traditional stack frame
mov $1, %rax #retval = 1. (Wasting 2 bytes vs. the exactly equivalent mov $1, %eax)
mov 8(%rbp), %rcx #load the return address into %rcx
... and calculate the factorial
For static executables (and dynamically linked executables that aren't ASLR enabled with PIE), _start is normally at 0x4000c0. Your program will still run nearly instantaneously on a modern CPU, because 0x4000c0 * 3c latency of imul is still only 12.5 million core clock cycles. On a 4GHz CPU, that's 3 milliseconds of CPU time.
If you'd made a position-independent executable by linking with gcc foo.o on a recent distro, _start would have an address like 0x5555555545a0, and your function would have taken ~70368 seconds to run on a 4GHz CPU with 3-cycle imul latency.
4194496! includes many even numbers, so its binary representation has many trailing zeros. The whole %rax will be zero by the time you're done multiplying by every number from 0x4000c0 down to 1.
The exit status of a Linux process is only the low 8 bits of the integer you pass to sys_exit() (because the wstatus is only a 32-bit int and includes other stuff, like what signal ended the process. See wait4(2)). So even with small args, it doesn't take much.

What is the cause of undefined ARM Exceptions?

One question is when the undefined instruction happens .... Do we need to get the current executing instruction from R14_SVC or R14_UNDEF? . Currently I am working on one problem where an undefined instruction happened. On checking the R14_SVC I found the instruction was like below:
0x46BFD73C cmp r0, #0x0
0x46BFD740 beq 0x46BFD75C
0x46BFD744 ldr r0,0x46BFE358
so in my assumption the undefined instruction would have happened while executing the instruction beq 0x46BFD75C
One thing that puzzles me is I checked the r14_undef and the istruction was different.
0x46bfd4b8 bx r14
0x46bfd4bC mov r0, 0x01
0x46bfd4c0 bx r14
Which one caused the undefined instruction exception?
All of your answers are in the ARM ARM, ARM Architectural Reference Manual. go to infocenter.arm.com under reference manuals find the architecture family you are interested in. The non-cortex-m series all handle these exceptions the same way
When an Undefined Instruction exception occurs, the following actions are performed:
R14_und = address of next instruction after the Undefined instruction
SPSR_und = CPSR
CPSR[4:0] = 0b11011 /* Enter Undefined Instruction mode */
CPSR[5] = 0 /* Execute in ARM state */
/* CPSR[6] is unchanged */
CPSR[7] = 1 /* Disable normal interrupts */
/* CPSR[8] is unchanged */
CPSR[9] = CP15_reg1_EEbit
/* Endianness on exception entry */
if high vectors configured then
PC = 0xFFFF0004
else
PC = 0x00000004
R14_und points at the next instruction AFTER the undefined instruction. you have to examine SPSR_und to determine what mode the processor was in (arm or thumb) to know if you need to subtract 2 or 4 from R14_und and if you need to fetch 2 or 4 bytes. Unfortunately if on a newer architecture that supports thumb2 you may have to fetch 4 bytes even in thumb mode and try to figure out what happened. being variable word length it is very possible to be in a situation where it is impossible to determine what happened. If you are not using thumb2 instructions then it is deterministic.