How can RISC-V SYSTEM instructions be implemented as trap? - exception

I am currently studying the specifications for RISC-V with specification version 2.2 and Privileged Architecture version 1.10. In Chapter 2 of RISC-V specification, it is mentioned that "[...] though a simple implementation might cover the eight SCALL/SBREAK/CSRR* instructions with a single SYSTEM hardware instruction that always traps [...]"
However, when I look at the privileged specification, the instruction MRET is also a SYSTEM instruction, which is required to return from a trap. Right now I am confused how much of the Machine-level ISA are required: is it possible to omit all M-level CSRs and use a software handler for any SYSTEM instructions, as stated in Specification? If so, how does one pass in information such as return address and trap cause? Are they done through regular registers x1-x31?
Alternatively, is it enough to implement only the following M-level CSRs, if I am aiming for a simple embedded core with only M-level privilege?
mvendorid
marchid
mimpid
mhartid
misa
mscratch
mepc
mcause
Finally, how many of these CSRs can be omitted?

ECALL/EBREAK instructions are traps anyway. CSR instructions need to be carefully parsed to make sure they specify existent registers being accessed in allowed modes, which sounds like a job for your favorite sparse matrix, whether PLA or if/then.
You could emulate all SYSTEM instructions, but, as you see, you need to be able to access information inside the hardware that is not part of the normal ISA. This implies that you need to add "instruction extensions."
I would also recommend making the SYSTEM instructions atomic, meaning that exceptions should be masked or avoided within each emulated instruction.
Since I am not a very trusting person, I would create a new mode that would enable the instruction extensions that would allow you to read the exception address directly from the hardware, for example, and fetch instructions from a protected area of memory. Interrupts would be disabled automatically. The mode would be exited by branching to epc+4 or the illegal instruction handler. I would not want to have anything outside the RISC-V spec available even in M-mode, just to be safe.
In my experience, it is better to say "I do everything," than it is to explain to each customer, or worse, have a competitor explain to your customers, what it is that you do not do. But perhaps someone who knows the CSRs better could help; it is not something I do.

Related

When to use Vectored-Interrupt vs. Non-vectored Interrupt?

Why would you choose Vectored Interrupt and non-vectored interrupt?
I know the differences between them but not sure when you would use one over the other/what devices use either one!
Thank you so much.
If the hardware supports vectored interrupts, there is no reason not to use them. This is more a question of implementation cost (vector tables and prioritisation logic) vs software cost (reading status registers and looking up the correct vector).
As hardware has become cheaper over time, it makes sense to have dedicated logic to provide the correct vector address - this improves interrupt latency for typical real world implementations to start processing 'actual handler code'.
Where hardware supports both, the non-vectored mode may be for legacy compatibility, or for the unusual case where only one interrupt is required (possibly saving one or two cycles of latency).

Precise exception

I was going through the book The Design and Implementation of the FreeBSD operating system and I came across this:
This ability to restart an instruction is called a precise exception.
The CPU may implement restarting by saving enough state when an
instruction begins that the state can be restored when a fault is
discovered. Alternatively, instructions could delay any modifications
or side effects until after any faults would be discovered so that the
instruction execution does not need to back up before restarting.
I could not understand what does
modification or side effects
refer to in the passage. Can anyone elaborate?
That description from the FreeBSD book is very OS centric. I even disagree with its definition, This ability to restart an instruction is called a precise exception. You don't restart an instruction after a power failure exception. So rather than try to figure out what McKusick meant, I'm gonna suggest going elsewhere for a better description of exceptions.
BTW, I prefer Smith's definition:
An interrupt is precise if the saved process state corresponds with
the sequential model of program execution where one instruction completes before the next begins.
https://lagunita.stanford.edu/c4x/Engineering/CS316/asset/smith.precise_exceptions.pdf
Almost all modern processors support precise exceptions. So the hard work is already done. What an OS must do is register a trap handler for the exceptions that the hardware takes. For example, there will be a page fault handler, floating point, .... To figure out what is necessary for these handlers you'd have to read the processor theory of operations.
Despite what seems like gritty systems detail, that is fairly high level. It doesn't say anything about what the hardware is doing and so the FreeBSD description is shorthanding a lot.
To really understand precise exceptions, you need to read about that in the context of pipelining, out of order, superscalar, ... in a computer architecture book. I'd recommend Computer Architecture A Quantitative Approach 6th ed. There's a section Dealing With Exceptions p C-38 that presents a taxonomy of the different types of exceptions. The FreeBSD description is only describing some exceptions. It then gets into how each exception type is handled by the pipeline.
Also, The Linux Programming Interface has 3 long chapters on the POSIX signal interface. I know it's not FreeBSD but it covers what an application will see when, for example, a floating point exception is taken and a SIGFPE signal is sent to the process.

Where is hardware exception handling entry / exit code stored

I know this question seems very generic as it can depend on the platform,
but I understand with procedure / function calls, the assembler code to push return address on the stack and local variables etc. can be part of either the caller function or callee function.
When a hardware exception or interrupt occurs tho, the Program Counter will get the address of the exception handler via the exception table, but where is the actual code to store the state, return address etc. Or is this automatically done at the hardware level for interrupts and exceptions?
Thanks in advance
since you are asking about arm and you tagged microcontroller you might be talking about the arm7tdmi but are probably talking about one of the cortex-ms. these work differently than the full sized arm architecture. as documented in the architectural reference manual that is associated with these cores (the armv6-m or armv7-m depending on the core) it documents that the hardware conforms to the ABI, plus stuff for an interrupt. So the return address the psr and registers 0 through 4 plus some others are all put on the stack, which is unusual for an architecture to do. R14 instead of getting the return address gets an invalid address of a specific pattern which is all part of the architecture, unlike other processor ip, addresses spaces on the cortex-ms are encouraged or dictated by arm, that is why you see ram starts at 0x20000000 usually on these and flash is less than that, there are some exceptions where they place ram in the "executable" range pretending to be harvard when really modified harvard. This helps with the 0xFFFxxxxx link register return address, depending on the manual they either yada yada over the return address or they go into detail as to what the patterns you find mean.
likewise the address in the vector table is spelled out something like the first 16 are system/arm exceptions then interrupts follow after that where it can be up to 128 or 256 possible interrupts, but you have to look at the chip vendor (not arm) documentation for that to see how many they exposed and what is tied to what. if you are not using those interrupts you dont have to leave a huge hole in your flash for vectors, just use that flash for your program (so long as you insure you are never going to fire that exception or interrupt).
For function calls, which occur at well defined (synchronous) locations in the program, the compiler generates executable instructions to manage the return address, registers and local variables. These instructions are integrated with your function code. The details are hardware and compiler specific.
For a hardware exception or interrupt, which can occur at any location (asynchronous) in the program, managing the return address and registers is all done in hardware. The details are hardware specific.
Think about how a hardware exception/interrupt can occur at any point during the execution of a program. And then consider that if a hardware exception/interrupt required special instructions integrated into the executable code then those special instructions would have to be repeated everywhere throughout the program. That doesn't make sense. Hardware exception/interrupt management is handled in hardware.
The "code" isn't software at all; by definition the CPU has to do it itself internally because interrupts happen asynchronously. (Or for synchronous exceptions caused by instructions being executed, then the internal handling of that instruction is what effectively triggers it).
So it's microcode or hardwired logic inside the CPU that generates the stores of a return address on an exception, and does any other stuff that the architecture defines as happening as part of taking an exception / interrupt.
You might as well as where the code is that pushes a return address when the call instruction executes, on x86 for example where the call instruction pushes return info onto the stack instead of overwriting a link register (the way most RISCs do).

cudaMemcpy D2D flag - semantics w.r.t. multiple devices, and is it necessary?

I've not had the need before to memcpy data between 2 GPUs. Now, I'm guessing I'm going to do it with cudaMemcpy() and the cudaMemcpyDeviceToDevice flag, but:
is the cudaMemcpyDeviceToDevice flag used both for copying data within a single device's memory space and between the memory spaces of all devices?
If it is,
How are pointers to memory on different devices distinguished? Is it using the specifics of the Unified Virtual Address Space mechanism?
And if that's the case, then
Why even have the H2D, D2H, D2D flags at all for cudaMemcpy? Doesn't it need to check which device it needs to address anyway?
Can't we implement a flag-free version of cudaMemcpy using cuGetPointerAttribute() from the CUDA low-level driver?
For devices with UVA in effect, you can use the mechanism you describe. This doc section may be of interest (both the one describing device-to-device transfers as well as the subsequent section on UVA implications). Otherwise there is a cudaMemcpyPeer() API available, which has somewhat different semantics.
How are pointers to memory on different devices distinguished? Is it using the specifics of the Unified Virtual Address Space mechanism?
Yes, see the previously referenced doc sections.
Why even have the H2D, D2H, D2D flags at all for cudaMemcpy? Doesn't it need to check which device it needs to address anyway?
cudaMemcpyDefault is the transfer flag that was added when UVA first appeared, to enable the use of generically-flagged transfers, where the direction is inferred by the runtime upon inspection of the supplied pointers.
Can't we implement a flag-free version of cudaMemcpy using cuGetPointerAttribute() from the CUDA low-level driver?
I'm assuming the generically-flagged method described above would meet whatever needs you have (or perhaps I'm not understanding this question).
Such discussions could engender the question "Why would I ever use anything but cudaMemcpyDefault"?
One possible reason I can think of to use an explicit flag would be that the runtime API will do explicit error checking if you supply an explicit flag. If you're sure that a given invocation of cudaMemcpy would always be in a H2D transfer direction, for example, then explicitly using cudaMemcpyHostToDevice will cause the runtime API to throw an error if the supplied pointers do not match the indicated direction. Whether you attach any value to such a concept is probably a matter of opinion.
As a matter of lesser importance (IMO) code that uses the explicit flags does not depend on UVA being available, but such execution scenarios are "disappearing" with newer environments

No LR and SPSR for EL0 in Aarch64

In AArch64, There are 4 exception levels viz EL0-3. ARM site mentions there are 4 Stack pointers (SP_EL0/1/2/3) but only 3 exception Link registers (ELR_EL1/2/3) and only 3 saved program status register(SPSR_EL1/2/3).
Why the ELR_EL0 and SPSR_EL0 are not required?
P.S. Sorry if this is a silly question. I am new to ARM architecture.
By design exceptions cannnot target EL0, so if it can't ever take an exception then it has no use for the machinery to be able to return from one.
To expand on the reasoning a bit (glossing over the optional and more special-purpose higher exception levels), the basic design is that EL1 is where privileged system code runs, and EL0 is where unprivileged user code runs. Thus EL0 is by necessity far more restricted in what it can do, and wouldn't be very useful for handling architectural exceptions, i.e. low-level things requiring detailed knowledge of the system. Only privileged software (typically the OS kernel) should have access to the full hardware and software state necessary to decide whether handling that basic hardware exception means e.g. going and quietly paging something in from swap, versus delivering a "software exception"-type signal to the offending task to tell it off for doing something bad.