What would happen if I were to divide by 0 when interrupts are disabled? - exception

When IF flag is cleared, (interrupt table is not ready), all maskable interrupts are disabled.
Questions are:
What happens if I trigger an exception? (eg: div by zero)
What happens if a non-maskable interrupt arrives, (interrupt table is not ready)? What will the cpu do ?

Setting IF to 0 (with cli, popf, iret, etc.) only inhibits external interrupts.
To quote the documentation (Intel® 64 and IA-32 Architectures
Software Developer’s Manual Volume 2A, Chapter 3.2, for the cli instruction):
Clearing the IF flag causes the processor to ignore maskable external interrupts. The IF flag and the CLI and STI instruction have no effect on the generation of exceptions and NMI interrupts.
If your interrupt table (IDT or IVT) is insufficiently configured, you're likely to get a double fault that cascades into a triple fault on an instruction that raises a #DE divide exception (interrupt 0).
NMIs don't normally happen without some prior set up to enable something that will raise one. (e.g. enabling a CPU performance counter event.)

Related

Are there any CPU-state bits indicating being in an exception/interrupt handler in x86 and x86-64?

Are there any CPU-state bits indicating being in an exception/interrupt handler in x86 and x86-64? In other words, can we tell whether the main thread or exception handler is currently executed based only on the CPU registers' state?
Not, there's no bit in the CPU itself (e.g. a control register) that means "we're in an exception or interrupt handler".
But there is hidden state indicating that you're in an NMI (Non-Maskable Interrupt) handler. Since you can't block them by disabling interrupts, and unblockable arbitrary nesting of NMIs would be inconvenient, another NMI won't get delivered until you run an iret. Even if an exception (like #DE div by 0) happens during an NMI handler, and that exception handler itself returns with iret even if you're not done handling the NMI. See The x86 NMI iret problem on LWN.
For normal interrupts, you can disable interrupts (cli) if you don't want another interrupt to be delivered while this one is being handled.
However, the interrupt controller (logically outside the CPU core, but actually part of modern CPUs) may need to be told when you're done handling an external interrupt. (Not a software-interrupt or exception). https://wiki.osdev.org/IDT_problems#I_can_only_receive_one_IRQ shows the outb instructions needed to keep the legacy PIC happy. (I don't know if this applies to more modern ways of doing interrupts, like MSI-X message-signalled interrupts.
That part of the OSdev wiki page might be specific to toy OSes that let the BIOS emulate legacy IBM-PC stuff.) But either way, that's only for external interrupts like PS/2 keyboard controller, hard drive DMA complete, or whatever (not exceptions), so it's unrelated to your Are Linux system calls executed inside an exception handler? question.
The lack of exception-state means there's no special instruction you have to run to "acknowledge" an exception before calling schedule() from what was an interrupt handler. All you have to do is make sure interrupts are enabled or not when they should or shouldn't be. (sti / cli, or pushf / popf to save/restore the old interrupt state.) And of course that your software data structures remain consistent and appropriate for what you're doing. But there isn't anything you have to do specifically to keep the CPU happy.
It's not like with user-space where a signal handler should tell the OS it's done instead of just jumping somewhere and running indefinitely. (In Linux, a signal handler can modify the main-thread program-counter so sigreturn(2) resumes execution somewhere other than where you were when it was delivered.) If POSIX or Linux signals were the (mental) model you were wondering about for interrupts/exceptions, no, it's not like that.
There is an interrupt-priority mechanism (CR8 in x86-64, or the LAPIC TPR (Task Priority Register)), but it does not automatically get set when the CPU delivers an interrupt. You can set it once (e.g. if you have a lot of high-priority interrupts to process on this core) and it persists across interrupts. (How is CR8 register used to prioritize interrupts in an x86-64 CPU?).
It's just a filter on what interrupt-numbers can get delivered to this core when interrupts are enabled (sti, IF=1 bit in RFLAGS). Apparently Windows makes some use of it, or did back in 2007, but Linux doesn't (or didn't).
It's not like you have to tell the CPU / LAPIC that you're done with this interrupt so it's ok for it to deliver another interrupt of this or lower priority.

RISC-V return from exception handler with compressed instructions

I see the standard way of exiting RISC-V exception handler is update mepc to mepc+4 before mret.
But won't this cause problem if the next instruction is only 2-bytes long in compressed instruction mode?
In compressed instruction mode there are mixed of 4-bytes and 2-bytes instructions. If you not update mepc and just mret then you keep getting same exception. But always adding 4 to trapped mepc seem like a bug for mixed compressed instruction.
Am I missing something?
I see the standard way of exiting risc-v exception handler is update mepc to mepc+4 before mret.
These are not serious exception handlers; they are illustrative only — to show catching of exceptions at all, and returning back to the interrupted code without having done the actual exception processing needed for the given situation. Thus, the easiest thing to do to prevent an infinite loop is to simply skip past the offending instruction.
One of the few places where we advance the pc in order to return to the code that cause the exception is in handling an ecall. As far as I know there is no compressed (16-bit) ecall instruction.
Many resumable exceptions need to rerun the instruction that caused the exception — loads & stores that cause page faults (available in both 32-bit and 16-bit form), for example, need to re-execute once the page tables have been fixed (the page read in from disc and mapped into the user's address space).
Many other exceptions are not generally resumable.
However, emulation of an instruction requires knowing its size, as is the case with ecall. if you choose to emulate, for example, misaligned memory accesses, you will indeed have to make a decision as to the size of the instruction, as emulating it means resuming past it. Also note that RISC V supports 16-bit, 32-bit, 48-bit, 64-bit and longer instructions, so an exception handler that is going to emulate instructions will need to be able to decode their length (only the instructions chosen for emulation, though).
The other thing to add is that the sample exception handlers you may be looking at are designed to work without the compressed instruction set, and since RVC is optional that is a reasonable design choice (though ideally, of course, would be clearly stated).

How are pending exceptions managed by the RISC-V specification?

I'm working with the RISC-V specification and have a problem with pending interrupts/exceptions. I'm reading version 1.10 of volume II, published in May 7, 2017.
In section 3.1.14, describing the registers mip and mie it is said that:
Multiple simultaneous interrupts and traps at the same privilege level are handled in the following decreasing priority order: extern interrupts, software interrupts, timer interrupts, then finally any synchronous traps.
Up until that point I thought that exceptions, e.g. a misaligned instruction fetch exception on a JAL/JALR instruction, would be handled immediately by a trap because
a) there is no way to continue executing your stream of instructions and
b) there is no description of how an exception could be pending, i.e. there are no concepts described by the specification that could manage state for exceptions (for example registers like mip but for exceptions).
However, the paragraph cited above indicates something different.
My questions are:
Are there pending exceptions in RISC-V?
If yes, how is it possible that the exception still can be handled after an interrupt was handled and isn't forgotten?
In my option there are pending exceptions in RISCV-V, exactly by the reason you stated. It is a matter of semantics, if two events occur simultaneously, and one is deferred, it must be pending. One must cater for the possibility of an asynchronous event (interrupt) occurring simultaneously with a trap, and (by section 3.1.14) the asynchronous event has priority. Depending on the implementation one does not neccesarely need to save any state in this case, after the interrupt is handled, the instruction that triggers a trap is re-fetched, and duly leads to an exception. In my view section 3.1.14 describes the serialization of asynchronous events.

Understanding hardware interrupts and exceptions at processor and hardware level

After a lot of reading about interrupt handling etcetera, i still can figure out the full process of interrupt handling from the very beginning.
For example:
A division by zero.
The CPU fetches the instruction to divide a number by zero and send it to the ALU.
Assuming the the ALU started the process of the division or run some checks before starting it.
How the exception is signaled to the CPU ?
How the CPU knows what exception has occurred from only one bit signal ? Is there a register that is reads after it gets interrupted to know this ?
2.How my application catches the exception?
Do i need to write some function to catch a specipic SIGNAL or something else? And when i write expcepion handling routine like
Try {}
Catch {}
And an exception occurres how can i know what exeption is thrown and handle it well ?
The most important part that bugs me is for example when an interupt is signaled from the keyboard to the PIC the pic in his turn signals to the CPU that an interrupt occurred by changing the wite INT.
But how does the CPU knows what device need to be served ?
What is the processes the CPU is doing when his INTR pin turns on ?
Does he has a routine that checks some register that have a value of the interrupt (that set by the PIC when it turns on the INT wire? )
Please don't ban the post, it's really important for me to understand this topic, i read a researched a couple of weaks but connot connect the dots in my head.
Thanks.
There are typically several thing associated with interrupts other than just a pin. Normally for more recent micro-controllers there is a interrupt vector placed on memory that addresses each interrupt call, and a register that signals the interrupt event/flag.
When a event that is handled by an interruption occurs and a specific flag is set. Depending on priority's and current state of the CPU the context switch time may vary for example a low priority interrupt flagged duding a higher priority interrupt will have to wait till the high priority interrupt is finished. In the event that nesting is possible than higher priority interrupts may interrupt lower priority interrupts.
In the particular case of exceptions like dividing by 0, that indeed would be detected by the ALU, the CPU may offer or not a derived interruption that we will call in events like this. For other types of exceptions an interrupt might not be available and the CPU would just act accordingly for example rebooting.
As a conclusion the interrupt events would occur in the following manner:
Interrupt event is flagged and the corresponding flag on the register is set
When the time comes the CPU will switch context to the interruption handler function.
At the end of the handler the interruption flag is cleared and the CPU is ready to re-flag the interrupt when the next event comes.
Deciding between interrupts arriving at the same time or different priority interrupts varies with different hardware.
It may be simplest to understand interrupts if one starts with the way they work on the Z80 in its simplest interrupt mode. That processor checks the state of a
pin called /IRQ at a certain point during each instruction; if the pin is asserted and an "interrupt enabled" flag is set, then when it is time to fetch the next instruction the processor won't advance the program counter or read a byte from memory, but instead disable the "interrupt enabled" flag and "pretend" that it read an "RST 38h" instruction. That instruction behaves like a single-byte "CALL 0038h" instruction, pushing the program counter and transferring control to that address.
Code at 0038h can then poll various peripherals if they need any service, use an "ei" instruction to turn the "interrupt enabled" flag back on, and perform a "ret". If no peripheral still has an immediate need for service at that point, code can then resume with whatever it was doing before the interrupt occurred. To prevent problems if the interrupt line is still asserted when the "ret" is executed, some special logic will ensure that the interrupt line will be ignored during that instruction (or any other instruction which immediately follows "ei"). If another peripheral has developed a need for service while the interrupt handler was running, the system will return to the original code, notice the state of /IRQ while it processes the first instruction after returning, and then restart the sequence with the RST 38h.
In the simple Z80 approach, there is only one kind of interrupt; any peripheral can assert /IRQ, and if any peripheral does so the Z80 will need to ask every peripheral if it wants attention. In more advanced systems, it's possible to have many different interrupts, so that when a peripheral needs service control can be dispatched to a routine which is designed to handle just that peripheral. The same general principles still apply, however: an interrupt effectively inserts a "call" instruction into whatever the processor was doing, does something to ensure that the processor will be able to service whatever needed attention without continuously interrupting that process [on the Z80, it simply disables interrupts, but systems with multiple interrupt sources can leave higher-priority sources enabled while servicing lower ones], and then returns to whatever the processor had been doing while re-enabling interrupts.

Do spin locks always require a memory barrier? Is spinning on a memory barrier expensive?

I wrote some lock-free code that works fine with local
reads, under most conditions.
Does local spinning on a memory read necessarily imply I
have to ALWAYS insert a memory barrier before the spinning
read?
(To validate this, I managed to produce a reader/writer
combination which results in a reader never seeing the
written value, under certain very specific
conditions--dedicated CPU, process attached to CPU,
optimizer turned all the way up, no other work done in the
loop--so the arrows do point in that direction, but I'm not
entirely sure about the cost of spinning through a memory
barrier.)
What is the cost of spinning through a memory barrier if
there is nothing to be flushed in the cache's store buffer?
i.e., all the process is doing (in C) is
while ( 1 ) {
__sync_synchronize();
v = value;
if ( v != 0 ) {
... something ...
}
}
Am I correct to assume that it's free and it won't encumber
the memory bus with any traffic?
Another way to put this is to ask: does a memory barrier do
anything more than: flush the store buffer, apply the
invalidations to it, and prevent the compiler from
reordering reads/writes across its location?
Disassembling, __sync_synchronize() appears to translate into:
lock orl
From the Intel manual (similarly nebulous for the neophyte):
Volume 3A: System Programming Guide, Part 1 -- 8.1.2
Bus Locking
Intel 64 and IA-32 processors provide a LOCK# signal that
is asserted automatically during certain critical memory
operations to lock the system bus or equivalent link.
While this output signal is asserted, requests from other
processors or bus agents for control of the bus are
blocked.
[...]
For the P6 and more recent processor families, if the
memory area being accessed is cached internally in the
processor, the LOCK# signal is generally not asserted;
instead, locking is only applied to the processor’s caches
(see Section 8.1.4, “Effects of a LOCK Operation on
Internal Processor Caches”).
My translation: "when you say LOCK, this would be expensive, but we're
only doing it where necessary."
#BlankXavier:
I did test that if the writer does not explicitly push out the write from the store buffer and it is the only process running on that CPU, the reader may never see the effect of the writer (I can reproduce it with a test program, but as I mentioned above, it happens only with a specific test, with specific compilation options and dedicated core assignments--my algorithm works fine, it's only when I got curious about how this works and wrote the explicit test that I realized it could potentially have a problem down the road).
I think by default simple writes are WB writes (Write Back), which means they don't get flushed out immediately, but reads will take their most recent value (I think they call that "store forwarding"). So I use a CAS instruction for the writer. I discovered in the Intel manual all these different types of write implementations (UC, WC, WT, WB, WP), Intel vol 3A chap 11-10, still learning about them.
My uncertainty is on the reader's side: I understand from McKenney's paper that there is also an invalidation queue, a queue of incoming invalidations from the bus into the cache. I'm not sure how this part works. In particular, you seem to imply that looping through a normal read (i.e., non-LOCK'ed, without a barrier, and using volatile only to insure the optimizer leaves the read once compiled) will check into the "invalidation queue" every time (if such a thing exists). If a simple read is not good enough (i.e. could read an old cache line which still appears valid pending a queued invalidation (that sounds a bit incoherent to me too, but how do invalidation queues work then?)), then an atomic read would be necessary and my question is: in this case, will this have any impact on the bus? (I think probably not.)
I'm still reading my way through the Intel manual and while I see a great discussion of store forwarding, I haven't found a good discussion of invalidation queues. I've decided to convert my C code into ASM and experiment, I think this is the best way to really get a feel for how this works.
The "xchg reg,[mem]" instruction will signal its lock intention over the LOCK pin of the core. This signal weaves its way past other cores and caches down to the bus-mastering buses (PCI variants etc) which will finish what they are doing and eventually the LOCKA (acknowledge) pin will signal the CPU that the xchg may complete. Then the LOCK signal is shut off. This sequence can take a long time (hundreds of CPU cycles or more) to complete. Afterwards the appropriate cache lines of the other cores will have been invalidated and you will have a known state, i e one that has ben synchronized between the cores.
The xchg instruction is all that is neccessary to implement an atomic lock. If the lock itself is successful you have access to the resource that you have defined the lock to control access to. Such a resource could be a memory area, a file, a device, a function or what have you. Still, it is always up to the programmer to write code that uses this resource when it's been locked and doesn't when it hasn't. Typically the code sequence following a successful lock should be made as short as possible such that other code will be hindered as little as possible from acquiring access to the resource.
Keep in mind that if the lock wasn't successful you need to try again by issuing a new xchg.
"Lock free" is an appealing concept but it requires the elimination of shared resources. If your application has two or more cores simultaneously reading from and writing to a common memory address "lock free" is not an option.
I may well not properly have understood the question, but...
If you're spinning, one problem is the compiler optimizing your spin away. Volatile solves this.
The memory barrier, if you have one, will be issued by the writer to the spin lock, not the reader. The writer doesn't actually have to use one - doing so ensures the write is pushed out immediately, but it'll go out pretty soon anyway.
The barrier prevents for a thread executing that code re-ordering across it's location, which is its other cost.
Keep in mind that barriers typically are used to order sets of memory accesses, so your code could very likely also need barriers in other places. For example, it wouldn't be uncommon for the barrier requirement to look like this instead:
while ( 1 ) {
v = pShared->value;
__acquire_barrier() ;
if ( v != 0 ) {
foo( pShared->something ) ;
}
}
This barrier would prevent loads and stores in the if block (ie: pShared->something) from executing before the value load is complete. A typical example is that you have some "producer" that used a store of v != 0 to flag that some other memory (pShared->something) is in some other expected state, as in:
pShared->something = 1 ; // was 0
__release_barrier() ;
pShared->value = 1 ; // was 0
In this typical producer consumer scenario, you'll almost always need paired barriers, one for the store that flags that the auxiliary memory is visible (so that the effects of the value store aren't seen before the something store), and one barrier for the consumer (so that the something load isn't started before the value load is complete).
Those barriers are also platform specific. For example, on powerpc (using the xlC compiler), you'd use __isync() and __lwsync() for the consumer and producer respectively. What barriers are required may also depend on the mechanism that you use for the store and load of value. If you've used an atomic intrinsic that results in an intel LOCK (perhaps implicit), then this will introduce an implicit barrier, so you may not need anything. Additionally, you'll likely also need to judicious use of volatile (or preferably use an atomic implementation that does so under the covers) in order to get the compiler to do what you want.