What is a latent error - an example? - terminology

Quoting an document dealing with taxonomy of threats etc.:
An error is detected if its presence is indicated by an error message
or error signal. Errors that are present but not detected are latent
errors.
Please mind that this is not the same as a dormant fault, which is a defect in the code activated by certain events and producing error when it gets activated.
Also the latent error is an error caused by a fault but not causing a failure. I guess it will be common in multi layer applications, yet I cannot think of any example. But I do not understand one more thing - eventually it has to cause a failure, otherwise it would not be discovered at all, dont you think?

Good example of a latent error
In 2005 a Boeing 777-2H6ER aircraft with the registration 9M-MRG, serial number 28414, operating as Malaysia Airlines Flight 124 flying from Perth to Kuala Lumpur experienced an ADIRU (air data inertial reference unit) fault resulting in uncommanded manoeuvres by the aircraft acting on false indications.
In that incident the incorrect data impacted all planes of movement while the aircraft was climbing through 38,000 feet (11,600 m). The aircraft pitched up and climbed to around 41,000 feet (12,500 m), with the stall warning activated. The pilots recovered the aircraft with the autopilot disengaged and requested a return to Perth. During the return to Perth, both the left and right autopilots were briefly activated by the crew, but in both instances the aircraft pitched down and banked to the right.
The aircraft was flown manually for the remainder of the flight and landed safely in Perth. There were no injuries and no damage to the aircraft. The ATSB (Australian Transport Safety Bureau) found that the main probable cause of this incident was a latent software error which allowed the ADIRU to use data from a failed accelerometer. The US Federal Aviation Administration issued Emergency Airworthiness Directive (AD) 2005-18-51 requiring all 777 operators to install upgraded software to resolve the error.
Source: https://en.wikipedia.org/wiki/Malaysia_Airlines_Flight_370#Aircraft

I did some reading:
e terms "active" and "latent" as applied to errors were coined by James Reason.(1,2) Latent errors (or latent conditions) refer to less apparent failures of organisation or design that contributed to the occurrence of errors or allowed them to cause harm to workers. For instance, whereas the active failure in a particular adverse event may have been a mistake in programming a logic controller, a latent error might be that the institution uses multiple different software code, making programming errors more likely. Thus, latent errors are quite literally "accidents waiting to happen."
Latent errors are sometimes referred to as errors at the "blunt end," referring to the many layers of the safety management system that affect the person carrying out the task. Active failures, in contrast, are sometimes referred to as errors at the ?sharp end,? or the personnel involved in the performance of the task.
So, applying the above to software, to me it means:
Error signal - defect manifesting itself into fault of some sort
Latent error - root cause with side effects; side effects are considered detected errors
I guess your example (if search in my app is not case sensitive but should be) qualifies quite well to be named "latent error". Its active failure could be something like "search results are clobbered with irrelevant stuff"

Related

Precise exception

I was going through the book The Design and Implementation of the FreeBSD operating system and I came across this:
This ability to restart an instruction is called a precise exception.
The CPU may implement restarting by saving enough state when an
instruction begins that the state can be restored when a fault is
discovered. Alternatively, instructions could delay any modifications
or side effects until after any faults would be discovered so that the
instruction execution does not need to back up before restarting.
I could not understand what does
modification or side effects
refer to in the passage. Can anyone elaborate?
That description from the FreeBSD book is very OS centric. I even disagree with its definition, This ability to restart an instruction is called a precise exception. You don't restart an instruction after a power failure exception. So rather than try to figure out what McKusick meant, I'm gonna suggest going elsewhere for a better description of exceptions.
BTW, I prefer Smith's definition:
An interrupt is precise if the saved process state corresponds with
the sequential model of program execution where one instruction completes before the next begins.
https://lagunita.stanford.edu/c4x/Engineering/CS316/asset/smith.precise_exceptions.pdf
Almost all modern processors support precise exceptions. So the hard work is already done. What an OS must do is register a trap handler for the exceptions that the hardware takes. For example, there will be a page fault handler, floating point, .... To figure out what is necessary for these handlers you'd have to read the processor theory of operations.
Despite what seems like gritty systems detail, that is fairly high level. It doesn't say anything about what the hardware is doing and so the FreeBSD description is shorthanding a lot.
To really understand precise exceptions, you need to read about that in the context of pipelining, out of order, superscalar, ... in a computer architecture book. I'd recommend Computer Architecture A Quantitative Approach 6th ed. There's a section Dealing With Exceptions p C-38 that presents a taxonomy of the different types of exceptions. The FreeBSD description is only describing some exceptions. It then gets into how each exception type is handled by the pipeline.
Also, The Linux Programming Interface has 3 long chapters on the POSIX signal interface. I know it's not FreeBSD but it covers what an application will see when, for example, a floating point exception is taken and a SIGFPE signal is sent to the process.

No LR and SPSR for EL0 in Aarch64

In AArch64, There are 4 exception levels viz EL0-3. ARM site mentions there are 4 Stack pointers (SP_EL0/1/2/3) but only 3 exception Link registers (ELR_EL1/2/3) and only 3 saved program status register(SPSR_EL1/2/3).
Why the ELR_EL0 and SPSR_EL0 are not required?
P.S. Sorry if this is a silly question. I am new to ARM architecture.
By design exceptions cannnot target EL0, so if it can't ever take an exception then it has no use for the machinery to be able to return from one.
To expand on the reasoning a bit (glossing over the optional and more special-purpose higher exception levels), the basic design is that EL1 is where privileged system code runs, and EL0 is where unprivileged user code runs. Thus EL0 is by necessity far more restricted in what it can do, and wouldn't be very useful for handling architectural exceptions, i.e. low-level things requiring detailed knowledge of the system. Only privileged software (typically the OS kernel) should have access to the full hardware and software state necessary to decide whether handling that basic hardware exception means e.g. going and quietly paging something in from swap, versus delivering a "software exception"-type signal to the offending task to tell it off for doing something bad.

An example of dormant fault?

I have been thinking about the dormant fault and cannot figure out an example. By definition, dormant fault is a fault (defect in the code) that does not cause error and thus do not cause a failure. Can anyone give me an example? The only thing that crossed my mind was unusued buggy code..
Thanks
Dormant faults are much more common than one might think. Most programmers have experienced moments of thinking "What was I thinking? How could that ever run correctly?", even though the code didn't show erroneous behaviour. A classic case is faulty corner-case handling, e.g. on failed memory allocation:
char *foo = malloc(42);
strcpy( foo, "BarBaz" );
The above code will work fine in most situations and pass tests just fine; however, when malloc fails due to memory exhaustion, it will fail miserably. The fault is there, but dormant.
Dormant faults are simply ones that don't get revealed until you send the right input [edit: or circumstances] to the system.
A classic example is from Therac-25. The race condition caused by an unlikely set of keys on input didn't occur until technicians became "fluent" with using the system. They memorized the key strokes for common treatments, which means they could enter them very quickly.
Some other ones that come to my mind:
Y2K bugs were all dormant faults, until the year 2000 came around...
Photoshop 7 still runs OK on my Windows 7 machine, yet it thinks my 1TB disks are full. An explanation is that the datatype used to hold free space was not designed to account for such high amounts of free space, and there's an overflow causing the free space to appear insufficient.
Transfering a file greater than 32MB with TFTP (the block counter can only go to 65535 in 16 bits) can reveal a dormant bug in a lot of old implementations.
In this last set of examples, one could argue that there was no specification requiring these systems to support such instances, and so they're not really faults. But that gets into completeness of specifications.

page fault,shortage of page or access violation?

It is known that when access a page which does not exist in the memory can lead to a page fault, but writing a read-only page can also cause a page fault? How to identify the two types of page fault in exception handler?
You read the exception error code that the CPU places on the stack before invoking your page fault handler. This error code contains 5 bits, of which you're interested in these 4:
P=0: The fault was caused by a non-present page.
P=1: The fault was caused by a page-level protection violation.
W/R=0: The access causing the fault was a read.
W/R=1: The access causing the fault was a write.
U/S=0: The access causing the fault originated when the processor
was executing in supervisor mode.
U/S=1: The access causing the fault originated when the processor
was executing in user mode.
I/D=0: The fault was not caused by an instruction fetch.
I/D=1: The fault was caused by an instruction fetch.
If you get P=0, the page isn't present.
If you get P=1, the privileges are insufficient to access the page. U/S tells you if it's in the kernel or application. I/D tells you if it's because of code instruction reading or not (reading/writing data). W/R tells you if it is reading or writing that can't be done.
This is described in the Interrupt 14—Page-Fault Exception (#PF) section of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3: System Programming Guide.
Alex's answer is perfectly correct, however you also need to combine that information with some information of your own (i.e. by looking at the memory manager data). For example some operating systems don't allocate pages backing memory until they're referenced for the first time, so if you get a read or a write to a page which is not present you may find that the reason it is not present is that you haven't allocated it yet and you should allocate it and continue from the exception. Similarly a write to a read only page can occur as part of a copy-on-write mechanism (a number of systems do this, most notably posix style systems when performing fork()), so you detect the write to a read only page, check the memory manager tables and see the page should be copied, copy the page, update the page tables and continue.
I've found that usually the only flag from the list Alex mentions that is interesting is the one that says whether it was a read or a write. Beyond that you need to check everything else from the MM tables anyway.
Trying to write to read only will usually cause a segmentation fault (SIGSEGV).
http://en.wikipedia.org/wiki/Segmentation_fault
I think its called an access violation exception (memory access violation) in x86 parlance.

Real time system exception handling

As always after some research I was unable to find anything of real value. My question is how does one go about handling exceptions in a real time system? As program failure generally is not the best case i.e. nuclear reactor/ heart monitor.
Ok since everyone got lost on the second piece of this, which had NOTHING to do with the main question. I had it in there to show how I normally escape code blocks.
Exception handling in real-time/embedded systems has several layers. Not just the language supported options, but also MMU, CPU exceptions and one of my favorites: watchdogs.
Language exceptions (C/C++)
- not often used, beause it is hard to prove that all exceptions are handled at the right level. Also it is pretty hard to determine what threat/process should be responsible. Instead, programming by contract is preferred.
Programming style:
- i.e. programming by contract. Additional constraints : Misra/C Misra/C++. This can be checked to unsure that all possible cases are somehow handled. (i.e. no if without else)
Hardware support:
- MMU : use of multiple processes which are protected against each other. This allows
- watchdog
- CPU exceptions
- multi core: use of multiple cores to separate cricical processes from the rest. Also allows to have voting mechanisms (you want this and more for your nuclear reactor).
- multi-system
Most important is to define a strategy. Depending on the other nonfunctional requirements (safety, reliability, security) a strategy needs to be thought of. Can be graceful degradation to partial system reboot.
In a 'real-time', 'nuclear reactor' type system, chances are the exception handling allows the system to instead of fail, do the next best thing.
Let's say that we have a heart monitor. If it isn't receiving a signal, that might trigger an exception. In that case, the heart monitor might handle the exception by waiting a few seconds and trying again.
In a nuclear reactor, getting to a certain temperature might trigger an exception. In that case, the handling might shut off various parts of the reactor to start to cool it down, and then start them back up when it gets to a reasonable temperature.
Exceptions are meant to have a lower-level system say that it doesn't know what to do, and to have a higher level system handle it. Like in the nuclear reactor, the system that measures temperature probably doesn't know how to turn on parts of the reactor, so it triggers an exception so that some higher-level system can handle it.
A critical system is like any other system, except it's specified more clearly, passes through more testing phases, and is generally will fail-safe.
Regarding your form, yes it's pretty bad. I do mind the lack of {} very much; and it's been said so-often that this is just plain bad style, and leads to confusion when adding new code.