I am trying to generate a run time error such as divide by zero in ARM Cortex M3. I don't know why when I generate divide by zero error system works correctly. However value seems "Infinity"
Does ARM gcc compilers handle these kind of UsageFault errors? I did not implement hardware exception handler yet like Usage Fault, Bus Fault or Mem Manage.
Depending on the architecture the behaviour is different. ARMv6-M doesn't include a divide instruction so it's the software the one to manage this situation (or the compiler, from the C/C++ point of view, it is UB).
On Cortex M3 (ARMv7-M) things are different, there is an UsageFault exception to manage DIVBY0 situations.
In contrast to x86, no exception is thrown for arm if an integer division by zero takes place. There is simply returned 0 as the result
Edit: This only applies to the Cortex-A series. As Jose noted, there is a control register for integer division in the Cortex-M series, as in the case of Floating-point division described in the following. See the link in his answer.
For floating point operations, the Floating-point Control Register (FPSCR for aarch32 or FPCR for aarch64) is decisive for whether an exception is thrown. If the corresponding bit is set there, an exception is thrown, otherwise only a flag in the Floating-Point Status Register (FPSCR in aarch32 or FPSR in aarch64) is set which then indicates the error. This registers can be set via msr and read via mrs.
If no exception is thrown, there are the following rules:
infinity divided infinity is NaN
zero divided zero is NaN
Anything other divided infinity is ±zero
Anything other divided zero in ±infinity (sign according to the dividend,
this is the case you got in your screenshot)
infinity divided anything other is ±infinity
zero divided anything other is ±zero
See the pseudocode of FDIV in ARM a64 instruction set architecture.
References:
FPCR and FPSR in aarch64
FPSCR in aarch32
ARM a64 instruction set architecture
Related
I need an invalid opcode with x86 (not x64!) that's exactly one byte in length to overwrite some code in a foreign process. Currently I'm using INT3 (0xCC) but it would be nicer to trap an invalid opcode separately since the foreign process contains a lot of valid INT3.
According to http://ref.x86asm.net/coder32.html, there aren't any in 32-bit mode guaranteed to #UD. Anything that wasn't nailed down has been reused as building material for new extensions.
The ones that exist in 64-bit mode are reserved and not guaranteed to fault on future CPUs; only ud2 is truly guaranteed future-proof. Assuming x86-64 lasts long enough, likely some vendor will make use of that 64-bit-only coding space and stop wasting code-size to also cater to increasingly obsolete 32-bit mode.
If you don't need #UD, you can raise #GP(0) with some privileged instructions in user-space, assuming you're never going to be running in kernel mode.
F4 hlt will always #GP(0) in user-space, not enabled by IOPL, only true CPL=0. (Or #UD if used with a lock prefix). Even if it somehow gets executed in a kernel context, it just stops and waits for the next interrupt, so typically no effect on correctness unless executed with interrupts disabled. (In which case you're stuck until the next NMI).
A similar but worse option is FB sti. But it can execute successfully in a program that's used Linux iopl(), like an X11 server. Unless interrupts were supposed to be disabled, though, that's still not going to lock up your system, it just won't trigger the exception you were looking for. (Unlike cli which could get that CPU stuck, or in al, dx which could do wild IO and even be allowed by ioperm not just iopl, depending on what value is in DX.)
Depending what comes next in memory, 9A callf ptr16:32 might fault on trying to load an invalid value into CS. That value would come from the 2 bytes of machine code 5 and 6 bytes after this one (i.e. after a 32-bit new EIP, since ptr16:32 is stored little-endian). Unlike call rel32 or whatever, it may fault before actually pushing anything and overwriting the current CS:EIP. (But if not, in theory your debugger could simulate popping that far-return address back into CS:EIP after catching the fault.)
Just to be clear, I'm suggesting overwriting a byte with 9A, and leaving the later bytes of machine code unmodified, after checking that the bytes that would be the new CS value are in fact invalid. e.g. by making sure a far call to that address segfaults. Or if this is near the end of a page, and the next is unmapped, it can #PF.
The F0 lock prefix faults with #UD if used on things other than a memory-destination RMW operation, so it can also work if later context would decode as any other instruction. But you can't always use it; you need to check that you aren't creating a valid atomic RMW instruction. e.g. if the ModRM byte was 00 or 01, replacing the opcode with a lock prefix creates a memory-destination add.
#ecm points out that f1 on some CPUs is icebp / int1, but on other CPUs where it isn't, it's undefined but doesn't raise #UD. (http://ref.x86asm.net/coder32.html#xF1)
If the following byte is 0, D4 00 aam 0 is guaranteed to #DE (divide exception). But any other value does immediate 8-bit division of AL.
Depending what byte comes next, CD int n can be used. But not for all following bytes, e.g. int 0x80 won't fault under Linux (unless your kernel is built without CONFIG_IA32_EMULATION). And you might not want some of the other random interrupt numbers. e.g. CD 03 int 3 is pretty much like CC int3.
This question is about adapting to the change in semantics from lock step to independent program counters. Essentially, what can I change calls like int __all(int predicate); into for volta.
For example, int __all_sync(unsigned mask, int predicate);
with semantics:
Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for all of them.
The docs assume that the caller knows which threads are active and can therefore populate mask accurately.
a mask must be passed that specifies the threads participating in the call
I don't know which threads are active. This is in a function that is inlined into various places in user code. That makes one of the following attractive:
__all_sync(UINT32_MAX, predicate);
__all_sync(__activemask(), predicate);
The first is analogous to a case declared illegal at https://forums.developer.nvidia.com/t/what-does-mask-mean-in-warp-shuffle-functions-shfl-sync/67697, quoting from there:
For example, this is illegal (will result in undefined behavior for warp 0):
if (threadIdx.x > 3) __shfl_down_sync(0xFFFFFFFF, v, offset, 8);
The second choice, this time quoting from __activemask() vs __ballot_sync()
The __activemask() operation has no such reconvergence behavior. It simply reports the threads that are currently converged. If some threads are diverged, for whatever reason, they will not be reported in the return value.
The operating semantics appear to be:
There is a warp of N threads
M (M <= N) threads are enabled by compile time control flow
D (D subset of M) threads are converged, as a runtime property
__activemask returns which threads happen to be converged
That suggests synchronising threads then using activemask,
__syncwarp();
__all_sync(__activemask(), predicate);
An nvidia blog post says that is also undefined, https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
Calling the new __syncwarp() primitive at line 10 before __ballot(), as illustrated in Listing 11, does not fix the problem either. This is again implicit warp-synchronous programming. It assumes that threads in the same warp that are once synchronized will stay synchronized until the next thread-divergent branch. Although it is often true, it is not guaranteed in the CUDA programming model.
That marks the end of my ideas. That same blog concludes with some guidance on choosing a value for mask:
Don’t just use FULL_MASK (i.e. 0xffffffff for 32 threads) as the mask value. If not all threads in the warp can reach the primitive according to the program logic, then using FULL_MASK may cause the program to hang.
Don’t just use __activemask() as the mask value. __activemask() tells you what threads happen to be convergent when the function is called, which can be different from what you want to be in the collective operation.
Do analyze the program logic and understand the membership requirements. Compute the mask ahead based on your program logic.
However, I can't compute what the mask should be. It depends on the control flow at the call site that the code containing __all_sync was inlined into, which I don't know. I don't want to change every function to take an unsigned mask parameter.
How do I retrieve semantically correct behaviour without that global transform?
TL;DR: In summary, the correct programming approach will most likely be to do the thing you stated you don't want to do.
Longer:
This blog specifically suggests an opportunistic method for handling an unknown thread mask: precede the desired operation with __activemask() and use that for the desired operation. To wit (excerpting verbatim from the blog):
int mask = __match_any_sync(__activemask(), (unsigned long long)ptr);
That should be perfectly legal.
You might ask "what about item 2 mentioned at the end of the blog?" I think if you read that carefully and taking into account the previous usage I just excerpted, it's suggesting "don't just use __activemask()" if you intend something different. That reading seems evident from the full text there. That doesn't abrogate the legality of the previous construct.
You might ask "what about incidental or enforced divergence along the way?" (i.e. during the processing of my function which is called from elsewhwere)
I think you have only 2 options:
grab the value of __activemask() at entry to the function. Use it later when you call the sync operation you desire. That is your best guess as to the intent of the calling environment. CUDA doesn't guarantee that this will be correct, however this should certainly be legal if you don't have enforced divergence at the point of your sync function call.
Make the intent of the calling environment clear - add a mask parameter to your function and rewrite the code everywhere (which you've stated you don't want to do).
There is no way to deduce the intent of the calling environment from within your function, if you permit the possibility of warp divergence prior to entry to your function, which obscures the calling environment intent. To be clear, CUDA with the Volta execution model permits the possibility of warp divergence at any time. Therefore, the correct approach is to rewrite the code to make the intent at the call site explicit, rather than trying to deduce it from within the called function.
I know MIPS would get wrong epc register value when it happens at branch delay, and epc = fault_address - 4.
But now, I often get the wrong EPC value which is even NOT in .text segment such as 0xb6000000, what's wrong with the case??
Thanks for your advance..
The CPU does not know anything about the boundaries of the .text region in your program. It simply implements a 2^32 byte address space.
It is possible for an incorrectly programmed jump to go to any address within the 2^32 byte address space. The jump instruction itself will not cause any sort of exception - in fact the MIPS32® Architecture for Programmers Volume II: The MIPS32® Instruction Set explicitly states that jump (J, JR, JALR) instructions do not trigger any exceptions.
When the processor starts executing from the destination of an incorrectly programmed jump, in presumably uninitialized memory, what happens next depends on the contents of that memory. If uninitialized memory is filled with "random" data, that data will be interpreted as instructions which the processor will execute until an illegal instruction is found, or until an instruction triggers some other exception.
I am very new to cuda and started reading about parallel programming and cuda just a few weeks ago. After I installed the cuda toolkit, I was browsing the sdk samples (which come with the installation of the toolkit) and wanted to try some of them out. I started with matrixMul from 0_Simple folder. This program executes fine (I am using Visual Studio 2010).
Now I want to change the size of the matrices and try with a bigger one (for example 960X960 or 1024x1024). In this case, something crashes (I get black screen, and then the message: display driver stopped responding and has recovered).
I am changing this two lines in the code (from main function):
dim3 dimsA(8*4*block_size, 8*4*block_size, 1);
dim3 dimsB(8*4*block_size, 8*4*block_size, 1);
before they were:
dim3 dimsA(5*2*block_size, 5*2*block_size, 1);
dim3 dimsB(5*2*block_size, 5*2*block_size, 1);
Can someone point to me what I am doing wrong. and should I alter something else in this example for it to work properly. Thx!
Edit: like some of you suggested, i changed the timeout value (0 somehow did not work for me, I set the timeout to 60), so my driver does not crash, but I get huge list of errors, like:
... ... ...
Error! Matrix[409598]=6.40005159, ref=6.39999986 error term is > 1e-5
Error! Matrix[409599]=6.40005159, ref=6.39999986 error term is > 1e-5
Does this got something to do with the allocation of the memory. Should I make changes there and what could they be?
Your new problem is actually just the strict tolerances provided in the NVidia example. Your kernel is running correctly. It's just complaining that accumluated error is greater than the limit that they had set for this example. This is just because you're doing a lot more math operations which are all accumulating error. If you look at the numbers it's giving you, you're only off of the reference answer by about 0.00005, which is not unusual after a lot of single-precision floating-point math. The reason you're getting these errors now and not with the default matrix sizes is that the original matricies were smaller and thus required a lot less operations to multiply. Matrix multiplication of N x N matricies requires on the order of N^3 operations, so the number of operations required increases much faster than the size of the matrix and the accumulated error would increase in proportion with the number of operations.
If you look near the end of the runTest() function, there's a call to computeGold() which computes the reference answer on your CPU. There should then be a call to something like shrCompareL2fe that compares the results. The last parameter to this is a tolerance. If you increase the size of this tolerance (say, to 1e-3 or 1e-4 instead of 1e-5,) you should eliminate these error messages. Note that there may be a couple of these calls. The version of the SDK examples that I have has an optional CUBLAS implementation, so it has a comparison for that against the gold, too. The one right after the print statement that says "Comparing CUDA matrixMul & Host results" is the one you'd want to change.
I'd advise looking at the indexing used in the kernel (matrixMulCUDA) a bit closer - it sounds like you're writing to unallocated memory.
More specifically, is the only thing that you changed the dimsA and dimsB variables? Inside the kernel they use the thread and block index to access the data - did you also increase the data size accordingly? There is no bounds checking going on in the kernel, so if you just change the kernel launch configuration, but not the data, then odds are you're writing past your data into some other memory
Have you disabled Timeout Detection and Recovery (TDR) in Windows? It is entirely possible that your code is running fine but that the larger matricies caused the kernel execution to exceed Windows' timeout, which causes Windows to assume the card is locked up, so it resets the card and gives you a message identical to the one you describe. Even if that is not your problem here, you definitely want to disable that before doing any serious CUDA work in Windows. The timeout is quite short by default, since normal graphics rendering should take small fractions of a second per frame.
See this post on the NVidia forums that describes TDR and how to turn it off:
WDDM TDR - NVidia devtalk forum
In particular, you probably want to set the key HKLM\System\CurrentControlSet\Control\GraphicsDrivers\TdrLevel to 0 (Detection Disabled).
Alternatively, you can increase the timeout period by setting
HKLM\System\CurrentControlSet\Control\GraphicsDrivers\TdrDelay. It defaults to 2 and is specified in seconds. Personally, I have found that TDR is always annoying when doing work in CUDA, so I just turn it off entirely. IIRC, you need to restart your system for any TDR-related changes to take effect.
Basically I want to know how do you differentiate an error from an exception. In some programming languages accessing a non existent file throws an error and in others its an exception. How do you know if some thing is an error or an exception?
Like anything else - you either test it or read the documentation. It can be an "Error" or an "Exception" based on the language.
Eg.
C:
Crashes and gives a divide by zero error.
Ruby:
>> 6 / 0
ZeroDivisionError: divided by 0
from (irb):1:in `/'
from (irb):1
(ZeroDivisionError is actually an exception.)
Java:
Code:
int x = 6 / 0;
Output:
Exception in thread "main" java.lang.ArithmeticException: / by zero
It depends on the language :
some languages don't have exceptions
some languages don't use exceptions for everything.
For example, in PHP :
There are exceptions
But divide by 0 doesn't cause an exception to be thrown : is only raises a warning -- that doesn't stop the execution of the script.
The following portion of code :
echo 10 / 0;
echo "hello, world!";
Would give this result :
Warning: Division by zero in /.../temp.php on line 5
hello, world!
The terms error and exception are commonly used as jargon terms, with meanings that vary depending upon the programming ecosystem in which they are used.
Conditions
This response follows the lead of Common Lisp, and adopts the term condition as a nonjudgmental way of referring to an "interesting situation" in a program.
What makes a program condition "interesting"? Let's consider the division-by-zero case for real numbers. In the overwhelming majority of cases in which one real is divided by another, the result is another plain ordinary well-behaved real number. These are the "routine" or "uninteresting" cases. However, in the case that the divisor is zero then, mathematically speaking, the result is undefined. The program is now in an "interesting" or "exceptional" condition.
It becomes even more complicated once we take the mathematical ideal of a real number and model it, say, as an IEEE-format floating point number. If we divide 1.0 / 0.0, the IEEE standard (mostly) says that the result is in fact another floating point number, the quiet NaN Infinity. Since the result no longer behaves in the same way as a plain old real number, the program condition is once again "interesting" or "exceptional".
Classifying Conditions
The question is: what should we do when we run into an interesting condition? The answer is dependent upon the context. When classifying program conditions, the following questions are useful:
How likely is it that the condition will occur: certain, probable, unlikely, impossible?
How is the condition detected: program malfunction, distinguished value, signal/handler (aka exception handling), program termination?
How should the condition be handled: ignore it, perform some special action, terminate the program?
The answers to these questions yield 4 x 4 x 3 = 48 distinct cases -- and surely more could be distinguished by further criteria. This brings us to the heart of the matter. We have more than two cases but only two labels, error and exception, to apply to them. Needless to say, there are many possible ways to divide the 48+ cases into two groups.
For example, one could say that anything involving program malfunction is an error, anything else is an exception. Or that anything involving a language's built-in exception handling facilities is an exception, anything else is an error. The possibilities are legion.
Examples
End-Of-File
When reading and processing a stream of characters, hitting the end-of-file is certain. In C, this event is detected by means of a distinguished return value from an I/O function, a so-called error return value. Thus, one speaks of an EOF error.
Division-By-Zero
When dividing two user-entered numbers in a simple calculator program, we want to give a meaningful result even if the user enters a divisor of zero. In some C environments, division-by-zero results in a signal (SIGFPE) that must be fielded by a signal handler. Signals are sometimes called exceptions in the C community and, confusingly, sometimes called program error signals. In other C environments, IEEE floating-point rules apply and the division-by-zero would result in a NaN value. The C environment would be blissfully unaware of that value, considering it to be neither an exception nor an error.
Runtime Load Failure
Programs frequently load their program code dynamically at run-time (e.g. classes, DLLs). This might fail due to a missing file. C offers no standard way to detect or recover from this case. The program would be terminated involuntarily, and one often speaks of this situation as a fatal exception. In Java, this would be termed a linkage error.
Java's Throwable Hierarchy
Java's exception-handling system divides the so-called Throwable class hierarchy into two main groups. Subclasses of Error are meant to represent conditions from which recovery is impossible. Subclasses of Exception are meant for recoverable conditions are are further subdivided into checked exceptions (for probable conditions) and unchecked exceptions (for unlikely conditions). Unfortunately, the boundaries between these categories are poorly defined and you will often find instances of throwables whose semantics suggest that they belong in a different category.
Be Wary Of Jargon
These examples show that the meanings of error and exception are murky at best. One must treat error and exception as jargon, whose meaning is determined by the context of discussion.
Of greater value are distinguishing characteristics of program conditions. What is the likelihood of the condition occurring? How is the condition detected? What action should be taken when the condition is detected? In any discussion that demands clarity, one is better suited to answer these questions directly rather than relying upon jargon terminology.
Exceptions should indicate exceptional activity, so if you reach a point in your code for which you've done your best to avoid divide by zero, then throwing an exception (if you are able to in your language) is the right way.
If it's routine logic to check for divide by zero (like for a calculator app) then you should check for that in your code before it has the chance to raise an exception. In that case, it's an error (in user input) and should be handled as such.
(Stole this idea either from The Pragmatic Programmer or Code Complete; can't remember which.)