What term is rounded in the CUDA __fsqrt round intrinsics? - cuda

I need the square root of a float, in CUDA device code. Hard to say whether speed matters more than accuracy in my use case.
__sqrtf CUDA intrinsic is the natural choice
But then I saw the various __fsqrt with rounding CUDA intrinsics;
What is rounded in these intrinsics; the argument "x" or the return value? Or do I misunderstand the meaning of rounding here?
My testing suggests neither is rounded! I wrote a kernel that evaluates:
__fsqrt_rn(42 * 42 + 0.1)
and the return value is always 42.0011902, which is equal to the square root of 42 * 42 + 0.1. So what is being rounded?

It's a rounding mode for the result. Input arguments are not "rounded" before they are injected into the arithmetic flow.
the "rn" rounding "direction" is "round-to-nearest"
It means that at whatever precision the interim result is being calculated to, that result will be rounded to the nearest available representation. In the case of a float final result, it will be rounded to the nearest available float representation.
Let's revisit your example. When I put your problem into the windows 10 calculator, the result I get is 42.001190459319126303634970957554 The way we get from that "correct result at arbitrary precision" to a 32-bit floating point "rn" result is to take the two 32-bit floating point numbers, one which is closest but numerically higher, and one which is closest but numerically lower, and of those 2, select the one that is closest. That is apparently 42.0011902.

Related

Incorrect data from MariaDB POLYGON SELECT

Server: MariaDB 10.4.17
INSERTing a POLYGON with 14 digits to the right of the decimal point, then SELECTing the same data, returns a POLYGON with 15 digits to the right of the decimal point, which is more data than actually exists, and the excess precision is incorrect.
INSERTing a 0-padded POLYGON with 15 digits to the right of the decimal point, then SELECTing the same data, returns a POLYGON with 15 digits to the right of the decimal point, however the SELECTed data is incorrect in the last digit and is not the 0 used for right-padding.
Because the table data is incorrect, the various Geometry functions like ST_Contains() produce incorrect results. This appears to be some sort of floating point type of error, but I'm not sure how to work around it.
Is there any way to make MariaDB save, use and return the same data is was given?
Example:
INSERT INTO `Area`
(`Name`, `Coords`)
VALUES ('Test ', GeomFromText('POLYGON((
-76.123527198020080 43.010597920077250,
-76.128263410842290 43.016193091211520,
-76.130763247573610 43.033194256815040,
-76.140676208063910 43.033514863935440,
-76.13626333248750 43.008550330099250,
-76.123527198020080 43.010597920077250))'));
SELECT Coords FROM `Area` WHERE `Name` = 'Test';
POLYGON ((
-76.123527198020085 43.010597920077252,
-76.128263410842294 43.01619309121152,
-76.130763247573611 43.033194256815037,
-76.140676208063908 43.033514863935437,
-76.136263332487502 43.008550330099247,
-76.123527198020085 43.010597920077252
))
Edit:
As per #Michael-Entin the floating point error was a dead end and could not be responsible for the size of the errors I was getting.
Update:
The problem was "me". I had accidentally used MBRContains() in one of the queries instead of ST_Contains().
MBRContains uses the "Minimum Bounding Rectangle" that will contain the polygon, not the actual POLYGON coordinates.
Using MBRContains had caused the area to be significantly larger than expected, and appeared to be a processing error, which it was not.
ST_Contains() is slower but respects all the POLYGON edges and yields correct results.
Thanks to #Michael-Entin for noticing that the floating point error couldn't account for the magnitude of the error I was experiencing. This information pointed me in the right direction.
I think the precision you have is reaching the limit of the 64-bit floating point, and what you get is really the nearest floating point value representable by CPU.
The code below prints the input value without any modification, and then the very next double floating point values decremented and incremented by smallest possible amounts:
int main() {
const double f = -76.123527198020080;
cout << setprecision(17) << f << endl
<< nextafter(f, -INFINITY) << endl
<< nextafter(f, INFINITY) << endl;
}
The results I get
-76.123527198020085
-76.123527198020099
-76.123527198020071
As you see, -76.123527198020085 is the nearest value to your coordinate -76.123527198020080, and its closest possible neighbors are -76.123527198020099 (even further), and -76.123527198020071 (also slightly further, but to a different direction).
So I don't think there is any way to keep the precision you want. Nor there should be a practical reason to keep such precision (the difference is less than a micron, i.e. 1e-6 of a meter).
What you should be looking at is how exactly ST_Contains does not meet your expectations. The geometric libraries usually do snapping with tolerance distance that is slightly higher than the numeric precision of coordinates, and this should ideally make sure such minor differences in input values don't affect the outcome of such function.
Most floating point hardware will be in base 2.
If we try and decompose the absolute value of -76.128263410842290 in base 2 it's:
64 (2^6) + 8 (2^3) + 4 (2^2) + 0.125 (2^-3) + ...
Somehow we can note this number in base two with a sequence of bits 1001100.001...
Bad luck, in base 2, this number would require an infinite sequence of such bits.
The sequence begins with:
1001100.001000001101010111011110111100101101011101001110111000...
But floats have limited precision, the significand only has 53 bits in IEEE double precision, including the bits BEFORE the fraction separator.
That means that the least significant bit (the unit of least precision) represents 2^-46...
1001100.001000001101010111011110111100101101011101001110111000...
1001100.00100000110101011101111011110010110101110101
Notice that the floating point value has been rounded up (to the nearest float).
Let's multiply 2^-46 by appropriate power of five 5^46/5^46: it is 5^46/10^46.
It means that its DECIMAL representation ends exactly 46 places after the DECIMAL point, or a bit less if the trailing bits of float significand are zero (not the case here, trailing bit is 1).
So potentially, the fraction part of those floating point numbers has about 46 digits, not even 14 nor 15 as you seem to assume.
If we turn this floating point value back to decimal, we indeed get:
-76.12826341084229397893068380653858184814453125
-76.128263410842290
See it's indeed slightly greater than your initial input here, because the float was rounded to upper.
If you ask to print 15 decimal places AFTER the fraction separator, you get a rounded result.
-76.128263410842294
In this float number, the last bit 2^-46 has the decimal value
0.0000000000000142108547152020037174224853515625
where 142108547152020037174224853515625 is 5^46, you can do the math.
The immediate floating point values will differ in this last bit (we can add or subtract it)
1001100.00100000110101011101111011110010110101110100
1001100.00100000110101011101111011110010110101110101
1001100.00100000110101011101111011110010110101110110
It means that the immediate floating point neighbours are about +/- 1.42 10^-14 further...
This means that you cannot trust the 14th digits after the fraction, double precision does not have such resolution!
Not a surprise that the nearest float falls up to 7 10^-15 off your specified input sometimes (half the resolution, thanks to round to nearest rule).
Remember, float precision is RELATIVE, if we consume bits left of fraction separator, we reduce the precision of the fraction part (the point is floating literally).
This is very basic knowledge scientists should acquire before using floating point.
I hope those examples help as a very restricted introduction.

Expr for float values in TCL

Calculating float values
tclsh
% expr 0.2+0.2
0.4
% expr 0.2+0.1
0.30000000000000004
%
Why not 0.3??
Am i missing some thing.
thanks in advance.
Neither 0.1 or 0.2 have an exact representation in IEEE double precision binary floating point arithmetic (which Tcl uses internally for expressions involving fractional values, as there's good hardware support for them). This means that the values you are computing with are never exactly what you think they are; instead, they're both very slightly more (as it happens; they could also have been slightly less in general). When you add 0.2+ε1+0.1+ε2, it can happen that ε1+ε2 can add up to more than the threshold where 0.3 (another imprecisely represented value) becomes the next exactly represented value above it. This is what you have observed. It's also inherent in the way floating point mathematics works in a vast array of languages; only integer arithmetic (or fractional arithmetic capable of being expressed as exact multiples of some power of 2, e.g., 0.5, 0.25, 0.125) is guaranteed to be exact.
The only interesting thing of note here is that Tcl 8.5 and 8.6 prefer to render floating point numbers with the minimal number of digits required to get the exact value back when re-parsed. If you want to get a fixed number of digits (e.g., 8) try using format when converting:
format %.8f [expr 0.2+0.1]
This behavior exists in almost all programming languages, e.g. Ruby, Python, etc.
The suggestion here is try to avoid storing numbers in floating points, use integer whenever possible. The bottom line is do not use floating points in a comparison.

Why does division by zero in IEEE754 standard results in Infinite value?

I'm just curious, why in IEEE-754 any non zero float number divided by zero results in infinite value? It's a nonsense from the mathematical perspective. So I think that correct result for this operation is NaN.
Function f(x) = 1/x is not defined when x=0, if x is a real number. For example, function sqrt is not defined for any negative number and sqrt(-1.0f) if IEEE-754 produces a NaN value. But 1.0f/0 is Inf.
But for some reason this is not the case in IEEE-754. There must be a reason for this, maybe some optimization or compatibility reasons.
So what's the point?
It's a nonsense from the mathematical perspective.
Yes. No. Sort of.
The thing is: Floating-point numbers are approximations. You want to use a wide range of exponents and a limited number of digits and get results which are not completely wrong. :)
The idea behind IEEE-754 is that every operation could trigger "traps" which indicate possible problems. They are
Illegal (senseless operation like sqrt of negative number)
Overflow (too big)
Underflow (too small)
Division by zero (The thing you do not like)
Inexact (This operation may give you wrong results because you are losing precision)
Now many people like scientists and engineers do not want to be bothered with writing trap routines. So Kahan, the inventor of IEEE-754, decided that every operation should also return a sensible default value if no trap routines exist.
They are
NaN for illegal values
signed infinities for Overflow
signed zeroes for Underflow
NaN for indeterminate results (0/0) and infinities for (x/0 x != 0)
normal operation result for Inexact
The thing is that in 99% of all cases zeroes are caused by underflow and therefore in 99%
of all times Infinity is "correct" even if wrong from a mathematical perspective.
I'm not sure why you would believe this to be nonsense.
The simplistic definition of a / b, at least for non-zero b, is the unique number of bs that has to be subtracted from a before you get to zero.
Expanding that to the case where b can be zero, the number that has to be subtracted from any non-zero number to get to zero is indeed infinite, because you'll never get to zero.
Another way to look at it is to talk in terms of limits. As a positive number n approaches zero, the expression 1 / n approaches "infinity". You'll notice I've quoted that word because I'm a firm believer in not propagating the delusion that infinity is actually a concrete number :-)
NaN is reserved for situations where the number cannot be represented (even approximately) by any other value (including the infinities), it is considered distinct from all those other values.
For example, 0 / 0 (using our simplistic definition above) can have any amount of bs subtracted from a to reach 0. Hence the result is indeterminate - it could be 1, 7, 42, 3.14159 or any other value.
Similarly things like the square root of a negative number, which has no value in the real plane used by IEEE754 (you have to go to the complex plane for that), cannot be represented.
In mathematics, division by zero is undefined because zero has no sign, therefore two results are equally possible, and exclusive: negative infinity or positive infinity (but not both).
In (most) computing, 0.0 has a sign. Therefore we know what direction we are approaching from, and what sign infinity would have. This is especially true when 0.0 represents a non-zero value too small to be expressed by the system, as it frequently the case.
The only time NaN would be appropriate is if the system knows with certainty that the denominator is truly, exactly zero. And it can't unless there is a special way to designate that, which would add overhead.
NOTE:
I re-wrote this following a valuable comment from #Cubic.
I think the correct answer to this has to come from calculus and the notion of limits. Consider the limit of f(x)/g(x) as x->0 under the assumption that g(0) == 0. There are two broad cases that are interesting here:
If f(0) != 0, then the limit as x->0 is either plus or minus infinity, or it's undefined. If g(x) takes both signs in the neighborhood of x==0, then the limit is undefined (left and right limits don't agree). If g(x) has only one sign near 0, however, the limit will be defined and be either positive or negative infinity. More on this later.
If f(0) == 0 as well, then the limit can be anything, including positive infinity, negative infinity, a finite number, or undefined.
In the second case, generally speaking, you cannot say anything at all. Arguably, in the second case NaN is the only viable answer.
Now in the first case, why choose one particular sign when either is possible or it might be undefined? As a practical matter, it gives you more flexibility in cases where you do know something about the sign of the denominator, at relatively little cost in the cases where you don't. You may have a formula, for example, where you know analytically that g(x) >= 0 for all x, say, for example, g(x) = x*x. In that case the limit is defined and it's infinity with sign equal to the sign of f(0). You might want to take advantage of that as a convenience in your code. In other cases, where you don't know anything about the sign of g, you cannot generally take advantage of it, but the cost here is just that you need to trap for a few extra cases - positive and negative infinity - in addition to NaN if you want to fully error check your code. There is some price there, but it's not large compared to the flexibility gained in other cases.
Why worry about general functions when the question was about "simple division"? One common reason is that if you're computing your numerator and denominator through other arithmetic operations, you accumulate round-off errors. The presence of those errors can be abstracted into the general formula format shown above. For example f(x) = x + e, where x is the analytically correct, exact answer, e represents the error from round-off, and f(x) is the floating point number that you actually have on the machine at execution.

CUDA, float precision

I am using CUDA 4.0 on Geforce GTX 580 (Fermi) . I have numbers as small as 7.721155e-43 . I want to multiply them with each other just once or better say I want to calculate 7.721155e-43 * 7.721155e-43 .
My experience showed me I can't do it just straight forward. Could you please give me suggestion? Do I need to use double precision? How?
The magnitude of the smallest normal IEEE single-precision number is about 1.18e-38, the smallest denormal gets you down to about 1.40e-45. As a consequece an operand of magnitude 7.82e-43 will comprise only about 9 non-zero bits, which in itself may already be a problem, even before you get to the multiplication (whose result will underflow to zero in single precision). So you may also want to look at any up-stream computation that produces these tiny numbers.
If these small numbers are intermediate terms in a mathematical expression, rewriting that expression into a mathematically equivalent one that does not involve tiny intermediates would be one way of addressing the issue. Or you could scale some operands by factors that are powers of two (so as to not incur additional round-off due to the scaling). For example, scale by 2^24 = 16777216.
Lastly, you can switch part of the computation to double precision. To do so, simply introduce temporary variables of type double, perform the computation on them, then convert the final result back to float:
float r, f = 7.721155e-43f;
double d, t;
d = (double)f; // explicit cast is not necessary, since converting to wider type
t = d * d;
[... more intermediate computation, leaving result in 't' ...]
r = (float)t; // since conversion is to narrower type, cast will avoid warnings
In statistics we often have to work with likelihoods that end up being very small numbers and the standard technique is to use logs for everything. Then multiplication on a log scale is just addition. All intermediate numbers are stored as logs. Indeed it can take a bit of getting used to - but the alternative will often fail even when doing relatively modest computations. In R (for my convenience!) which uses doubles and prints 7 significant figures by default btw:
> 7.721155e-43 * 7.721155e-43
[1] 5.961623e-85
> exp(log(7.721155e-43) + log(7.721155e-43))
[1] 5.961623e-85

When I'm multiplying a float using multu, should I ignore the result in the LO register?

In our project, we take two floats from the user, store them in integer registers, and treat them as a IEEE 754 single precision floats, manipulating the bits by masking. So after I multiply the 23 bits of fraction value, should I take into account the result placed in the LO register if I want to return a single precision float (32 bits) as the product?
First off, I hope you mean 24 bits of value, since you'll need to include the implicit mantissa bit in your multiplication.
Second, if you you want your multiplication to be correctly rounded, as in IEEE-754, you will (sometimes) need the low part of the multiply in order to deliver the correct rounded result.
On the other hand, if you don't need to implement correct rounding, and you left-shift your fraction bits before multiplication, you will be able to ignore the low word of the result.