Compact yet exact floating-point number formatting

Compact yet exact floating-point number formatting - json

Fixing a Lua JSON implementation, I replaced "%.14g" for number formatting with "%.17g" to counter loss of precision: Formatting 17 digits using "%.17g" always suffices to allow converting an IEEE 754 double (the default Lua number type) to a string representation which can be converted back to the exact number. However, this may result in ugly number formatting: ("%.17g"):format(0.1) and ("%.17g"):format(0.3) result in "0.10000000000000001" and "0.29999999999999999" respectively due to float inaccuracies. Clearly %g doesn't always use the shortest representation of a number in case there are two decimal representations that get rounded to the same float.
I'm currently considering the following hack for cleaner (more compact) number formatting: Use %.17g only if %.16g does not produce an exact representation, e.g.:
local function number_to_string(num)
local compact = ("%.16g"):format(num)
if tonumber(compact) == num then
return compact
end
return ("%.17g"):format(num)
end
using this, 0.1 and 0.3 are converted to their more compact representations:
> number_to_string(0.1)
0.1
> number_to_string(0.3)
0.3
tostring produces compact number representations, but does not preserve precision; besides, it seems the reference manual does not guarantee that it uses any particular format:
> tostring(0.123456789101112)
0.12345678910111
> ("%.17g"):format(0.123456789101112)
0.123456789101112
Question(s):
Does this work in general, or will there be Lua numbers where a more compact representation exists but this fails to find it?
Is there a more efficient way of finding the compact number representation in Lua?

will there be Lua numbers where a more compact representation exists but this fails to find it?
Yes.
Denormalized numbers might have much less than 17 significant digits:
string.format("%.17g", 2^-1074) -- 4.9406564584124654e-324
string.format("%.16g", 2^-1074) -- 4.940656458412465e-324
....
string.format("%.2g", 2^-1074) -- 4.9e-324
string.format("%.1g", 2^-1074) -- 5e-324
Single-digit-approximation gives the same number:
5e-324 == 2^-1074 -- true

Related

What is precision of MYSQL RAND() function? and how to generate huge random numbers?

What is precision of MYSQL RAND() function?
I can't find it on the official page: MYSQL RAND() function is told to return floating-point number, unfortunately it's precision is not stated in a clear way. It can be a single-precision floating-point data, or double-precision, or any other kind of data.
What I would like to know exactly is - what is the maximum integer range [0,N] in which I can generate random integer numbers with FLOOR(RAND()*N) such that there won't be any "skips" and any number from 0 to N can be generated?
Another thing which I would like to know:
How to generate numbers, which are bigger than N in MySQL?

As written in the MySQL docs the precision is system dependent. So there is not the one answer to your question.
https://dev.mysql.com/doc/internals/en/floating-point-types.html
Since MySQL uses the machine-dependent binary representation of float and double to store values in the database, we have to care about these. Today, most systems use the IEEE standard 754 for binary floating-point arithmetic. It describes a representation for single precision numbers as 1 bit for sign, 8 bits for biased exponent and 23 bits for fraction and for double precision numbers as 1-bit sign, 11-bit biased exponent and 52-bit fraction. However, we can not rely on the fact that every system uses this representation. Luckily, the ISO C standard requires the standard C library to have a header float.h that describes some details of the floating point representation on a machine. The comment above describes the value DBL_DIG. There is an equivalent value FLT_DIG for the C data type float.
At the end I have no clue why the precision of a random number is important in any case. I cannot see any use case

Why would you use a string in JSON to represent a decimal number

Some APIs, like the paypal API use a string type in JSON to represent a decimal number. So "7.47" instead of 7.47.
Why/when would this be a good idea over using the json number value type? AFAIK the number value type allows for infinite precision as well as scientific notation.

The main reason to transfer numeric values in JSON as strings is to eliminate any loss of precision or ambiguity in transfer.
It's true that the JSON spec does not specify a precision for numeric values. This does not mean that JSON numbers have infinite precision. It means that numeric precision is not specified, which means JSON implementations are free to choose whatever numeric precision is convenient to their implementation or goals. It is this variability that can be a pain if your application has specific precision requirements.
Loss of precision generally isn't apparent in the JSON encoding of the numeric value (1.7 is nice and succinct) but manifests in the JSON parsing and intermediate representations on the receiving end. A JSON parsing function would quite reasonably parse 1.7 into an IEEE double precision floating point number. However, finite length / finite precision decimal representations will always run into numbers whose decimal expansions cannot be represented as a finite sequence of digits:
Irrational numbers (like pi and e)
1.7 has a finite representation in base 10 notation, but in binary (base 2) notation, 1.7 cannot be encoded exactly. Even with a near infinite number of binary digits, you'll only get closer to 1.7, but you'll never get to 1.7 exactly.
So, parsing 1.7 into an in-memory floating point number, then printing out the number will likely return something like 1.69 - not 1.7.
Consumers of the JSON 1.7 value could use more sophisticated techniques to parse and retain the value in memory, such as using a fixed-point data type or a "string int" data type with arbitrary precision, but this will not entirely eliminate the specter of loss of precision in conversion for some numbers. And the reality is, very few JSON parsers bother with such extreme measures, as the benefits for most situations are low and the memory and CPU costs are high.
So if you are wanting to send a precise numeric value to a consumer and you don't want automatic conversion of the value into the typical internal numeric representation, your best bet is to ship the numeric value out as a string and tell the consumer exactly how that string should be processed if and when numeric operations need to be performed on it.
For example: In some JSON producers (JRuby, for one), BigInteger values automatically output to JSON as strings, largely because the range and precision of BigInteger is so much larger than the IEEE double precision float. Reducing the BigInteger value to double in order to output as a JSON numeric will often lose significant digits.
Also, the JSON spec (http://www.json.org/) explicitly states that NaNs and Infinities (INFs) are invalid for JSON numeric values. If you need to express these fringe elements, you cannot use JSON number. You have to use a string or object structure.
Finally, there is another aspect which can lead to choosing to send numeric data as strings: control of display formatting. Leading zeros and trailing zeros are insignificant to the numeric value. If you send JSON number value 2.10 or 004, after conversion to internal numeric form they will be displayed as 2.1 and 4.
If you are sending data that will be directly displayed to the user, you probably want your money figures to line up nicely on the screen, decimal aligned. One way to do that is to make the client responsible for formatting the data for display. Another way to do it is to have the server format the data for display. Simpler for the client to display stuff on screen perhaps, but this can make extracting the numeric value from the string difficult if the client also needs to make computations on the values.

I'll be a bit contrarian and say that 7.47 is perfectly safe in JSON, even for financial amounts, and that "7.47" isn't any safer.
First, let me address some misconceptions from this thread:
So, parsing 1.7 into an in-memory floating point number, then printing out the number will likely return something like 1.69 - not 1.7.
That is not true, especially in the context of IEEE 754 double precision format that was mentioned in that answer. 1.7 converts into an exact double 1.6999999999999999555910790149937383830547332763671875 and when that value is "printed" for display, it will always be 1.7, and never 1.69, 1.699999999999 or 1.70000000001. It is 1.7 "exactly".
Learn more here.
7.47 may actually be 7.4699999923423423423 when converted to float
7.47 already is a float, with an exact double value 7.46999999999999975131004248396493494510650634765625. It will not be "converted" to any other float.
a simple system that simply truncates the extra digits off will result in 7.46 and now you've lost a penny somewhere
IEEE rounds, not truncates. And it would not convert to any other number than 7.47 in the first place.
is the JSON number actually a float? As I understand it's a language independent number, and you could parse a JSON number straight into a java BigDecimal or other arbitrary precision format in any language if so inclined.
It is recommended that JSON numbers are interpreted as doubles (IEEE 754 double-precision format). I haven't seen a parser that wouldn't be doing that.
And no, BigDecimal(7.47) is not the right way to do it – it will actually create a BigDecimal representing the exact double of 7.47, which is 7.46999999999999975131004248396493494510650634765625. To get the expected behavior, BigDecimal("7.47") should be used.
Overall, I don't see any fundamental issue with {"price": 7.47}. It will be converted into a double on virtually all platforms, and the semantics of IEEE 754 guarantee that it will be "printed" as 7.47 exactly and always.
Of course floating point rounding errors can happen on further calculations with that value, see e.g. 0.1 + 0.2 == 0.30000000000000004, but I don't see how strings in JSON make this better. If "7.47" arrives as a string and should be part of some calculation, it will need to be converted to some numeric data type anyway, probably float :).
It's worth noting that strings also have disadvantages, e.g., they cannot be passed to Intl.NumberFormat, they are not a "pure" data type, e.g., the dot is a formatting decision.
I'm not strongly against strings, they seem fine to me as well but I don't see anything wrong on {"price": 7.47} either.

The reason I'm doing it is that the SoftwareAG parser tries to "guess" the java type from the value it receives.
So when it receives
"jackpot":{
"growth":200,
"percentage":66.67
}
The first value (growth) will become a java.lang.Long and the second (percentage) will become a java.lang.Double
Now when the second object in this jackpot-array has this
"jackpot":{
"growth":50.50,
"percentage":65
}
I have a problem.
When I exchange these values as Strings, I have complete control and can cast/convert the values to whatever I want.

Summarized Version
Just quoting from #dthorpe's answer, as I think this is the most important point:
Also, the JSON spec (http://www.json.org/) explicitly states that NaNs and Infinities (INFs) are invalid for JSON numeric values. If you need to express these fringe elements, you cannot use JSON number. You have to use a string or object structure.

I18N is another reason NOT to use String for decimal numbers
In tens of countries, such as Germany and France, comma (,) is the decimal separator and dot (.) is the thousands separator. See the list on Wikipedia.
If your JSON document carries decimal numbers as string, you're relying on all possible API consumers using the same number format conversion (which is a step after the JSON parsing). There's the risk of incorrect conversion due to inverted use of comma and dot as separators.
If you use number for decimal numbers that risk is averted.

Is there error propagation when serializing floating point values to strings?

Say I have a float (or double) in my favorite language. Say that in memory this value is stored according to IEEE 754, say that I serialize this value in XML or JSON or plain text using base 10. When serializing and de-serializing this value will I lose precision of my number? When should I care about this precision loss?
Would converting the number to base64 prevent the loss of precision?

It depends on the binary-to-decimal conversion function that you use. Assuming this function is not botched (it has no reason to be):
Either it converts to a fixed precision. Old-fashioned languages such as C offer this kind of conversion to decimal. In this case, you should use a format with 17 significant decimal digits. A common format is D.DDDDDDDDDDDDDDDDEXXX where D and X are decimal digits, and there are 16 digits after the dot. This would be specified as %.16e in C-like languages. Converting back such a decimal value to the nearest double produces the same double that was originally printed.
Or convert it to the shortest decimal representation that converts back to the same double. This is what some modern programming languages (e.g. Java) offer by default as printing function. In this case, the property that parsing back the decimal representation will return the original double is automatic.
In either case loss of accuracy should not happen. This is not because you get the exact decimal representation of the original binary64 number with either method 1. or 2. above: in the general case, you don't. Such an exact representation always exists (because 10 is a multiple of 2), but can be up to ~750 digits long for a binary64 number.
What you get with method 1. or 2. above is a decimal number that is closer to the original binary64 number than to any other binary64 number. This means that the opposite conversion, from decimal to binary64, will “round back” to the original.
This is where the “non-botched” assumption is necessary: in order for the successive conversions to return to the original number they must respectively produce the closest decimal to the binary64 number passed and the closest binary64 to the decimal number passed. In these conditions, and with the appropriate number of decimal digits for the first conversion, the round-trip is lossless.
I should point out that (non-botched) conversions to and from decimal are expensive operations. Unless human-readability of the result is important for you, you should consider a simpler format to convert to. The C99-style hexadecimal representation for floating-point numbers is a good compromise between conversion cost and readability. It is not the most compact but it contains only printable characters.

The approach of converting to the shortest form which converts back the same is dangerous (the "round-trip" string formatting mode in .NET uses such an approach, and is buggy as a result). There is probably no reason not to have a decimal-to-binary conversion method yield a result which is more than 0.75lsb from the exact specified numerical value, guaranteeing that a conversion will always yield a perfectly-rounded numerical value is expensive and in most cases not particularly helpful. It would be better to ensure that the precise arithmetic value of the decimal expression will be less than 0.25lsb from the double value to be represented. If a that's less than 0.25lsb away from a double is fed to a routine which returns a double within 0.75lsb of it, the latter routine can be guaranteed to yield the same double as was given to the former.
The approach of simply finding the shortest form that yields the same double assumes that any string representation will always be parsed the same way, even if the value represented falls almost exactly halfway between two adjacent double values. Since obtaining a perfectly-rounded result could require reading an arbitrary number of digits (e.g. 1125899906842624.125000...1 should round up to 1125899906842624.25) few implementations are apt to bother; if an implementation is going to ignore digits beyond a certain point, even when that might yield a result that was e.g. more than .056lsb way from the correct one, it shouldn't be trusted to be accurate to 0.50000lsb in any case.

How does the computer converts binary number into its decimal equivalent in 2's complement

(My question is related to 2's complement only)
Suppose I give you this binary number 11111110 which is stored as two's complement on a machine and I want you to find its decimal equivalent. Some may say it is -2 while some may say it is 254 as they don't know whether its signed or unsigned. (I know it is a signed number so I took its complement and added 1 which gave me 2, so answer is -2. But if I didn't knew the sign, I would have said 254).
In short, how does the computer converts such binary representation which is stored in 2's complement into its decimal equivalent without making mistakes?
Does the computer knows about its sign? (if yes then where is this information stored?)

Technically you can not convert a binary represented number into a decimal, because computers do not have any storage facility to represent decimal numbers.
Practically this might sound absurd, since we are always dealing with numbers in decimal representation. But these decimal representations are never actually stored in decimal. Only thing a computer does is converting a number into decimal representation when displaying it. And this conversion is related to program construction and library design.
I'll give a small example on C language. In C you have signed and unsigned integer variables. When you are writing a program these variables are used to store numbers in memory. Who knows about their signs? The compiler. Assembly languages have signed and unsigned operations. Compiler keeps track of the sign of all variables and generate appropriate code for signed and unsigned case. So your program works with signed or unsigned integers perfectly when it is compiled.
Assume you used a printf sentence to print an integer variable and you used %d format converter to print the value in decimal representation. This conversion will be handled by printf function defined in standart input output library of C. The function reads the variable from memory, converts the binary representation to decimal representation by using a simple base conversion algorithm. But the target of the algorithm is a char sequence, not an integer. So this algorithm does two things, it both converts binary to decimal representation; and it converts bits to char values (or ASCII codes to be more precise). printf should know the sign of the number to carry on the conversion successfully and this information is again supplied by the compiler constructs placed at compile time. By using these constructs printf could check whether the integer is signed or unsigned and use the appropriate conversion method.
Other programming languages follow similar paths. In essence numbers are always stored in binary. The signed or unsigned representation is known by compiler/interpreter and thus is a common knowledge. The decimal conversion is only carried on for cosmetic reasons and the target of conversion is a char sequence or a string.

This is because when you specify that you want to use a signed number the computer will interpret the first bit [1]1111110 as the sign of the number made from the rest of the bits 1[1111110]. So 1 means - and 0 means +;
a "char" can store numbers from -127 to 127;
a "unsigned char" can store numbers from 0 to 255;

Why do most languages not allow binary numbers?

Why do most computer programming languages not allow binary numbers to be used like decimal or hexadecimal?
In VB.NET you could write a hexadecimal number like &H4
In C you could write a hexadecimal number like 0x04
Why not allow binary numbers?
&B010101
0y1010
Bonus Points!... What languages do allow binary numbers?
Edit
Wow! - So the majority think it's because of brevity and poor old "waves" thinks it's due to the technical aspects of the binary representation.

Because hexadecimal (and rarely octal) literals are more compact and people using them usually can convert between hexadecimal and binary faster than deciphering a binary number.
Python 2.6+ allows binary literals, and so do Ruby and Java 7, where you can use the underscore to make byte boundaries obvious. For example, the hexadedecimal value 0x1b2a can now be written as 0b00011011_00101010.

In C++0x with user defined literals binary numbers will be supported, I'm not sure if it will be part of the standard but at the worst you'll be able to enable it yourself
int operator "" _B(int i);
assert( 1010_B == 10);

In order for a bit representation to be meaningful, you need to know how to interpret it.
You would need to specify what the type of binary number you're using (signed/unsigned, twos-compliment, ones-compliment, signed-magnitude).
The only languages I've ever used that properly support binary numbers are hardware description languages (Verilog, VHDL, and the like). They all have strict (and often confusing) definitions of how numbers entered in binary are treated.

See perldoc perlnumber:
NAME
perlnumber - semantics of numbers and numeric operations in Perl
SYNOPSIS
$n = 1234; # decimal integer
$n = 0b1110011; # binary integer
$n = 01234; # octal integer
$n = 0x1234; # hexadecimal integer
$n = 12.34e-56; # exponential notation
$n = "-12.34e56"; # number specified as a string
$n = "1234"; # number specified as a string

Slightly off-topic, but newer versions of GCC added a C extension that allows binary literals. So if you only ever compile with GCC, you can use them. Documenation is here.

Common Lisp allows binary numbers, using #b... (bits going from highest-to-lowest power of 2). Most of the time, it's at least as convenient to use hexadecimal numbers, though (by using #x...), as it's fairly easy to convert between hexadecimal and binary numbers in your head.

Hex and octal are just shorter ways to write binary. Would you really want a 64-character long constant defined in your code?

Common wisdom holds that long strings of binary digits, eg 32 bits for an int, are too difficult for people to conveniently parse and manipulate. Hex is generally considered easier, though I've not used either enough to have developed a preference.
Ruby which, as already mentioned, attempts to resolve this by allowing _ to be liberally inserted in the literal , allowing, for example:
irb(main):005:0> 1111_0111_1111_1111_0011_1100
=> 111101111111111100111100

D supports binary literals using the syntax 0[bB][01]+, e.g. 0b1001. It also allows embedded _ characters in numeric literals to allow them to be read more easily.

Java 7 now has support for binary literals. So you can simply write 0b110101. There is not much documentation on this feature. The only reference I could find is here.

While C only have native support for 8, 10 or 16 as base, it is actually not that hard to write a pre-processor macro that makes writing 8 bit binary numbers quite simple and readable:
#define BIN(d7,d6,d5,d4, d3,d2,d1,d0) \
( \
((d7)<<7) + ((d6)<<6) + ((d5)<<5) + ((d4)<<4) + \
((d3)<<3) + ((d2)<<2) + ((d1)<<1) + ((d0)<<0) \
)
int my_mask = BIN(1,1,1,0, 0,0,0,0);
This can also be used for C++.

for the record, and to answer this:
Bonus Points!... What languages do allow binary numbers?
Specman (aka e) allows binary numbers. Though to be honest, it's not quite a general purpose language.

Every language should support binary literals. I go nuts not having them!
Bonus Points!... What languages do allow binary numbers?
Icon allows literals in any base from 2 to 16, and possibly up to 36 (my memory grows dim).

It seems the from a readability and usability standpoint, the hex representation is a better way of defining binary numbers. The fact that they don't add it is probably more of user need that a technology limitation.

I expect that the language designers just didn't see enough of a need to add binary numbers. The average coder can parse hex just as well as binary when handling flags or bit masks. It's great that some languages support binary as a representation, but I think on average it would be little used. Although binary -- if available in C, C++, Java, C#, would probably be used more than octal!

In Smalltalk it's like 2r1010. You can use any base up to 36 or so.

Hex is just less verbose, and can express anything a binary number can.
Ruby has nice support for binary numbers, if you really want it. 0b11011, etc.

In Pop-11 you can use a prefix made of number (2 to 32) + colon to indicate the base, e.g.
2:11111111 = 255
3:11111111 = 3280
16:11111111 = 286331153
31:11111111 = 28429701248
32:11111111 = 35468117025

Forth has always allowed numbers of any base to be used (up to size limit of the CPU of course). Want to use binary: 2 BASE ! octal: 8 BASE ! etc. Want to work with time? 60 BASE ! These examples are all entered from base set to 10 decimal. To change base you must represent the base desired from the current number base. If in binary and you want to switch back to decimal then 1010 BASE ! will work. Most Forth implementations have 'words' to shift to common bases, e.g. DECIMAL, HEX, OCTAL, and BINARY.

Although it's not direct, most languages can also parse a string. Java can convert "10101000" into an int with a method.
Not that this is efficient or anything... Just saying it's there. If it were done in a static initialization block, it might even be done at compile time depending on the compiler.
If you're any good at binary, even with a short number it's pretty straight forward to see 0x3c as 4 ones followed by 2 zeros, whereas even that short a number in binary would be 0b111100 which might make your eyes hurt before you were certain of the number of ones.
0xff9f is exactly 4+4+1 ones, 2 zeros and 5 ones (on sight the bitmask is obvious). Trying to count out 0b1111111110011111 is much more irritating.
I think the issue may be that language designers are always heavily invested in hex/octal/binary/whatever and just think this way. If you are less experienced, I can totally see how these conversions wouldn't be as obvious.
Hey, that reminds me of something I came up with while thinking about base conversions. A sequence--I didn't think anyone could figure out the "Next Number", but one guy actually did, so it is solvable. Give it a try:
10
11
12
13
14
15
16
21
23
31
111
?
Edit:
By the way, this sequence can be created by feeding sequential numbers into single built-in function in most languages (Java for sure).

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008