Does JSON have any numeric limits or is it basically "anything goes" with respect to numbers and it's just up to the receiving party to validate and do error checking on the numeric values. For example:
[12e+10000, 9223372036854775808]
The above would be an out-of-bounds value in a language that only supported float64 and int64, so how is this usually done? Or does JSON only specify the grammar (for example for an int 0|[1-9]\d+ and its up to the processor to determine what to do with it?
The JSON standard does not set any limits on the magnitude or precision of a number; it is only concerned with the syntax. The RFC version is explicit about the fact that the standard permits an implementation to set numeric limits; it does not impose minimum limits.
An implementation which sets a small limit on number magnitude or precision will create interoperability issues; an implementation which outputs numbers outside the possibilities of IEEE754 binary64 may also cause interop issues. Many processors allow numbers to be transmitted as strings, leaving it to the receiver to interpret the string as a number (perhaps with the aid of a schema); this is the only way to transmit infinities and NaN values.
Related
SUMMARY
some support for JSON was added to XSLT 3.0 + XPath/XQuery 3.1
unfortunately, JSON number types are handled as IEEE double, subjecting the data to loss of numeric precision
I am considering writing a set of custom functions based on Java BigDecimal instead of IEEE double
Q: In order to support numeric precision beyond that offered by IEEE double, is it reasonable for me to consider cloning the JSON support in saxon 9.8 HE and building a set of customized functions which use BigDecimal instead of IEEE double?
DETAIL
I need to perform a number of transformations of JSON data.
XSLT 3.0 + XPath 3.1 + XQuery 3.1 have some support for JSON through json-to-xml + parse-json.
https://www.w3.org/TR/xpath-functions-31/#json-functions
https://www.saxonica.com/papers/xmlprague-2016mhk.pdf
I have hit a significant snag related to treatment of numeric data types.
My JSON data includes numeric values that exceed the precision of IEEE double-floats. In Java, my numeric values need to be processed using BigDecimal.
https://www.w3.org/TR/xpath-functions-31/#json-to-xml-mapping
states
Information may however be lost if (a) JSON numbers are not exactly representable as double-precision floating point ...
In addition, I have taken a look at the saxonica 9.8 HE reference implementation source for ./ma/json/JsonParser.java and confirm that the private method parseNumericLiteral() returns a primitive double.
I am considering cloning the saxon 9.8 HE JSON support code and using this as the basis for a set of customized functions which uses Java BigDecimal instead of double in order to retain numeric precision through the transformations ...
Q: In order to support numeric precision beyond that offered by IEEE double, is it reasonable for me to consider cloning the JSON support in saxon 9.8 HE and building a set of customized functions which use BigDecimal instead of IEEE double?
Q: Are you aware of any unforeseen issues which I may encounter?
The XML data model defines decimal numbers as having any finite precision.
https://www.w3.org/TR/xmlschema-2/#decimal
The JSON data model defines numbers as having any finite precision.
https://www.rfc-editor.org/rfc/rfc7159#page-6
Not surprisingly, both warn of potential interoperability issues with numeric values with extended precision.
Q: What was the rationale for explicitly defining the JSON number type in XPath/XQuery as IEEE double?
THE END
This is what the RFC says:
This specification allows implementations to set limits on the range
and precision of numbers accepted. Since software that implements
IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is
generally available and widely used, good interoperability can be
achieved by implementations that expect no more precision or range
than these provide, in the sense that implementations will
approximate JSON numbers within the expected precision. A JSON
number such as 1E400 or 3.141592653589793238462643383279 may indicate
potential interoperability problems, since it suggests that the
software that created it expects receiving software to have greater
capabilities for numeric magnitude and precision than is widely
available.
That, to my mind, is a pretty clear warning: it says that although the JSON grammar allows arbitrary precision in numeric values, you can't rely on JSON consumers to retain that precision, and it follows that if you want to convey high-precision numeric values, it would be better to convey them as strings.
The rules for fn:json-to-xml and fn:xml-to-json need to be read carefully:
The fn:json-to-xml function creates an element whose string value is
lexically the same as the JSON representation of the number. The
fn:xml-to-json function generates a JSON representation that is the
result of casting the (typed or untyped) value of the node to
xs:double and then casting the result to xs:string. Leading and
trailing whitespace is accepted. Since JSON does not impose limits on
the range or precision of numbers, these rules mean that conversion
from JSON to XML will always succeed, and will retain full precision
in the lexical representation unless the data model implementation is
one that reconstructs the string value from the typed value. In the
reverse direction, conversion from XML to JSON may fail if the value
is infinity or NaN, or if the string value is such that casting to
xs:double produces positive or negative infinity.
Although I probably wrote these words, I'm not sure I recall the exact rationale for why the decision was made this way, but it does suggest that the matter received careful thought. I suspect the thinking was that when you consume JSON, you should try to preserve all the information that is present in the input, but when you generate JSON, you should try to generate something that will be acceptable to all consumers. (The famous maxim about being liberal in what you accept and conservative in what you produce.)
Your analysis of the Saxon source isn't quite correct. You say:
the private method parseNumericLiteral() returns a primitive double.
which is true enough; but the original lexical representation is retained, and when the parser communicates the value to a JsonReceiver, it passes both the Java double and the string representation, so the JsonReceiver has access to both (which is needed for a correct implementation of fn:json-to-xml).
Some APIs, like the paypal API use a string type in JSON to represent a decimal number. So "7.47" instead of 7.47.
Why/when would this be a good idea over using the json number value type? AFAIK the number value type allows for infinite precision as well as scientific notation.
The main reason to transfer numeric values in JSON as strings is to eliminate any loss of precision or ambiguity in transfer.
It's true that the JSON spec does not specify a precision for numeric values. This does not mean that JSON numbers have infinite precision. It means that numeric precision is not specified, which means JSON implementations are free to choose whatever numeric precision is convenient to their implementation or goals. It is this variability that can be a pain if your application has specific precision requirements.
Loss of precision generally isn't apparent in the JSON encoding of the numeric value (1.7 is nice and succinct) but manifests in the JSON parsing and intermediate representations on the receiving end. A JSON parsing function would quite reasonably parse 1.7 into an IEEE double precision floating point number. However, finite length / finite precision decimal representations will always run into numbers whose decimal expansions cannot be represented as a finite sequence of digits:
Irrational numbers (like pi and e)
1.7 has a finite representation in base 10 notation, but in binary (base 2) notation, 1.7 cannot be encoded exactly. Even with a near infinite number of binary digits, you'll only get closer to 1.7, but you'll never get to 1.7 exactly.
So, parsing 1.7 into an in-memory floating point number, then printing out the number will likely return something like 1.69 - not 1.7.
Consumers of the JSON 1.7 value could use more sophisticated techniques to parse and retain the value in memory, such as using a fixed-point data type or a "string int" data type with arbitrary precision, but this will not entirely eliminate the specter of loss of precision in conversion for some numbers. And the reality is, very few JSON parsers bother with such extreme measures, as the benefits for most situations are low and the memory and CPU costs are high.
So if you are wanting to send a precise numeric value to a consumer and you don't want automatic conversion of the value into the typical internal numeric representation, your best bet is to ship the numeric value out as a string and tell the consumer exactly how that string should be processed if and when numeric operations need to be performed on it.
For example: In some JSON producers (JRuby, for one), BigInteger values automatically output to JSON as strings, largely because the range and precision of BigInteger is so much larger than the IEEE double precision float. Reducing the BigInteger value to double in order to output as a JSON numeric will often lose significant digits.
Also, the JSON spec (http://www.json.org/) explicitly states that NaNs and Infinities (INFs) are invalid for JSON numeric values. If you need to express these fringe elements, you cannot use JSON number. You have to use a string or object structure.
Finally, there is another aspect which can lead to choosing to send numeric data as strings: control of display formatting. Leading zeros and trailing zeros are insignificant to the numeric value. If you send JSON number value 2.10 or 004, after conversion to internal numeric form they will be displayed as 2.1 and 4.
If you are sending data that will be directly displayed to the user, you probably want your money figures to line up nicely on the screen, decimal aligned. One way to do that is to make the client responsible for formatting the data for display. Another way to do it is to have the server format the data for display. Simpler for the client to display stuff on screen perhaps, but this can make extracting the numeric value from the string difficult if the client also needs to make computations on the values.
I'll be a bit contrarian and say that 7.47 is perfectly safe in JSON, even for financial amounts, and that "7.47" isn't any safer.
First, let me address some misconceptions from this thread:
So, parsing 1.7 into an in-memory floating point number, then printing out the number will likely return something like 1.69 - not 1.7.
That is not true, especially in the context of IEEE 754 double precision format that was mentioned in that answer. 1.7 converts into an exact double 1.6999999999999999555910790149937383830547332763671875 and when that value is "printed" for display, it will always be 1.7, and never 1.69, 1.699999999999 or 1.70000000001. It is 1.7 "exactly".
Learn more here.
7.47 may actually be 7.4699999923423423423 when converted to float
7.47 already is a float, with an exact double value 7.46999999999999975131004248396493494510650634765625. It will not be "converted" to any other float.
a simple system that simply truncates the extra digits off will result in 7.46 and now you've lost a penny somewhere
IEEE rounds, not truncates. And it would not convert to any other number than 7.47 in the first place.
is the JSON number actually a float? As I understand it's a language independent number, and you could parse a JSON number straight into a java BigDecimal or other arbitrary precision format in any language if so inclined.
It is recommended that JSON numbers are interpreted as doubles (IEEE 754 double-precision format). I haven't seen a parser that wouldn't be doing that.
And no, BigDecimal(7.47) is not the right way to do it – it will actually create a BigDecimal representing the exact double of 7.47, which is 7.46999999999999975131004248396493494510650634765625. To get the expected behavior, BigDecimal("7.47") should be used.
Overall, I don't see any fundamental issue with {"price": 7.47}. It will be converted into a double on virtually all platforms, and the semantics of IEEE 754 guarantee that it will be "printed" as 7.47 exactly and always.
Of course floating point rounding errors can happen on further calculations with that value, see e.g. 0.1 + 0.2 == 0.30000000000000004, but I don't see how strings in JSON make this better. If "7.47" arrives as a string and should be part of some calculation, it will need to be converted to some numeric data type anyway, probably float :).
It's worth noting that strings also have disadvantages, e.g., they cannot be passed to Intl.NumberFormat, they are not a "pure" data type, e.g., the dot is a formatting decision.
I'm not strongly against strings, they seem fine to me as well but I don't see anything wrong on {"price": 7.47} either.
The reason I'm doing it is that the SoftwareAG parser tries to "guess" the java type from the value it receives.
So when it receives
"jackpot":{
"growth":200,
"percentage":66.67
}
The first value (growth) will become a java.lang.Long and the second (percentage) will become a java.lang.Double
Now when the second object in this jackpot-array has this
"jackpot":{
"growth":50.50,
"percentage":65
}
I have a problem.
When I exchange these values as Strings, I have complete control and can cast/convert the values to whatever I want.
Summarized Version
Just quoting from #dthorpe's answer, as I think this is the most important point:
Also, the JSON spec (http://www.json.org/) explicitly states that NaNs and Infinities (INFs) are invalid for JSON numeric values. If you need to express these fringe elements, you cannot use JSON number. You have to use a string or object structure.
I18N is another reason NOT to use String for decimal numbers
In tens of countries, such as Germany and France, comma (,) is the decimal separator and dot (.) is the thousands separator. See the list on Wikipedia.
If your JSON document carries decimal numbers as string, you're relying on all possible API consumers using the same number format conversion (which is a step after the JSON parsing). There's the risk of incorrect conversion due to inverted use of comma and dot as separators.
If you use number for decimal numbers that risk is averted.
Say I have a float (or double) in my favorite language. Say that in memory this value is stored according to IEEE 754, say that I serialize this value in XML or JSON or plain text using base 10. When serializing and de-serializing this value will I lose precision of my number? When should I care about this precision loss?
Would converting the number to base64 prevent the loss of precision?
It depends on the binary-to-decimal conversion function that you use. Assuming this function is not botched (it has no reason to be):
Either it converts to a fixed precision. Old-fashioned languages such as C offer this kind of conversion to decimal. In this case, you should use a format with 17 significant decimal digits. A common format is D.DDDDDDDDDDDDDDDDEXXX where D and X are decimal digits, and there are 16 digits after the dot. This would be specified as %.16e in C-like languages. Converting back such a decimal value to the nearest double produces the same double that was originally printed.
Or convert it to the shortest decimal representation that converts back to the same double. This is what some modern programming languages (e.g. Java) offer by default as printing function. In this case, the property that parsing back the decimal representation will return the original double is automatic.
In either case loss of accuracy should not happen. This is not because you get the exact decimal representation of the original binary64 number with either method 1. or 2. above: in the general case, you don't. Such an exact representation always exists (because 10 is a multiple of 2), but can be up to ~750 digits long for a binary64 number.
What you get with method 1. or 2. above is a decimal number that is closer to the original binary64 number than to any other binary64 number. This means that the opposite conversion, from decimal to binary64, will “round back” to the original.
This is where the “non-botched” assumption is necessary: in order for the successive conversions to return to the original number they must respectively produce the closest decimal to the binary64 number passed and the closest binary64 to the decimal number passed. In these conditions, and with the appropriate number of decimal digits for the first conversion, the round-trip is lossless.
I should point out that (non-botched) conversions to and from decimal are expensive operations. Unless human-readability of the result is important for you, you should consider a simpler format to convert to. The C99-style hexadecimal representation for floating-point numbers is a good compromise between conversion cost and readability. It is not the most compact but it contains only printable characters.
The approach of converting to the shortest form which converts back the same is dangerous (the "round-trip" string formatting mode in .NET uses such an approach, and is buggy as a result). There is probably no reason not to have a decimal-to-binary conversion method yield a result which is more than 0.75lsb from the exact specified numerical value, guaranteeing that a conversion will always yield a perfectly-rounded numerical value is expensive and in most cases not particularly helpful. It would be better to ensure that the precise arithmetic value of the decimal expression will be less than 0.25lsb from the double value to be represented. If a that's less than 0.25lsb away from a double is fed to a routine which returns a double within 0.75lsb of it, the latter routine can be guaranteed to yield the same double as was given to the former.
The approach of simply finding the shortest form that yields the same double assumes that any string representation will always be parsed the same way, even if the value represented falls almost exactly halfway between two adjacent double values. Since obtaining a perfectly-rounded result could require reading an arbitrary number of digits (e.g. 1125899906842624.125000...1 should round up to 1125899906842624.25) few implementations are apt to bother; if an implementation is going to ignore digits beyond a certain point, even when that might yield a result that was e.g. more than .056lsb way from the correct one, it shouldn't be trusted to be accurate to 0.50000lsb in any case.
How is number represented in JSON internally and how many bytes of data does it take to store a JSON number?
I can't find any info specifying this internal detail.
According to the ECMA standard (PDF), §8:
A number is represented in base 10 with no superfluous leading zero. It may have a preceding minus sign (U+002D). It may have a (U+002E) prefixed fractional part. It may have an exponent of ten, prefixed by e (U+0065) or E (U+0045) and optionally + (U+002B) or – (U+002D). The digits are the code points U+0030 through U+0039.
So, pretty much text, except that (later on the page) NaN and Infinity aren't acceptable values.
BSON, however, has int32, int64, and double types that are a bit more traditional.
JSON is a data interchange format. It is just text. There is no "internal" representation of JSON, unless you are referring to how your particular system encodes and stores text data.
The number of bytes it takes to store a JSON number would be the length of the number, in characters, multiplied by the number of bytes required to store a character in your particular system.
Is it specified anywhere how big JSON integers can be? I'm guessing that they're limited to normal (32 bit) ints, but I can't find anywhere that that's written down. I need to encode identifiers that are longs in Java, so I presume I need to store those as strings in JSON so as not to risk overflow.
A JSON number is not limited by the spec.
Since JSON is an abstract format that is not exclusively targeted at JavaScript, the actual target environment determines the boundaries of what can be interpreted.
It's also worth noting that there are no "JSON Integers", they are a sub-set of the "Number" datatype.
RFC 7159: The JavaScript Object Notation (JSON) Data Interchange Format
This specification allows implementations to set limits on the range
and precision of numbers accepted. Since software that implements
IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is
generally available and widely used, good interoperability can be
achieved by implementations that expect no more precision or range
than these provide, in the sense that implementations will
approximate JSON numbers within the expected precision. A JSON
number such as 1E400 or 3.141592653589793238462643383279 may indicate
potential interoperability problems, since it suggests that the
software that created it expects receiving software to have greater
capabilities for numeric magnitude and precision than is widely
available.
I just did the following empirical test using Chrome (v.23 on Mac) Console:
> var j = JSON.parse("[999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999]")
undefined
> j[0]
1e+228
If JSON is passed through HTTP then the number will be converted in String from Java in any case and then the issue could be only in Javascript.
From ECMAScript Language Specification 4.3.19:
4.3.19 Number value
primitive value corresponding to a double-precision 64-bit binary
format IEEE 754 value
NOTE A Number value is a member of the Number type and is a direct
representation of a number.
Which is what defined in wikipedia Double-precision floating-point format.