Having read Joel on Encoding like a good boy, I find myself perplexed by the workings of Foundation's JSONDecoder, neither of whose init or decode methods take an encoding value. Looking through the docs, I see the instance variable dataDecodingStrategy, which perhaps this is where the encoding-guessing magic happens...?
Am I missing something here? Shouldn't JSONDecoder need to know the encoding of the data it receives? I realize that the JSON standard requires this data to be UTF-8 encoded, but can JSONDecoder be making that assumption? I'm confused.
RFC 8259 (from 2017) requires that
JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8.
The older RFC 7159 (from 2013) and RFC 7158 (from 2013) only stated that
JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default
encoding is UTF-8, and JSON texts that are encoded in UTF-8 are
interoperable in the sense that they will be read successfully by the
maximum number of implementations; there are many implementations
that cannot successfully read texts in other encodings (such as
UTF-16 and UTF-32).
And RFC 4627 (from 2006, the oldest one that I could find):
JSON text SHALL be encoded in Unicode. The default encoding is
UTF-8.
Since the first two characters of a JSON text will always be ASCII
characters, it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.
JSONDecoder (which uses JSONSerialization under the hood) is able to decode UTF-8, UTF-16, and UTF-32, both little-endian and big-endian. Example:
let data = "[1, 2, 3]".data(using: .utf16LittleEndian)!
print(data as NSData) // <5b003100 2c002000 32002c00 20003300 5d00>
let a = try! JSONDecoder().decode([Int].self, from: data)
print(a) // [1, 2, 3]
Since a valid JSON text must start with "[", or "{", the encoding can unambiguously be determined from the first bytes of the data.
I did not find this documented though, and one probably should not rely on it. A future implementation of JSONDecoder might support only the newer standard and require UTF-8.
Related
RFC 7515 Section 3 mentions:
In both serializations, the JWS Protected Header, JWS Payload, and JWS Signature are base64url encoded, since JSON lacks a way to directly represent arbitrary octet sequences.
Why JSON cannot be represented using octet sequences?
JSON by definition is UTF-8, so there is no way to (usefully) represent a byte sequence which is not a valid UTF-8 character.
For example, you cannot encode the bytes \x80 \x80.
(You could set up mutual agreement on both sides for additional semantics beyond what JSON supports, and encode them like for example \\x80\\x80; but then your format is no longer strictly JSON. In this case, to actually encode as UTF-8, you'd have to spell out the UTF-8 encoding for U-0080 twice! And then base64 is just a better convention because it's more compact, and avoids any confusion between characters and bytes.)
To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".
ECMA-404: The JSON Data Interchange Format
I believe that there is no need to encode this character at all, so it could be represented directly as "𝄞". However, should one wish to encode it, it must, per spec, be encoded as "\uD834\uDD1E", not (as would seem reasonable) as "\u1d11e". Why is this?
One of the key architectural features of JSON is that JSON-encoded objects are valid Javascript literals that can be evaluated using the eval function, for example. Unfortunately, older Javascript implementations only support 16-bit Unicode escape sequences with four hex characters in string literals, so there's no other way than to use UTF-16 surrogates in escape sequences for code points above 0xFFFF in a portable way. (The \u{...} syntax that allows arbitrary code points was only introduced in ECMAScript 6.)
But as you mentioned, there's no need to use escape sequences if your application supports Unicode JSON text. Simply encode the characters directly in the respective Unicode format.
I use jackson to parse json data. Now I have a problem with handling a \uXXXX issue.
The data I got here is like
{"UID":"here_\ud83d\udc3b"}
After I use ObjectMapper.readValue(jsonContent, UserId.class); to convert json to an instance of UserId, the UID property is not literally "here_\ud83d\udc3b". Jackson convert \ud83d\udc3b to 2 chars as the unicode value.
My question is, is it possible to let jackson skip this "Unicode transformation" and key the literal value "\ud83d\udc3b" as it is?
No. JSON parsers are required to handle Unicode escapes to produce underlying Unicode characters.
When writing, on the other hand, some characters may also be encoded using similar Unicode escapes.
So if you need to use escaping, you need to re-encode such values yourself.
I have a comment section and submission form that any of my member can submit.
If my member post in English I will receive an email update and the comment will be post no problem in English. But if they use other than English such an example of Thai language. Then what happen all the words let say for example สวัสดี it will appear as ??????
I don't know why, but I went to check on my php.ini file and the unicode/encoded setted to UTF8 and also on the MySQL collation setted to UTF8 as well. I make sure the meta setted to UTF8 as well on the .html/.php files, but still causing the same problem.
Any suggestion what else I missed to configure?
Make sure you are using multibyte safe string functions or you might be losing your UTF-8 encoding.
From the PHP mbstring manual:
While there are many languages in
which every necessary character can be
represented by a one-to-one mapping to
an 8-bit value, there are also several
languages which require so many
characters for written communication
that they cannot be contained within
the range a mere byte can code (A byte
is made up of eight bits. Each bit can
contain only two distinct values, one
or zero. Because of this, a byte can
only represent 256 unique values (two
to the power of eight)). Multibyte
character encoding schemes were
developed to express more than 256
characters in the regular bytewise
coding system.
When you manipulate (trim, split,
splice, etc.) strings encoded in a
multibyte encoding, you need to use
special functions since two or more
consecutive bytes may represent a
single character in such encoding
schemes. Otherwise, if you apply a
non-multibyte-aware string function to
the string, it probably fails to
detect the beginning or ending of the
multibyte character and ends up with a
corrupted garbage string that most
likely loses its original meaning.
mbstring provides multibyte specific
string functions that help you deal
with multibyte encodings in PHP. In
addition to that, mbstring handles
character encoding conversion between
the possible encoding pairs. mbstring
is designed to handle Unicode-based
encodings such as UTF-8 and UCS-2 and
many single-byte encodings for
convenience
I just found out that what is causing the problem
in php.ini
line mbstring.internal_encodingit was setted to something else so I setted it to UTF-8 then magical! now everything worked!
I have a string which gets serialized to JSON in Javascript, and then deserialized to Java.
It looks like if the string contains a degree symbol, then I get a problem.
I could use some help in figuring out who to blame:
is it the Spidermonkey 1.8 implementation? (this has a JSON implementation built-in)
is it Google gson?
is it me for not doing something properly?
Here's what happens in JSDB:
js>s='15\u00f8C'
15°C
js>JSON.stringify(s)
"15°C"
I would have expected "15\u00f8C' which leads me to believe that Spidermonkey's JSON implementation isn't doing the right thing... except that the JSON homepage's syntax description (is that the spec?) says that a char can be
any-Unicode-character-
except-"-or-\-or-
control-character"
so maybe it passes the string along as-is without encoding it as \u00f8... in which case I would think the problem is with the gson library.
Can anyone help?
I suppose my workaround is to use either a different JSON library, or manually escape strings myself after calling JSON.stringify() -- but if this is a bug then I'd like to file a bug report.
This is not a bug in either implementation. There is no requirement to escape U+00B0. To quote the RFC:
2.5. Strings
The representation of strings is
similar to conventions used in the C
family of programming languages. A
string begins and ends with quotation
marks. All Unicode characters may be
placed within the quotation marks
except for the characters that must be
escaped: quotation mark, reverse
solidus, and the control characters
(U+0000 through U+001F).
Any character may be escaped.
Escaping everything inflates the size of the data (all code points can be represented in four or fewer bytes in all Unicode transformation formats; whereas encoding them all makes them six or twelve bytes).
It is more likely that you have a text transcoding bug somewhere in your code and escaping everything in the ASCII subset masks the problem. It is a requirement of the JSON spec that all data use a Unicode encoding.
hmm, well here's a workaround anyway:
function JSON_stringify(s, emit_unicode)
{
var json = JSON.stringify(s);
return emit_unicode ? json : json.replace(/[\u007f-\uffff]/g,
function(c) {
return '\\u'+('0000'+c.charCodeAt(0).toString(16)).slice(-4);
}
);
}
test case:
js>s='15\u00f8C 3\u0111';
15°C 3◄
js>JSON_stringify(s, true)
"15°C 3◄"
js>JSON_stringify(s, false)
"15\u00f8C 3\u0111"
This is SUPER late and probably not relevant anymore, but if anyone stumbles upon this answer, I believe I know the cause.
So the JSON encoded string is perfectly valid with the degree symbol in it, as the other answer mentions. The problem is most likely in the character encoding that you are reading/writing with. Depending on how you are using Gson, you are probably passing it a java.io.Reader instance. Any time you are creating a Reader from an InputStream, you need to specify the character encoding, or java.nio.charset.Charset instance (it's usually best to use java.nio.charset.StandardCharsets.UTF_8). If you don't specify a Charset, Java will use your platform default encoding, which on Windows is usually CP-1252.