RFC 7515 Section 3 mentions:
In both serializations, the JWS Protected Header, JWS Payload, and JWS Signature are base64url encoded, since JSON lacks a way to directly represent arbitrary octet sequences.
Why JSON cannot be represented using octet sequences?
JSON by definition is UTF-8, so there is no way to (usefully) represent a byte sequence which is not a valid UTF-8 character.
For example, you cannot encode the bytes \x80 \x80.
(You could set up mutual agreement on both sides for additional semantics beyond what JSON supports, and encode them like for example \\x80\\x80; but then your format is no longer strictly JSON. In this case, to actually encode as UTF-8, you'd have to spell out the UTF-8 encoding for U-0080 twice! And then base64 is just a better convention because it's more compact, and avoids any confusion between characters and bytes.)
Related
Having read Joel on Encoding like a good boy, I find myself perplexed by the workings of Foundation's JSONDecoder, neither of whose init or decode methods take an encoding value. Looking through the docs, I see the instance variable dataDecodingStrategy, which perhaps this is where the encoding-guessing magic happens...?
Am I missing something here? Shouldn't JSONDecoder need to know the encoding of the data it receives? I realize that the JSON standard requires this data to be UTF-8 encoded, but can JSONDecoder be making that assumption? I'm confused.
RFC 8259 (from 2017) requires that
JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8.
The older RFC 7159 (from 2013) and RFC 7158 (from 2013) only stated that
JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default
encoding is UTF-8, and JSON texts that are encoded in UTF-8 are
interoperable in the sense that they will be read successfully by the
maximum number of implementations; there are many implementations
that cannot successfully read texts in other encodings (such as
UTF-16 and UTF-32).
And RFC 4627 (from 2006, the oldest one that I could find):
JSON text SHALL be encoded in Unicode. The default encoding is
UTF-8.
Since the first two characters of a JSON text will always be ASCII
characters, it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.
JSONDecoder (which uses JSONSerialization under the hood) is able to decode UTF-8, UTF-16, and UTF-32, both little-endian and big-endian. Example:
let data = "[1, 2, 3]".data(using: .utf16LittleEndian)!
print(data as NSData) // <5b003100 2c002000 32002c00 20003300 5d00>
let a = try! JSONDecoder().decode([Int].self, from: data)
print(a) // [1, 2, 3]
Since a valid JSON text must start with "[", or "{", the encoding can unambiguously be determined from the first bytes of the data.
I did not find this documented though, and one probably should not rely on it. A future implementation of JSONDecoder might support only the newer standard and require UTF-8.
When trying to download a video on vevo by inspecting element, I discovered that that was impossible even though the content wasn't DRM protected. The video tag refers to a file that I can't trace or find using ctrl+I (Firefix Dev Edition), while it is still playing in the browser. Instead of /folder/video it says data:folder/video. How does this data: work?
A quick Google search and our friend wikipedia says:
The data URI scheme is a uniform resource identifier (URI) scheme that provides a way to include data in-line in web pages as if they were external resources. It is a form of file literal or here document. This technique allows normally separate elements such as images and style sheets to be fetched in a single Hypertext Transfer Protocol (HTTP) request, which may be more efficient than multiple HTTP requests.
Syntax
The scheme followed by a colon (data:).
An optional media type. The media type part may include one or more parameters, in the format attribute=value, separated by semicolons. A common media type parameter is charset, specifying the character set of the media type, where the value is from the IANA list of character set names. If one is not specified, the media type of the data URI is assumed to be text/plain;charset=US-ASCII.
An optional base64 extension base64, separated from the preceding part by a semicolon. When present, this indicates that the data content of the URI is binary data, encoded in ASCII format using the Base64 scheme for binary-to-text encoding. The base64 extension is distinguished from any media type parameters by virtue of not having a =value component and by coming after any media type parameters.
The data, separated from the preceding part by a comma. The data is a sequence of zero or more octets represented as characters. The comma is required in a data URI, even when the data part has zero length. The characters permitted within the data part include ASCII upper and lowercase letters, digits, and many ASCII punctuation and special characters. Note that this may include characters, such as colon, semicolon, and comma which are delimiters in the URI components preceding the data part. Other octets must be percent-encoded. If the data is Base64-encoded, then the data part may contain only valid Base64 characters. Note that Base64-encoded data: URIs use the standard Base64 character set (with + and / as characters 62 and 63) rather than the so-called "URL-safe Base64" character set
Postamble for future readers
Elm allows literal C:\Users\myuser in strings
This is consistent with the JSON spec
My problem was unrelated to this, but several layers of escaping convoluted the problem. Future lesson: fully producing a minimal working example would have found the error!
Original question
I have a Clojure backend that talks to an Elm frontend. I hit a bump when decoding JSON values in Elm.
\U below means the literal characters backslash and U, as if read from a text file. "\\U" is the same string as input in Clojure and Elm source (\ must be escaped). Note enclosing "".
Problem: encoding \U
The literal string \U, escaped "\\U" is not accepted by the Elm string decoder.
A blog post suggests that to obtain the literal string \U, this should be encoded in source code as "\\\\U", "escaping the unicode escape".
The literal string I want to send to the client is C:\Users\myuser. I prefer to send valid JSON from the server to the client.
Clojure standard library behavior
clojure.data.json does not do anything special for strings containing the literal \U. The example below shows that \U and \m are threated equally, the backslash is escaped, and the following character ignored.
project.core> (clojure.data.json/write-str "C:\\Users\\myuser")
"\"C:\\\\Users\\\\myuser\""
Manual workaround
Temporary workaround is manually escaping the strings I need:
(defn escape-backslash-u [s]
(clojure.string/replace s "\\U" "\\\\U"))
Concrete questions
Is clojure.data.json/write-str behaving correctly? As I understand the documentation, output should be valid unicode.
Are other JSON libraries behaving similarly?
Is Elm's Json.Decode behaving correctly by rejecting the literal string \U?
Solution progress
A friendly Clojurians Slack user pointed to the JSON standard specification, specifically sections 7. Strings and 8.2. Unicode characters.
I think you may be on the wrong track here.
The string you gave as an example, "C:\\Users\\myuser" is completely unproblematic, it does not contain any Unicode escape sequences. It is a string containing the ASCII characters ‘C’, ‘:’, ‘\’, ‘U’, and so on. The backslash is the escape character in Clojure strings so it needs to be escaped itself to represent a literal backslash.
In any case the string "C:\\Users\\myuser" can be serialized with (clojure.data.json/write-str "C:\\Users\\myuser"), and, as you know, this gives "\"C:\\\\Users\\\\myuser\"". All of this seems perfectly straightforward and sound.
Printing "\"C:\\\\Users\\\\myuser\"" results in the original string "C:\\Users\\myuser" being printed. That string is accepted as valid by JSONLint, again as expected.
I understood it as Elm beeing unable to decode \"C:\\\\User... to "C:\\User... because it interprets \u as start for an escape sequence.
I tried elm here with the following code:
import Html exposing (text)
main =
text "\"c:\\\\user\\\\foo\"" // from clojure.data.json/write-str
which in turn compiles/runs to
"c:\\user\\foo"
which looks fine to me.
Are you sure there is nothing else going on (middleware, transport) ?
To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".
ECMA-404: The JSON Data Interchange Format
I believe that there is no need to encode this character at all, so it could be represented directly as "𝄞". However, should one wish to encode it, it must, per spec, be encoded as "\uD834\uDD1E", not (as would seem reasonable) as "\u1d11e". Why is this?
One of the key architectural features of JSON is that JSON-encoded objects are valid Javascript literals that can be evaluated using the eval function, for example. Unfortunately, older Javascript implementations only support 16-bit Unicode escape sequences with four hex characters in string literals, so there's no other way than to use UTF-16 surrogates in escape sequences for code points above 0xFFFF in a portable way. (The \u{...} syntax that allows arbitrary code points was only introduced in ECMAScript 6.)
But as you mentioned, there's no need to use escape sequences if your application supports Unicode JSON text. Simply encode the characters directly in the respective Unicode format.
I use jackson to parse json data. Now I have a problem with handling a \uXXXX issue.
The data I got here is like
{"UID":"here_\ud83d\udc3b"}
After I use ObjectMapper.readValue(jsonContent, UserId.class); to convert json to an instance of UserId, the UID property is not literally "here_\ud83d\udc3b". Jackson convert \ud83d\udc3b to 2 chars as the unicode value.
My question is, is it possible to let jackson skip this "Unicode transformation" and key the literal value "\ud83d\udc3b" as it is?
No. JSON parsers are required to handle Unicode escapes to produce underlying Unicode characters.
When writing, on the other hand, some characters may also be encoded using similar Unicode escapes.
So if you need to use escaping, you need to re-encode such values yourself.