How to allow jackson to trade \uXXXX as plaint text? - json

I use jackson to parse json data. Now I have a problem with handling a \uXXXX issue.
The data I got here is like
{"UID":"here_\ud83d\udc3b"}
After I use ObjectMapper.readValue(jsonContent, UserId.class); to convert json to an instance of UserId, the UID property is not literally "here_\ud83d\udc3b". Jackson convert \ud83d\udc3b to 2 chars as the unicode value.
My question is, is it possible to let jackson skip this "Unicode transformation" and key the literal value "\ud83d\udc3b" as it is?

No. JSON parsers are required to handle Unicode escapes to produce underlying Unicode characters.
When writing, on the other hand, some characters may also be encoded using similar Unicode escapes.
So if you need to use escaping, you need to re-encode such values yourself.

Related

Raw string field value in JSON file

In my JSON file, one of the fields has to carry the content of another file (a string).
The string has CRLFs, single/double quotes, tabs.
Is there a way to consider my whole string as a raw string so I don't have to escape special characters?
Is there an equivalent in JSON to the string raw delimiter in C++?
In C++, I would just put the whole file content inside : R"( ... )"
Put simply, no there isn't. Depending on what parser you use, it may have a feature that allows this and/or there may be a variant of JSON that allows this (examples of variants include JSONP and JSON-C, though I'm not aware of one specifically that allows for the features you are looking for), but the JSON standard ubiquitous on the web does not support multiline strings or unescaped special characters.
A workaround for the lack of raw string support in JSON is to Base64 encode your string before adding it to your JSON.

Save json(content is japanese) in database

I used laravel to save json in to database like as the folowing format
[{"s":"0.000","e":"","t":"\u672c\u65e5\u306f\u3054\u6765\u5854\u304f\u3060\u3055\u3044\u307e\u3057\u3066"},{"s":"0.001","e":"28.000","t":""},{"s":"0.002","e":"29.000","t":"\u3069\u3046\u305e\u6771\u4eac\u306e\u4eca\u3092\u3054\u3086\u3063\u304f\u308a\u304a\u697d\u3057\u307f\u304f\u3060\u3055\u3044\u307e\u305b\u30022"}]
Japanese become "\u3069\u3046\u305e\u..."
Can you tell me the way to covert them to Japanese?
"\u672c\u65e5\u306f\u3054\u6765\u5854\u304f\u3060\u3055\u3044\u307e\u3057\u3066" represents the same string as "本日はご来塔くださいまして". It just depends on the JSON stringifier whether or not it escapes certain characters. Since you're using laravel, I'm assuming you are generating the JSON in PHP. If you are using json_encode you can use the JSON_UNESCAPED_UNICODE option to get JSON with the Japanese UTF-8 characters instead of the escape sequences.
Either way when you parse the JSON you will get the same string. When you display that sting make sure you interpret it as UTF-8.

Clojure: Escaping unicode `\U` in JSON encoding

Postamble for future readers
Elm allows literal C:\Users\myuser in strings
This is consistent with the JSON spec
My problem was unrelated to this, but several layers of escaping convoluted the problem. Future lesson: fully producing a minimal working example would have found the error!
Original question
I have a Clojure backend that talks to an Elm frontend. I hit a bump when decoding JSON values in Elm.
\U below means the literal characters backslash and U, as if read from a text file. "\\U" is the same string as input in Clojure and Elm source (\ must be escaped). Note enclosing "".
Problem: encoding \U
The literal string \U, escaped "\\U" is not accepted by the Elm string decoder.
A blog post suggests that to obtain the literal string \U, this should be encoded in source code as "\\\\U", "escaping the unicode escape".
The literal string I want to send to the client is C:\Users\myuser. I prefer to send valid JSON from the server to the client.
Clojure standard library behavior
clojure.data.json does not do anything special for strings containing the literal \U. The example below shows that \U and \m are threated equally, the backslash is escaped, and the following character ignored.
project.core> (clojure.data.json/write-str "C:\\Users\\myuser")
"\"C:\\\\Users\\\\myuser\""
Manual workaround
Temporary workaround is manually escaping the strings I need:
(defn escape-backslash-u [s]
(clojure.string/replace s "\\U" "\\\\U"))
Concrete questions
Is clojure.data.json/write-str behaving correctly? As I understand the documentation, output should be valid unicode.
Are other JSON libraries behaving similarly?
Is Elm's Json.Decode behaving correctly by rejecting the literal string \U?
Solution progress
A friendly Clojurians Slack user pointed to the JSON standard specification, specifically sections 7. Strings and 8.2. Unicode characters.
I think you may be on the wrong track here.
The string you gave as an example, "C:\\Users\\myuser" is completely unproblematic, it does not contain any Unicode escape sequences. It is a string containing the ASCII characters ‘C’, ‘:’, ‘\’, ‘U’, and so on. The backslash is the escape character in Clojure strings so it needs to be escaped itself to represent a literal backslash.
In any case the string "C:\\Users\\myuser" can be serialized with (clojure.data.json/write-str "C:\\Users\\myuser"), and, as you know, this gives "\"C:\\\\Users\\\\myuser\"". All of this seems perfectly straightforward and sound.
Printing "\"C:\\\\Users\\\\myuser\"" results in the original string "C:\\Users\\myuser" being printed. That string is accepted as valid by JSONLint, again as expected.
I understood it as Elm beeing unable to decode \"C:\\\\User... to "C:\\User... because it interprets \u as start for an escape sequence.
I tried elm here with the following code:
import Html exposing (text)
main =
text "\"c:\\\\user\\\\foo\"" // from clojure.data.json/write-str
which in turn compiles/runs to
"c:\\user\\foo"
which looks fine to me.
Are you sure there is nothing else going on (middleware, transport) ?

Why does JSON encode UTF-16 surrogate pairs instead of Unicode code points directly?

To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".
ECMA-404: The JSON Data Interchange Format
I believe that there is no need to encode this character at all, so it could be represented directly as "𝄞". However, should one wish to encode it, it must, per spec, be encoded as "\uD834\uDD1E", not (as would seem reasonable) as "\u1d11e". Why is this?
One of the key architectural features of JSON is that JSON-encoded objects are valid Javascript literals that can be evaluated using the eval function, for example. Unfortunately, older Javascript implementations only support 16-bit Unicode escape sequences with four hex characters in string literals, so there's no other way than to use UTF-16 surrogates in escape sequences for code points above 0xFFFF in a portable way. (The \u{...} syntax that allows arbitrary code points was only introduced in ECMAScript 6.)
But as you mentioned, there's no need to use escape sequences if your application supports Unicode JSON text. Simply encode the characters directly in the respective Unicode format.

JSON and escaping characters

I have a string which gets serialized to JSON in Javascript, and then deserialized to Java.
It looks like if the string contains a degree symbol, then I get a problem.
I could use some help in figuring out who to blame:
is it the Spidermonkey 1.8 implementation? (this has a JSON implementation built-in)
is it Google gson?
is it me for not doing something properly?
Here's what happens in JSDB:
js>s='15\u00f8C'
15°C
js>JSON.stringify(s)
"15°C"
I would have expected "15\u00f8C' which leads me to believe that Spidermonkey's JSON implementation isn't doing the right thing... except that the JSON homepage's syntax description (is that the spec?) says that a char can be
any-Unicode-character-
except-"-or-\-or-
control-character"
so maybe it passes the string along as-is without encoding it as \u00f8... in which case I would think the problem is with the gson library.
Can anyone help?
I suppose my workaround is to use either a different JSON library, or manually escape strings myself after calling JSON.stringify() -- but if this is a bug then I'd like to file a bug report.
This is not a bug in either implementation. There is no requirement to escape U+00B0. To quote the RFC:
2.5. Strings
The representation of strings is
similar to conventions used in the C
family of programming languages. A
string begins and ends with quotation
marks. All Unicode characters may be
placed within the quotation marks
except for the characters that must be
escaped: quotation mark, reverse
solidus, and the control characters
(U+0000 through U+001F).
Any character may be escaped.
Escaping everything inflates the size of the data (all code points can be represented in four or fewer bytes in all Unicode transformation formats; whereas encoding them all makes them six or twelve bytes).
It is more likely that you have a text transcoding bug somewhere in your code and escaping everything in the ASCII subset masks the problem. It is a requirement of the JSON spec that all data use a Unicode encoding.
hmm, well here's a workaround anyway:
function JSON_stringify(s, emit_unicode)
{
var json = JSON.stringify(s);
return emit_unicode ? json : json.replace(/[\u007f-\uffff]/g,
function(c) {
return '\\u'+('0000'+c.charCodeAt(0).toString(16)).slice(-4);
}
);
}
test case:
js>s='15\u00f8C 3\u0111';
15°C 3◄
js>JSON_stringify(s, true)
"15°C 3◄"
js>JSON_stringify(s, false)
"15\u00f8C 3\u0111"
This is SUPER late and probably not relevant anymore, but if anyone stumbles upon this answer, I believe I know the cause.
So the JSON encoded string is perfectly valid with the degree symbol in it, as the other answer mentions. The problem is most likely in the character encoding that you are reading/writing with. Depending on how you are using Gson, you are probably passing it a java.io.Reader instance. Any time you are creating a Reader from an InputStream, you need to specify the character encoding, or java.nio.charset.Charset instance (it's usually best to use java.nio.charset.StandardCharsets.UTF_8). If you don't specify a Charset, Java will use your platform default encoding, which on Windows is usually CP-1252.