JSON feed in UTF-8 without byte order marker - json

I have an WCF application written in C# that deliver my data in JSON or XML, depending on what the user asks for in the query string. Here is a quick snippet of my code that delivers the data:
Encoding utf8 = new System.Text.UTF8Encoding(false);
return WebOperationContext.Current.CreateTextResponse(data, "application/json", utf8);
When I deliver the data using above method, the special characters are all messed up. So Chávez looks like Chávez. On the other hand, if I create the utf8 variable above with the BOM or use the enum (Encoding.UTF8), the special characters are working fine. But then, some of my consumers are complaining that their code is throwing exception when consuming my API. This of course is happening because of the BOM in the feed. Is there a way for me to correctly display the special characters without the BOM in the feed?

It looks like the output is correct, but whatever you are using to display it expects ANSI encoded text. Chávez is what you get when you encode Chávez in UTF-8 and interpret the result as if it was Latin 1.

Related

Save json(content is japanese) in database

I used laravel to save json in to database like as the folowing format
[{"s":"0.000","e":"","t":"\u672c\u65e5\u306f\u3054\u6765\u5854\u304f\u3060\u3055\u3044\u307e\u3057\u3066"},{"s":"0.001","e":"28.000","t":""},{"s":"0.002","e":"29.000","t":"\u3069\u3046\u305e\u6771\u4eac\u306e\u4eca\u3092\u3054\u3086\u3063\u304f\u308a\u304a\u697d\u3057\u307f\u304f\u3060\u3055\u3044\u307e\u305b\u30022"}]
Japanese become "\u3069\u3046\u305e\u..."
Can you tell me the way to covert them to Japanese?
"\u672c\u65e5\u306f\u3054\u6765\u5854\u304f\u3060\u3055\u3044\u307e\u3057\u3066" represents the same string as "本日はご来塔くださいまして". It just depends on the JSON stringifier whether or not it escapes certain characters. Since you're using laravel, I'm assuming you are generating the JSON in PHP. If you are using json_encode you can use the JSON_UNESCAPED_UNICODE option to get JSON with the Japanese UTF-8 characters instead of the escape sequences.
Either way when you parse the JSON you will get the same string. When you display that sting make sure you interpret it as UTF-8.

Problems with bad characters in UTF8 JSON strings with Delphi XE

We have a Delphi XE application that exchanges JSON data with a cloud application that is written in Node.
Generally everything works fine - but every once in a while we get bad (unknown) characters in some of our strings - and we are having problems tracking it down. These characters are rendered as diamonds and have a character code of 65533
We perform a rest POST call to get our data from the cloud and we get it as a JSON object that includes some meta-data and an array of record. We use DBXJson's TJsonObject for the json parsing in the form of
jsv := TJSONObject.parseJsonValue(s)
where s is the data we got from our Post call.
From this we use a TJsonArray to get the record, traverse it and use the JSonValue.ToString to retrieve the string value.
The data is stored in DBIsam using VarChar fields.
Any ideas how to detect and prevent "bad" characters that appear or where else this could happen?

Text encoding problems in JSON.stringified() object

I have a index.html with a which sends a text to a PHP code. This PHP sends it again by POST (curl) to a Node.js server, inserted in a JSON message (utf8-encoded)
//Node.js server file (app.js) -- gets the json and shows it in a <script> to save it in client JS
render(index, {json:{string:"mystring"}})
//Template to render (index.ejs)
var data = <%=JSON.stringify(json)%>;
So that I can pass those variables in the JSON to data. JSON is way bigger than here, I wrote only the part which creates a bug : the string contained here makes an "INvalid character" JS bug. What should I do ? Which encoding/decoding/escaping should I use ?
I have utf-8 everywhere, as all my other strings work, even with german or arabic characters. In this particular case, this is the "mystring" below which breaks the app :
If I remove the characters in the red circles It works.
Here is the string as it is in the JSON i receive :
"Otto\nTheater-, Konzert- und Gpb\n\u2028\u2028Rhoasse\u00dfe 20\u2028\n51065 K\u00f6ln\n\nTelefon: 0000-000000-0\u2028\nTelefax: 0000-000000\n\nE-Mail: address#mail.com\u2028"
Because it is a user-entered text, I must handle this kind of characters. I don't have access to the PHP part of the code, only to the nodeJS and client JS. How can I find and remove/convert those chars in JS ?
<%- JSON.stringify(data).replace(/[\u0000\u00ad\u0600-\u0604\u070f\u17b4\u17b5\u200c-\u200f\u2028-\u202f\u2060-\u206f\ufeff\ufff0-\uffff]/g, "\\n") %>;
I ended up replacing invalid unicode characters (which are valid for JSON but not in JS code) with line breaks. This solves the problem
JSON is commonly thought to be a subset of JavaScript, but it isn't quite. Due to an unfortunate oversight, the raw characters U+2028 and U+2029 are permitted in JSON string literals, but not in JavaScript string literals. In JavaScript, they are interpreted as newlines and so having one in a string literal is a syntax error.
Consequently this:
var data = <%=JSON.stringify(json)%>;
isn't safe. You can make it so by manually replacing them with string-literal-escaped versions:
JSON.stringify(json).replace('\u2028', '\\u2028').replace('\u2029', '\\u2029')
Typically it's best to avoid this kind of problem, and keep code and data strictly separated, by dropping the JSON data into an HTML data- attribute. It can then be read out of the DOM from the client-side script and passed through JSON.parse. Then the only kind of escaping you have to worry about is normal HTML-escaping, which hopefully your templating language does by default.
The other characters in your answer are actually okay for JS string literals, except for the control characters, which JSON also escapes.
It may well make sense to remove some of these characters anyway, as an input filtering step. It's unusual and almost always undesirable to have cruft like U+2028 in your data. You could consider filtering out the characters unsuitable for use in markup which include U+2028/9 and other bad things like bidi overrides that can mess up your page rendering.

JSON and escaping characters

I have a string which gets serialized to JSON in Javascript, and then deserialized to Java.
It looks like if the string contains a degree symbol, then I get a problem.
I could use some help in figuring out who to blame:
is it the Spidermonkey 1.8 implementation? (this has a JSON implementation built-in)
is it Google gson?
is it me for not doing something properly?
Here's what happens in JSDB:
js>s='15\u00f8C'
15°C
js>JSON.stringify(s)
"15°C"
I would have expected "15\u00f8C' which leads me to believe that Spidermonkey's JSON implementation isn't doing the right thing... except that the JSON homepage's syntax description (is that the spec?) says that a char can be
any-Unicode-character-
except-"-or-\-or-
control-character"
so maybe it passes the string along as-is without encoding it as \u00f8... in which case I would think the problem is with the gson library.
Can anyone help?
I suppose my workaround is to use either a different JSON library, or manually escape strings myself after calling JSON.stringify() -- but if this is a bug then I'd like to file a bug report.
This is not a bug in either implementation. There is no requirement to escape U+00B0. To quote the RFC:
2.5. Strings
The representation of strings is
similar to conventions used in the C
family of programming languages. A
string begins and ends with quotation
marks. All Unicode characters may be
placed within the quotation marks
except for the characters that must be
escaped: quotation mark, reverse
solidus, and the control characters
(U+0000 through U+001F).
Any character may be escaped.
Escaping everything inflates the size of the data (all code points can be represented in four or fewer bytes in all Unicode transformation formats; whereas encoding them all makes them six or twelve bytes).
It is more likely that you have a text transcoding bug somewhere in your code and escaping everything in the ASCII subset masks the problem. It is a requirement of the JSON spec that all data use a Unicode encoding.
hmm, well here's a workaround anyway:
function JSON_stringify(s, emit_unicode)
{
var json = JSON.stringify(s);
return emit_unicode ? json : json.replace(/[\u007f-\uffff]/g,
function(c) {
return '\\u'+('0000'+c.charCodeAt(0).toString(16)).slice(-4);
}
);
}
test case:
js>s='15\u00f8C 3\u0111';
15°C 3◄
js>JSON_stringify(s, true)
"15°C 3◄"
js>JSON_stringify(s, false)
"15\u00f8C 3\u0111"
This is SUPER late and probably not relevant anymore, but if anyone stumbles upon this answer, I believe I know the cause.
So the JSON encoded string is perfectly valid with the degree symbol in it, as the other answer mentions. The problem is most likely in the character encoding that you are reading/writing with. Depending on how you are using Gson, you are probably passing it a java.io.Reader instance. Any time you are creating a Reader from an InputStream, you need to specify the character encoding, or java.nio.charset.Charset instance (it's usually best to use java.nio.charset.StandardCharsets.UTF_8). If you don't specify a Charset, Java will use your platform default encoding, which on Windows is usually CP-1252.

Querying Chinese addresses in Googlemap API geocoding

I'm following the demo code from article of phpsqlgeocode.html
In the db, I inserted some Chinese addresses, which are utf-8 encoded. I
found after urlencode the Chinese address, the output of the address
would be wrong. Like this one:
http://maps.google.com.tw/maps/geo?output=csv&key=ABQIAAAAfG3KxFZXjEslq8VNxMBpKRR08snBovzCxLQZ9DWwpnzxH-ROPxSAS9Q36m-6OOy0qlwTL6Ht9qp87w&q=%3F%3F%3F%3F%3F%3F%3F%3F%3F132%3F
Then it outputs 200,5,59.3266963,18.2733433 (I can't query this through PHP, but through the browser instead).
This address is actually located in Taichung, Taiwan, but it turns out to be
in Sweden, Europe. But when I paste the Chinese address(such as 台中市西屯區智惠
街131巷56號58號60號) in the url, the result turns out to be fine!
How do I make sure it sends out the original Chinese
address? How do I avoid urlencode()? I found that removing urlencode() doesn't change anything.
(I've change the MAPS_HOST from maps.google.com to
maps.google.com.tw.)
(I'm sure my key is right, and other English address geocoding are
fine.)
q=%3F%3F%3F%3F%3F%3F%3F%3F%3F132%3F
decodes to:
?????????132?
so something has corrupted the string already before URL-encoding. This could happen if you try to convert the Chinese characters into an encoding that doesn't support Chinese characters, such as Latin-1.
You need to ensure that you're using UTF-8 consistently through your application. In particular you will need to ensure the tables in the database are stored using a UTF-8 character set; in MySQL terms, a UTF-8 collation. The default collation for MySQL otherwise is Latin-1. You'll also want to ensure your connection to the database uses UTF-8 by calling 1mysql_set_charset('utf-8')`.
(I am guessing from your question that you're using PHP.)