AS3: Conversion to GBK charset - actionscript-3

Using Flex (and HTTPService), I am loading data from an URL, data that is encoded with the GBK charset. A good example of such an URL is this one.
A browser gets that the data is in the GBK charset, and correctly displays the text using Chinese characters where they appear. However, Flex will hold the data in a different charset, and it happens to look like this:
({"q":"tes","p":false,"bs":"","s":["ÌØ˹À­","ÌØÊâ·ûºÅ","test","ÌØÊâÉí·Ý","tesco","ÌØ˹À­Æû³µ","ÌØÊÓÍø","ÌØÊâ·ûºÅͼ°¸´óȫ","testin","ÌØ˹À­Æ󳵼۸ñ"]});
I need to correctly change the text to the same character string that the browsers display.
What I am already doing is using ByteArray, with the best result so far by using "iso-8859-1":
var convert:String;
var byte:ByteArray = new ByteArray();
byte.writeMultiByte(event.result as String, "iso-8859-1");
byte.position = 0;
convert = byte.readMultiByte(byte.bytesAvailable, "gbk");
This creates the following string, which is very close to the browser result but not entirely:
({"q":"tes","p":false,"bs":"","s":["特?拉","特殊符号","test","特殊身份","tesco","特?拉汽车","特视网","特殊符号?案大?","testin","特?拉????]});
Some characters are still replaced by "?" marks. And when I copy the browser result into Flex and print it, it gets displayed correctly so it is not a matter of unsupported characters in Flash trace or anything like that.
Interesting fact: Notepad++ gives the same close-but-not-quite result as the bytearray approach in Flex. Also in NP++, when converting the correct/expected string, from gbk to iso-8859-1, I am getting a slightly different string than the one Flex is getting from the URL:
({"q":"tes","p":false,"bs":"","s":["ÌØ˹À­","ÌØÊâ·ûºÅ","test","ÌØÊâÉí·Ý","tesco","ÌØ˹À­Æû³µ","ÌØÊÓÍø","ÌØÊâ·ûºÅͼ°¸´óÈ«","testin","ÌØ˹À­Æû³µ¼Û¸ñ"]});
Seems to me that this string is the one that Flex should be getting, to have the ByteArray approach create the correct result (visible in browsers). So I see possible 3 causes for this:
Something is happening to the data coming from the URL to Flex, causing it to be slightly different (unlikely)
The received charset is not actually iso-8859-1, but another similar charset
I don't have a complete grasp of the difference between encoding and charset, so maybe this keeps me from understanding the problem.
Any help/idea would be greatly appreciated.
Thank you.

Managed to find the problem and solution, hope this will help anyone else in the future.
Turns out using HTTPService automatically converts the result into a String, which may compress some pair of bytes into single characters. That is why I was getting the first result (see up) instead of the third one. What I needed to do is get the result in binary form, and HTTPService does not have this type of resultFormat; however URLLoader does.
Replace HTTPService with URLLoader
Set the dataFormat property of the URLLoader to URLLoaderDataFormat.BINARY
After loading, the data property will return as a ByteArray. Tracing this byte array (or converting it into a String) will display the same result as the HTTPService is getting, which is still wrong, however in reality the byte array actually holds the correct data byte for byte (the length property of the byte array will be a bit larger than the size of the converted string).
So you can read the string from this bytearray, using the "gbk" charset:
byteArray.readMultyByte(byteArray.length, "gbk");
This returns the correct string, which the browser is also displaying.

Related

json parsing error for large data in browser

I am trying to perform an export to excel functionality from the data in an html table (5000+ rows). I am using json2.js for parsing the client side data in to json string called as jsonToExport.
The value of this variable is fine for less number of records and it is decoded fine (I checked in the browser in debug mode).
But for large dataset 5000+ records the json parsing/decoding is failing. I can see the encoded string but the decoded value shows:
jsonToExport: unable to decode
I experimented with the data and found that if the data exceeds a particular size then I get this error.
like increasing the column size or replacing large data columns with small length columns, so in effect its not an issue with the data format of encoded json string missing anything since all combination of columns work if the number of columns is limited.
Its definitely not able to decode/parse and then pass the json string in the request if its above a particular size limit.
Is there an Issue with json2.js which does the parsing (I think)?.
I also tried json3.min.js and received the same error.
Unless you're doing old browser support, like i.e. 7, you don't need to use antiquated libraries to parse JSON any longer, it's built in JSON.parse(jsonString)

JSON feed in UTF-8 without byte order marker

I have an WCF application written in C# that deliver my data in JSON or XML, depending on what the user asks for in the query string. Here is a quick snippet of my code that delivers the data:
Encoding utf8 = new System.Text.UTF8Encoding(false);
return WebOperationContext.Current.CreateTextResponse(data, "application/json", utf8);
When I deliver the data using above method, the special characters are all messed up. So Chávez looks like Chávez. On the other hand, if I create the utf8 variable above with the BOM or use the enum (Encoding.UTF8), the special characters are working fine. But then, some of my consumers are complaining that their code is throwing exception when consuming my API. This of course is happening because of the BOM in the feed. Is there a way for me to correctly display the special characters without the BOM in the feed?
It looks like the output is correct, but whatever you are using to display it expects ANSI encoded text. Chávez is what you get when you encode Chávez in UTF-8 and interpret the result as if it was Latin 1.

encoding issues between python and mysql

I have a weird encoding problem from my PyQt app to my mysql database.
I mean weird in the sense that it works in one case and not the other ones, even though I seem to be doing the exact same thing for all.
My process is the following:
I have some QFocusOutTextEdit elements in which I write text possibly containing accents and stuff (é,à,è,...)
I get the text written with :
text = self.ui.text_area.toPlainText()
text = text.toUtf8()
Then to insert it in my database I do :
text= str(text).decode('unicode_escape').encode('iso8859-1').decode('utf8')
I also set the character set of my database, the specific tables and the specific columns of the table to utf8.
It is working for one my text areas, and for the other ones it puts weird characters instead in my db.
Any hint appreciated on this !
RESOLVED :
sorry for the disturbance, apparently I had some fields in my database that weren't up to date and this was blocking the process of encoding somehow.
You are doing a lot of encoding, decoding, and reencoding which is hard to follow even if you know what all of it means. You should try to simplify this down to just working natively with Unicode strings. In Python 3 that means str (normal strings) and in Python 2 that means unicode (u"this kind of string").
Arrange for your connection to the MySQL database to use Unicode on input and output. If you use something high-level like Sqlalchemy, you probably don't need to do anything. If you use MySQLdb directly make sure you pass charset="UTF8" (which implies use_unicode) to the connect() method.
Then make sure the value you are getting from PyQT is a unicode value. I don't know PyQT. Check the type of self.ui.text_area or self.ui.text_area.toPlainText(). Hopefully it is already a Unicode string. If yes: you're all set. If no: it's a byte string which is probably encoded in UTF-8 so you can decode it with theresult.decode('utf8') which will give you a Unicode object.
Once your code is dealing with all Unicode objects and no more encoded byte strings, you don't need to do any kind of encoding or decoding anymore. Just pass the strings directly from PyQT to MySQL.

Saving a canvas drawn image to a mysql database

I was wondering about the best way to tackle this. I'm trying to save a user-drawn image on a HTML5 canvas to my database so I can retrieve it later.
I got as far as creating the base64 data string for the image with the following code, hooked to a simple button clickhandler:
var image_data = $("#drawing_canvas").get(0).toDataURL('image/png');
I was planning on saving that data to my database and then retrieving it later on with something like this:
var myImage = new Image();
myImage.src = imgData;
ctx.drawImage(myImage, 0, 0);
However, these base64 'strings' seem to contain a lot of data. I was wondering if there's a better way to save these images to my database? I have yet to figure out a way to save the actual image as a .png. I could get it to open as a png in a new browser tab, but that's where I'm stuck at the moment.
Or would it be fine to store these base64 data strings in my database (in, I suppose, a 'text' column)?
Thanks in advance.
You want to use a BLOB type. Here's what the MySQL docs say about it:
BLOB values are treated as binary strings (byte strings). They have no character set, and sorting and comparison are based on the numeric values of the bytes in column values. TEXT values are treated as nonbinary strings (character strings). They have a character set, and values are sorted and compared based on the collation of the character set.
http://dev.mysql.com/doc/refman/5.0/en/blob.html

JSON and escaping characters

I have a string which gets serialized to JSON in Javascript, and then deserialized to Java.
It looks like if the string contains a degree symbol, then I get a problem.
I could use some help in figuring out who to blame:
is it the Spidermonkey 1.8 implementation? (this has a JSON implementation built-in)
is it Google gson?
is it me for not doing something properly?
Here's what happens in JSDB:
js>s='15\u00f8C'
15°C
js>JSON.stringify(s)
"15°C"
I would have expected "15\u00f8C' which leads me to believe that Spidermonkey's JSON implementation isn't doing the right thing... except that the JSON homepage's syntax description (is that the spec?) says that a char can be
any-Unicode-character-
except-"-or-\-or-
control-character"
so maybe it passes the string along as-is without encoding it as \u00f8... in which case I would think the problem is with the gson library.
Can anyone help?
I suppose my workaround is to use either a different JSON library, or manually escape strings myself after calling JSON.stringify() -- but if this is a bug then I'd like to file a bug report.
This is not a bug in either implementation. There is no requirement to escape U+00B0. To quote the RFC:
2.5. Strings
The representation of strings is
similar to conventions used in the C
family of programming languages. A
string begins and ends with quotation
marks. All Unicode characters may be
placed within the quotation marks
except for the characters that must be
escaped: quotation mark, reverse
solidus, and the control characters
(U+0000 through U+001F).
Any character may be escaped.
Escaping everything inflates the size of the data (all code points can be represented in four or fewer bytes in all Unicode transformation formats; whereas encoding them all makes them six or twelve bytes).
It is more likely that you have a text transcoding bug somewhere in your code and escaping everything in the ASCII subset masks the problem. It is a requirement of the JSON spec that all data use a Unicode encoding.
hmm, well here's a workaround anyway:
function JSON_stringify(s, emit_unicode)
{
var json = JSON.stringify(s);
return emit_unicode ? json : json.replace(/[\u007f-\uffff]/g,
function(c) {
return '\\u'+('0000'+c.charCodeAt(0).toString(16)).slice(-4);
}
);
}
test case:
js>s='15\u00f8C 3\u0111';
15°C 3◄
js>JSON_stringify(s, true)
"15°C 3◄"
js>JSON_stringify(s, false)
"15\u00f8C 3\u0111"
This is SUPER late and probably not relevant anymore, but if anyone stumbles upon this answer, I believe I know the cause.
So the JSON encoded string is perfectly valid with the degree symbol in it, as the other answer mentions. The problem is most likely in the character encoding that you are reading/writing with. Depending on how you are using Gson, you are probably passing it a java.io.Reader instance. Any time you are creating a Reader from an InputStream, you need to specify the character encoding, or java.nio.charset.Charset instance (it's usually best to use java.nio.charset.StandardCharsets.UTF_8). If you don't specify a Charset, Java will use your platform default encoding, which on Windows is usually CP-1252.