How to set locale properly in AS3 - actionscript-3

What is the proper way of setting the locale in ActionScript, so that functions like String.toLocaleUpperCase() and String.toLocaleLowerCase() work as expected?

Here's an interesting line from the documentation:
While this method is intended to handle the conversion in a locale-specific way, the ActionScript 3.0 implementation does not produce a different result from the toUpperCase() method.
Following this information, here is what the documentation for .toUpperCase() has to say:
This method converts all characters (not simply a-z) for which Unicode uppercase equivalents exist.
These case mappings are defined in the Unicode Character Database specification.
In summary, I don't think there is actually a way to set a locale.

Related

Strict JSON parsing in Go

The encoding/json exposes a forgiving parser. Every not present property is simply set to its default value. Is there a better way to make a field required than using bulky switch statments and check every field for its default value? Another problem is that not all default types are nil. Is there another way to distinguish between than a not set field and e.g. 0 other than using pointers to be able to check against nil?
You may look at what there is available to implement
so-called "JSON schema validation".
You may start with this search
which yields github.com/juju/gojsonschema among others;
while I have no idea about its quality, it's used as part of
Ubuntu's Juju cloud orchestration solution so I'd expect it
to be battle tested. Still, caveat emptor.

encoding issues between python and mysql

I have a weird encoding problem from my PyQt app to my mysql database.
I mean weird in the sense that it works in one case and not the other ones, even though I seem to be doing the exact same thing for all.
My process is the following:
I have some QFocusOutTextEdit elements in which I write text possibly containing accents and stuff (é,à,è,...)
I get the text written with :
text = self.ui.text_area.toPlainText()
text = text.toUtf8()
Then to insert it in my database I do :
text= str(text).decode('unicode_escape').encode('iso8859-1').decode('utf8')
I also set the character set of my database, the specific tables and the specific columns of the table to utf8.
It is working for one my text areas, and for the other ones it puts weird characters instead in my db.
Any hint appreciated on this !
RESOLVED :
sorry for the disturbance, apparently I had some fields in my database that weren't up to date and this was blocking the process of encoding somehow.
You are doing a lot of encoding, decoding, and reencoding which is hard to follow even if you know what all of it means. You should try to simplify this down to just working natively with Unicode strings. In Python 3 that means str (normal strings) and in Python 2 that means unicode (u"this kind of string").
Arrange for your connection to the MySQL database to use Unicode on input and output. If you use something high-level like Sqlalchemy, you probably don't need to do anything. If you use MySQLdb directly make sure you pass charset="UTF8" (which implies use_unicode) to the connect() method.
Then make sure the value you are getting from PyQT is a unicode value. I don't know PyQT. Check the type of self.ui.text_area or self.ui.text_area.toPlainText(). Hopefully it is already a Unicode string. If yes: you're all set. If no: it's a byte string which is probably encoded in UTF-8 so you can decode it with theresult.decode('utf8') which will give you a Unicode object.
Once your code is dealing with all Unicode objects and no more encoded byte strings, you don't need to do any kind of encoding or decoding anymore. Just pass the strings directly from PyQT to MySQL.

JSON and escaping characters

I have a string which gets serialized to JSON in Javascript, and then deserialized to Java.
It looks like if the string contains a degree symbol, then I get a problem.
I could use some help in figuring out who to blame:
is it the Spidermonkey 1.8 implementation? (this has a JSON implementation built-in)
is it Google gson?
is it me for not doing something properly?
Here's what happens in JSDB:
js>s='15\u00f8C'
15°C
js>JSON.stringify(s)
"15°C"
I would have expected "15\u00f8C' which leads me to believe that Spidermonkey's JSON implementation isn't doing the right thing... except that the JSON homepage's syntax description (is that the spec?) says that a char can be
any-Unicode-character-
except-"-or-\-or-
control-character"
so maybe it passes the string along as-is without encoding it as \u00f8... in which case I would think the problem is with the gson library.
Can anyone help?
I suppose my workaround is to use either a different JSON library, or manually escape strings myself after calling JSON.stringify() -- but if this is a bug then I'd like to file a bug report.
This is not a bug in either implementation. There is no requirement to escape U+00B0. To quote the RFC:
2.5. Strings
The representation of strings is
similar to conventions used in the C
family of programming languages. A
string begins and ends with quotation
marks. All Unicode characters may be
placed within the quotation marks
except for the characters that must be
escaped: quotation mark, reverse
solidus, and the control characters
(U+0000 through U+001F).
Any character may be escaped.
Escaping everything inflates the size of the data (all code points can be represented in four or fewer bytes in all Unicode transformation formats; whereas encoding them all makes them six or twelve bytes).
It is more likely that you have a text transcoding bug somewhere in your code and escaping everything in the ASCII subset masks the problem. It is a requirement of the JSON spec that all data use a Unicode encoding.
hmm, well here's a workaround anyway:
function JSON_stringify(s, emit_unicode)
{
var json = JSON.stringify(s);
return emit_unicode ? json : json.replace(/[\u007f-\uffff]/g,
function(c) {
return '\\u'+('0000'+c.charCodeAt(0).toString(16)).slice(-4);
}
);
}
test case:
js>s='15\u00f8C 3\u0111';
15°C 3◄
js>JSON_stringify(s, true)
"15°C 3◄"
js>JSON_stringify(s, false)
"15\u00f8C 3\u0111"
This is SUPER late and probably not relevant anymore, but if anyone stumbles upon this answer, I believe I know the cause.
So the JSON encoded string is perfectly valid with the degree symbol in it, as the other answer mentions. The problem is most likely in the character encoding that you are reading/writing with. Depending on how you are using Gson, you are probably passing it a java.io.Reader instance. Any time you are creating a Reader from an InputStream, you need to specify the character encoding, or java.nio.charset.Charset instance (it's usually best to use java.nio.charset.StandardCharsets.UTF_8). If you don't specify a Charset, Java will use your platform default encoding, which on Windows is usually CP-1252.

check input for UTF-8, count characters, use regular expressions

I want to write a C-program that gets some strings from input. I want to save them in a MySQL database.
For security I would like to check, if the input is a (possible) UTF-8 string, count the number of characters and also use some regular expressions to validate the input.
Is there a library that offers me that functionality?
I thought about to use wide characters, but as far as I understood, the fact if they are supporting UTF-8 depends on the implementation and ist not defined by a standard.
And also I would be missing the regular expressions.
PCRE supports UTF-8. To validate the string before any processing, the W3C suggests this expression, which I re-implemented in plain C, but PCRE already automatically checks for UTF-8 in accordance to RFC 3629.

convert html entities to unicode(utf-8) strings in c? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to decode HTML Entities in C?
This question is very similar to that one, but I need to do the same thing in C, not python. Here are some examples of what the function should do:
input output
< <
> >
ä ä
ß ß
The function should have the signature char *html2str(char *html) or similar. I'm not reading byte by byte from a stream.
Is there a library function I can use?
There isn't a standard library function to do the job. There must be a large number of implementation available in the Open Source world - just about any program that has to deal with HTML will have one.
There are two aspects to the problem:
Finding the HTML entities in the source string.
Inserting the appropriate replacement text in its place.
Since the shortest possible entity is '&x;' (but, AFAIK, they all use at least 2 characters between the ampersand and the semi-colon), you will always be shortening the string since the longest possible UTF-8 character representation is 4 bytes. Hence, it is possible to edit in situ safely.
There's an illustration of HTML entity decoding in 'The Practice of Programming' by Kernighan and Pike, though it is done somewhat 'in passing'. They use a tokenizer to recognize the entity, and a sorted table of entity names plus the replacement value so that they can use a binary search to identify the replacements. This is only needed for the non-algorithmic entity names. For entities encoded as 'ß', you use an algorithmic technique to decode them.
This sounds like a job for flex. Granted, flex is usually stream-based, but you can change that using the flex function yy_scan_string (or its relatives). For details, see The flex Manual: Scanning Strings.
Flex's basic Unicode support is pretty bad, but if you don't mind coding in the bytes by hand, it could be a workaround. There are probably other tools that can do what you want, as well.