check input for UTF-8, count characters, use regular expressions - mysql

I want to write a C-program that gets some strings from input. I want to save them in a MySQL database.
For security I would like to check, if the input is a (possible) UTF-8 string, count the number of characters and also use some regular expressions to validate the input.
Is there a library that offers me that functionality?
I thought about to use wide characters, but as far as I understood, the fact if they are supporting UTF-8 depends on the implementation and ist not defined by a standard.
And also I would be missing the regular expressions.

PCRE supports UTF-8. To validate the string before any processing, the W3C suggests this expression, which I re-implemented in plain C, but PCRE already automatically checks for UTF-8 in accordance to RFC 3629.

Related

Why does JSON encode UTF-16 surrogate pairs instead of Unicode code points directly?

To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".
ECMA-404: The JSON Data Interchange Format
I believe that there is no need to encode this character at all, so it could be represented directly as "đť„ž". However, should one wish to encode it, it must, per spec, be encoded as "\uD834\uDD1E", not (as would seem reasonable) as "\u1d11e". Why is this?
One of the key architectural features of JSON is that JSON-encoded objects are valid Javascript literals that can be evaluated using the eval function, for example. Unfortunately, older Javascript implementations only support 16-bit Unicode escape sequences with four hex characters in string literals, so there's no other way than to use UTF-16 surrogates in escape sequences for code points above 0xFFFF in a portable way. (The \u{...} syntax that allows arbitrary code points was only introduced in ECMAScript 6.)
But as you mentioned, there's no need to use escape sequences if your application supports Unicode JSON text. Simply encode the characters directly in the respective Unicode format.

MySQL: Is it safe to lowercase or uppercase regular expression?

I use regular expressions in MySQL on multibyte-encoded (utf-8) data, but I need it to be match case-insensitively. As MySQL has bug (for many years unresolved) that it can't deal properly with matching multibyte-encoded strings case-insensitively, I am trying to simulate the "insensitiveness" by lowercasing the value and the regexp pattern. Is it safe to lowercase regexp pattern such way? I mean, are there any edge cases I forgot?
Could following cause any problems?
LOWER('šárKA') = REGEXP LOWER('^Šárka$')
Update: I edited the question to be more concrete.
MySQL documentation:
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
It is their bug filed in 2007 and unsolved until now. However, I can't just change database to solve this issue. I need MySQL somehow to consider 'Š' equal to 'š', even if it is by hacking it with not-so-elegant solution. Other characters than accented (multi-byte) match well and with no issues.
The i option for the Regex will make sure it matches case insensitively.
Example:
'^(?i)Foo$' // (?i) will turn on case insensitivity for the rest of the regex
'/^Foo$/i' // the i options turns off case sensitivity
Note that these may not work in your particular Flavour of Regex (which you haven't hinted upon) so make sure you consult your manual for the correct syntax.
Update:
From here: http://dev.mysql.com/doc/refman/5.1/en/regexp.html
REGEXP is not case sensitive, except when used with binary strings.
As noone actually answered my original question, I made my own research and realized it is not safe to lowercase or uppercase regular expression without any other processing. To be precise, it is safe to do this with theoretically pure regular expressions, but their every sane implementation adds some character classes and special directives, which can be vulnerable to case changing:
Escape sequences like \n, \t, etc.
Character classes like \W (non-alphanumeric) and \w (alphanumeric).
Character classes like [.characters.], [=character_class=], or [:character_class:] (MySQL regular expressions dialect).
Lowercasing or uppercasing \W and \w could completely change regular expression's meaning. This leads to following conclusion:
Presented solution is no-go.
Presented solution is possible, but the regular expression must be lowercased in more sophisticated way than just by using LOWER or something similar. It has to be parsed and the case has to be changed carefully.

How to set locale properly in AS3

What is the proper way of setting the locale in ActionScript, so that functions like String.toLocaleUpperCase() and String.toLocaleLowerCase() work as expected?
Here's an interesting line from the documentation:
While this method is intended to handle the conversion in a locale-specific way, the ActionScript 3.0 implementation does not produce a different result from the toUpperCase() method.
Following this information, here is what the documentation for .toUpperCase() has to say:
This method converts all characters (not simply a-z) for which Unicode uppercase equivalents exist.
These case mappings are defined in the Unicode Character Database specification.
In summary, I don't think there is actually a way to set a locale.

JSON and escaping characters

I have a string which gets serialized to JSON in Javascript, and then deserialized to Java.
It looks like if the string contains a degree symbol, then I get a problem.
I could use some help in figuring out who to blame:
is it the Spidermonkey 1.8 implementation? (this has a JSON implementation built-in)
is it Google gson?
is it me for not doing something properly?
Here's what happens in JSDB:
js>s='15\u00f8C'
15°C
js>JSON.stringify(s)
"15°C"
I would have expected "15\u00f8C' which leads me to believe that Spidermonkey's JSON implementation isn't doing the right thing... except that the JSON homepage's syntax description (is that the spec?) says that a char can be
any-Unicode-character-
except-"-or-\-or-
control-character"
so maybe it passes the string along as-is without encoding it as \u00f8... in which case I would think the problem is with the gson library.
Can anyone help?
I suppose my workaround is to use either a different JSON library, or manually escape strings myself after calling JSON.stringify() -- but if this is a bug then I'd like to file a bug report.
This is not a bug in either implementation. There is no requirement to escape U+00B0. To quote the RFC:
2.5. Strings
The representation of strings is
similar to conventions used in the C
family of programming languages. A
string begins and ends with quotation
marks. All Unicode characters may be
placed within the quotation marks
except for the characters that must be
escaped: quotation mark, reverse
solidus, and the control characters
(U+0000 through U+001F).
Any character may be escaped.
Escaping everything inflates the size of the data (all code points can be represented in four or fewer bytes in all Unicode transformation formats; whereas encoding them all makes them six or twelve bytes).
It is more likely that you have a text transcoding bug somewhere in your code and escaping everything in the ASCII subset masks the problem. It is a requirement of the JSON spec that all data use a Unicode encoding.
hmm, well here's a workaround anyway:
function JSON_stringify(s, emit_unicode)
{
var json = JSON.stringify(s);
return emit_unicode ? json : json.replace(/[\u007f-\uffff]/g,
function(c) {
return '\\u'+('0000'+c.charCodeAt(0).toString(16)).slice(-4);
}
);
}
test case:
js>s='15\u00f8C 3\u0111';
15°C 3◄
js>JSON_stringify(s, true)
"15°C 3◄"
js>JSON_stringify(s, false)
"15\u00f8C 3\u0111"
This is SUPER late and probably not relevant anymore, but if anyone stumbles upon this answer, I believe I know the cause.
So the JSON encoded string is perfectly valid with the degree symbol in it, as the other answer mentions. The problem is most likely in the character encoding that you are reading/writing with. Depending on how you are using Gson, you are probably passing it a java.io.Reader instance. Any time you are creating a Reader from an InputStream, you need to specify the character encoding, or java.nio.charset.Charset instance (it's usually best to use java.nio.charset.StandardCharsets.UTF_8). If you don't specify a Charset, Java will use your platform default encoding, which on Windows is usually CP-1252.

Which are the valid control characters in HTML/XHTML forms

I'm tring to create form validation unit that, in addition to "regular" tests checks
encoding as well.
According to this article http://www.w3.org/International/questions/qa-forms-utf-8 the
allowed characters are CR, LF and TAB in range of 0-31, the DEL=127 in not allowed.
On the other hand, there are control characters in range 0x80-0xA0. In different sources
I had seen that they are allowed and that not. Also I had seen that this is different
for XHTML, HTML and XML.
Some articles had told that FF is allowed as well?
Can someone provide a good answer with sources what can be given and what isn't?
EDIT: Even there http://www.w3.org/International/questions/qa-controls some ambiguity
The C1 range is supported
But table shows that they are illegal and previous shown UTF-8 validations allows them?
I think you're looking at this the wrong way around. The resources you link specify what encoded values are valid in (X)HTML, but it sounds like you want to validate the "response" from a web form — as in, the values of the various form controls, as passed back to your server. In that case, you shouldn't be looking at what's valid in (X)HTML, but what's valid in the application/x-www-form-urlencoded, and possibly also multipart/form-data, MIME types. The HTML 4.01 standards for <FORM> elements clearly states that for application/x-www-form-urlencoded, "Non-alphanumeric characters are replaced by '%HH'":
This is the default content type. Forms submitted with this content type must be encoded as follows:
Control names and values are escaped. Space characters are replaced by '+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
The control names/values are listed in the order they appear in the document. The name is separated from the value by '=' and name/value pairs are separated from each other by '&'.
As for what character encoding is contained, (i.e. whether %A0 is a non-breaking space or an error), that's negotiated by the accept-charset attribute on your <FORM> element and the response's (well, really a GET or POST request) Content-Type header.
Postel's Law: Be conservative in what you do; be liberal in what you accept from others.
If you're generating documents for others to read, you should avoid/escape all control characters, even if they're technically legal. And if you're parsing documents, you should endeavor to accept all control characters even if they're technically illegal.
The Unicode characters in these ranges are valid in HTML 4.01:
0x09..0x0A
0x0D
0x20..0x7E
0x00A0..0xD7FF
0xE000..0x10FFFF
In XHTML 1.0... it's unclear. See http://cmsmcq.com/2007/C1.xml#o127626258
First of all any octet is valid. The mentioned regular expression for UTF-8 sequences just omits some of them as they are rather uncommon in practice to be entered by a user. But that doesn’t mean that they are invalid. They are just not expected to occur.
The first link you mention does not have anything to do with validating the allowed characters in XHTML... the example on that link is simply showing a common/generic pattern for detecting whether or not raw data is in utf-8 encoding or not.
This is a quote from the second link:
HTML, XHTML and XML 1.0 do not support
the C0 range, except for HT
(Horizontal Tabulation) U+0009, LF
(Line Feed) U+000A, and CR (Carriage
Return) U+000D. The C1 range is
supported, i.e. you can encode the
controls directly or represent them as
NCRs (Numeric Character References).
The way I read this is:
Any control character in the C1 range is supported if you encode them (using base64, or Hex representations) or represent them as NCRs.
Only U+0009, U+000A, and U+000D are supported in the C0 range. No other control code in that range can be represented.
If the document is known to be XHTML, then you should just load it and validate it against the schema.
What programming language do you use? At least for Java there exist libraries to check the encoding of a string (or byte-array). I guess similar libraries would exist for other languages too.
Do I understand your question correctly: you want to check whether the data submitted by a form is valid, and properly encoded?
If so, why do several things at once? It would be a lot easier to separate those checks, and perform them step by step, IMHO.
You want to check that the submitted form data is correctly encoded (in UTF-8, I gather). As Archchancellor Ridcully says, that's easy to check in most languages.
Then, if the encoding is correct, you can check whether it's valid form data.
Then, if the form data is valid, you can check whether the data contains what you expect.