Which are the valid control characters in HTML/XHTML forms - html

I'm tring to create form validation unit that, in addition to "regular" tests checks
encoding as well.
According to this article http://www.w3.org/International/questions/qa-forms-utf-8 the
allowed characters are CR, LF and TAB in range of 0-31, the DEL=127 in not allowed.
On the other hand, there are control characters in range 0x80-0xA0. In different sources
I had seen that they are allowed and that not. Also I had seen that this is different
for XHTML, HTML and XML.
Some articles had told that FF is allowed as well?
Can someone provide a good answer with sources what can be given and what isn't?
EDIT: Even there http://www.w3.org/International/questions/qa-controls some ambiguity
The C1 range is supported
But table shows that they are illegal and previous shown UTF-8 validations allows them?

I think you're looking at this the wrong way around. The resources you link specify what encoded values are valid in (X)HTML, but it sounds like you want to validate the "response" from a web form — as in, the values of the various form controls, as passed back to your server. In that case, you shouldn't be looking at what's valid in (X)HTML, but what's valid in the application/x-www-form-urlencoded, and possibly also multipart/form-data, MIME types. The HTML 4.01 standards for <FORM> elements clearly states that for application/x-www-form-urlencoded, "Non-alphanumeric characters are replaced by '%HH'":
This is the default content type. Forms submitted with this content type must be encoded as follows:
Control names and values are escaped. Space characters are replaced by '+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
The control names/values are listed in the order they appear in the document. The name is separated from the value by '=' and name/value pairs are separated from each other by '&'.
As for what character encoding is contained, (i.e. whether %A0 is a non-breaking space or an error), that's negotiated by the accept-charset attribute on your <FORM> element and the response's (well, really a GET or POST request) Content-Type header.

Postel's Law: Be conservative in what you do; be liberal in what you accept from others.
If you're generating documents for others to read, you should avoid/escape all control characters, even if they're technically legal. And if you're parsing documents, you should endeavor to accept all control characters even if they're technically illegal.

The Unicode characters in these ranges are valid in HTML 4.01:
0x09..0x0A
0x0D
0x20..0x7E
0x00A0..0xD7FF
0xE000..0x10FFFF
In XHTML 1.0... it's unclear. See http://cmsmcq.com/2007/C1.xml#o127626258

First of all any octet is valid. The mentioned regular expression for UTF-8 sequences just omits some of them as they are rather uncommon in practice to be entered by a user. But that doesn’t mean that they are invalid. They are just not expected to occur.

The first link you mention does not have anything to do with validating the allowed characters in XHTML... the example on that link is simply showing a common/generic pattern for detecting whether or not raw data is in utf-8 encoding or not.
This is a quote from the second link:
HTML, XHTML and XML 1.0 do not support
the C0 range, except for HT
(Horizontal Tabulation) U+0009, LF
(Line Feed) U+000A, and CR (Carriage
Return) U+000D. The C1 range is
supported, i.e. you can encode the
controls directly or represent them as
NCRs (Numeric Character References).
The way I read this is:
Any control character in the C1 range is supported if you encode them (using base64, or Hex representations) or represent them as NCRs.
Only U+0009, U+000A, and U+000D are supported in the C0 range. No other control code in that range can be represented.

If the document is known to be XHTML, then you should just load it and validate it against the schema.

What programming language do you use? At least for Java there exist libraries to check the encoding of a string (or byte-array). I guess similar libraries would exist for other languages too.

Do I understand your question correctly: you want to check whether the data submitted by a form is valid, and properly encoded?
If so, why do several things at once? It would be a lot easier to separate those checks, and perform them step by step, IMHO.
You want to check that the submitted form data is correctly encoded (in UTF-8, I gather). As Archchancellor Ridcully says, that's easy to check in most languages.
Then, if the encoding is correct, you can check whether it's valid form data.
Then, if the form data is valid, you can check whether the data contains what you expect.

Related

What does "URL-safe" mean?

The definition for JSON Web Tokens (JWT, see RFC 7519) says that it is a "URL-safe means of representing claims to be transferred between two parties".
I'm wondering, what does it mean if something is URL-safe?
As far as I know, JWT are not passed around as part of the URL. Is it just that, or is there more to it?
Later in the RFC it says:
A JWT is represented as a sequence of URL-safe parts separated by
period ('.') characters. Each part contains a base64url-encoded
value.
This, combined with the RFC not specifying some other meaning explicitly, suggests it means simply "safe to put in a URL" (e.g., doesn't have unencoded / or ? or & characters, etc.).

What is the correct syntax to fold a JSON string?

I am using Delphi 2009 to build up a string variable containing a simple JON string from values I get from a database. This results in a string of the form below (although the real string could be much longer)
{"alice#example.com": {"first":"Alice", "id": 2},"bob#example.com": {"first":"Bob", "id":1},"cath#example.com": {"first":"Cath", "id":3},"derek#example.com": {"first":"Derek", "id": 4}}
This string gets sent as a header called Recipient-Variables in an email to a company.
The instructions I have for sending the emails to the company say
Note The value of the “Recipient-Variables” header should be
valid JSON string, otherwise we won’t be able to parse it. If
your “Recipient-Variables” header exceeds 998 characters,
you should use folding to spread the variables over multiple lines.
I have looked at these SO posts to try to understand what is meant by folding but cannot really understand the replies as they often seem to be referencing a particular editor.
notepad++ user defined regions with folding
Folding JSON at specific points
Can you customize code folding?
Please can somebody use my example to show me what syntax I should use or what characters I need to insert in my string to comply with the instruction and fold my JSON string, say in between the records for bob and cath?
(BTW I understand what is meant by folding when viewing JSON or other code in an editor but I don't understand how a simple JSON string needs to be formatted in order for the folding to happen at a specific place)
I finally found the answer myself so posting here to help others, just in case.
The answer is given in this document on rfc2822 standards, published in 2001 by the Network Working Group (P. Resnick, Editor)
https://www.rfc-editor.org/rfc/rfc2822#page-11
The document ...
specifies a syntax for text messages that are sent between computer
users, within the framework of "electronic mail" messages.
...and in particular describes how emails are constructed and in particular how to deal with long headers.
Section 2.2.3 talks about long header fields, > 998 characters, and says such headers need to be folded by inserting the CRLF characters followed immediately by some white space, eg a space character.
If the receiving server is following the same standards it will strip out the CRLF character before parsing the header, which will itself will include stripping space characters.
Though structured field bodies are defined in such a way that
folding can take place between many of the lexical tokens (and even
within some of the lexical tokens), folding SHOULD be limited to
placing the CRLF at higher-level syntactic breaks. For instance, if
a field body is defined as comma-separated values, it is recommended
that folding occur after the comma separating the structured items in
preference to other places where the field could be folded, even if
it is allowed elsewhere.
Later, in section 3.2.3 it explains how comments may be combined with folding white space.
So it seems that if generating the string through code, it is necessary to fold long header lines by detecting a higher level syntactic boundary, such as a comma, that is less than 988 characters from the start of the header (or the last fold point) and insert the three hex characters x0D0A20. This could be done after the header has been constructed or on the fly as it is generated.
As a follow up, I now notice that the Overbytes ICS component I am using (TSslSmtpCli) has a boolean property FoldHeaders so this might do all the work for me.

Inside a <video> tag, what is the meaning of data:?

When trying to download a video on vevo by inspecting element, I discovered that that was impossible even though the content wasn't DRM protected. The video tag refers to a file that I can't trace or find using ctrl+I (Firefix Dev Edition), while it is still playing in the browser. Instead of /folder/video it says data:folder/video. How does this data: work?
A quick Google search and our friend wikipedia says:
The data URI scheme is a uniform resource identifier (URI) scheme that provides a way to include data in-line in web pages as if they were external resources. It is a form of file literal or here document. This technique allows normally separate elements such as images and style sheets to be fetched in a single Hypertext Transfer Protocol (HTTP) request, which may be more efficient than multiple HTTP requests.
Syntax
The scheme followed by a colon (data:).
An optional media type. The media type part may include one or more parameters, in the format attribute=value, separated by semicolons. A common media type parameter is charset, specifying the character set of the media type, where the value is from the IANA list of character set names. If one is not specified, the media type of the data URI is assumed to be text/plain;charset=US-ASCII.
An optional base64 extension base64, separated from the preceding part by a semicolon. When present, this indicates that the data content of the URI is binary data, encoded in ASCII format using the Base64 scheme for binary-to-text encoding. The base64 extension is distinguished from any media type parameters by virtue of not having a =value component and by coming after any media type parameters.
The data, separated from the preceding part by a comma. The data is a sequence of zero or more octets represented as characters. The comma is required in a data URI, even when the data part has zero length. The characters permitted within the data part include ASCII upper and lowercase letters, digits, and many ASCII punctuation and special characters. Note that this may include characters, such as colon, semicolon, and comma which are delimiters in the URI components preceding the data part. Other octets must be percent-encoded. If the data is Base64-encoded, then the data part may contain only valid Base64 characters. Note that Base64-encoded data: URIs use the standard Base64 character set (with + and / as characters 62 and 63) rather than the so-called "URL-safe Base64" character set

what does data types in html5 means

I didn't get what does it really mean, when someone refers to data types in html5.
I googled it, and found http://www.w3.org/TR/html-markup/datatypes.html
It says,
data types (microsyntaxes) that are referenced by attribute
descriptions
But, now I'm even confused what it means with micorsyntaxes.
Wikipedia says:
[...] the syntax of a computer language is the set of rules that defines the combinations of symbols that are considered to be a correctly structured document or fragment in that language. This applies both to programming languages, where the document represents source code, and markup languages, where the document represents data.
So in order for an HTML document to be read and understood by a browser, it should adhere to the syntax of HTML: That is, it should follow the rules that define the language. A microsyntax is essentially a very small syntax, applying to a very specific thing.
A data type is simply a type of data. The HTML specifications refer to various data types (e.g. String, Token, Integer, Date, Set of comma-separated strings, etc) and the document you linked describes exactly what those things are. It does this by defining a set of rules, or a microsyntax.
E.g. the microsyntax which defines a Set of comma-separated strings is:
Zero or more strings that are themselves each zero or more characters, each optionally with leading and/or trailing space characters, and each separated from the next by a single "," (comma) character. Each string itself must not begin or end with any space characters, and each string itself must not contain any "," (comma) characters.

check input for UTF-8, count characters, use regular expressions

I want to write a C-program that gets some strings from input. I want to save them in a MySQL database.
For security I would like to check, if the input is a (possible) UTF-8 string, count the number of characters and also use some regular expressions to validate the input.
Is there a library that offers me that functionality?
I thought about to use wide characters, but as far as I understood, the fact if they are supporting UTF-8 depends on the implementation and ist not defined by a standard.
And also I would be missing the regular expressions.
PCRE supports UTF-8. To validate the string before any processing, the W3C suggests this expression, which I re-implemented in plain C, but PCRE already automatically checks for UTF-8 in accordance to RFC 3629.