Why does xmlGetProp substitute character entity references when reading attribute values? - html

I'm using libxml2 to parse/read an HTML page. The following code is used to read the value of an attribute:
char *value = (char*)xmlGetProp(node, attr->name);
But xmlGetProp substitutes character entity references when it reads the attribute content. E.g.
<p onload="readId="blahString"; myFun();"> Event handler in P HTML TAG</p>
In the above case, it returns the following string as "onload" attribute value:
readId="blahString";myFun();
The character entity reference is substituted in the above reading process. Is there any way to read the attribute value keeping the original HTML content using libxml2?

What you call "HTML encoding" is actually called character entity reference. To answer your question: No, the HTML parser of libxml2 has no option to turn off substitution of character references.
The XML parser keeps character entity references by default, but it can't be used for typical HTML documents.

Related

Are there some valid HTML entities without the semicolon?

Looking at this official entities.json file, some of the entities are defined without an ending semicolon.
For example:
"&Acirc": { "codepoints": [194], "characters": "\u00C2" },
"Â": { "codepoints": [194], "characters": "\u00C2" },
Where is that documented in HTML5? Or is that a browser thing¹?
¹ thing as in extension for backward compatibility.
HTML named character list is defined at https://html.spec.whatwg.org/multipage/named-characters.html and yes, some of these don't have a trailing ; e.g &not
&not
Named HTML entities without a semicolon are not valid, per the HTML spec, but browsers are required to support some of them anyway. (This spec pattern - where something is officially illegal for you to do as a HTML author, but still has a single unambiguously specified behaviour that browsers must implement - is used a lot in the HTML spec.)
There are a few pertinent sections in the spec:
§13.1.4 Character references
Pertinent quote:
Named character references
The ampersand must be followed by one of the names given in the named character references section, using the same case. The name must be one that is terminated by a U+003B SEMICOLON character (;).
§13.2 Parsing HTML Documents, especially 13.2.5.73 Named character reference state (if you really want to pick through the horrible hard-to-read implementation details of the parsing algorithm).
The non-normative §1.11.2 Syntax errors, which contains some explanation on why the spec makes references without semicolons errors (though I don't personally find it hugely compelling):
Errors involving fragile syntax constructs
There are syntax constructs that, for historical reasons, are relatively fragile. To help reduce the number of users who accidentally run into such problems, they are made non-conforming.
Example
For example, the parsing of certain named character references in attributes happens even with the closing semicolon being omitted. It is safe to include an ampersand followed by letters that do not form a named character reference, but if the letters are changed to a string that does form a named character reference, they will be interpreted as that character instead.
In this fragment, the attribute's value is "?bill&ted":
Bill and Ted
In the following fragment, however, the attribute's value is actually "?art©", not the intended "?art&copy", because even without the final semicolon, "&copy" is handled the same as "©" and thus gets interpreted as "©":
Art and Copy
To avoid this problem, all named character references are required to end with a semicolon, and uses of named character references without a semicolon are flagged as errors.
Thus, the correct way to express the above cases is as follows:
Bill and Ted <!-- &ted is ok, since it's not a named character reference -->
Art and Copy <!-- the & has to be escaped, since &copy is a named character reference -->
As a final bit of corroboration that entities like &Acirc are invalid but work anyway, we can use this test document:
<!DOCTYPE html>
<html lang="en">
<title>Test page</title>
<div>&Acirc</div>
</html>
Open it in Chrome, and it works and shows us an A with a circumflex accent:
But paste it into the Nu Html Checker (endorsed by WhatWG), and we get an error stating "Named character reference was not terminated by a semicolon.":
i.e. it works, but it's invalid.
I made a program in python to get some numbers, and I found out that:
In the 2231 total entities, there are 4.75% or 106 valid entities without a semi-colon at end
All those entities:
&AElig, &AMP, &Aacute, &Acirc, &Agrave, &Aring, &Atilde, &Auml, &COPY, &Ccedil, &ETH, &Eacute, &Ecirc, &Egrave, &Euml, &GT, &Iacute, &Icirc, &Igrave, &Iuml, &LT, &Ntilde, &Oacute, &Ocirc, &Ograve, &Oslash, &Otilde, &Ouml, &QUOT, &REG, &THORN, &Uacute, &Ucirc, &Ugrave, &Uuml, &Yacute, &aacute, &acirc, &acute, &aelig, &agrave, &amp, &aring, &atilde, &auml, &brvbar, &ccedil, &cedil, &cent, &copy, &curren, &deg, &divide, &eacute, &ecirc, &egrave, &eth, &euml, &frac12, &frac14, &frac34, &gt, &iacute, &icirc, &iexcl, &igrave, &iquest, &iuml, &laquo, &lt, &macr, &micro, &middot, &nbsp, &not, &ntilde, &oacute, &ocirc, &ograve, &ordf, &ordm, &oslash, &otilde, &ouml, &para, &plusmn, &pound, &quot, &raquo, &reg, &sect, &shy, &sup1, &sup2, &sup3, &szlig, &thorn, &times, &uacute, &ucirc, &ugrave, &uml, &uuml, &yacute, &yen, &yuml

Does "Text" in the HTML5 syntax mean "any character"?

I wasn't able to find any restrictions what characters are allowed in Text does this imply that erverthing is allowed or are there restrictions that affect HTML documents in general?
For example the Character Reference Section states that:
The numeric character reference forms [...] are allowed to reference any Unicode code point other than U+0000, U+000D, permanently undefined Unicode characters (noncharacters), surrogates (U+D800–U+DFFF), and control characters other than space characters.
Are those characters still allowed in their "unescaped" form in Text? E.g. as attribute value: <span title="Hello ␀ World"></span> where ␀ is the U+0000 NULL character (not U+2400).
The character restriction for text on your page and in your markup is defined according to your selected character set. If you don't define a character set, the browser will take a guess or assert its default option (usually, whatever is the least restrictive). The character set is defined by using the meta tag with the charset attribute in your document's head section. The most common example of this uses the UTF-8 character set:
<meta charset="UTF-8" />
The value of this attribute can be any of the character sets defined by the Internet Assigned Numbers Authority (IANA). The full list of defined character sets is available here.
Additionally, there may be specific restrictions on unescaped text used within certain elements (or types of elements). In this case, you would have to read the specifications for that tag or type of tag, or simply escape the characters in question by replacing them with their ampersand-encoded html entities escape values.
I dont think that there is any restriction which is there on Text in the context which you have pointed. The text here means all the allowed alphabets,numbers and alphanumeric characters.
The answer is in the link you provided:
Text is allowed inside elements, attribute values, and comments. Extra constraints are placed on what is and what is not allowed in text based on where the text is to be put, as described in the other sections
Now if we go to the syntax definition for CDATA sections:
CDATA sections must consist of the following components, in this
order:
The string "<![CDATA[".
Optionally, text, with the additional restriction that the text must not contain the string "]]>".
The string "]]>".
So every type of content has it's own set of restrictions, and text is just used to define the superset of all characters, symbols and so on...

Using SQL Server xml.modify on an xm document with escaped xml

I want to make modifications to an XML document using SQL Server's XML.modify. My problem is my XML document uses some escaped XML so "<" are appearing as "<" and ">" is appearing as ">". I want to know if it would be possible to set the value of an element that is surrounded by escaped XML. An example of what I'm dealing with is below:
Declare #myDoc as xml;
Set #myDoc = '<Root>
<ProductDescription ProductID="1" ProductName="Road Bike">
<Features>
<BikeLight>False</BikeLight>
<BikeHorn>True</BikeHorn>
</Features>
</ProductDescription>
</Root>' ;
I know I can edit the value of the BikeLight element by using
set #myDoc.modify('replace value of (/Root/ProductDescription/Features/BikeLight/text())[1] with "True"')
but trying to do the same with BikeHorn only returns the XML document, unmodified. Is it possible to modify the value of elements surrounded by escaped XML? Any help would be appreciated, thanks. Also, just to note that in my actual code all elements under Features would be surrounded by escaped XML.
The problem is you don't have a node called <BikeHorn>, you have a complex node <Features> which contains some text of its own in addition to a <BikeLight> child node. So you need to modify <Features> to change the value of BikeHorn:
set #myDoc.modify('
replace value of (/Root/ProductDescription/Features/text())[1]
with "<BikeHorn>False</BikeHorn>"')

Namespace and HTML 5

In the HTML specs one can find the following line:
In the HTML syntax, namespace prefixes and namespace declarations do not have the same effect as in XML. For instance, the colon has no special meaning in HTML element names.
After looking into the Grammar definition there are the following sections:
On tag names it states:
Tags contain a tag name, giving the element's name. HTML elements all have names that only use alphanumeric ASCII characters. In the HTML syntax, tag names, even those for foreign elements, may be written with any mix of lower- and uppercase letters that, when converted to all-lowercase, matches the element's tag name; tag names are case-insensitive.
This leaves almost no room for interpretation. There is no underscore or dollar sign here. Also there is no ':' making it impossible to legally express names spaces. It also makes it possible to use only a number like <1> but then the grammar states:
Uppercase ASCII letter
Create a new start tag token, set its tag name to the lowercase version of the current input character (add 0x0020 to the character's code point), then switch to the tag name state. (Don't emit the token yet; further details will be filled in before it is emitted.)
Lowercase ASCII letter
Create a new start tag token, set its tag name to the current input character, then switch to the tag name state. (Don't emit the token yet; further details will be filled in before it is emitted.)
So we are only left to something like <a1234>.
On attribute names it states:
Attributes have a name and a value. Attribute names must consist of one or more characters other than the space characters, U+0000 NULL, U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), ">" (U+003E), "/" (U+002F), and "=" (U+003D) characters, the control characters, and any characters that are not defined by Unicode. In the HTML syntax, attribute names, even those for foreign elements, may be written with any mix of lower- and uppercase letters that are an ASCII case-insensitive match for the attribute's name.
Reading this it seems this is possible:
<div ::::::="hello" $_$="dollar"></div>
From all this using namespaces for tag names is forbidden and for attributes it's mere a convention you may follow but do not need to.
So to put it simple namespace for HTML 5 does not exist and at least for the tag name can not be emulated and we have no underscore and no dot or something alike.
Is this correct? On the other hand HTML 5 specs state that we are free to add xmlns attributes to the elements making it possible to clearly introduce new namespaces. How does this fit?
[Update]
I rechecked the specification using the single page version of the specs and it actually stats that the name space declartion is allowed for xhtml left overs but it actually has to be ignored so no name spaces for us. Sad thing.
[/Update]
So the only question left is, if there is no ':' or anything else what can I legally do with element tag names. Can I use some special one I have made up. Remember we habe a relaxed specification for the parser here. The parser should be build in a way that it can handle unkown element tags. The question here is, how do they handle unknown element tags?
The HTML 5 specification allows only xmlns name space attributes with regard to the xhtml document specification. Those name spaces are ignored and not valued.
The tag name section of the specs is a bit confusing since it only talks about HTML elements. The parser section for tag names reads:
8.2.4.10 Tag name state
Consume the next input character:
"tab" (U+0009)
"LF" (U+000A)
"FF" (U+000C)
U+0020 SPACE
-> Switch to the before attribute name state.
"/" (U+002F)
-> Switch to the self-closing start tag state.
">" (U+003E)
-> Switch to the data state. Emit the current tag token.
Uppercase ASCII letter
-> Append the lowercase version of the current input character (add 0x0020 to the character's code point) to the current tag token's tag name.
U+0000 NULL
-> Parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current tag token's tag name.
EOF
-> Parse error. Switch to the data state. Reconsume the EOF character.
Anything else
-> Append the current input character to the current tag token's tag name.
The last line is the important part. Also the specification only states for HTML elements defined as those. Therefore we are free to do things like and it is considered a valid element but not a valid HTML element. The question is how a browser or Editor reacts toward this character soup. But again it is a valid element name but not a valid HTML element name.

what actually is PCDATA and CDATA?

it seems that a loose definition of PCDATA and CDATA is that
PCDATA is character data, but is to be parsed.
CDATA is character data, and is not to be parsed.
but then someone told me that CDATA is actually parsed or PCDATA is actually not parsed... so it is a bit of a confusion. Does anyone know the real deal is?
Update: I actually added the PCDATA definition on Wikipedia... so don't take that answer too seriously as that's only my rough understanding of it.
From WIKI:
PCDATA
Simply speaking, PCDATA stands for Parsed Character Data. That means the characters are to be parsed by the XML, XHTML, or HTML parser. (< will be changed to <, <p> will be taken to mean a paragraph tag, etc). Compare that with CDATA, where the characters are not to be parsed by the XML, XHTML, or HTML parser.
CDATA
The term CDATA, meaning character data, is used for distinct, but related purposes in the markup languages SGML and XML. The term indicates that a certain portion of the document is general character data, rather than non-character data or character data with a more specific, limited structure.
Both PCDATA and CDATA are parsed. They are both character data.
They both must include only valid characters. For example if your document encoding is UTF-8, the content of CDATA sections must still be valid UTF-8 characters. So random binary data will probably prevent the document from being well-formed. Also CDATA sections are still parsed, if only to find the end section tag. But other markup-like characters, like <, > and & are ignored and passed as-is by the parser.
OTOH in PCDATA literal < and & (and ' or " in attribute values) must be escaped, or they will be interpreted as markup. Entities will also be expanded.
So yes, CDATA sections are indeed parsed. I am not sure why you were told that PCDATA is not parsed though.
PCDATA - Parsed Character Data
CDATA - (Unparsed) Character Data
http://www.w3schools.com/XML/xml_cdata.asp
PCDATA is text that will be parsed by a parser. Tags inside the text
will be treated as markup and entities will be expanded.
CDATA is text that will not be parsed by a parser. Tags inside the text will
not be treated as markup and entities will not be expanded.
By default, everything is PCDATA. In the following example, ignoring the root, <bar> will be parsed, and it'll have no content, but one child.
<?xml version="1.0"?>
<foo>
<bar><test>content!</test></bar>
</foo>
When we want to specify that an element will only contain text, and no child elements, we use the keyword PCDATA, because this keyword specifies that the element must contain parsable character data – that is , any text except the characters less-than (<) , greater-than (>) , ampersand (&), quote(') and double quote (").
In the next example, bar is CDATA, and isn't parsed, and has the content "<test>content!</test>".
<?xml version="1.0"?>
<foo>
<bar><![CDATA[<test>content!</test>]]></bar>
</foo>
There are several content models in SGML. The #PCDATA content model says that an element may contain plain text. The "parsed" part of it means that markup (including PIs, comments and SGML directives) in it is parsed instead of displayed as raw text. It also means that entity references are replaced.
Another type of content model allowing plain text contents is CDATA. In XML, the element content model may not implicitly be set to CDATA, but in SGML, it means that markup and entity references are ignored in the contents of the element. In attributes of CDATA type however, entity references are replaced.
In XML #PCDATA is the only plain text content model. You use it if you at all want to allow text contents in the element. The CDATA content model may be used explicitly through the CDATA block markup in #PCDATA, but element contents may not be defined as CDATA per default.
In a DTD, the type of an attribute that contains text must be CDATA. The CDATA keyword in an attribute declaration has a different meaning than the CDATA section in an XML document. In CDATA section all characters are legal (including <,>,&,’ and “ characters) except the “]]>” end tag.
#PCDATA is not appropriate for the type of an attribute. It is used for the type of "leaf" text.
#PCDATA is prepended by a hash (also known as a "hashtag" or octothorp) simply for historical reasons.
Your first definition is correct.
PCDATA is parsed which means that entities are expanded and that text is treated as markup. CDATA is not parsed by an XML parser.
If only elements were set to CDATA by default in the XHTML DTDs, it would save a lot of ugly manual overrides... Why would script blocks contain other elements? If there are such elements, they are handled by the JS interpreter in DOM manipulation actions -- in which case they should still be completely ignored by the XML parser before document insertion and rendering. I suppose it may have been designed to force the use of external script resource files, which is a ultimately a good thing.