what actually is PCDATA and CDATA? - html

it seems that a loose definition of PCDATA and CDATA is that
PCDATA is character data, but is to be parsed.
CDATA is character data, and is not to be parsed.
but then someone told me that CDATA is actually parsed or PCDATA is actually not parsed... so it is a bit of a confusion. Does anyone know the real deal is?
Update: I actually added the PCDATA definition on Wikipedia... so don't take that answer too seriously as that's only my rough understanding of it.

From WIKI:
PCDATA
Simply speaking, PCDATA stands for Parsed Character Data. That means the characters are to be parsed by the XML, XHTML, or HTML parser. (< will be changed to <, <p> will be taken to mean a paragraph tag, etc). Compare that with CDATA, where the characters are not to be parsed by the XML, XHTML, or HTML parser.
CDATA
The term CDATA, meaning character data, is used for distinct, but related purposes in the markup languages SGML and XML. The term indicates that a certain portion of the document is general character data, rather than non-character data or character data with a more specific, limited structure.

Both PCDATA and CDATA are parsed. They are both character data.
They both must include only valid characters. For example if your document encoding is UTF-8, the content of CDATA sections must still be valid UTF-8 characters. So random binary data will probably prevent the document from being well-formed. Also CDATA sections are still parsed, if only to find the end section tag. But other markup-like characters, like <, > and & are ignored and passed as-is by the parser.
OTOH in PCDATA literal < and & (and ' or " in attribute values) must be escaped, or they will be interpreted as markup. Entities will also be expanded.
So yes, CDATA sections are indeed parsed. I am not sure why you were told that PCDATA is not parsed though.

PCDATA - Parsed Character Data
CDATA - (Unparsed) Character Data
http://www.w3schools.com/XML/xml_cdata.asp

PCDATA is text that will be parsed by a parser. Tags inside the text
will be treated as markup and entities will be expanded.
CDATA is text that will not be parsed by a parser. Tags inside the text will
not be treated as markup and entities will not be expanded.
By default, everything is PCDATA. In the following example, ignoring the root, <bar> will be parsed, and it'll have no content, but one child.
<?xml version="1.0"?>
<foo>
<bar><test>content!</test></bar>
</foo>
When we want to specify that an element will only contain text, and no child elements, we use the keyword PCDATA, because this keyword specifies that the element must contain parsable character data – that is , any text except the characters less-than (<) , greater-than (>) , ampersand (&), quote(') and double quote (").
In the next example, bar is CDATA, and isn't parsed, and has the content "<test>content!</test>".
<?xml version="1.0"?>
<foo>
<bar><![CDATA[<test>content!</test>]]></bar>
</foo>
There are several content models in SGML. The #PCDATA content model says that an element may contain plain text. The "parsed" part of it means that markup (including PIs, comments and SGML directives) in it is parsed instead of displayed as raw text. It also means that entity references are replaced.
Another type of content model allowing plain text contents is CDATA. In XML, the element content model may not implicitly be set to CDATA, but in SGML, it means that markup and entity references are ignored in the contents of the element. In attributes of CDATA type however, entity references are replaced.
In XML #PCDATA is the only plain text content model. You use it if you at all want to allow text contents in the element. The CDATA content model may be used explicitly through the CDATA block markup in #PCDATA, but element contents may not be defined as CDATA per default.
In a DTD, the type of an attribute that contains text must be CDATA. The CDATA keyword in an attribute declaration has a different meaning than the CDATA section in an XML document. In CDATA section all characters are legal (including <,>,&,’ and “ characters) except the “]]>” end tag.
#PCDATA is not appropriate for the type of an attribute. It is used for the type of "leaf" text.
#PCDATA is prepended by a hash (also known as a "hashtag" or octothorp) simply for historical reasons.

Your first definition is correct.
PCDATA is parsed which means that entities are expanded and that text is treated as markup. CDATA is not parsed by an XML parser.

If only elements were set to CDATA by default in the XHTML DTDs, it would save a lot of ugly manual overrides... Why would script blocks contain other elements? If there are such elements, they are handled by the JS interpreter in DOM manipulation actions -- in which case they should still be completely ignored by the XML parser before document insertion and rendering. I suppose it may have been designed to force the use of external script resource files, which is a ultimately a good thing.

Related

Random Letter html Tag

I was wondering if you can use a random letter as an html tag. Like, f isn't a tag, but I tried it in some code and it worked just like a span tag. Sorry if this is a bad question, I've just been curious about it for a while, and I couldn't find anything online.
I was wondering if you can use a random letter as an html tag.
Yes and no.
"Yes" - in that it works, but it isn't correct: when you have something like <z> it only works because the web (HTML+CSS+JS) has a degree of forwards compatibility built-in: browsers will render HTML elements that they don't recognize basically the same as a <span> (i.e. an inline element that doesn't do anything other than reify a range of the document's text).
However, to use HTML5 Custom Elements correctly you need to conform to the Custom Elements specification which states:
The name of a custom element must contain a dash (-). So <x-tags>, <my-element>, and <my-awesome-app> are all valid names, while <tabs> and <foo_bar> are not. This requirement is so the HTML parser can distinguish custom elements from regular elements. It also ensures forward compatibility when new tags are added to HTML.
So if you use <my-z> then you'll be fine.
The HTML Living Standard document, as of 2021-12-04, indeed makes an explicit reference to forward-compatibility in its list of requirements for custom element names:
https://html.spec.whatwg.org/#valid-custom-element-name
They start with an ASCII lower alpha, ensuring that the HTML parser will treat them as tags instead of as text.
They do not contain any ASCII upper alphas, ensuring that the user agent can always treat HTML elements ASCII-case-insensitively.
They contain a hyphen, used for namespacing and to ensure forward compatibility (since no elements will be added to HTML, SVG, or MathML with hyphen-containing local names in the future).
They can always be created with createElement() and createElementNS(), which have restrictions that go beyond the parser's.
Apart from these restrictions, a large variety of names is allowed, to give maximum flexibility for use cases like <math-α> or <emotion-😍>.
So, by example:
<a>, <q>, <b>, <i>, <u>, <p>, <s>
No: these single-letter elements are already used by HTML.
<z>
No: element names that don't contain a hyphen - cannot be custom elements and will be interpreted by present-day browsers as invalid/unrecognized markup that they will nevertheless (largely) treat the same as a <span> element.
<a:z>
No: using a colon to use an XML element namespace is not a thing in HTML5 unless you're using XHTML5.
<-z>
No - the element name must start with a lowercase ASCII character from a to z, so - is not allowed.
<a-z>
Yes - this is fine.
<a-> and <a-->
Unsure - these two names are curious:
The HTML spec says the name must match the grammar rule [a-z] (PCENChar)* '-' (PCENChar)*.
The * denotes "zero-or-more" which is odd, because that implies the hyphen doesn't need to be followed by another character.
PCENChar represents a huge range of visible characters permitted in element names, curiously this includes -, so by that rule <a--> should be valid.
But note that -- is a reserved character sequence in the greater SGML-family (including HTML and XML) which may cause weirdness. YMMV!

What Are The Reserved Characters In (X)HTML?

Yes, I've googled it, and surprisingly got confusing answers.
One page says that < > & " are the only reserved characters in (X)HTML. No doubt, this makes sense.
This page says < > & " ' are the reserved characters in (X)HTML. A little confusing, but okay, this makes sense too.
And then comes this page which says < > & " © ° £ and non-breaking space (&nbsp) are all reserved characters in (X)HTML. This makes no sense at all, and pretty much adds to my confusion.
Can someone knowledgeable, who actually do know this stuff, clarify which the reserved characters in (X)HTML actually are?
EDIT: Also, should all the reserved characters in code be escaped when wrapped in <pre> tag? or is it just these three -- < > & ??
The XHTML 1.0 specification states at http://www.w3.org/TR/2002/REC-xhtml1-20020801/#xhtml:
XHTML 1.0 [...] is a reformulation of the three HTML 4 document types as
applications of XML 1.0 [XML].
The XML 1.0 specification states at http://www.w3.org/TR/2008/REC-xml-20081126/#syntax:
Character Data and Markup: Text consists of intermingled character
data and markup. [...] The ampersand character (&) and the left angle
bracket (<) MUST NOT appear in their literal form, except when used as
markup delimiters, or within a comment, a processing instruction, or a
CDATA section. If they are needed elsewhere, they MUST be escaped
using either numeric character references or the strings "&" and
"<" respectively. The right angle bracket (>) may be represented
using the string ">", and MUST, for compatibility, be escaped
using either ">" or a character reference when it appears in the
string "]]>" in content, when that string is not marking the end of
a CDATA section.
This means that when writing the text parts of an XHTML document you must escape &, <, and >.
You can escape a lot more, e.g. ü for umlaut u. You can as well state that the document is encoded in for example UTF-8 and write the byte sequence 0xc3bc instead to get the same umlaut u.
When writing the element parts (col. "tags") of the document, there are different rules. You have to take care of ", ' and a lot of rules concerning comments, CDATA and so on. There are also rules which characters can be used in element and attribute names. You can look it up in the XML specification, but in the end it comes down to: for element and attribute names, use letters, digits and "-"; do not use "_". For attribute values, you must escape & and (depending on the quote style) either ' or ".
If you use one of the many libraries to write XML / XHTML documents, somebody else has already taken care of this and you just have to tell the library to write text or elements. All the escaping is done the in the background.&
Only < and & need to be escaped. Inside attributes, " or ' (depending on which quote style you use for the attribute's value) needs to be escaped, too.
<a href="#" onclick='here you can use " safely'></a>
By writing "(X)HTML", you are asking (at least) two different questions.
By the HTML rules, with "HTML" meaning any HTML version up to and including HTML 4.01, only "<" and "&" are reserved. The rules are somewhat complex. They should not not appear literally except in their syntactic use in tags, entity references, and character references. But by the formal rules, they may appear literally e.g. in the context "A & B" or "A < B" (but A&B be formally wrong, and so would A<B).
The XHTML rules, based on XML, are somewhat stricter, simpler: "<" and "&" are unconditionally reserved.
The ASCII quotation mark " and the ASCII apostrophe ' are not reserved, except in the very specific sense that a quoted attribute value must not literally contain the character used as quote, i.e. in "foo" the string foo must not contain " as such and in 'foo' the string foo must not contain ' as such.
The characters < > & " are reserved by XML format.
It means that you can use < and > chars only to define tags (<mytag></mytag>).
Double quotes (") are used to define values of attributes (<mytag attribute="value" />)
Ampersand (&) is used to write entities (& is used when you actually want to write ampersand, NOT &). Also, when you write url in your XML document, you should use &, not just &: www.aaa.com?a=1&b=2 - is wrong; www.aaa.com?a=1&b=2 - is good!
XHTML is based on XML, so what I have wrote applies to XHTML.
© ° £ - These are not reserved chars. These are entities defined specifically for XHTML, not for XML.
In XML you can simply write ©. In XHMTL you can also simply write ©, or use entity ©, or numeric entity &00A9;.
In addition to the other answers, it might help to know that there are also forbidden characters: all control characters in ASCII and ISO-8859-1 except TAB, LF, and CR.
https://www.w3.org/MarkUp/html3/specialchars.html

Can data-* attribute contain HTML tags?

I.E. <img src="world.jpg" data-title="Hello World!<br/>What gives?"/>
As far as I understand the guidelines, it is basically valid, but it's better to use HTML entities.
From the HTML 4 reference:
You should also escape & within attribute values since entity references are allowed within cdata attribute values. In addition, you should escape > as > to avoid problems with older user agents that incorrectly perceive this as the end of a tag when coming across this character in quoted attribute values.
From the HTML 5 reference:
Except where otherwise specified, attributes on HTML elements may have any string value, including the empty string. Except where explicitly stated, there is no restriction on what text can be specified in such attributes.
So the best thing to do, as #tdammers already says, is to escape these characters (quoting the W3C reference)
& to represent the & sign.
< to represent the < sign.
> to represent the > sign.
" to represent the " mark.
and decoding them from their entity values if they are to be used as HTML.
Providing you're serving it as text/html, then yes it's valid.
Note that not only is it possible to include markup inside attributes, but the HTML5 srcdoc attribute on the iframe element positively encourages it. The HTML5 draft says:
In the HTML syntax, authors need only
remember to use U+0022 QUOTATION MARK
characters (") to wrap the attribute
contents and then to escape all U+0022
QUOTATION MARK (") and U+0026
AMPERSAND (&) characters, ....
Note, that when served with an XML content type (e.g. application/xhtml+xml), it is not valid, or even well-formed.
I'd say yes, as in it's still valid HTML5. Older browsers (which ones?) may not parse correctly.
Section 3.2.4.1 Attributes of the current HTML5 draft says this:
Except where otherwise specified, attributes on HTML elements may have any string value, including the empty string. Except where explicitly stated, there is no restriction on what text can be specified in such attributes.
HTML tags inside attributes also validates at http://html5.validator.nu
No. That would be invalid - HTML does not allow < or > inside attributes.
<img src="world.jpg" data-title="Hello World!<br/>What gives?"/> would be valid, but it would display the <br/> literally, not as a newline.

Is > ever necessary?

I now develop websites and XML interfaces since 7 years, and never, ever came in a situation, where it was really necessary to use the > for a >. All disambiguition could so far be handled by quoting <, &, " and ' alone.
Has anyone ever been in a situation (related to, e.g., SGML processing, browser issues, XSLT, ...) where you found it indespensable to escape the greater-than sign with >?
Update: I just checked with the XML spec, where it says, for example, about character data in section 2.4:
Character Data
[14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
So even there, the > isn't mentioned as something special, except from the ending sequence of a CDATA section.
This one single case, where the > is of any significance, would be the ending of a CDATA section, ]]>, but then again, if you'd quote it, the quote (i.e., the literal string ]]>) would land literally in the output (since it's CDATA).
You don't need to absolutely because almost any XML interpreter will understand what you mean. But still you use a special character without any protection if you do so.
XML is all about semantic, and this is not really semantic compliant.
About your update, you forgot this part :
The right angle bracket (>) may be represented using the string " > ", and must, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.
The use case given in the documentation is more about something like this :
<xmlmarkup>
]]>
</xmlmarkup>
Here the ]]> part could be a problem with old SGML parsers, so it must be escaped into = ]]> for compatibilities reasons.
I used one not 19 hours ago to pass a strict xml validator. Another case is when you use them actually in html/xml content text (rather than attributes), like this: <.
Sure, a lax parser will accept most anything you throw at it, but if you're ever worried about XSS, < is your friend.
Update: Here's an example where you need to escape > in Firefox:
<?xml version="1.0" encoding="utf-8" ?>
<test>
]]>
</test>
Granted, it still isn't an example of having to escape a lone >.
Not so much as an author of (x)html documents, but more as a user of sloppy written comments fields in websites, that "offer" you to insert html.
I mean if you do your site the right way, you wouldn't hardcode your content anyway, right? So your call to htmlentities or whatever (long time no see, php) would take care of replacing special characters for you.
So sure, you wouldn't manually type > but I hope you take measures so > is automatically replaced.
I just thought of another example, where you need to quote > in HTML5 (not XHTML5) documents: If you need it in attributes without quotes (which is something, that can be argued of course).
<img src=arrow.png alt=>>
should be equivalent to XHTML
<img src="arrow.png" alt=">" />
But then again, (?<!X)HTML is not SGML.
Imagine that you have the following text this is a not a ]]> nice day and you decide to surround it by CDATA sections <![CDATA[this is a not a ]]> nice day]]>.
In order to avoid that (and for allowing parsing of SGML fragments with unterminated marked sections), clause 10.4 of ISO 8879:1986 declares that the occurrence of ]]> outside a marked
section is an error.
Also, in the times of SGML marked sections were very popular, as they were not only used for CDATA (as in XML), but also for RCDATA (only entities and character references allowed) and IGNORE and INCLUDE (which allowed for recognition of markup inside them).
For instance, in SGML one could write:
<!ENTITY %WHATTODO "INCLUDE">
<![%WHATTODO;[<b>]]></b>]]>
Which is equivalent to:
<b>]]></b>

Why is <br> an HTML element rather than an HTML entity?

Why indeed? Wouldn't something like &br; be more appropriate?
An HTML entity reference is, depending on HTML version either an SGML entity or an XML entity (HTML inherits entities from the underlying technology). Entities are a way of inserting chunks of content defined elsewhere into the document.
All HTML entities are single-character entities, and are hence basically the same as character references (technically they are different to character references, but as there are no multi-character entities defined, the distinction has no impact on HTML).
When an HTML processor sees, for example — it replaces it with the content of that entity reference with the appropriate entity, based on the section in the DTD that says:
<!ENTITY mdash CDATA "—" -- em dash, U+2014 ISOpub -->
So it replaces the entity reference with the entity — which is in turn a character reference that gets replaced by the character — (U+2014). In reality unless you are doing this with a general-purpose XML or SGML processor that doesn't understand HTML directly, this will really be done in one step.
Now, what would we replace your hypothetical &br; with to cause a line-break to happen? We can't do so with a newline character, or even the lesser known U+2028 LINE SEPARATOR (which semantically in plain text has the same meaning as <br/> in HTML), because they are whitespace characters which are not significant in most HTML code, which is something that you should be grateful for as writing HTML would be much harder if we couldn't format for readability within the source code.
What we need is not an entity, but a way to indicate semantically that the rendered content contains a line-break at this point. We also need to not indicate anything else (we can already indicate a line-break by beginning or ending a block element, but that's not what we want). The only reasonable way to do so is to have an element that means exactly that, and so we have the <br/> element, with its related tag being put into the source code.
A tag and a character entity reference exist for different reasons - character entities are stand-ins for certain characters (sometimes required as escape sequences - for example & for an ampersand &), tags are there for structure.
The reason the <br> tag exists is that HTML collapses whitespace. There needs to be a way to specify a hard line break - a place that has to have a line break. This is the function of the <br> tag.
There is no single character that has this meaning, though U+2028 LINE SEPARATOR has similar meaning, and even if it were to be used it would not help as it is considered to be whitespace and HTML would collapse it.
See the answers from #John Kugelman and #John Hanna for more detail on this aspect.
Not entirely related, there is another reason why a &br; character entity reference does not exist: a line break is defined in such a way that it could have more than one character, see the HTML 4 spec:
A line break is defined to be a carriage return (
), a line feed (
), or a carriage return/line feed pair.
Character entities are single character escapes, so cannot represent this, again in the HTML 4 spec:
A character entity reference is an SGML construct that references a character of the document character set.
You will see that all the defined character entities map to a single character. A line break/new line cannot be cleanly mapped this way, thus an entity is required instead of a character entity reference.
This is why a line break cannot be represented by a character entity reference.
Regardless, it not not needed as simply using the Enter key inserts a line break.
Entities are stand-ins for other characters or bits of text. In HTML they are used to represent characters that are hard to type (e.g. — for "—") or for characters that need to be escaped (& for "&"). What would a hypothetical &br; entity stand for?
It couldn't be \r or \n or \r\n as these are already easy enough to type (just press enter). The issue you're trying to workaround is that HTML collapses whitespace in most contexts and treats newlines as spaces. That is, \n is not a line break character, it is just whitespace like tabs and spaces.
An entity &br; would have to be replaced by some other text. What character do you use to represent the concept of "hard line break"? The standard line break character \n is exactly the right character, but unfortunately it's unsuitable since it's thrown in the generic "whitespace" bucket. You'd have to either overload some other control character to represent "hard line break", or use some extended Unicode character. When HTML was designed Unicode was only a nascent, still-developing standard, so that wasn't an option.
A <br> element was the simple, straightforward way to add the concept of "hard line break" to a document since no character could represent that concept.
In HTML all line breaks are treated as white space:
A line break is defined to be a carriage return (
), a line feed (
), or a carriage return/line feed pair. All line breaks constitute white space.
And white space does only separate words and sequences of white space is collapsed:
For all HTML elements except PRE, sequences of white space separate "words" (we use the term "word" here to mean "sequences of non-white space characters"). […]
[…]
Note that a sequence of white spaces between words in the source document may result in an entirely different rendered inter-word spacing (except in the case of the PRE element). In particular, user agents should collapse input white space sequences when producing output inter-word space. […]
This means that line breaks cannot be expressed by plain characters. And although there are certain special characters in Unicode to unambiguously separate lines and paragraphs, they are not specified to do this in HTML too:
Note that although 
 and 
 are defined in [ISO10646] to unambiguously separate lines and paragraphs, respectively, these do not constitute line breaks in HTML […]
That means there is no plain character or sequence of plain characters that is to mark a line break in HTML. And that’s why there is the BR element.
Now if you want to use &br; instead of <br>, you just need to declare the entity br to represent the value <br>:
<!ENTITY br "<br>">
Having this additional entity named br declared, a general-purpose XML or SGML processor will replace every occurrence of the entity reference &br; with the value it represents (<br>). An example document:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd" [
<!ENTITY br "<br>">
]>
<HTML>
<HEAD>
<TITLE>My first HTML document</TITLE>
</HEAD>
<BODY>
<P>Hello &br;world!
</BODY>
</HTML>
Entities are content, tags are structure or layout (very roughly speaking). It seems whoever made the <br> a tag decided that breaking a line has more to do with structure and layout than with content. Not being able to actually "see" a <br>
I'd tend to agree. Oh and I'm making this up as I go so feel free to disagree ;)
HTML is a mark-up language - it represents the structure of a document, not how that document should appear visually. Take the <EM> tag as an example - it tells user-agents that they should give emphasis to any text that is placed between the opening and closing <EM> tags. However, it does not state how that emphasis should be represented. Yes, most visual web-browsers will place the text in italics, but this is only convention. Other browsers, such as monochrome text-only browsers may display the text in inverse. A screen reader might read the text in a louder voice, or change the pronunciation. A search-engine spider might decide the text is more important than other elements.
The same goes for the <BR> tag - it isn't just another character entity, it actually represents a break in the document structure. A <BR> is not just a replacement for a newline character, but is a "semantic" part of the document and how it is structured. This is similar to the way an <H1> is not just a way of making text bigger and bolder, but is an integral part of the way the document is structured.
br elements can be styled, though. How would you style an HTML entity? Because they're elements it makes them more flexible.
Yes. An HTML entity would be more appropriate, as a break tag cannot contain text and behaves much like a newline.
That's just not the way things are, though. Too late. I can't tell you the number of non-XML-compatible HTML documents I've had to deal with because of unclosed break tags...