Is > ever necessary? - html

I now develop websites and XML interfaces since 7 years, and never, ever came in a situation, where it was really necessary to use the > for a >. All disambiguition could so far be handled by quoting <, &, " and ' alone.
Has anyone ever been in a situation (related to, e.g., SGML processing, browser issues, XSLT, ...) where you found it indespensable to escape the greater-than sign with >?
Update: I just checked with the XML spec, where it says, for example, about character data in section 2.4:
Character Data
[14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
So even there, the > isn't mentioned as something special, except from the ending sequence of a CDATA section.
This one single case, where the > is of any significance, would be the ending of a CDATA section, ]]>, but then again, if you'd quote it, the quote (i.e., the literal string ]]>) would land literally in the output (since it's CDATA).

You don't need to absolutely because almost any XML interpreter will understand what you mean. But still you use a special character without any protection if you do so.
XML is all about semantic, and this is not really semantic compliant.
About your update, you forgot this part :
The right angle bracket (>) may be represented using the string " > ", and must, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.
The use case given in the documentation is more about something like this :
<xmlmarkup>
]]>
</xmlmarkup>
Here the ]]> part could be a problem with old SGML parsers, so it must be escaped into = ]]> for compatibilities reasons.

I used one not 19 hours ago to pass a strict xml validator. Another case is when you use them actually in html/xml content text (rather than attributes), like this: <.
Sure, a lax parser will accept most anything you throw at it, but if you're ever worried about XSS, < is your friend.
Update: Here's an example where you need to escape > in Firefox:
<?xml version="1.0" encoding="utf-8" ?>
<test>
]]>
</test>
Granted, it still isn't an example of having to escape a lone >.

Not so much as an author of (x)html documents, but more as a user of sloppy written comments fields in websites, that "offer" you to insert html.
I mean if you do your site the right way, you wouldn't hardcode your content anyway, right? So your call to htmlentities or whatever (long time no see, php) would take care of replacing special characters for you.
So sure, you wouldn't manually type > but I hope you take measures so > is automatically replaced.

I just thought of another example, where you need to quote > in HTML5 (not XHTML5) documents: If you need it in attributes without quotes (which is something, that can be argued of course).
<img src=arrow.png alt=>>
should be equivalent to XHTML
<img src="arrow.png" alt=">" />
But then again, (?<!X)HTML is not SGML.

Imagine that you have the following text this is a not a ]]> nice day and you decide to surround it by CDATA sections <![CDATA[this is a not a ]]> nice day]]>.
In order to avoid that (and for allowing parsing of SGML fragments with unterminated marked sections), clause 10.4 of ISO 8879:1986 declares that the occurrence of ]]> outside a marked
section is an error.
Also, in the times of SGML marked sections were very popular, as they were not only used for CDATA (as in XML), but also for RCDATA (only entities and character references allowed) and IGNORE and INCLUDE (which allowed for recognition of markup inside them).
For instance, in SGML one could write:
<!ENTITY %WHATTODO "INCLUDE">
<![%WHATTODO;[<b>]]></b>]]>
Which is equivalent to:
<b>]]></b>

Related

Having issues with xml to have subject and bcc with mailto: . It is only letting me do one or the other [duplicate]

I want to use & character but the Visual Studio throw exception. How Have to write this?
Replace any & with
&
It would load properly in XML.
There are two ways to represent characters which have special meaning in XML (such as < and >) in an XML document.
CDATA sections
As entities
A CDATA section can only be used in places where you could have a text node.
<foo><![CDATA[Here is some data including < and > (and &!) ]]></foo>
The caveat is that you can't include the sequence ]]> as data in a CDATA section.
Entities can be used everywhere (except inside CDATA sections) and consist of &, then an identifier, then ;.
<foo>Here is some data including < and > (and &!)</foo>
You can use & or &, or you can wrap it in a CDATA section like this:
<![CDATA[Foo & Bar]]>
Use the entity, &.
(+1 to the other answers that discuss CDATA.)
You may use a CDATA :
Everything inside a CDATA section is ignored by the parser.
A CDATA section starts with <![CDATA[ and ends with ]]>

How to avoid <> in HTML?

I would like to paste into my HTML code a phrase
"<car>"
and I would like that this word "car" will be between <>. In some text will be
"<car>"
and this is not a HTML expression. The problem is that when I put it the parser think that this is the HTML syntax how to avoid it. Is there any expression which need to be between this?
replace < by < and > by >
Live on JSFiddle.
< and > are special characters, more special characters in HTML you can find here.
More about HTML entities you can find here.
use > for > and < for <
$gt;car<
you need to use special character .. To know more about Special Character link here
CODE:
<p>"<car >"</p>
OUTPUT:
"<car>"
< = < less than
> = > greater than
The same applies for XML too. Take a look here, special characters for HTML.
If you really want LESS THAN SIGN “<” to appear visibly in page content, write it as &, so that it will not be treated as starting a tag. Ref.: 5.3.2 Character entity references in HTML 4.01.
So you would write
<car>
If you like, you can write “>” as > for symmetry, but there is no need to.
But if you really want to put something in angle brackets, e.g. using a mathematical notation, rather than a markup notation (as in HTML and XML), consider using U+27E8 MATHEMATICAL LEFT ANGLE BRACKET “⟨” and U+27E9 MATHEMATICAL RIGHT ANGLE BRACKET “⟩”. They cause no problems to HTML markup, as they are not markup-significant. If you don’t know how to type them in your authoring environment, you can use character references for them:
⟨car⟩
This would result in ⟨car⟩, though as always with less common special characters, you would need to consider character (font) problems.
You can use the "greater than" and "less than" entities:
<car>
The W3C, the organization responsible for setting web standards, has some pretty good documentation on HTML entities. They consist of an ampersand followed by an entity name followed by a semicolon (&name;) or an ampersand followed by a pound sign followed by an entity number followed by a semicolon (&#number;). The link I provided has a table of common HTML entities.

What Are The Reserved Characters In (X)HTML?

Yes, I've googled it, and surprisingly got confusing answers.
One page says that < > & " are the only reserved characters in (X)HTML. No doubt, this makes sense.
This page says < > & " ' are the reserved characters in (X)HTML. A little confusing, but okay, this makes sense too.
And then comes this page which says < > & " © ° £ and non-breaking space (&nbsp) are all reserved characters in (X)HTML. This makes no sense at all, and pretty much adds to my confusion.
Can someone knowledgeable, who actually do know this stuff, clarify which the reserved characters in (X)HTML actually are?
EDIT: Also, should all the reserved characters in code be escaped when wrapped in <pre> tag? or is it just these three -- < > & ??
The XHTML 1.0 specification states at http://www.w3.org/TR/2002/REC-xhtml1-20020801/#xhtml:
XHTML 1.0 [...] is a reformulation of the three HTML 4 document types as
applications of XML 1.0 [XML].
The XML 1.0 specification states at http://www.w3.org/TR/2008/REC-xml-20081126/#syntax:
Character Data and Markup: Text consists of intermingled character
data and markup. [...] The ampersand character (&) and the left angle
bracket (<) MUST NOT appear in their literal form, except when used as
markup delimiters, or within a comment, a processing instruction, or a
CDATA section. If they are needed elsewhere, they MUST be escaped
using either numeric character references or the strings "&" and
"<" respectively. The right angle bracket (>) may be represented
using the string ">", and MUST, for compatibility, be escaped
using either ">" or a character reference when it appears in the
string "]]>" in content, when that string is not marking the end of
a CDATA section.
This means that when writing the text parts of an XHTML document you must escape &, <, and >.
You can escape a lot more, e.g. ü for umlaut u. You can as well state that the document is encoded in for example UTF-8 and write the byte sequence 0xc3bc instead to get the same umlaut u.
When writing the element parts (col. "tags") of the document, there are different rules. You have to take care of ", ' and a lot of rules concerning comments, CDATA and so on. There are also rules which characters can be used in element and attribute names. You can look it up in the XML specification, but in the end it comes down to: for element and attribute names, use letters, digits and "-"; do not use "_". For attribute values, you must escape & and (depending on the quote style) either ' or ".
If you use one of the many libraries to write XML / XHTML documents, somebody else has already taken care of this and you just have to tell the library to write text or elements. All the escaping is done the in the background.&
Only < and & need to be escaped. Inside attributes, " or ' (depending on which quote style you use for the attribute's value) needs to be escaped, too.
<a href="#" onclick='here you can use " safely'></a>
By writing "(X)HTML", you are asking (at least) two different questions.
By the HTML rules, with "HTML" meaning any HTML version up to and including HTML 4.01, only "<" and "&" are reserved. The rules are somewhat complex. They should not not appear literally except in their syntactic use in tags, entity references, and character references. But by the formal rules, they may appear literally e.g. in the context "A & B" or "A < B" (but A&B be formally wrong, and so would A<B).
The XHTML rules, based on XML, are somewhat stricter, simpler: "<" and "&" are unconditionally reserved.
The ASCII quotation mark " and the ASCII apostrophe ' are not reserved, except in the very specific sense that a quoted attribute value must not literally contain the character used as quote, i.e. in "foo" the string foo must not contain " as such and in 'foo' the string foo must not contain ' as such.
The characters < > & " are reserved by XML format.
It means that you can use < and > chars only to define tags (<mytag></mytag>).
Double quotes (") are used to define values of attributes (<mytag attribute="value" />)
Ampersand (&) is used to write entities (& is used when you actually want to write ampersand, NOT &). Also, when you write url in your XML document, you should use &, not just &: www.aaa.com?a=1&b=2 - is wrong; www.aaa.com?a=1&b=2 - is good!
XHTML is based on XML, so what I have wrote applies to XHTML.
© ° £ - These are not reserved chars. These are entities defined specifically for XHTML, not for XML.
In XML you can simply write ©. In XHMTL you can also simply write ©, or use entity ©, or numeric entity &00A9;.
In addition to the other answers, it might help to know that there are also forbidden characters: all control characters in ASCII and ISO-8859-1 except TAB, LF, and CR.
https://www.w3.org/MarkUp/html3/specialchars.html

Encoding rules for URL with the `javascript:` pseudo-protocol?

Is there any authoritative reference about the syntax and encoding of an URL for the pseudo-protocol javascript:? (I know it's not very well considered, but anyway it's useful for bookmarklets).
First, we know that standard URLs follow the syntax:
scheme://username:password#domain:port/path?query_string#anchor
but this format doesn't seem to apply here. Indeed, it seems, it would be more correct to speak of URI instead of URL : here is listed the "unofficial" format javascript:{body}.
Now, then, which are the valid characters for such a URI, (what are the escape/unescape rules) when embedding in a HTML?
Specifically, if I have the code of a javascript function and I want to embed it in a javascript: URI, which are the escape rules to apply?
Of course one could escape every non alfanumeric character, but that would be overkill and make the code unreadable. I want to escape only the necessary characters.
Further, it's clear that it would be bad to use some urlencode/urldecode routine pair (those are for query string values), we don't want to decode '+' to spaces, for example.
My findings, so far:
First, there are the rules for writing a valid HTML attribute value: but here the standard only requires (if the attribute value if enclosed in quotes) an arbitrary CDATA (actually a %URI, but HTML itself does not impose additional validation at its level: any CDATA will validate).
Some examples:
<a href="javascript:alert('Hi!')"> (1)
<a href="javascript:if(a > b && 1 < 0) alert( b ? 'hi' : 'bye')"> (2)
<a href="javascript:if(a>b &&& 1 < 0) alert( b ? 'hi' : 'bye')"> (3)
Example (1) is valid. But also example (2) is valid HTML 4.01 Strict. To make it valid XHTML we only need to escape the XML special characters < > & (example 3 is valid XHTML 1.0 Strict).
Now, is example (2) a valid javascript: URI ? I'm not sure, but I'd say it's not.
From RFC 2396: an URI is subject to some addition restrictions and, in particular, the escape/unescape via %xx sequences. And some characters are always prohibited:
among them spaces and {}# .
The RFC also defines a subset of opaque URIs: those that do not have hierarchical components, and for which the separating charactes have no special meaning (for example, they dont have a 'query string', so the ? can be used as any non special character). I assume javascript: URIs should be considered among them.
This would imply that the valid characters inside the 'body' of a javascript: URI are
a-zA-Z0-9
_|. !~*'();?:#&=+$,/-
%hh : (escape sequence, with two hexadecimal digits)
with the additional restriction that it can't begin with /.
This stills leaves out some "important" ASCII characters, for example
{}#[]<>^\
Also % (because it's used for escape sequences), double quotes " and (most important) all blanks.
In some respects, this seems quite permissive: it's important to note that + is valid (and hence it should not be 'unescaped' when decoding, as a space).
But in other respects, it seems too restrictive. Braces and brackets, specially: I understand that they are normally used unescaped and browsers have no problems.
And what about spaces? As braces, they are disallowed by the RFC, but I see no problem in this kind of URI. However, I see that in most bookmarklets they are escaped as "%20". Is there any (empirical or theorical) explanation for this?
I still don't know if there are some standard functions to make this escape/unescape (in mainstream languages) or some sample code.
javascript: URLs are currently part of the HTML spec and are specified at https://html.spec.whatwg.org/multipage/browsing-the-web.html#the-javascript:-url-special-case

what actually is PCDATA and CDATA?

it seems that a loose definition of PCDATA and CDATA is that
PCDATA is character data, but is to be parsed.
CDATA is character data, and is not to be parsed.
but then someone told me that CDATA is actually parsed or PCDATA is actually not parsed... so it is a bit of a confusion. Does anyone know the real deal is?
Update: I actually added the PCDATA definition on Wikipedia... so don't take that answer too seriously as that's only my rough understanding of it.
From WIKI:
PCDATA
Simply speaking, PCDATA stands for Parsed Character Data. That means the characters are to be parsed by the XML, XHTML, or HTML parser. (< will be changed to <, <p> will be taken to mean a paragraph tag, etc). Compare that with CDATA, where the characters are not to be parsed by the XML, XHTML, or HTML parser.
CDATA
The term CDATA, meaning character data, is used for distinct, but related purposes in the markup languages SGML and XML. The term indicates that a certain portion of the document is general character data, rather than non-character data or character data with a more specific, limited structure.
Both PCDATA and CDATA are parsed. They are both character data.
They both must include only valid characters. For example if your document encoding is UTF-8, the content of CDATA sections must still be valid UTF-8 characters. So random binary data will probably prevent the document from being well-formed. Also CDATA sections are still parsed, if only to find the end section tag. But other markup-like characters, like <, > and & are ignored and passed as-is by the parser.
OTOH in PCDATA literal < and & (and ' or " in attribute values) must be escaped, or they will be interpreted as markup. Entities will also be expanded.
So yes, CDATA sections are indeed parsed. I am not sure why you were told that PCDATA is not parsed though.
PCDATA - Parsed Character Data
CDATA - (Unparsed) Character Data
http://www.w3schools.com/XML/xml_cdata.asp
PCDATA is text that will be parsed by a parser. Tags inside the text
will be treated as markup and entities will be expanded.
CDATA is text that will not be parsed by a parser. Tags inside the text will
not be treated as markup and entities will not be expanded.
By default, everything is PCDATA. In the following example, ignoring the root, <bar> will be parsed, and it'll have no content, but one child.
<?xml version="1.0"?>
<foo>
<bar><test>content!</test></bar>
</foo>
When we want to specify that an element will only contain text, and no child elements, we use the keyword PCDATA, because this keyword specifies that the element must contain parsable character data – that is , any text except the characters less-than (<) , greater-than (>) , ampersand (&), quote(') and double quote (").
In the next example, bar is CDATA, and isn't parsed, and has the content "<test>content!</test>".
<?xml version="1.0"?>
<foo>
<bar><![CDATA[<test>content!</test>]]></bar>
</foo>
There are several content models in SGML. The #PCDATA content model says that an element may contain plain text. The "parsed" part of it means that markup (including PIs, comments and SGML directives) in it is parsed instead of displayed as raw text. It also means that entity references are replaced.
Another type of content model allowing plain text contents is CDATA. In XML, the element content model may not implicitly be set to CDATA, but in SGML, it means that markup and entity references are ignored in the contents of the element. In attributes of CDATA type however, entity references are replaced.
In XML #PCDATA is the only plain text content model. You use it if you at all want to allow text contents in the element. The CDATA content model may be used explicitly through the CDATA block markup in #PCDATA, but element contents may not be defined as CDATA per default.
In a DTD, the type of an attribute that contains text must be CDATA. The CDATA keyword in an attribute declaration has a different meaning than the CDATA section in an XML document. In CDATA section all characters are legal (including <,>,&,’ and “ characters) except the “]]>” end tag.
#PCDATA is not appropriate for the type of an attribute. It is used for the type of "leaf" text.
#PCDATA is prepended by a hash (also known as a "hashtag" or octothorp) simply for historical reasons.
Your first definition is correct.
PCDATA is parsed which means that entities are expanded and that text is treated as markup. CDATA is not parsed by an XML parser.
If only elements were set to CDATA by default in the XHTML DTDs, it would save a lot of ugly manual overrides... Why would script blocks contain other elements? If there are such elements, they are handled by the JS interpreter in DOM manipulation actions -- in which case they should still be completely ignored by the XML parser before document insertion and rendering. I suppose it may have been designed to force the use of external script resource files, which is a ultimately a good thing.