Why are certain characters prohibited in the HTML5 spec? - html

According to the HTML5 spec (just after the table), the following characters are prohibited:
Otherwise, return a character token for the Unicode character whose code point is that number. Additionally, if the number is in the range 0x0001 to 0x0008, 0x000D to 0x001F, 0x007F to 0x009F, 0xFDD0 to 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or 0x10FFFF, then this is a parse error.
What was the reasoning or motivation behind this exclusion?

They're code points that cause interoperability problems, either with XML/XHTML documents or with extant HTML parsers. As none of them have any obvious valid use they should be avoided.
The noncharacters (U+FDD0–FDEF and U+NFFFE–F) and control characters U+0000–8;0D–1F are invalid in XML 1.0. Character references in the range 0x80–0x9F produce different results in XML and HTML parsers due to the substitutions in the immediately-preceding table (and there are also many non-browser HTML parsers that do not implement this weird historical quirk).

Related

Random Letter html Tag

I was wondering if you can use a random letter as an html tag. Like, f isn't a tag, but I tried it in some code and it worked just like a span tag. Sorry if this is a bad question, I've just been curious about it for a while, and I couldn't find anything online.
I was wondering if you can use a random letter as an html tag.
Yes and no.
"Yes" - in that it works, but it isn't correct: when you have something like <z> it only works because the web (HTML+CSS+JS) has a degree of forwards compatibility built-in: browsers will render HTML elements that they don't recognize basically the same as a <span> (i.e. an inline element that doesn't do anything other than reify a range of the document's text).
However, to use HTML5 Custom Elements correctly you need to conform to the Custom Elements specification which states:
The name of a custom element must contain a dash (-). So <x-tags>, <my-element>, and <my-awesome-app> are all valid names, while <tabs> and <foo_bar> are not. This requirement is so the HTML parser can distinguish custom elements from regular elements. It also ensures forward compatibility when new tags are added to HTML.
So if you use <my-z> then you'll be fine.
The HTML Living Standard document, as of 2021-12-04, indeed makes an explicit reference to forward-compatibility in its list of requirements for custom element names:
https://html.spec.whatwg.org/#valid-custom-element-name
They start with an ASCII lower alpha, ensuring that the HTML parser will treat them as tags instead of as text.
They do not contain any ASCII upper alphas, ensuring that the user agent can always treat HTML elements ASCII-case-insensitively.
They contain a hyphen, used for namespacing and to ensure forward compatibility (since no elements will be added to HTML, SVG, or MathML with hyphen-containing local names in the future).
They can always be created with createElement() and createElementNS(), which have restrictions that go beyond the parser's.
Apart from these restrictions, a large variety of names is allowed, to give maximum flexibility for use cases like <math-α> or <emotion-😍>.
So, by example:
<a>, <q>, <b>, <i>, <u>, <p>, <s>
No: these single-letter elements are already used by HTML.
<z>
No: element names that don't contain a hyphen - cannot be custom elements and will be interpreted by present-day browsers as invalid/unrecognized markup that they will nevertheless (largely) treat the same as a <span> element.
<a:z>
No: using a colon to use an XML element namespace is not a thing in HTML5 unless you're using XHTML5.
<-z>
No - the element name must start with a lowercase ASCII character from a to z, so - is not allowed.
<a-z>
Yes - this is fine.
<a-> and <a-->
Unsure - these two names are curious:
The HTML spec says the name must match the grammar rule [a-z] (PCENChar)* '-' (PCENChar)*.
The * denotes "zero-or-more" which is odd, because that implies the hyphen doesn't need to be followed by another character.
PCENChar represents a huge range of visible characters permitted in element names, curiously this includes -, so by that rule <a--> should be valid.
But note that -- is a reserved character sequence in the greater SGML-family (including HTML and XML) which may cause weirdness. YMMV!

Are there some valid HTML entities without the semicolon?

Looking at this official entities.json file, some of the entities are defined without an ending semicolon.
For example:
"&Acirc": { "codepoints": [194], "characters": "\u00C2" },
"Â": { "codepoints": [194], "characters": "\u00C2" },
Where is that documented in HTML5? Or is that a browser thing¹?
¹ thing as in extension for backward compatibility.
HTML named character list is defined at https://html.spec.whatwg.org/multipage/named-characters.html and yes, some of these don't have a trailing ; e.g &not
&not
Named HTML entities without a semicolon are not valid, per the HTML spec, but browsers are required to support some of them anyway. (This spec pattern - where something is officially illegal for you to do as a HTML author, but still has a single unambiguously specified behaviour that browsers must implement - is used a lot in the HTML spec.)
There are a few pertinent sections in the spec:
§13.1.4 Character references
Pertinent quote:
Named character references
The ampersand must be followed by one of the names given in the named character references section, using the same case. The name must be one that is terminated by a U+003B SEMICOLON character (;).
§13.2 Parsing HTML Documents, especially 13.2.5.73 Named character reference state (if you really want to pick through the horrible hard-to-read implementation details of the parsing algorithm).
The non-normative §1.11.2 Syntax errors, which contains some explanation on why the spec makes references without semicolons errors (though I don't personally find it hugely compelling):
Errors involving fragile syntax constructs
There are syntax constructs that, for historical reasons, are relatively fragile. To help reduce the number of users who accidentally run into such problems, they are made non-conforming.
Example
For example, the parsing of certain named character references in attributes happens even with the closing semicolon being omitted. It is safe to include an ampersand followed by letters that do not form a named character reference, but if the letters are changed to a string that does form a named character reference, they will be interpreted as that character instead.
In this fragment, the attribute's value is "?bill&ted":
Bill and Ted
In the following fragment, however, the attribute's value is actually "?art©", not the intended "?art&copy", because even without the final semicolon, "&copy" is handled the same as "©" and thus gets interpreted as "©":
Art and Copy
To avoid this problem, all named character references are required to end with a semicolon, and uses of named character references without a semicolon are flagged as errors.
Thus, the correct way to express the above cases is as follows:
Bill and Ted <!-- &ted is ok, since it's not a named character reference -->
Art and Copy <!-- the & has to be escaped, since &copy is a named character reference -->
As a final bit of corroboration that entities like &Acirc are invalid but work anyway, we can use this test document:
<!DOCTYPE html>
<html lang="en">
<title>Test page</title>
<div>&Acirc</div>
</html>
Open it in Chrome, and it works and shows us an A with a circumflex accent:
But paste it into the Nu Html Checker (endorsed by WhatWG), and we get an error stating "Named character reference was not terminated by a semicolon.":
i.e. it works, but it's invalid.
I made a program in python to get some numbers, and I found out that:
In the 2231 total entities, there are 4.75% or 106 valid entities without a semi-colon at end
All those entities:
&AElig, &AMP, &Aacute, &Acirc, &Agrave, &Aring, &Atilde, &Auml, &COPY, &Ccedil, &ETH, &Eacute, &Ecirc, &Egrave, &Euml, &GT, &Iacute, &Icirc, &Igrave, &Iuml, &LT, &Ntilde, &Oacute, &Ocirc, &Ograve, &Oslash, &Otilde, &Ouml, &QUOT, &REG, &THORN, &Uacute, &Ucirc, &Ugrave, &Uuml, &Yacute, &aacute, &acirc, &acute, &aelig, &agrave, &amp, &aring, &atilde, &auml, &brvbar, &ccedil, &cedil, &cent, &copy, &curren, &deg, &divide, &eacute, &ecirc, &egrave, &eth, &euml, &frac12, &frac14, &frac34, &gt, &iacute, &icirc, &iexcl, &igrave, &iquest, &iuml, &laquo, &lt, &macr, &micro, &middot, &nbsp, &not, &ntilde, &oacute, &ocirc, &ograve, &ordf, &ordm, &oslash, &otilde, &ouml, &para, &plusmn, &pound, &quot, &raquo, &reg, &sect, &shy, &sup1, &sup2, &sup3, &szlig, &thorn, &times, &uacute, &ucirc, &ugrave, &uml, &uuml, &yacute, &yen, &yuml

What are the exponent characters (in non-formatted text)? How can I create these exponent characters?

I´m searching for a list of exponents like ¹²³ and so on and the same with letters. Note these still remain superscripted even in plain text.
Does something like these exist? If not, how can I create those?
(I need them for a website-project)
Unicode versions of superscripted/subscripted characters exist for all ten digits but not for all letters. They remain superscripted/subscripted in a plain-text environment without the need of format tags such as <sup>/<sub>.
However (as of v14), not all letters have Unicode superscripts. Furthermore, they are scattered along different Unicode ranges, and are in fact used mainly for phonetic transcription. Additionally, they are used for compatibility purposes especially if the text does not support markup superscripts and subscripts.
Exponent characters:
These are mostly used for mathematical and referencing usage.
- ⁰ [U+2070]
- ¹ [U+00B9, Latin-1 Supplement]
- ² [U+00B2, Latin-1 Supplement]
- ³ [U+00B3, Latin-1 Supplement]
- ⁴ [U+2074]
- ⁵ [U+2075]
- ⁶ [U+2076]
- ⁷ [U+2077]
- ⁸ [U+2078]
- ⁹ [U+2079]
- ⁺ [U+207A]
- ⁻ [U+207B]
- ⁼ [U+207C]
- ⁽ [U+207D]
- ⁾ [U+207E]
- ⁿ [U+207F]
- ⁱ [U+2071]
The "linear", "squared", and "cubed" subscripts are the most familiar and are found in Latin-1 Supplement. All the others are found in Superscripts and Subscripts. Add 0x2070 to all the non-Latin-1 Supplement superscripts to obtain the code point value of these digits. See this Wikipedia article and the official Unicode codepage segment.
Interesting notes
There are also subtle differences between <sup> subscripts and Unicode subscripts; Unicode subscripts are entirely different codepoints altogether, and some fonts professionally design subscripted letters because <sup> subscripts may look thin.
Compare x² with x2, similarly x⁺ with x+ (the first involves Unicode, the second is markup)
The best solution is to use markup, such as <sup>.
You can't create the characters, but you can format then as super-scripts if you are generating HTML.
As to find which exist, you just have to use an unicode-character searching resource and look for "superscript" to have a listing -
This query, for example:
https://www.fileformat.info/info/unicode/char/search.htm?q=superscript&preview=entity
As you can see, all digits are available (more than once, even), but very few letters.
However, if you intend to generate HTML output, the <sup> tag will work for any text you want, and give the necessary semantic meaning to the text - you can read about it and try it online here: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/sup

Is > ever necessary?

I now develop websites and XML interfaces since 7 years, and never, ever came in a situation, where it was really necessary to use the > for a >. All disambiguition could so far be handled by quoting <, &, " and ' alone.
Has anyone ever been in a situation (related to, e.g., SGML processing, browser issues, XSLT, ...) where you found it indespensable to escape the greater-than sign with >?
Update: I just checked with the XML spec, where it says, for example, about character data in section 2.4:
Character Data
[14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
So even there, the > isn't mentioned as something special, except from the ending sequence of a CDATA section.
This one single case, where the > is of any significance, would be the ending of a CDATA section, ]]>, but then again, if you'd quote it, the quote (i.e., the literal string ]]>) would land literally in the output (since it's CDATA).
You don't need to absolutely because almost any XML interpreter will understand what you mean. But still you use a special character without any protection if you do so.
XML is all about semantic, and this is not really semantic compliant.
About your update, you forgot this part :
The right angle bracket (>) may be represented using the string " > ", and must, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.
The use case given in the documentation is more about something like this :
<xmlmarkup>
]]>
</xmlmarkup>
Here the ]]> part could be a problem with old SGML parsers, so it must be escaped into = ]]> for compatibilities reasons.
I used one not 19 hours ago to pass a strict xml validator. Another case is when you use them actually in html/xml content text (rather than attributes), like this: <.
Sure, a lax parser will accept most anything you throw at it, but if you're ever worried about XSS, < is your friend.
Update: Here's an example where you need to escape > in Firefox:
<?xml version="1.0" encoding="utf-8" ?>
<test>
]]>
</test>
Granted, it still isn't an example of having to escape a lone >.
Not so much as an author of (x)html documents, but more as a user of sloppy written comments fields in websites, that "offer" you to insert html.
I mean if you do your site the right way, you wouldn't hardcode your content anyway, right? So your call to htmlentities or whatever (long time no see, php) would take care of replacing special characters for you.
So sure, you wouldn't manually type > but I hope you take measures so > is automatically replaced.
I just thought of another example, where you need to quote > in HTML5 (not XHTML5) documents: If you need it in attributes without quotes (which is something, that can be argued of course).
<img src=arrow.png alt=>>
should be equivalent to XHTML
<img src="arrow.png" alt=">" />
But then again, (?<!X)HTML is not SGML.
Imagine that you have the following text this is a not a ]]> nice day and you decide to surround it by CDATA sections <![CDATA[this is a not a ]]> nice day]]>.
In order to avoid that (and for allowing parsing of SGML fragments with unterminated marked sections), clause 10.4 of ISO 8879:1986 declares that the occurrence of ]]> outside a marked
section is an error.
Also, in the times of SGML marked sections were very popular, as they were not only used for CDATA (as in XML), but also for RCDATA (only entities and character references allowed) and IGNORE and INCLUDE (which allowed for recognition of markup inside them).
For instance, in SGML one could write:
<!ENTITY %WHATTODO "INCLUDE">
<![%WHATTODO;[<b>]]></b>]]>
Which is equivalent to:
<b>]]></b>

Encoding rules for URL with the `javascript:` pseudo-protocol?

Is there any authoritative reference about the syntax and encoding of an URL for the pseudo-protocol javascript:? (I know it's not very well considered, but anyway it's useful for bookmarklets).
First, we know that standard URLs follow the syntax:
scheme://username:password#domain:port/path?query_string#anchor
but this format doesn't seem to apply here. Indeed, it seems, it would be more correct to speak of URI instead of URL : here is listed the "unofficial" format javascript:{body}.
Now, then, which are the valid characters for such a URI, (what are the escape/unescape rules) when embedding in a HTML?
Specifically, if I have the code of a javascript function and I want to embed it in a javascript: URI, which are the escape rules to apply?
Of course one could escape every non alfanumeric character, but that would be overkill and make the code unreadable. I want to escape only the necessary characters.
Further, it's clear that it would be bad to use some urlencode/urldecode routine pair (those are for query string values), we don't want to decode '+' to spaces, for example.
My findings, so far:
First, there are the rules for writing a valid HTML attribute value: but here the standard only requires (if the attribute value if enclosed in quotes) an arbitrary CDATA (actually a %URI, but HTML itself does not impose additional validation at its level: any CDATA will validate).
Some examples:
<a href="javascript:alert('Hi!')"> (1)
<a href="javascript:if(a > b && 1 < 0) alert( b ? 'hi' : 'bye')"> (2)
<a href="javascript:if(a>b &&& 1 < 0) alert( b ? 'hi' : 'bye')"> (3)
Example (1) is valid. But also example (2) is valid HTML 4.01 Strict. To make it valid XHTML we only need to escape the XML special characters < > & (example 3 is valid XHTML 1.0 Strict).
Now, is example (2) a valid javascript: URI ? I'm not sure, but I'd say it's not.
From RFC 2396: an URI is subject to some addition restrictions and, in particular, the escape/unescape via %xx sequences. And some characters are always prohibited:
among them spaces and {}# .
The RFC also defines a subset of opaque URIs: those that do not have hierarchical components, and for which the separating charactes have no special meaning (for example, they dont have a 'query string', so the ? can be used as any non special character). I assume javascript: URIs should be considered among them.
This would imply that the valid characters inside the 'body' of a javascript: URI are
a-zA-Z0-9
_|. !~*'();?:#&=+$,/-
%hh : (escape sequence, with two hexadecimal digits)
with the additional restriction that it can't begin with /.
This stills leaves out some "important" ASCII characters, for example
{}#[]<>^\
Also % (because it's used for escape sequences), double quotes " and (most important) all blanks.
In some respects, this seems quite permissive: it's important to note that + is valid (and hence it should not be 'unescaped' when decoding, as a space).
But in other respects, it seems too restrictive. Braces and brackets, specially: I understand that they are normally used unescaped and browsers have no problems.
And what about spaces? As braces, they are disallowed by the RFC, but I see no problem in this kind of URI. However, I see that in most bookmarklets they are escaped as "%20". Is there any (empirical or theorical) explanation for this?
I still don't know if there are some standard functions to make this escape/unescape (in mainstream languages) or some sample code.
javascript: URLs are currently part of the HTML spec and are specified at https://html.spec.whatwg.org/multipage/browsing-the-web.html#the-javascript:-url-special-case