In what scopes do special HTML characters need to be escaped? - html

In HTML,
Dust & Bones
needs to be escaped as follows:
Dust & Bones
What's the scope of where &amp needs to be applied. Is it just href or is it anywhere within HTML text? What about
<input value="http://... & ">?
or within
<script>... & ... </script>
do these need escaping?
update
The bigger question, which would explain this, is, when does the HTML parser look for &XXX; tokens and replace them? Is it done once on the whole document, or do different rules apply for the text between tags vs. attribute values within a tag vs. wihtin tagA vs. within tagB -- different parsing rules seem to apply within , so I may write && (for AND) and < for (LESS-THAN). So, what rules apply in which scopes?

The rules vary depending on the version of HTML you are dealing with but are always more complex then is worth trying to remember.
The safe approach is "Use character references to represent the 5 HTML special characters everywhere except inside script and style elements", which makes you safe for everything except XHTML.
For XHTML the rule is the same with the additional proviso of "and use explicit CDATA sections in script and style elements".
The bigger question, which would explain this, is, when does the HTML parser look for &XXX; tokens and replace them?
As it parses the HTML (depending on what the current state of the tokeniser is ("inside start tag" and "inside attribute value" are examples of different states)).
Is it done once on the whole document
Unless you trigger additional HTML parsing (e.g. by setting innerHTML on an element).
or do different rules apply for the text between tags vs. attribute values within a tag vs. wihtin tagA vs. within tagB
Different rules apply in different places. The complete, current rules are (as I suggested in a comment) rather complex and would require a lot of work to extract from the HTML 5 parsing rules. This is why I suggest, if you are an HTML author and not a browser author, using the simpler rules of "Use character references unless you are in a script or style element".
-- different parsing rules seem to apply within <script>, so I may write && (for AND) and < for (LESS-THAN). So, what rules apply in which scopes?
In HTML 4 terms, script and style elements are defined as containing CDATA (where the only sequence of characters with special meaning in HTML are </ which terminates the CDATA section). Everywhere else in the document (including, counter-intuitively, attribute values that are defined as containing CDATA) & indicates the start of a character reference (although there might be a few exceptions based on what the character following the & is).
The HTML 5 rules are more complicated, but the basic principle of "It is safe and sane to use character references for &, <, >, " and ' everywhere except inside script and style elements" holds.

Related

Random Letter html Tag

I was wondering if you can use a random letter as an html tag. Like, f isn't a tag, but I tried it in some code and it worked just like a span tag. Sorry if this is a bad question, I've just been curious about it for a while, and I couldn't find anything online.
I was wondering if you can use a random letter as an html tag.
Yes and no.
"Yes" - in that it works, but it isn't correct: when you have something like <z> it only works because the web (HTML+CSS+JS) has a degree of forwards compatibility built-in: browsers will render HTML elements that they don't recognize basically the same as a <span> (i.e. an inline element that doesn't do anything other than reify a range of the document's text).
However, to use HTML5 Custom Elements correctly you need to conform to the Custom Elements specification which states:
The name of a custom element must contain a dash (-). So <x-tags>, <my-element>, and <my-awesome-app> are all valid names, while <tabs> and <foo_bar> are not. This requirement is so the HTML parser can distinguish custom elements from regular elements. It also ensures forward compatibility when new tags are added to HTML.
So if you use <my-z> then you'll be fine.
The HTML Living Standard document, as of 2021-12-04, indeed makes an explicit reference to forward-compatibility in its list of requirements for custom element names:
https://html.spec.whatwg.org/#valid-custom-element-name
They start with an ASCII lower alpha, ensuring that the HTML parser will treat them as tags instead of as text.
They do not contain any ASCII upper alphas, ensuring that the user agent can always treat HTML elements ASCII-case-insensitively.
They contain a hyphen, used for namespacing and to ensure forward compatibility (since no elements will be added to HTML, SVG, or MathML with hyphen-containing local names in the future).
They can always be created with createElement() and createElementNS(), which have restrictions that go beyond the parser's.
Apart from these restrictions, a large variety of names is allowed, to give maximum flexibility for use cases like <math-α> or <emotion-😍>.
So, by example:
<a>, <q>, <b>, <i>, <u>, <p>, <s>
No: these single-letter elements are already used by HTML.
<z>
No: element names that don't contain a hyphen - cannot be custom elements and will be interpreted by present-day browsers as invalid/unrecognized markup that they will nevertheless (largely) treat the same as a <span> element.
<a:z>
No: using a colon to use an XML element namespace is not a thing in HTML5 unless you're using XHTML5.
<-z>
No - the element name must start with a lowercase ASCII character from a to z, so - is not allowed.
<a-z>
Yes - this is fine.
<a-> and <a-->
Unsure - these two names are curious:
The HTML spec says the name must match the grammar rule [a-z] (PCENChar)* '-' (PCENChar)*.
The * denotes "zero-or-more" which is odd, because that implies the hyphen doesn't need to be followed by another character.
PCENChar represents a huge range of visible characters permitted in element names, curiously this includes -, so by that rule <a--> should be valid.
But note that -- is a reserved character sequence in the greater SGML-family (including HTML and XML) which may cause weirdness. YMMV!

In what contexts can you use greater than as text in html

In what contexts can I use the greater than symbol < as text in HTML?
For example < & <= parse render as text perfectly fine if they are in a tag:
<p>
<
<=
</p>
However <t will be parsed as HTML by the browser and not produce the text <t.
Is there a rule for what characters can proceed the greater than symbol for the browser to assume that it is the start of a tag?
The rule is: almost never.
Only inside quoted attribute values (and in raw text tags like script and style) are you permitted to write < unescaped. I think attribute names permit these too, but not > (though why you would put a < in an attribute name is beyond me).
Browsers will do their best to recover from bad HTML, so sometimes you might get away with it if you forget.
But it's best to always encode your entities.
You should scan the HTML spec, but here's one relevant chapter with some of the constraints listed in various sections.
Use an HTML validator in strict mode to make sure you're getting it right; the HTML you gave in your question is rejected by the linked tool, with a suggestion to switch to <.

<in a nutshell> as text not html tag

I have a text: Our process<in a nutshell>
that has an output as:
Our process<in nutshell="" a=""></in>
I didn't even know in is a tag and cannot find on google what it does.
How do I post it as text? And what is <in>?
Thanks!
In HTML:
Our process <in a nutshell>
There is no <in> tag defined in HTML, but browsers and other parsers still treat <in a nutshell> as tag. It creates an element node in the document tree, representing an unknown element, so it has only a set of general properties. It has no special rendering, and no functionality is associated with it. But you could style it and/or use client-side JavaScript to add functionality to it.
In this case, you didn’t mean to do anything like that, but the tag is still parsed, and in is treated as the element name (tag name) and nutshell and a as attribute names, with attribute values defaulted to the empty string. Since tags are treated as code for starting an element, the tag itself is not rendered. Browsers may imply a closing tag </in> under certain conditions. This explains the “output” presented in the question; it’s really just the fragment of code viewed in a browser’s Developer Tools. The actual rendering in the example case is just the string “Our process”.
To prevent this processing, the “<” character needs to be escaped somehow; < is the best and most common method, so you would write
Our process<in a nutshell>
There is no need to escape the “>”, but you may do so, for symmetry, using >.
Try to replace
< with <
and replace
> with >
Does this give you the expected results?
The browser is interpreting anything in '<>' as a tag.
You need to use the character code to display those symbols as text:
Our process <in a nutshell>

invalid tags in HTML <abc> vs <1234>

I was writing a simple web page. And I wanted to print <abc> and <1234> inside the page. Why <1234> is printed not <abc>? I know <abc> is invalid tag thats why it is not rendered. But what about <1234>?
You have to do it like:
and <1234>
Use HTML entities.
< = <
> = >
Using them tells HTML that you want the < and > to be displayed as it is and not be interpreted as the < and > in <html>
DEMO
P.S.: Here's a list of them.
This is down to the way that browsers parse the HTML into a format that gets displayed as a web page.
As a rule, HTML tags must start with letters. Because of this, the browser attempts to parse as a valid tag (therefore hiding it), but doesn't recognise <1234> and therefore leaves it untouched.
Edit:
As #Arkana pointed out below, there's nothing I can see in the HTML specification that specifically forbids starting a HTML tag with a number. My best guess is that because no (currently valid) HTML tags actually do start with a number, the browser's parser just ignores these tags, based on the same rule that IDs and Names follow according to the HTML4 spec.
In XHTML and in HTML5 (even in HTML serialization), both <abc> and <123> are invalid. In HTML 4.01, <123> is valid, though not recommended, and it simply means those five data characters.
What matters in browsers is how they parse an HTML document. There is an attempted semi-formal description of this in HTML5 CR, but it’s a bit hard reading. The bottom line is that < triggers special parsing: if the next character is a letter, data is parsed as an HTML tag; otherwise, the < as well as data after it are taken as normal data characters.
When a tag like <abc> has been parsed, modern browsers construct an element node in the document tree – even though the tag is invalid and the tag name is not known to the browser at all. If there is no end tag <abc>, the node contains all the rest there is in the document. But for an element node with an unknown name, there is no default styling and no default action. You won’t notice its existence, unless you try to do something with it (like put abc { color: solid red } in a style sheet).
Technically, one could say that the cause of the difference is that “a” is a name start character (a character that may appear as the first one in a tag name), whereas “1” is not.
It is safest to always escape a “<” character in content (except for style and script and xmp elements, which have rules of their own) as <. There is no need to escape a “>”, but if desired, for symmetry, you can escape it as >.
Unrecognised elements are added to the DOM for forward compatibility (they can be enhanced with CSS/JS). Element names may not begin with a number though, so they are not added to the DOM and error recovery treats them as text instead.
Use < and > if you want to include < and > as data instead of markup.

What else can be used instead of < or > in HTML codes?

When we do any HTML coding we use < and > to specify a tag which any browser does not show as text but as display. Can anything else(any coding, for HTML) be used instead of these symbols?
I think you are asking whether it is possible to use characters other than < and > as tag start and tag end characters. For example, can one somehow define that [ and ] are used instead, so that we would write [p] and not <p>.
The answer is no. HTML was formally based on SGML, which has provisions for such definitions; in SGML, < and > are just “reference concrete syntax” characters for abstract “start of tag” and “end of tag” notations. But HTML was never actually implemented as SGML-based, and the HTML specifications even formally fixed the syntax to use < and >. And XML, the simplified version of SGML, upon which XHTML is based, has no provisions for setting such syntax features.
In practical terms: No. Only < and > mark the start and end of a tag in HTML.
In theoretical terms only (because this is not supported by any mainstream browser), in HTML 4 and earlier you could use SHORT TAGS. The syntax for this is to use / instead of > to end the start tag and then / instead of the entire end tag:
For example:
<title/This is the title/
or
<br/ <!-- note that the end tag for br elements must be omitted in HTML 4 and earlier -->
Some other SGML features may allow other options, but they would also not be supported by browsers.
The following is my answer to what appeared to be the original question after someone had edited it to show < instead of <.
In theory, for HTML 4 and earlier, you can use CDATA sections … but they never saw widespread support in browsers so aren't of any practical value in HTML.
There is also the <xmp> element, which is obsolete. The HTML 5 draft marks it as non-conforming and says:
Use pre and code instead, and escape "<" and "&" characters as "<" and "&" respectively
The W3C Wiki has this to say about xmp:
No, really. don't use it.
Character references (< and co) are the correct tools for the job. Any desire to avoid them is better replaced by learning to love a programatic solution or the find & replace feature of your editor.