Why is <br> an HTML element rather than an HTML entity? - html

Why indeed? Wouldn't something like &br; be more appropriate?

An HTML entity reference is, depending on HTML version either an SGML entity or an XML entity (HTML inherits entities from the underlying technology). Entities are a way of inserting chunks of content defined elsewhere into the document.
All HTML entities are single-character entities, and are hence basically the same as character references (technically they are different to character references, but as there are no multi-character entities defined, the distinction has no impact on HTML).
When an HTML processor sees, for example — it replaces it with the content of that entity reference with the appropriate entity, based on the section in the DTD that says:
<!ENTITY mdash CDATA "—" -- em dash, U+2014 ISOpub -->
So it replaces the entity reference with the entity — which is in turn a character reference that gets replaced by the character — (U+2014). In reality unless you are doing this with a general-purpose XML or SGML processor that doesn't understand HTML directly, this will really be done in one step.
Now, what would we replace your hypothetical &br; with to cause a line-break to happen? We can't do so with a newline character, or even the lesser known U+2028 LINE SEPARATOR (which semantically in plain text has the same meaning as <br/> in HTML), because they are whitespace characters which are not significant in most HTML code, which is something that you should be grateful for as writing HTML would be much harder if we couldn't format for readability within the source code.
What we need is not an entity, but a way to indicate semantically that the rendered content contains a line-break at this point. We also need to not indicate anything else (we can already indicate a line-break by beginning or ending a block element, but that's not what we want). The only reasonable way to do so is to have an element that means exactly that, and so we have the <br/> element, with its related tag being put into the source code.

A tag and a character entity reference exist for different reasons - character entities are stand-ins for certain characters (sometimes required as escape sequences - for example & for an ampersand &), tags are there for structure.
The reason the <br> tag exists is that HTML collapses whitespace. There needs to be a way to specify a hard line break - a place that has to have a line break. This is the function of the <br> tag.
There is no single character that has this meaning, though U+2028 LINE SEPARATOR has similar meaning, and even if it were to be used it would not help as it is considered to be whitespace and HTML would collapse it.
See the answers from #John Kugelman and #John Hanna for more detail on this aspect.
Not entirely related, there is another reason why a &br; character entity reference does not exist: a line break is defined in such a way that it could have more than one character, see the HTML 4 spec:
A line break is defined to be a carriage return (
), a line feed (
), or a carriage return/line feed pair.
Character entities are single character escapes, so cannot represent this, again in the HTML 4 spec:
A character entity reference is an SGML construct that references a character of the document character set.
You will see that all the defined character entities map to a single character. A line break/new line cannot be cleanly mapped this way, thus an entity is required instead of a character entity reference.
This is why a line break cannot be represented by a character entity reference.
Regardless, it not not needed as simply using the Enter key inserts a line break.

Entities are stand-ins for other characters or bits of text. In HTML they are used to represent characters that are hard to type (e.g. — for "—") or for characters that need to be escaped (& for "&"). What would a hypothetical &br; entity stand for?
It couldn't be \r or \n or \r\n as these are already easy enough to type (just press enter). The issue you're trying to workaround is that HTML collapses whitespace in most contexts and treats newlines as spaces. That is, \n is not a line break character, it is just whitespace like tabs and spaces.
An entity &br; would have to be replaced by some other text. What character do you use to represent the concept of "hard line break"? The standard line break character \n is exactly the right character, but unfortunately it's unsuitable since it's thrown in the generic "whitespace" bucket. You'd have to either overload some other control character to represent "hard line break", or use some extended Unicode character. When HTML was designed Unicode was only a nascent, still-developing standard, so that wasn't an option.
A <br> element was the simple, straightforward way to add the concept of "hard line break" to a document since no character could represent that concept.

In HTML all line breaks are treated as white space:
A line break is defined to be a carriage return (
), a line feed (
), or a carriage return/line feed pair. All line breaks constitute white space.
And white space does only separate words and sequences of white space is collapsed:
For all HTML elements except PRE, sequences of white space separate "words" (we use the term "word" here to mean "sequences of non-white space characters"). […]
[…]
Note that a sequence of white spaces between words in the source document may result in an entirely different rendered inter-word spacing (except in the case of the PRE element). In particular, user agents should collapse input white space sequences when producing output inter-word space. […]
This means that line breaks cannot be expressed by plain characters. And although there are certain special characters in Unicode to unambiguously separate lines and paragraphs, they are not specified to do this in HTML too:
Note that although 
 and 
 are defined in [ISO10646] to unambiguously separate lines and paragraphs, respectively, these do not constitute line breaks in HTML […]
That means there is no plain character or sequence of plain characters that is to mark a line break in HTML. And that’s why there is the BR element.
Now if you want to use &br; instead of <br>, you just need to declare the entity br to represent the value <br>:
<!ENTITY br "<br>">
Having this additional entity named br declared, a general-purpose XML or SGML processor will replace every occurrence of the entity reference &br; with the value it represents (<br>). An example document:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd" [
<!ENTITY br "<br>">
]>
<HTML>
<HEAD>
<TITLE>My first HTML document</TITLE>
</HEAD>
<BODY>
<P>Hello &br;world!
</BODY>
</HTML>

Entities are content, tags are structure or layout (very roughly speaking). It seems whoever made the <br> a tag decided that breaking a line has more to do with structure and layout than with content. Not being able to actually "see" a <br>
I'd tend to agree. Oh and I'm making this up as I go so feel free to disagree ;)

HTML is a mark-up language - it represents the structure of a document, not how that document should appear visually. Take the <EM> tag as an example - it tells user-agents that they should give emphasis to any text that is placed between the opening and closing <EM> tags. However, it does not state how that emphasis should be represented. Yes, most visual web-browsers will place the text in italics, but this is only convention. Other browsers, such as monochrome text-only browsers may display the text in inverse. A screen reader might read the text in a louder voice, or change the pronunciation. A search-engine spider might decide the text is more important than other elements.
The same goes for the <BR> tag - it isn't just another character entity, it actually represents a break in the document structure. A <BR> is not just a replacement for a newline character, but is a "semantic" part of the document and how it is structured. This is similar to the way an <H1> is not just a way of making text bigger and bolder, but is an integral part of the way the document is structured.

br elements can be styled, though. How would you style an HTML entity? Because they're elements it makes them more flexible.

Yes. An HTML entity would be more appropriate, as a break tag cannot contain text and behaves much like a newline.
That's just not the way things are, though. Too late. I can't tell you the number of non-XML-compatible HTML documents I've had to deal with because of unclosed break tags...

Related

Valid and Invalid HTML tags

So recently I found a question that was
Which of the following is a valid tag?
<213person>
<_person> (This is given as the right answer)
Both
None
(Note: this is the explanation that was given:- Valid HTML tags are surrounded by the angle brackets and the tag name can only either start from an alphabet or an underscore(_))
As far as my knowledge goes none of the reserved tags start with an underscore and according to what I've read about custom HTML tags it has to start with an alphabet(I tested it and it doesn't work with a custom tag starting with any character that's not an alphabet). So in my opinion and according to what I tested HTML tags can only start with alphabets or! (in case of !-- -- and !DOCTYPE HTML)
What I want to know is if the given explanation is correct or not and if it's correct then can someone provide some proper documentation and working examples for it?
As mentioned by #Rob, the standard defines a valid tag name as string containing alphanumeric ASCII characters, being:
0-9|a-z|A-Z
However, browsers handle things differently.
There's a few main points that I've noticed which don't align with the current standard.
Tag names must start with a letter
If a tag name starts with any character outside a-z|A-Z, the start tag ends up being interpreted as text and the end tag gets converted into a comment.
Special characters can be used
The following HTML is valid in a lot of browsers and will create an element:
<Z[\]^_`a></Z[\]^_`a>
This seems to be browsers only checking if the characters are ASCII. The only exception is the first character (as stated above).
Initially, I thought this was a simplified check, so instead of [A-Z]|[a-z| they checked [A-z], but you can use any character outside this range.
This makes the following HTML also "valid" in the eyes of certain browsers:
<a!></a!>
<aʬ></aʬ>
<a͢͢͢></a͢͢͢>
<a͢͢͢ʬ͢ʬ͢ʬ͢ʬ͢ʬ͢ʬ͢ΘΘΘΘ></a͢͢͢ʬ͢ʬ͢ʬ͢ʬ͢ʬ͢ʬ͢ΘΘΘΘ>
<a></a>
I tested the HTML elements in both Chrome and Firefox, I didn't test any other browsers. I also didn't test every ASCII character, just some very high and low in terms of their character code.
From the HTML standard:
Start tags must have the following format:
The first character of a start tag must be a U+003C LESS-THAN SIGN
character (<). The next few characters of a start tag must be the
element's tag name.
So what is allowed in the element's tag name? This is defined just above:
Tags contain a tag name, giving the element's name. HTML elements all
have names that only use ASCII alphanumerics. In the HTML syntax, tag
names, even those for foreign elements, may be written with any mix of
lower- and uppercase letters that, when converted to all-lowercase,
matches the element's tag name; tag names are case-insensitive.

Random Letter html Tag

I was wondering if you can use a random letter as an html tag. Like, f isn't a tag, but I tried it in some code and it worked just like a span tag. Sorry if this is a bad question, I've just been curious about it for a while, and I couldn't find anything online.
I was wondering if you can use a random letter as an html tag.
Yes and no.
"Yes" - in that it works, but it isn't correct: when you have something like <z> it only works because the web (HTML+CSS+JS) has a degree of forwards compatibility built-in: browsers will render HTML elements that they don't recognize basically the same as a <span> (i.e. an inline element that doesn't do anything other than reify a range of the document's text).
However, to use HTML5 Custom Elements correctly you need to conform to the Custom Elements specification which states:
The name of a custom element must contain a dash (-). So <x-tags>, <my-element>, and <my-awesome-app> are all valid names, while <tabs> and <foo_bar> are not. This requirement is so the HTML parser can distinguish custom elements from regular elements. It also ensures forward compatibility when new tags are added to HTML.
So if you use <my-z> then you'll be fine.
The HTML Living Standard document, as of 2021-12-04, indeed makes an explicit reference to forward-compatibility in its list of requirements for custom element names:
https://html.spec.whatwg.org/#valid-custom-element-name
They start with an ASCII lower alpha, ensuring that the HTML parser will treat them as tags instead of as text.
They do not contain any ASCII upper alphas, ensuring that the user agent can always treat HTML elements ASCII-case-insensitively.
They contain a hyphen, used for namespacing and to ensure forward compatibility (since no elements will be added to HTML, SVG, or MathML with hyphen-containing local names in the future).
They can always be created with createElement() and createElementNS(), which have restrictions that go beyond the parser's.
Apart from these restrictions, a large variety of names is allowed, to give maximum flexibility for use cases like <math-α> or <emotion-😍>.
So, by example:
<a>, <q>, <b>, <i>, <u>, <p>, <s>
No: these single-letter elements are already used by HTML.
<z>
No: element names that don't contain a hyphen - cannot be custom elements and will be interpreted by present-day browsers as invalid/unrecognized markup that they will nevertheless (largely) treat the same as a <span> element.
<a:z>
No: using a colon to use an XML element namespace is not a thing in HTML5 unless you're using XHTML5.
<-z>
No - the element name must start with a lowercase ASCII character from a to z, so - is not allowed.
<a-z>
Yes - this is fine.
<a-> and <a-->
Unsure - these two names are curious:
The HTML spec says the name must match the grammar rule [a-z] (PCENChar)* '-' (PCENChar)*.
The * denotes "zero-or-more" which is odd, because that implies the hyphen doesn't need to be followed by another character.
PCENChar represents a huge range of visible characters permitted in element names, curiously this includes -, so by that rule <a--> should be valid.
But note that -- is a reserved character sequence in the greater SGML-family (including HTML and XML) which may cause weirdness. YMMV!

What characters can come immediately after the < in a tag?

On a webpage I found a tag that begins with a Unicode letter 休
Is there a list somewhere of the letters and symbols may validly follow right after the less than sign?
They're not using a mark-up less than sign "<" on their site but rather they are using the HTML entity less than < to display the reserved character as text rather than HTML.
This can be treated just like ordinary text. So in essence, it's not a tag, its just ordinary text.
For instance the line:
<font style="color:#F00;"><休闲文化></font>
Actually is:
<font style="color:#F00;"><休闲文化></font>
Thus, <休闲文化> isn't a tag itself, but rather just text (which uses HTML reserved characters within it - perhaps marking you confuse it for a tag)
In which context is it used? XML, HTML,...?
In case of HTML there are tags already defined, you can't use a random one. In XML you can define you're own tags. In both cases you might use random tags, while not ending up with error you would notice, the tag might just get skipped.
I believe this Wiki page might help you:
https://en.m.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
All characters you want (but you should quote few).
As you notice, the character <休 come in <a href="http:(...)" target="_blank" title="<休闲文化>【成都大熊猫基. So this is inside a string (an attribute value). The < in this case will not indicate a start of a tag.

HTML's handling of white-space characters depends on context - but what are the rules?

The Unicode catalogue includes a number of white-space characters, some of which don't appear to work in any context in HTML documents - but some of which, rather usefully, do.
Here is an example:
<h1 title="Hi! As a title attribute, 
I can contain horizontal tabs 
and carriage returns
and line feeds.">HTML's handling of &009; | &010; | &013;</h1>
<p>Hello. As a paragraph element, I can't contain horizontal tabs 
or carriage returns
or line feeds.</p>
<input type="submit" value="I am a value attribute and
like title I can also handle line feeds" /><br />
<input type="submit" value="I am another value attribute. Like title I can handle horizontal tabs" /><br />
<input type="submit" value="I am a third value attribute. 
Unlike title I can't handle carriage returns" />
Is there any official spec or series of guidelines which detail which white-space characters can be deployed in HTML documents and where?
It's a little unclear what you mean by work, but I'm going to assume you mean rendering, at which point what happens is really up to CSS.
https://www.w3.org/TR/CSS2/text.html#white-space-model defines how most whitespace characters are normalized away, unless you adjust the white-space property.
Note that the display of toolbars (such as from the title attribute) and form controls (such as from input elements) is not defined by any standard, leaving that effectively up to browsers.
Disclaimer: this answer was composed for the question as originally written, making explicit references to ASCII control characters. It was apparently a red herring so the information here may look confusing now.
First of all, I don't think nobody uses ASCII any more. In 2016 the only sensible encoding is UTF-8. Whatever, UTF-8 is a superset of ASCII (and you can use ASCII anyway) so the question is still be valid.
Secondly, your example isn't correct. All the HTML entities you mention are printable characters:
is 'CHARACTER TABULATION' (U+0009) (i.e. a tab)
is 'CARRIAGE RETURN (CR)' (U+000D) (i.e. a legacy MacOS line feed)
is 'LINE FEED (LF)' (U+000A) (i.e. a Unix line feed)
(And please note that Windows line feeds are a combination of CR+LF.)
If you're really talking about control characters:
EOT End of Transmission
ACK Acknowledgement
BEL Bell
...
... we first need to understand that HTML is meant to be plain text (as such, it's MIME content type is text/html). The HTML5 Living Standard provides a definition of control character that's wider than the ASCII one but in any case it doesn't seem to be allowed:
Any occurrences of any characters in the ranges U+0001 to U+0008,
U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters
U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE,
U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF,
U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE,
U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF,
U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse
errors. These are all control characters or permanently undefined
Unicode characters (noncharacters).
Any character that is a not a Unicode character, i.e. any isolated
surrogate, is a parse error. (These can only find their way into the
input stream via script APIs such as document.write().)
If you actually refer to the characters in your example, some of then are considered exceptions in the parsing stage:
U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
characters are treated specially. Any LF character that immediately
follows a CR character must be ignored, and all CR characters must
then be converted to LF characters. Thus, newlines in HTML DOMs are
represented by LF characters, and there are never any CR characters in
the input to the tokenization stage.
... but I suspect you are only interested in white-space collapsing:
In HTML, only the following characters are defined as white space
characters:
ASCII space ( )
ASCII tab ( )
ASCII form feed ()
Zero-width space (​)
[...]
In particular, user agents should collapse input white space sequences
when producing output inter-word space.
[...]
The PRE element is used for preformatted text, where white space is
significant.
In other words, consecutive white space characters become a simple space (except inside <pre> tag). (I could only find a link for HTML 4 but that's something that hasn't changed significantly).
Is there any official spec or series of guidelines? Sure they are: you have the official W3C recommendations and the WHATWG specs but they're basically technical documentation mostly addressed at browser vendors: extensive, comprehensive and hard to decipher into plain English ;-)

Is it allowed to use other tags inside <title>?

Is it correct practice or valid syntax to use other tags inside a <title>?
An example for multi-language title
<html lang=en>
<title>Some title in English and a <i lang=fr>word in French</i></title>
See http://www.w3.org/TR/html401/struct/global.html#h-7.4.2:
Titles may contain character entities (for accented characters, special characters, etc.), but may not contain other markup (including comments).
(my emphasis)
No, it may not
http://www.w3.org/Provider/Style/TITLE.html
You can try to use whatever you want, but it will all be used as title string, without any additional parsing/processing from the browser (if that's what you expect). RFC says you have to resist from placing markup in title, though.
TLDR: The <title> tag (1) must contain text (it must not be empty), (2) must only contain text (i.e. no other elements), and (3) must contain text that is not just white-space.
In HTML 5, the Content Model of the title element is:
Text that is not inter-element white space.
where inter-element white space is any Text node that is either empty or only contains sequences of space characters:
U+0020 SPACE
U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+000D CARRIAGE RETURN (CR)