HTML whitespace: spaces before and after <br> - html

I am trying to better understand the HTML whitespace processing model. Right now I'm comparing two HTML snippets:
<div>a <br>z</div>
and
<div>a<br> z</div>
The first snippet, when renered, yields two lines: "a " and "z" (So the first line has a trailing space.)
The second snippet yields two lines: "a" and "z". There is no leading space on the second line.
My question is: why? I'm currently using this http://www.w3.org/TR/CSS2/text.html#white-space-model as a reference. It states
If a space (U+0020) at the beginning of a line has 'white-space' set to 'normal', 'nowrap', or 'pre-line', it is removed.
All tabs (U+0009) are rendered as a horizontal shift that lines up the start edge of the next glyph with the next tab stop. Tab stops occur at points that are multiples of 8 times the width of a space (U+0020) rendered in the block's font from the block's starting content edge.
If a space (U+0020) at the end of a line has 'white-space' set to 'normal', 'nowrap', or 'pre-line', it is also removed.
If spaces (U+0020) or tabs (U+0009) at the end of a line have 'white-space' set to 'pre-wrap', UAs may visually collapse them.
A naive reading of this would indicate that, since a space that the beginning or end of a line is to be removed (when 'white-space' is 'normal'), the first of my snippets ought to result in no trailing space. But that isn't the case.
So what's going on?
My current theory is that the <br> is secretly counted as a "character" which, in the first snippet, prevents the trailing space from being at the "end" of its line. But I really have no idea.
EDIT: To be clear, I know how to use to create spaces at will. My question is about what rule (with regard to some spec) induces the above behavior.

Good question! I've confirmed the behavior in both Chrome and Firefox, and confirmed that it has nothing to do with <br>, as it's also triggered by an ordinary linebreak in white-space: pre-line conditions:
<div style="white-space:pre-line">a
z</div>
I've sent an email to the list asking for clarification on this issue, and inquiring whether we should change the spec to match implementations, or file bugs on browsers to match the spec.

Related

Term to describe the space left between blocks of code (e.g. in a script)?

What's the correct term for the space between blocks of code? The best I have come up with is 'block delimiter' (as in code block delimiter)
Background
I'm writing some documentation and need to know what the space between blocks of code is called. I can see a very common pattern is to leave a single line gap (in other words, \n\n goes between the last character of the last code block and the first character of the next code block - examples here).
Question
What is the appropriate term for the space between the last character of a code block and the first character of the following code block?
Consider term padding line between blocks as inspired by ESLint's rule padding lines between statements:
Require or disallow padding lines between statements
(padding-line-between-statements)
This rule requires or disallows blank lines between the given 2 kinds
of statements. Properly blank lines help developers to understand the
code.

Indent subsequent lines without hanging indent

In Sublime, long lines are wrapped using hanging indent (also known as "reverse indent"). Example:
(I use "word_wrap": "true" in my settings).
Is there way to make long lines wrap without this hanging indentation, i.e. like in Brackets:
From my observations, ST always adds one indentation level to lines that were wrapped, if the base scope begins with source. (If indent_subsequent_lines is set to true, ST will indent subsequent word wrapped lines to the level of the line being wrapped and then add the extra level of indentation.)
It seems there is no way to disable hanging indentation in ST for source code. Indeed, this behavior can cause text to be unviewable if it is indented further than the window width - https://github.com/SublimeTextIssues/Core/issues/286.
It might be worth logging an issue at https://github.com/SublimeTextIssues/Core/issues to ask the ST devs to consider making an option to disable the hanging indentation.

How are tabs interpreted in CommonMark?

See the description before Example 6 in the CommonMark spec at: http://spec.commonmark.org/0.27/#example-5
I am trying to understand how the following code leads to a code-block starting with two spaces.
>→→foo
Example 6 shows that this would translate to the following.
<blockquote>
<pre><code> foo
</code></pre>
</blockquote>
But Section 2.2 clearly states:
However, in contexts where whitespace helps to define block structure, tabs behave as if they were replaced by spaces with a tab stop of 4 characters.
So as per my understanding, the above Markdown behaves like the following (I denote a space with a dot).
>........foo
Since, one optional space is allowed after >, and 4 spaces are used to indent code block, we are left with,
>...foo
That's a code-block starting with three spaces. How does CommonMark claim then that it should lead to a code-block starting with two spaces? What am I missing?
The key is in the very first paragraph of the Tabs section (emphasis added):
Tabs in lines are not expanded to spaces. However, in contexts where whitespace helps to define block structure, tabs behave as if they were replaced by spaces with a tab stop of 4 characters.
Notice that is says "4 characters" not 4 spaces.
If you configure your text editor to use a tab stop of length four and to replace tabs with spaces (any good text editor should offer this setting), the text editor will use columns that are four characters wide. When you press the tab key, it will forward the cursor to the next column, which will only every be four characters wide. If the column already contains any characters, then only as many spaces are added to total four characters, which, in this case would be less than four spaces.
For example, if you type an angle bracket (>) character in your editor and then press tab, you will get the following (when configured to replace tabs with spaces):
>···
Therefore the angle bracket plus the tab moves forward to the end of the column (four characters) for a total of three spaces. As we are now at the beginning of the next column, pressing tab a second time would move us to the next column (4 more spaces) for a total of 7 spaces:
>·······
We can confirm this is the correct interpretation with a more recent change to the spec committed in 3bc01c5dc (which apparently hasn't made it it to a release yet). As the commit comment suggests, the clarification helps the math make more sense (emphasis added):
Normally the > that begins a block quote may be followed
optionally by a space, which is not considered part of the
content. In the following case > is followed by a tab,
which is treated as if it were expanded into three spaces.
Since one of these spaces is considered part of the
delimiter, foo is considered to be indented six spaces
inside the block quote context, so we get an indented
code block starting with two spaces.
Notice the added sentence (in bold) which confirms that the first tab only adds "three spaces".
Therefore, as we have now established, we start with an angle bracket plus seven spaces. So first we break off the blockquote deliminator, which consists of the angle bracket and the first space (in the following examples the | is used to indicate where the parser breaks the string and should not be counted as characters):
>·|······
The text contained in the blockquote is now indented six spaces. Four of them are the code block deliminator:
>·|····|··
Which leaves two spaces at the start of the code block.
Of course, as stated back at the beginning (of the section in the spec), the tabs aren't actually replaced with spaces, it just behaves as if they were. And that can be confusing at times. It may help to configure your text editor to always replace tabs with spaces and then you can avoid this confusion.

HTML textarea cuts off beginning new lines

An HTML text area works fine with new lines ("\n") when they're after any other content in the text area, whether it be whitespace characters like spaces or tabs ("\t") or not.
However, when text area content begins with a new line (for example, "\ntest"), that new line gets cut off on display.
Any ideas on what causes this/how to remedy it?
This seems to be by the spec.
A single newline may be placed immediately after the start tag of pre and textarea elements. If the element's contents are intended to start with a newline, two consecutive newlines thus need to be included by the author.
Note that in the past there were some bugs in the various browsers regarding leading new lines in elements:
https://bugzilla.mozilla.org/show_bug.cgi?id=591988
https://bugs.chromium.org/p/chromium/issues/detail?id=62901

What character should I use to maintain height of an empty (zero width) string?

I have a string that can potentially be empty, and in that case, I want to substitute it with a special character to maintain the ordinary text height while having zero width. In TeX, this would be called \strut. What is the counterpart for that in HTML? I came up with two candidates: ⁠ and . Should I use one of these?
On modern browsers, any zero-width character will do the job, provided that the browser either knows that the character is zero-width or uses a font that contains an empty glyph for it. But some characters may have effects, depending on the context and on software used to process the HTML file.
U+2060 WORD JOINER has the effect of preventing line break.
U+FEFF ZERO WIDTH NO-BREAK SPACE has the same effect. It is formally deprecated for any use except as Byte Order Mark, but in reality it works more often than WORD JOINER (though there are exceptions).
U+200B ZERO WIDTH SPACE has the effect of allowing a line break even when it would otherwise not be permitted; it’s like SPACE, but with zero width.
Usually the worst-case scenario for characters like this is an old version of IE. Checking in IE 6 shows that U+FEFF and U+200B are OK, but U+2060 shows as a small rectangle (i.e., the browser tries to render the character but finds no glyph for it).
So I’d use  or ​ depending on whether I’d like to prevent or allow line break at that point. If it does not matter, ​ is more logical to use.
I would suggest  or if zero width is not essential or if it is essential you could try the Unicode character ⁠ which is a zero width non-breaking space.