html special characters in css content, using attr() - html

Relevant codepen: http://codepen.io/anon/pen/ocptF/
EDIT: The codepen uses Jade, and thus messes a few things up. I was not aware of this when starting this question.
Essentially, I thought CSS attr() would copy over an HTML attribute literally, but that is not the case.
I'd like to use the CSS attr function to fill in the content for some pseudoelements. However, it prints out 004 when the HTML attribute is set to \f004, and 08fa when \f08fa.
Relevant lines:
HTML:
<div class="zoomfade" data-fill='\f004' data-unfill='\f08a'></div>
CSS:
.zoomfade:before {
content: attr(data-unfill);
position: absolute;
}
.zoomfade:before {
content: attr(data-fill);
position: absolute;
}
Thanks!

Escape sequences in CSS are only treated specially in CSS syntax. When you specify it in an HTML attribute value then use that value in CSS attr(), it is taken literally. From the CSS2.1 spec:
attr(X)
This function returns as a string the value of attribute X for the subject of the selector. The string is not parsed by the CSS processor. [...]
Since you're specifying character codes in HTML attribute values, you can either use HTML character references, entity references or the Unicode characters themselves. It's worth noting that the two character codes you have do not appear to be valid, however, so they may not work at all.
EDIT: [...] Essentially, I thought CSS attr() would copy over an HTML attribute literally, but that is not the case.
It copies the attribute value according to the DOM, which may be different from how it is represented in the source, e.g. the source markup, or the script that is generating the element.
For example, if the source is represented in raw HTML markup, then as I mention above, you will need to use HTML character escapes, because HTML is parsed by an HTML parser. If the elements are generated using a JS-based template engine such as Jade, then the character escapes take the form of \u followed by the hexadecimal code-points. In both cases, the respective parsers will translate the escape sequences into their representative characters, and the characters themselves are what is stored in the DOM as part of the attribute value.
Of course, again there's always the alternative of just using the Unicode characters directly. If your source files are all encoded as UTF-8, you should have no problem using the characters directly.

Related

Valid and Invalid HTML tags

So recently I found a question that was
Which of the following is a valid tag?
<213person>
<_person> (This is given as the right answer)
Both
None
(Note: this is the explanation that was given:- Valid HTML tags are surrounded by the angle brackets and the tag name can only either start from an alphabet or an underscore(_))
As far as my knowledge goes none of the reserved tags start with an underscore and according to what I've read about custom HTML tags it has to start with an alphabet(I tested it and it doesn't work with a custom tag starting with any character that's not an alphabet). So in my opinion and according to what I tested HTML tags can only start with alphabets or! (in case of !-- -- and !DOCTYPE HTML)
What I want to know is if the given explanation is correct or not and if it's correct then can someone provide some proper documentation and working examples for it?
As mentioned by #Rob, the standard defines a valid tag name as string containing alphanumeric ASCII characters, being:
0-9|a-z|A-Z
However, browsers handle things differently.
There's a few main points that I've noticed which don't align with the current standard.
Tag names must start with a letter
If a tag name starts with any character outside a-z|A-Z, the start tag ends up being interpreted as text and the end tag gets converted into a comment.
Special characters can be used
The following HTML is valid in a lot of browsers and will create an element:
<Z[\]^_`a></Z[\]^_`a>
This seems to be browsers only checking if the characters are ASCII. The only exception is the first character (as stated above).
Initially, I thought this was a simplified check, so instead of [A-Z]|[a-z| they checked [A-z], but you can use any character outside this range.
This makes the following HTML also "valid" in the eyes of certain browsers:
<a!></a!>
<aʬ></aʬ>
<a͢͢͢></a͢͢͢>
<a͢͢͢ʬ͢ʬ͢ʬ͢ʬ͢ʬ͢ʬ͢ΘΘΘΘ></a͢͢͢ʬ͢ʬ͢ʬ͢ʬ͢ʬ͢ʬ͢ΘΘΘΘ>
<a></a>
I tested the HTML elements in both Chrome and Firefox, I didn't test any other browsers. I also didn't test every ASCII character, just some very high and low in terms of their character code.
From the HTML standard:
Start tags must have the following format:
The first character of a start tag must be a U+003C LESS-THAN SIGN
character (<). The next few characters of a start tag must be the
element's tag name.
So what is allowed in the element's tag name? This is defined just above:
Tags contain a tag name, giving the element's name. HTML elements all
have names that only use ASCII alphanumerics. In the HTML syntax, tag
names, even those for foreign elements, may be written with any mix of
lower- and uppercase letters that, when converted to all-lowercase,
matches the element's tag name; tag names are case-insensitive.

Random Letter html Tag

I was wondering if you can use a random letter as an html tag. Like, f isn't a tag, but I tried it in some code and it worked just like a span tag. Sorry if this is a bad question, I've just been curious about it for a while, and I couldn't find anything online.
I was wondering if you can use a random letter as an html tag.
Yes and no.
"Yes" - in that it works, but it isn't correct: when you have something like <z> it only works because the web (HTML+CSS+JS) has a degree of forwards compatibility built-in: browsers will render HTML elements that they don't recognize basically the same as a <span> (i.e. an inline element that doesn't do anything other than reify a range of the document's text).
However, to use HTML5 Custom Elements correctly you need to conform to the Custom Elements specification which states:
The name of a custom element must contain a dash (-). So <x-tags>, <my-element>, and <my-awesome-app> are all valid names, while <tabs> and <foo_bar> are not. This requirement is so the HTML parser can distinguish custom elements from regular elements. It also ensures forward compatibility when new tags are added to HTML.
So if you use <my-z> then you'll be fine.
The HTML Living Standard document, as of 2021-12-04, indeed makes an explicit reference to forward-compatibility in its list of requirements for custom element names:
https://html.spec.whatwg.org/#valid-custom-element-name
They start with an ASCII lower alpha, ensuring that the HTML parser will treat them as tags instead of as text.
They do not contain any ASCII upper alphas, ensuring that the user agent can always treat HTML elements ASCII-case-insensitively.
They contain a hyphen, used for namespacing and to ensure forward compatibility (since no elements will be added to HTML, SVG, or MathML with hyphen-containing local names in the future).
They can always be created with createElement() and createElementNS(), which have restrictions that go beyond the parser's.
Apart from these restrictions, a large variety of names is allowed, to give maximum flexibility for use cases like <math-α> or <emotion-😍>.
So, by example:
<a>, <q>, <b>, <i>, <u>, <p>, <s>
No: these single-letter elements are already used by HTML.
<z>
No: element names that don't contain a hyphen - cannot be custom elements and will be interpreted by present-day browsers as invalid/unrecognized markup that they will nevertheless (largely) treat the same as a <span> element.
<a:z>
No: using a colon to use an XML element namespace is not a thing in HTML5 unless you're using XHTML5.
<-z>
No - the element name must start with a lowercase ASCII character from a to z, so - is not allowed.
<a-z>
Yes - this is fine.
<a-> and <a-->
Unsure - these two names are curious:
The HTML spec says the name must match the grammar rule [a-z] (PCENChar)* '-' (PCENChar)*.
The * denotes "zero-or-more" which is odd, because that implies the hyphen doesn't need to be followed by another character.
PCENChar represents a huge range of visible characters permitted in element names, curiously this includes -, so by that rule <a--> should be valid.
But note that -- is a reserved character sequence in the greater SGML-family (including HTML and XML) which may cause weirdness. YMMV!

Displaying tags in HTML5

In HTML5, if you include <pre> for example on a paragraph text, the result won't display '<pre>' on the paragraph, and it will run the command <pre> on the words after it.
What I have to do display texts including signs like " " or <> on a text, without running the command.
How can I accomplish this?
What you're looking for are known as HTML entities: characters that are reserved, and which automatically get parsed to the the relevant HTML. Using these tags allow you to write out the entities that would usually automatically get parsed as HTML.
For example, attempting to write out the <pre> tag within a parent <pre> tag will normally result in the inner tag being treated as HTML:
<pre><pre>The relevant tags surround this text</pre></pre>
Though using the HTML entities < and > for the left and right bracket respectively parses the entities as HTML, where they get displayed as text:
<pre><pre>The relevant tags surround this text</pre></pre>
A full list of HTML entities can be found here.
Hope this helps! :)
The 'best' way is to replace every < and > element with < and &gt:
But if you want to do it fast, you can use the xmp tag. It's deprecated but is still supported by all browsers
<xmp>
<div>Lorem ipsum</div>
<p>Hello</p>
</xmp>
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/xmp
You need to escape the provided content to display it as-is, without it being interpreted as HTML.
https://github.com/sindresorhus/escape-goat
This involves taking the reserved characters for the given language (e.g. HTML) and converting them to a representation that either uses an escape sequence or only uses unreserved characters.
In this case, HTML prescribes the use of entities to display characters that would otherwise be used by the syntax for tags and attributes within the source code itself.

In what scopes do special HTML characters need to be escaped?

In HTML,
Dust & Bones
needs to be escaped as follows:
Dust & Bones
What's the scope of where &amp needs to be applied. Is it just href or is it anywhere within HTML text? What about
<input value="http://... & ">?
or within
<script>... & ... </script>
do these need escaping?
update
The bigger question, which would explain this, is, when does the HTML parser look for &XXX; tokens and replace them? Is it done once on the whole document, or do different rules apply for the text between tags vs. attribute values within a tag vs. wihtin tagA vs. within tagB -- different parsing rules seem to apply within , so I may write && (for AND) and < for (LESS-THAN). So, what rules apply in which scopes?
The rules vary depending on the version of HTML you are dealing with but are always more complex then is worth trying to remember.
The safe approach is "Use character references to represent the 5 HTML special characters everywhere except inside script and style elements", which makes you safe for everything except XHTML.
For XHTML the rule is the same with the additional proviso of "and use explicit CDATA sections in script and style elements".
The bigger question, which would explain this, is, when does the HTML parser look for &XXX; tokens and replace them?
As it parses the HTML (depending on what the current state of the tokeniser is ("inside start tag" and "inside attribute value" are examples of different states)).
Is it done once on the whole document
Unless you trigger additional HTML parsing (e.g. by setting innerHTML on an element).
or do different rules apply for the text between tags vs. attribute values within a tag vs. wihtin tagA vs. within tagB
Different rules apply in different places. The complete, current rules are (as I suggested in a comment) rather complex and would require a lot of work to extract from the HTML 5 parsing rules. This is why I suggest, if you are an HTML author and not a browser author, using the simpler rules of "Use character references unless you are in a script or style element".
-- different parsing rules seem to apply within <script>, so I may write && (for AND) and < for (LESS-THAN). So, what rules apply in which scopes?
In HTML 4 terms, script and style elements are defined as containing CDATA (where the only sequence of characters with special meaning in HTML are </ which terminates the CDATA section). Everywhere else in the document (including, counter-intuitively, attribute values that are defined as containing CDATA) & indicates the start of a character reference (although there might be a few exceptions based on what the character following the & is).
The HTML 5 rules are more complicated, but the basic principle of "It is safe and sane to use character references for &, <, >, " and ' everywhere except inside script and style elements" holds.

Can data-* attribute contain HTML tags?

I.E. <img src="world.jpg" data-title="Hello World!<br/>What gives?"/>
As far as I understand the guidelines, it is basically valid, but it's better to use HTML entities.
From the HTML 4 reference:
You should also escape & within attribute values since entity references are allowed within cdata attribute values. In addition, you should escape > as > to avoid problems with older user agents that incorrectly perceive this as the end of a tag when coming across this character in quoted attribute values.
From the HTML 5 reference:
Except where otherwise specified, attributes on HTML elements may have any string value, including the empty string. Except where explicitly stated, there is no restriction on what text can be specified in such attributes.
So the best thing to do, as #tdammers already says, is to escape these characters (quoting the W3C reference)
& to represent the & sign.
< to represent the < sign.
> to represent the > sign.
" to represent the " mark.
and decoding them from their entity values if they are to be used as HTML.
Providing you're serving it as text/html, then yes it's valid.
Note that not only is it possible to include markup inside attributes, but the HTML5 srcdoc attribute on the iframe element positively encourages it. The HTML5 draft says:
In the HTML syntax, authors need only
remember to use U+0022 QUOTATION MARK
characters (") to wrap the attribute
contents and then to escape all U+0022
QUOTATION MARK (") and U+0026
AMPERSAND (&) characters, ....
Note, that when served with an XML content type (e.g. application/xhtml+xml), it is not valid, or even well-formed.
I'd say yes, as in it's still valid HTML5. Older browsers (which ones?) may not parse correctly.
Section 3.2.4.1 Attributes of the current HTML5 draft says this:
Except where otherwise specified, attributes on HTML elements may have any string value, including the empty string. Except where explicitly stated, there is no restriction on what text can be specified in such attributes.
HTML tags inside attributes also validates at http://html5.validator.nu
No. That would be invalid - HTML does not allow < or > inside attributes.
<img src="world.jpg" data-title="Hello World!<br/>What gives?"/> would be valid, but it would display the <br/> literally, not as a newline.