Right angle bracket in HTML - html

For obvious reasons, the left angle bracket cannot be used literally in HTML text. The status of the right angle bracket is not quite so clear. It does work when I try it, but of course browsers are notoriously forgiving of constructs that are not strictly valid HTML.
http://www.w3.org/TR/html4/charset.html#h-5.4 seems to be saying it's valid, though may not be supported by older browsers, but also makes specific mention of quoted attribute values. Is it necessary to html encode right angle brackets? also says it's valid but again specifically talks about quoted attribute values.
What's the answer for plain chunks of text (contents of a <pre> Element happens to be the case I'm looking at), and does it differ in any way?

The character “>” can be used as such as data character in any version of HTML, both in element content and in an attribute value. This follows from the lack of any statement to the contrary in the specifications.
It is often routinely escaped as >, which is valid but not required for any formal or technical reason. It is used partly because people assume it is needed the same way as the “<” character needs to be escaped, partly for symmetry: writing, say, <code> may look more symmetric than <code>.
The character “>” is the GREATER THAN character. It is used in many contexts, like HTML markup, as a delimiter of a kind, in a bracket-like manner, but the real angle brackets, as used in some mathematical notations, are rather different, such as “⟩” U+27E9. If you need to include angle brackets in an HTML document, you have some serious issues to consider, but they relate to fonts (and semantics), not to any potential clash with markup-significant characters.

Right angle brackets are legal within a <pre> tag or as text within an element.
There is no ambiguity when using them in this manner and parsers have no issue with "understanding" them.
Personally, I just escape these whenever I need to use them, just to match left angle brackets...

Related

HTML Minification: Whitespace between element attributes

I'd like to remove more unnecessary bytes from my output, and it seems it's acceptable (in practice) to strip what can add up to quite a lot of whitespace from HTML markup by omitting/collapsing the gaps between DOM element attributes.
Although I've tested and researched (a little in both cases), I'm wondering how safe it would be?
I tested in Chrome (43.0.2357.65 m), IE (11.0.9600.17801), FF (38.0.1) and Safari (5.1.7 (blah-di-blah)) and they didn't seem to mind, and couldn't find anything specific in The Specs about whitespace between attributes.
w3.org's Validator complains, which is a strong indication that this is not safe and shouldn't be expected to work, but (there's always a "but") it's possible the requirement for a space is only strict when no quotes are present (for obvious reasons).
Also (snippy but poignant): their SSL is "out of date" which doesn't inspire confidence in their opinion.
I noted also that someone's HTML compressor could (when enabled) strip quotes around attribute values where those values had no whitespace within them (e.g. id), which implies that at least most if not all HTML parsing is focussed on the text either side of the equals signs (except with booleans of course), and where quotes are in use, they'd be considered the prioritized delimiter.
So, would:
<!DOCTYPE html><html><body>
Yabba Dabba Doo!
</body></html>
▲ that ever go wrong, and if so, under which conditions?
What other reasons could there be to maintain this whitespace in production output (code "readability" is a non issue in this case)?
Update (since finding an answer):
Although I basically answered my own question insofar that there is a specification governing whether there should be a space between attributes, I still wonder if omitting them when using quoted values can be considered practically safe, and would appreciate feedback on this point.
Considering how often spaces may be omitted by accident in production HTML, and that the browsers I tested don't seem to mind when they are, I assume it would be very rare if ever that a browser failed to handle documents with these spaces omitted.
Although it's sensible to follow the specs in pretty much all situations, might this be one time cheating a bit could be acceptable?
After all - if we can magically save several hundred bytes without affecting the quality of the output, why not?
There is a specification (after all)
It turns out I should have looked harder. My bad.
According to these specs:
If an attribute using the empty attribute syntax is to be followed by another attribute, then there must be a space character separating the two.
and
If an attribute using the unquoted attribute syntax is to be followed by another attribute or by the optional U+002F SOLIDUS character (/) allowed in step 6 of the start tag syntax above, then there must be a space character separating the two.
and
If an attribute using the single-quoted attribute syntax is to be followed by another attribute, then there must be a space character separating the two.
and
If an attribute using the double-quoted attribute syntax is to be followed by another attribute, then there must be a space character separating the two.
Which unless I am mistaken (again), means there must always be spaces between attributes.
You could try online HTML minifiers like http://www.whak.ca/minify/HTML.htm or http://www.scriptcompress.com/minify-HTML.htm (search google for more) and find little things they change for hints to what can be taken out yet still render the HTML code.
On the first link your code:
<!DOCTYPE html><html><body>
Yabba Dabba Doo!
</body></html>
Turns into:
<!DOCTYPE html><html><body>Yabba Dabba Doo!
saving you 18 bytes already...

How to represent negative number in HTML

Simple question, what is the proper character to use for representing a negative number? Should I use a normal dash, a minus entity or is there a more appropriate entity to use?
To be more clear, in an HTML document if I want to display a negative temperature, should I use:
-5 °C
or
−5 °C
or something else?
There are two separate questions here: 1) what do you use as the character that acts as the sign of a negative number (or, more generally, how you write a negative number with characters), and 2) how do you represent that character in HTML. Only the latter is HTML-specific and can thus be considered on-topic at SO.
However, question 1 is, too, somewhat programming-related. In most computer languages (such as JavaScript, HTML, and CSS), a negative number is almost always written using the common ASCII hyphen, officially HYPHEN-MINUS, “-”, U+002D. For this reason, people often use the ASCII hyphen in general texts, too, and even in mathematical texts. This typically violates the rules of human languages. In most languages, the MINUS SIGN “−” U+2212 is preferred. It is also typographically much better, especial in quality fonts, where e.g. “−42” has a noticeable sign and in “-42” the ASCII hyphen is less noticeable. The MINUS SIGN also has better line breaking properties: web browsers do not treat it as allowing a line break between it and the following digit, as they may well do to the ASCII hyphen.
Having chosen to use the minus sign, the simplest approach is to use the character “−” itself. For this, you need some method of inputting it. You also need to take care of character encoding issues, normally using UTF-8, but this is something that should be done anyway.
You can also use the named character reference −. It stands for the minus sign, and it might be convenient casually when you need to use the character but lack a convenient quick way of typing it.
Here you can find different operators representation in html.
The other answers give you the correct information on the html, in that there are character codes that mean "minus", and the downside of using the hyphen is that the browser could word-wrap the number and put the digits on the line after the hyphen.
However, you asked "Should I use a normal dash"? That really depends on the context. In particular, if you want the user to be able to copy your text and paste it into another program, and have that program interpret your negative numbers as negative numbers, you are going to have to use a hyphen.
For example, copy the following two lines and paste it into an Excel spreadsheet:
−45
-45
You will notice that the first number is treated as text, and the second is treated as a number, even though according to the html, the opposite should happen.
To use a hyphen with negative numbers, and prevent wrapping, use a white-space style of "nowrap".

HTML - Importance of Double Quotes in HTML Tags

i have been working with HTML for some time, recently a question popped in my mind that when ever give tags
<input type = "text" name="c-id" />
<input type =text name=c-id />
with or without quotes there is no difference, when working with V-Studio, this IDE doesn't put quotes automatically, but while working with the Dreamweaver if i remember correctly it does put the quotes automatically...
what i want is to know is that at what point\s the absence of quotes makes a difference or creates a problem, and what is the best practice, always go quoted or quotes-less...
p.s. there is very good chance this question has been asked before but didn't show up in the search
It makes a difference when the attribute value contains a space, linefeed, formfeed, or tab character, because the attribute value will end at any one of those characters if the value is not wrapped in quotes.
From an IDE point of view, it's little more than a matter of preference whether to add the quotes or not when the attribute value does not contain one of those whitespace characters.
In XHTML syntax, the quotation marks are always required (but you can alternatively use single quotes, i.e. 'apostrophes'). This does not matter if the page is served with text/html content type, as it almost always is: browsers will parse it as sloppy HTML, not as real XHTML. But if the page is served with an XML content type, or if it is opened in a program that expects XHTML or other XML, then lack of quotation marks causes Draconian error processing: only an error message is shown, not the content at all.
In the example case, XHTML syntax is used otherwise: the “/” before closing “>” belongs to XHTML, not HTML. It makes little point to write XHTML in that respect but not consistently.
In HTML syntax, the formal rules depend on HTML version. HTML5 is much more permissive than HTML 4.01. For example, <a href=/foo/bar/ title=What???> is valid HTML5 but not valid HTML 4.01. This mostly matters in validation and depend on which version of HTML you wish to validate against (i.e., which one do you mostly try to comply with). In this issue, HTML5 reflects browser practices: browsers have long been more permissive than HTML 4.01.
By HTML5 rules, the quotes (though always allowed) are needed only when the attribute value contains any of the following characters: space, tab, line feed, form feed, carriage return, quotation mark ("), apostrophe ('), equals sign (=), less-than sign (<), greater-than sign (>), or grave (`), or is empty.
There are also opinions as well as coding style guides and other recommendations on the matter. Most people who think about such issues apparently favor the “safe” way of always putting quotes around attribute values. Some people think the quotes improve readability of code; others think they reduce it.
Not putting your values into quotes may cause some problems, ie you can't provide multiple css classes for your elements. Browser will only understand the first value, and the page won't pass validation.
You should always put your values into quotes. The markup is easier to read and you won't run into some unexpected browser behaviour.
For me, any practice that improves code/markup (re-)readability is worth making a habit, and having quotes on the attribute values visibly separates one property from another (readability again).
On the technical side, some attribute values have spaces and the whole tag might not work without the quotes on these. Those that come to mind: img's alt, multiple class values, title (for tooltips), an href with a url that has the equal sign (without quotes, the first equal in href= is doomed), etc.
And here is the almost same question in SO.
--
My thanks to Alohci for enlightening me. I can't up vote yet.

HTML: Should I encode greater than or not? ( > > )

When encoding possibly unsafe data, is there a reason to encode >?
It validates either way.
The browser interprets the same either way, (In the cases of attr="data", attr='data', <tag>data</tag>)
I think the reasons somebody would do this are
To simplify regex based tag removal. <[^>]+>? (rare)
Non-quoted strings attr=data. :-o (not happening!)
Aesthetics in the code. (so what?)
Am I missing anything?
Strictly speaking, to prevent HTML injection, you need only encode < as <.
If user input is going to be put in an attribute, also encode " as ".
If you're doing things right and using properly quoted attributes, you don't need to worry about >. However, if you're not certain of this you should encode it just for peace of mind - it won't do any harm.
The HTML4 specification in its section 5.3.2 says that
authors should use ">" (ASCII decimal 62) in text instead of ">"
so I believe you should encode the greater > sign as > (because you should obey the standards).
Current browsers' HTML parsers have no problems with uquoted >s
However, unfortunately, using regular expressions to "parse" HTML in JS is pretty common. (example: Ext.util.Format.stripTags). Also poorly written command line tools, IDEs, or Java classes etc. may not be sophisticated enough to determine the limiter of an opening tag.
So, you may run into problems with code like this:
<script data-usercontent=">malicious();//"></script>
(Note how the syntax highlighter treats this snippet!)
Always
This is to prevent XSS injections (through users using any of your forms to submit raw HTML or javascript). By escaping your output, the browser knows not to parse or execute any of it - only display it as text.
This may feel like less of an issue if you're not dealing with dynamic output based on user input, however it's important to at least understand, if not to make a good habit.
Yes, because if signs were not encoded, this allows xss on forms social media and many other because a attacker can use <script> tag. If you parse the signs the browser would not execute it but instead show the sign.
Encoding html chars is always a delicate job. You should always encode what needs to be encoded and always use standards. Using double quotes is standard, and even quotes inside double quotes should be encoded. ENCODE always. Imagine something like this
<div> this is my text an img></div>
Probably the img> will be parsed from the browser as an image tag. Browsers always try to resolve unclosed tags or quotes. As basile says use standards, otherwise you could have unexpected results without understanding the source of errors.

What are all the HTML escaping contexts?

When outputting HTML, there are several different places where text can be interpreted as control characters rather than as text literals. For example, in "regular" text (that is, outside any element markup):
<div>This is regular text</div>
As well as within the values of attributes:
<input value="this is value text">
And, I believe, within HTML comments:
<!-- This text here might be programmatically generated
and could, in theory, contain the double-hyphen character
sequence, which is verboten inside comments -->
Each of these three kinds of text has different rules for how it must be escaped in order to be treated as non-markup. So my first question is, are there any other contexts in HTML in which characters can be interpreted as markup/control characters? The above contexts clearly have different rules about what needs to be escaped.
The second question is, what are the canonical, globally-safe lists of characters (for each context) that need to be escaped to ensure that any embedded text is treated as non-markup? For example, in theory you only need to escape ' and " in attribute values, since within an attribute value only the closing-delimiter character (' or " depending on which delimiter the attribute value started with) would have control meaning. Similarly, within "regular" text only < and & have control meaning. (I realize that not all HTML parsers are identical. I'm mostly interested in what is the minimum set of characters that need escaping in order to appease a spec-conforming parser.)
Tangentially: The following text will throw errors as HTML 4.01 Strict:
foo
Specifically, it says that it doesn't know what the entity "&y" is supposed to be. If you put a space after the &, however, it validates just fine. But if you're generating this on the fly, you're probably not going to want to check whether each use of & will cause a validation error, and instead just escape all & inside attribute values.
<div>This is regular text</div>
Text content: & must be escaped. < must be escaped.
If producing a document in a non-UTF encoding, characters that do not fit inside the chosen encoding must be escaped.
In XHTML (and XML in general), the sequence ]]> must not occur in text content, so in that specific case one of the characters in that sequence must be escaped, traditionally the >. For consistency, the Canonical XML specification chooses to escape > every time in text content, which is not a bad strategy for an escaping function, though you can certainly skip it for hand-authoring.
<input value="this is value text">
Attribute values: & must be escaped. The attribute value delimiter " or ' must be escaped. If no attribute value delimiter is used (don't do that) no escape is possible.
Canonical XML always chooses " as the delimiter and therefore escapes it. The > character does not need to be escaped in attribute values and Canonical XML does not. The HTML4 spec suggested encoding > anyway for backwards compatibility, but this affects only a few truly ancient and dreadful browsers that no-one remembers now; you can ignore that.
In XHTML < must be escaped. Whilst you can get away with not escaping it in HTML4, it's not a good idea.
To include tabs, CR or LF in attribute values (without them being turned into plain spaces by the attribute value normalisation algorithm) you must encode them as character references.
For both text content and attribute values: in XHTML under XML 1.1, you must escape the Restricted Characters, which are the Delete character and C0 and C1 control codes, minus tab, CR, LF and NEL. In total, [\x01-\x08\x0B\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]. The null character may not be included at all even escaped in XML 1.1. Outside XML 1.1 you can't use any of these characters at all, nor is there a good reason you'd ever want to.
<!-- This text here might be programmatically generated
and could, in theory, contain the double-hyphen character
sequence, which is verboten inside comments -->
Yes, but since there is no escaping possible inside comments, there is nothing you can do about it. If you write <!-- < -->, it literally means a comment containing “ampersand-letter l-letter t-semicolon” and will be reflected as such in the DOM or other infoset. A comment containing -- simply cannot be serialised at all.
<![CDATA[ sections and <?pi​s in XML also cannot use escaping. The traditional solution to serialise a CDATA section including a ]]> sequence is to split that sequence over two CDATA sections so it doesn't occur together. You can't serialise it in a single CDATA section, and you can't serialise a PI with ?> in the data.
CDATA-elements like <script> and <style> in HTML (not XHTML) may not contain the </ (ETAGO) sequence as this would end the element early and then error if not followed by the end-tag-name. Since no escaping is possible within CDATA-elements, this sequence must be avoided and worked around (eg. by turning document.write('</p>') into document.write('<\/p>');. (You see a lot of more complicated silly strategies to get around this one, like calling unescape on a JS-%-encoded string; even often '</scr'+'ipt>' which is still quite invalid.)
There is one more context in HTML and XML where different rules apply, and that's in the DTD (including the internal subset in the DOCTYPE declaration, if you have one), where the % character has Special Powers and would need to be escaped to be used literally. But as an HTML document author it is highly unlikely you would ever need to go anywhere near that whole mess.
The following text will throw errors as HTML 4.01 Strict:
foo
Yes, and it's just as much an error in Transitional.
If you put a space after the &, however, it validates just fine.
Yes, under SGML rules anything but [A-Za-z] and # doesn't start parsing as a reference. Not a good idea to rely on this though. (Of course, it's not well-formed in XHTML.)
The above contexts clearly have different rules about what needs to be escaped.
I'm not sure that the different elements have different encoding rules like you say. All the examples you list require the HTML encoding.
E.g.
<h1>Fish & Chips</h1>
<img alt="Awesome picture of Meat Pie & Chips" />
Fish & Chips
The last example includes some URL Encoding for the ampersand too (&) and its at this point things get hairy (sending an ampersand as data, which is why it must be encoded).
So my first question is, are there any other contexts in HTML in which characters can be interpreted as markup/control characters?
Anywhere within the HTML document, if the control characters are not being used as control characters, you should encode them (as a good rule of thumb). Most of the time, its HTML Encoding, & or > etc. Othertimes, when trying to pass these characters via a URL, use URL Encoding %20, %26 etc.
The second question is, what are the canonical, globally-safe lists of characters (for each context) that need to be escaped to ensure that any embedded text is treated as non-markup?
I'd say that the Wikipedia article has a few good comments on it and might be worth a read - also the W3 Schools article I guess is a good point. Most languages have built in functions to prepare text as safe HTML, so it may be worth checking your language of choice (if you are indeed even using any scripting languages and not hand coding the HTML).
Specifically, Wikipedia says: "Characters <, >, " and & are used to delimit tags, attribute values, and character references. Character entity references <, >, " and &, which are predefined in HTML, XML, and SGML, can be used instead for literal representations of the characters."
For URL Encoding, this article seems a good starting point.
Closing thoughts as I've already rambled a bit: This is all excluding the thoughts of XML / XHTML which brings a whole other ballgame to the court and its requirement that pretty much the world and its dog needs to be encoded. If you are using a scripting language and writing out a variable via that, I'm pretty sure it'll be easier to find the built in function, or download a library that'll do this for you. :) I hope this answer was scoped ok and didn't miss the point or question or come across in the wrong tone. :)
If you are looking for the best practices to escape characters in web browsers (including HTML, JavaScript and style sheets), the XSS prevention cheat sheet by Michael Coates is probably what you're looking for. It includes a description of the different interpretation contexts, tables indicating how to encode characters in each context and code samples (using ESAPI).
http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet
Beware that <script> followed by <!-- followed by <script> again, enters double-escaped state, in which you probably never want to be, so ideally you should escape < with "\u003C" within your script's strings (and regexps) to not trigger it accidentally.
You can read more about it here http://qbolec-memdump.blogspot.com/2013/11/script-tag-content-madness.html
If you are this concerned about the validity of the final HTML, you might consider constructing the HTML via a DOM, versus as text.
You don't say what environment you are targeting.