ASCII terminology and HTML - html

I have a question about ASCII code and HTML.
Most sites state what ASCII is but then mention things like HTML alternative or HTML code. Is this still ASCII?
Any way, my actual question is, is < ASCII (if not, what 'language' is it)?

ASCII is an encoding : it defines how the char you see are encoded in 0 & 1 (in fact in bytes). This problem is totally unrelated to how a browser displays the characters it decodes in a HTML file.
You can send to a browser a file containing the characters < in any encoding, be it UTF-8, ASCII, or another one.
< is a character entity reference, coming from SGML and defined both in XML and HTML.
Here's the official reference about HTML4 character entities.

< is an HTML entity. Html entities are used when a character cannot be safely used within the browser. For example if you wanted to use a less than sign within the content of your page, using < would get interpreted by the browser as the start of a new tag. Using an html entity tells the browser to render the actual character and not read it as the start of a tag.
http://www.w3schools.com/html/html_entities.asp

The notation < consists of four characters, which all have a representation in the ASCII character code, but that’s immaterial. In HTML (as well as in SGML and XML), the notation denotes the LESS-THAN character “<”, in most contexts. The “<” character, too, has a representation in ASCII, but this too is immaterial.
People often use the expression “ASCII character” to denote a character that has a representation in ASCII, i.e. an ASCII code. In reality, the characters need not be ASCII encoded. But the concept “ASCII character” is still useful for some practical purposes. And using it, we can say that < is a sequence of ASCII characters that denotes an ASCII character.
The “language” here is really a set of markup languages in which some sequences of ASCII characters are defined to mean certain (ASCII or non-ASCII) characters.
The need for using < (when you wish to include “<” as text content) stems from the principle that in most contexts in HTML, the “<” character starts a tag, instead of being taken as such.
Things like < are called entity references in SGML tradition, though in HTML contexts often prefixed with the word character to emphasize that the predefined entities of HTML all evaluate to single characters. The HTML5 drafts, abandoning the SGML tradition, use the term named character references instead.

Any way, my actual question is, is < ASCII (if not, what 'language'
is it)?
They are called entities and are part of HTML.

Related

HTML Entities: When to Use Decimal vs. Hex

Is there a good rule of thumb for when to use decimal vs. hexadecimal notation for HTML entities?
For example, a non-breaking hyphen is written in decimal as &#8209; and in hex as &#x2011;.
This answer says that hexadecimal is for Unicode; does that mean hex should be used if you're using the <meta charset="utf-8"> tag in the document <head>?
Occasionally, I will notice entity characters mistakenly rendered instead of the entities they represent -- for example, &amp; appearing (instead of an ampersand) in an email subject line or RSS headline. Is either hex or decimal better for avoiding this?
One last consideration: can using hex or decimal affect the rendering clarity (crispness) of the character?
The rule of thumb is: use whichever you prefer, but prefer hex. ☺
There is no difference in meaning and no difference in browser support (the last browsers that supported decimal references only died in the 1990s).
As #AlexW describes, hexadecimal references are more natural than decimal, due to the way character code standards are written. But if you find decimal references more convenient, use them.
The issue has nothing to with meta tags and character encodings. The main reason why character references were introduced into HTML is that they let you enter characters quite independently of the encoding of the document. This includes characters that cannot be directly written at all in the encoding used. Thanks to them, you can enter any Unicode character even if the character encoding is ASCII or some other limited encoding, like ISO-8859-1.
In the old days, it was common to recommend the use of named references (or “entity references” as they are formally called in classic HTML), when possible, because a reference like &Omega;, when displayed literally to the user, is more understandable than a reference like &#x3A9; or &#937;. This hasn’t been relevant for over a decade, as far as web browsers are considered. But e.g. e-mail clients might be kind of stupid^H^H^H^H^H^H^H^H^H underdeveloped in this respect. They might e.g. show references as such in a list of messages, even though they can intepret them properly when viewing a message. But there does not seem to be any consistent behavior that you could count on.
Overall
HTML (and XML) offers three ways to encode special characters: numeric hex &#x26;, numeric decimal &#38; (aka "character references"), and named &amp; (aka "entity references"). They've remained equally valid and fully supported by all major browsers for decades. They work with any encoding, but always render from the Unicode set (which is compatible with ASCII, ISO Latin, and Windows Latin, minus codes 128-159).
So it's up to personal preference, with a few things worth noting.
Necessity
If you add the proper charset meta tag to your HTML, you don't need to encode special characters at all (except & < > " ', or more generally, just & < in loose text). The exception is wanting to encode a character not present in the specified encoding. But if you use UTF-8, you can represent anything from Unicode anyway.
Brevity
For any character below index 10, decimal is shorter. A tab is &#9;, versus &#x09;, so it may be worth it for pre tags containing a lot of TSV data, for example.
Ease of Use
Named references are the easiest to use and memorize, especially for code shared among developers of different backgrounds and skill sets. &lt; is much more intuitive than &#x3c;. As for someone else's comment regarding relevance, they're actually still fully supported as part of the W3C standard, and have even been expanded on for HTML5.
Best Practice
Using named or decimal references may not be the best general practice since the names are English-only, and unique to HTML (even XML lacks named references, minus the "big five"). Most programming languages and character tables use hex encoding, so it makes things easier and more portable in the long run when you stay consistent. Though for small projects or special cases, it may not really matter.
More info: http://xmlnews.org/docs/xml-basics.html#references
These are called numeric character references. They are derived from SGML and the numeric portion of them references the specific Unicode code point of the character you are trying to display. They allow you to represent characters of Unicode, even if the particular character set you wrote the HTML in doesn't have the character you are referencing. Whether you reference the code point with decimal or hexidecimal does not matter, except for very old browsers that prefer decimal. Hexidecimal support was added because Unicode code points are referenced in hex notation and it makes it much easier to look up the code point and then add the reference, without having to convert to decimal:
U+007D
=
&#x007D;
To answer your question:
This answer says that hexadecimal is for Unicode; does that mean hex
should be used if you're using the <meta charset="utf-8"> tag in the
document ?
You have to understand that UTF-8 is backwards-compatible with ASCII / ISO-8859-1. So the first 256 characters of UTF-8 will be the same in ASCII and UTF-8. Hex is just easier for UTF-8 because, as of 2013 there are 1,114,112 Unicode code points. So it's easier to write &#x110000; than it is to write &#1114112; etc.

When to use which ASCII representation

Doing web work with PHP, JavaScript, HTML, etc. Had an issue where using a special character, in this case the less than symbol, had to be replaced with an ASCII representation in order for the code to work properly. No issues with the concept but how do you decide on which ASCII representation to use? Stated another way, are there some guidelines on when to use Dec 60, HEX \x3c, Octal \074, or just the HTML special character &#60?
They are all the same character and should all be interpreted the same by a browser.
If possible, use the actual character literal, though in HTML < has a special meaning and should be escaped as <.

Ascii characters more reliable in html?

Im making a webpage. In the html is it better to use ascii characters? The following look the same for me when I test in different browsers but is the first one better practice?
Opening -
Opening -
It's ok to use literals like - over escaped &#45 entities, and it is also encouraged for readability. Only characters you have to escape are the so called "unsafe entities" (like < and > are, as they can mark a new tag and therefore are ambigous to the browser.
If you declare the document encoding as UTF-8, then you can insert any character (also non ASCII, like letters from foreign alphabets or accented letters) which will not violate markup syntax.
Only reason to keep &... characters is compatibility with ancient browser not recognizing UTF-8.

What are all the HTML escaping contexts?

When outputting HTML, there are several different places where text can be interpreted as control characters rather than as text literals. For example, in "regular" text (that is, outside any element markup):
<div>This is regular text</div>
As well as within the values of attributes:
<input value="this is value text">
And, I believe, within HTML comments:
<!-- This text here might be programmatically generated
and could, in theory, contain the double-hyphen character
sequence, which is verboten inside comments -->
Each of these three kinds of text has different rules for how it must be escaped in order to be treated as non-markup. So my first question is, are there any other contexts in HTML in which characters can be interpreted as markup/control characters? The above contexts clearly have different rules about what needs to be escaped.
The second question is, what are the canonical, globally-safe lists of characters (for each context) that need to be escaped to ensure that any embedded text is treated as non-markup? For example, in theory you only need to escape ' and " in attribute values, since within an attribute value only the closing-delimiter character (' or " depending on which delimiter the attribute value started with) would have control meaning. Similarly, within "regular" text only < and & have control meaning. (I realize that not all HTML parsers are identical. I'm mostly interested in what is the minimum set of characters that need escaping in order to appease a spec-conforming parser.)
Tangentially: The following text will throw errors as HTML 4.01 Strict:
foo
Specifically, it says that it doesn't know what the entity "&y" is supposed to be. If you put a space after the &, however, it validates just fine. But if you're generating this on the fly, you're probably not going to want to check whether each use of & will cause a validation error, and instead just escape all & inside attribute values.
<div>This is regular text</div>
Text content: & must be escaped. < must be escaped.
If producing a document in a non-UTF encoding, characters that do not fit inside the chosen encoding must be escaped.
In XHTML (and XML in general), the sequence ]]> must not occur in text content, so in that specific case one of the characters in that sequence must be escaped, traditionally the >. For consistency, the Canonical XML specification chooses to escape > every time in text content, which is not a bad strategy for an escaping function, though you can certainly skip it for hand-authoring.
<input value="this is value text">
Attribute values: & must be escaped. The attribute value delimiter " or ' must be escaped. If no attribute value delimiter is used (don't do that) no escape is possible.
Canonical XML always chooses " as the delimiter and therefore escapes it. The > character does not need to be escaped in attribute values and Canonical XML does not. The HTML4 spec suggested encoding > anyway for backwards compatibility, but this affects only a few truly ancient and dreadful browsers that no-one remembers now; you can ignore that.
In XHTML < must be escaped. Whilst you can get away with not escaping it in HTML4, it's not a good idea.
To include tabs, CR or LF in attribute values (without them being turned into plain spaces by the attribute value normalisation algorithm) you must encode them as character references.
For both text content and attribute values: in XHTML under XML 1.1, you must escape the Restricted Characters, which are the Delete character and C0 and C1 control codes, minus tab, CR, LF and NEL. In total, [\x01-\x08\x0B\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]. The null character may not be included at all even escaped in XML 1.1. Outside XML 1.1 you can't use any of these characters at all, nor is there a good reason you'd ever want to.
<!-- This text here might be programmatically generated
and could, in theory, contain the double-hyphen character
sequence, which is verboten inside comments -->
Yes, but since there is no escaping possible inside comments, there is nothing you can do about it. If you write <!-- < -->, it literally means a comment containing “ampersand-letter l-letter t-semicolon” and will be reflected as such in the DOM or other infoset. A comment containing -- simply cannot be serialised at all.
<![CDATA[ sections and <?pi​s in XML also cannot use escaping. The traditional solution to serialise a CDATA section including a ]]> sequence is to split that sequence over two CDATA sections so it doesn't occur together. You can't serialise it in a single CDATA section, and you can't serialise a PI with ?> in the data.
CDATA-elements like <script> and <style> in HTML (not XHTML) may not contain the </ (ETAGO) sequence as this would end the element early and then error if not followed by the end-tag-name. Since no escaping is possible within CDATA-elements, this sequence must be avoided and worked around (eg. by turning document.write('</p>') into document.write('<\/p>');. (You see a lot of more complicated silly strategies to get around this one, like calling unescape on a JS-%-encoded string; even often '</scr'+'ipt>' which is still quite invalid.)
There is one more context in HTML and XML where different rules apply, and that's in the DTD (including the internal subset in the DOCTYPE declaration, if you have one), where the % character has Special Powers and would need to be escaped to be used literally. But as an HTML document author it is highly unlikely you would ever need to go anywhere near that whole mess.
The following text will throw errors as HTML 4.01 Strict:
foo
Yes, and it's just as much an error in Transitional.
If you put a space after the &, however, it validates just fine.
Yes, under SGML rules anything but [A-Za-z] and # doesn't start parsing as a reference. Not a good idea to rely on this though. (Of course, it's not well-formed in XHTML.)
The above contexts clearly have different rules about what needs to be escaped.
I'm not sure that the different elements have different encoding rules like you say. All the examples you list require the HTML encoding.
E.g.
<h1>Fish & Chips</h1>
<img alt="Awesome picture of Meat Pie & Chips" />
Fish & Chips
The last example includes some URL Encoding for the ampersand too (&) and its at this point things get hairy (sending an ampersand as data, which is why it must be encoded).
So my first question is, are there any other contexts in HTML in which characters can be interpreted as markup/control characters?
Anywhere within the HTML document, if the control characters are not being used as control characters, you should encode them (as a good rule of thumb). Most of the time, its HTML Encoding, & or > etc. Othertimes, when trying to pass these characters via a URL, use URL Encoding %20, %26 etc.
The second question is, what are the canonical, globally-safe lists of characters (for each context) that need to be escaped to ensure that any embedded text is treated as non-markup?
I'd say that the Wikipedia article has a few good comments on it and might be worth a read - also the W3 Schools article I guess is a good point. Most languages have built in functions to prepare text as safe HTML, so it may be worth checking your language of choice (if you are indeed even using any scripting languages and not hand coding the HTML).
Specifically, Wikipedia says: "Characters <, >, " and & are used to delimit tags, attribute values, and character references. Character entity references <, >, " and &, which are predefined in HTML, XML, and SGML, can be used instead for literal representations of the characters."
For URL Encoding, this article seems a good starting point.
Closing thoughts as I've already rambled a bit: This is all excluding the thoughts of XML / XHTML which brings a whole other ballgame to the court and its requirement that pretty much the world and its dog needs to be encoded. If you are using a scripting language and writing out a variable via that, I'm pretty sure it'll be easier to find the built in function, or download a library that'll do this for you. :) I hope this answer was scoped ok and didn't miss the point or question or come across in the wrong tone. :)
If you are looking for the best practices to escape characters in web browsers (including HTML, JavaScript and style sheets), the XSS prevention cheat sheet by Michael Coates is probably what you're looking for. It includes a description of the different interpretation contexts, tables indicating how to encode characters in each context and code samples (using ESAPI).
http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet
Beware that <script> followed by <!-- followed by <script> again, enters double-escaped state, in which you probably never want to be, so ideally you should escape < with "\u003C" within your script's strings (and regexps) to not trigger it accidentally.
You can read more about it here http://qbolec-memdump.blogspot.com/2013/11/script-tag-content-madness.html
If you are this concerned about the validity of the final HTML, you might consider constructing the HTML via a DOM, versus as text.
You don't say what environment you are targeting.

Why are HTML character entities necessary?

Why are HTML character entities necessary? What good are they? I don't see the point.
Two main things.
They let you use characters that are not defined in a current charset. E.g., you can legally use ASCII as the charset, and still include arbitrary Unicode characters thorugh entities.
They let you quote characters that HTML gives special meaning to, as Simon noted.
"1 < 2" lets you put "1 < 2" in your page.
Long answer:
Since HTML uses '<' to open tags, you can't just type '<' if you want that as text. Therefore, you have to have a way to say "I want the text < in my page". Whoever designed HTML (or, actually SGML, HTML's predecessor) decided to use '&something;', so you can also put things like non-breaking space: ' ' (spaces that are not collapsed or allow a line break). Of course, now you need to have a way to say '&', so you get '&'...
They aren't, apart from &, <, >, " and probably . For all other characters, just use UTF-8.
In SGML and XML they aren't just for characters. They are generic inclusion mechanism, and their use for special characters is just one of many cases.
<!ENTITY signature "<hr/><p>Regards, <i>&myname;</i></p>">
<!ENTITY myname "John Doe">
This kind of entities is not useful for web sites, because they work only in XML mode, and you can't use external DTD file without enabling "validating" parsing mode in browser configuration.
Entities can be expanded recursively. This allows use of XML for Denial of Serice attack called "Billion Laughs Attack".
Firefox uses entities internally (in XUL and such) for internationalization and brand-independent messages (to make life easier for Flock and IceWeasel):
<!ENTITY hidemac.label "Hide &brandShortName;">
<!ENTITY hidewin.label "Hide - &brandShortName;">
In HTML you just need <, & and " to avoid ambiguities between text and markup.
All other entities are basically obsoleted by Unicode encodings and remain only as covenience (but a good text editor should have macros/snippets that can replace them).
In XHTML all entities except the basic few are problematic, because won't work with stand-alone XML parsers (e.g. won't work).
To parse all XHTML entities you need validating XML parser (option's usually called "resolve externals") which is slower and needs DTD Catalog set up. If you ignore or screw up your DTD Catalog, you'll be participating in DDoS of W3C servers.
Character entities are used to represent character which are reserved to write HTML for.ex.
<, >, /, & etc, if you want to represent these characters in your content you should use character entities, this will help the parser to distinguish between the content and markup
You use entities to help the parser distinguish when a character should be represented as HTML, and what you really want to show the user, as HTML will reserve a special set of characters for itself.
Typing this literally in HTML
I don't mean it like that </sarcasm>
will cause the "</sarcasm>" tag to disappear,
e.g.
I don't mean it like that
as HTML does not have a tag defined as such. In this case, using entities will allow the text to display properly.
e.g.
No, really! </sarcasm>
gives
No, really! </sarcasm>
as desired.