Doing web work with PHP, JavaScript, HTML, etc. Had an issue where using a special character, in this case the less than symbol, had to be replaced with an ASCII representation in order for the code to work properly. No issues with the concept but how do you decide on which ASCII representation to use? Stated another way, are there some guidelines on when to use Dec 60, HEX \x3c, Octal \074, or just the HTML special character <?
They are all the same character and should all be interpreted the same by a browser.
If possible, use the actual character literal, though in HTML < has a special meaning and should be escaped as <.
Related
I'm trying to get special characters into HTML, and am not sure if this is even possible. If anyone remembers Kroz, or just about every DOS interface - there is a special set of shape characters. I'm wanting to use the single braces, double braces, shadows, and other shape characters, but I can't seem to track any of these down anywhere.
Also, will using these characters in an HTML environment present any localization concerns / will there be a required charset?
Thanks!
There is no “extended ASCII”; ASCII ends at code position 127 decimal, 7F hexadecimal. What is called “extended ASCII” is a set of mutually incompatible 8-bit encodings that contain the printable ASCII characters in the same positions as in ASCII. In your case, you seem to want to use the Code Page 437. All of its characters exist in Unicode. You can find the correspondence at
http://en.wikipedia.org/wiki/Code_page_437
which I believe to be correct in this issue; but the authoritative reference is
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT
There are various ways to enter the characters. You can use, say, “▓” as such in HTML, if you have some way of entering it and you use UTF-8 on the page. Alternatively, you can use character references like ▓.
Yes, similar characters exist in the UTF-8 character set. These are called box drawing characters.
See: http://www.fileformat.info/info/unicode/block/box_drawing/utf8test.htm
Is there a good rule of thumb for when to use decimal vs. hexadecimal notation for HTML entities?
For example, a non-breaking hyphen is written in decimal as ‑ and in hex as ‑.
This answer says that hexadecimal is for Unicode; does that mean hex should be used if you're using the <meta charset="utf-8"> tag in the document <head>?
Occasionally, I will notice entity characters mistakenly rendered instead of the entities they represent -- for example, & appearing (instead of an ampersand) in an email subject line or RSS headline. Is either hex or decimal better for avoiding this?
One last consideration: can using hex or decimal affect the rendering clarity (crispness) of the character?
The rule of thumb is: use whichever you prefer, but prefer hex. ☺
There is no difference in meaning and no difference in browser support (the last browsers that supported decimal references only died in the 1990s).
As #AlexW describes, hexadecimal references are more natural than decimal, due to the way character code standards are written. But if you find decimal references more convenient, use them.
The issue has nothing to with meta tags and character encodings. The main reason why character references were introduced into HTML is that they let you enter characters quite independently of the encoding of the document. This includes characters that cannot be directly written at all in the encoding used. Thanks to them, you can enter any Unicode character even if the character encoding is ASCII or some other limited encoding, like ISO-8859-1.
In the old days, it was common to recommend the use of named references (or “entity references” as they are formally called in classic HTML), when possible, because a reference like Ω, when displayed literally to the user, is more understandable than a reference like Ω or Ω. This hasn’t been relevant for over a decade, as far as web browsers are considered. But e.g. e-mail clients might be kind of stupid^H^H^H^H^H^H^H^H^H underdeveloped in this respect. They might e.g. show references as such in a list of messages, even though they can intepret them properly when viewing a message. But there does not seem to be any consistent behavior that you could count on.
Overall
HTML (and XML) offers three ways to encode special characters: numeric hex &, numeric decimal & (aka "character references"), and named & (aka "entity references"). They've remained equally valid and fully supported by all major browsers for decades. They work with any encoding, but always render from the Unicode set (which is compatible with ASCII, ISO Latin, and Windows Latin, minus codes 128-159).
So it's up to personal preference, with a few things worth noting.
Necessity
If you add the proper charset meta tag to your HTML, you don't need to encode special characters at all (except & < > " ', or more generally, just & < in loose text). The exception is wanting to encode a character not present in the specified encoding. But if you use UTF-8, you can represent anything from Unicode anyway.
Brevity
For any character below index 10, decimal is shorter. A tab is 	, versus 	, so it may be worth it for pre tags containing a lot of TSV data, for example.
Ease of Use
Named references are the easiest to use and memorize, especially for code shared among developers of different backgrounds and skill sets. < is much more intuitive than <. As for someone else's comment regarding relevance, they're actually still fully supported as part of the W3C standard, and have even been expanded on for HTML5.
Best Practice
Using named or decimal references may not be the best general practice since the names are English-only, and unique to HTML (even XML lacks named references, minus the "big five"). Most programming languages and character tables use hex encoding, so it makes things easier and more portable in the long run when you stay consistent. Though for small projects or special cases, it may not really matter.
More info: http://xmlnews.org/docs/xml-basics.html#references
These are called numeric character references. They are derived from SGML and the numeric portion of them references the specific Unicode code point of the character you are trying to display. They allow you to represent characters of Unicode, even if the particular character set you wrote the HTML in doesn't have the character you are referencing. Whether you reference the code point with decimal or hexidecimal does not matter, except for very old browsers that prefer decimal. Hexidecimal support was added because Unicode code points are referenced in hex notation and it makes it much easier to look up the code point and then add the reference, without having to convert to decimal:
U+007D
=
}
To answer your question:
This answer says that hexadecimal is for Unicode; does that mean hex
should be used if you're using the <meta charset="utf-8"> tag in the
document ?
You have to understand that UTF-8 is backwards-compatible with ASCII / ISO-8859-1. So the first 256 characters of UTF-8 will be the same in ASCII and UTF-8. Hex is just easier for UTF-8 because, as of 2013 there are 1,114,112 Unicode code points. So it's easier to write � than it is to write � etc.
Im making a webpage. In the html is it better to use ascii characters? The following look the same for me when I test in different browsers but is the first one better practice?
Opening -
Opening -
It's ok to use literals like - over escaped - entities, and it is also encouraged for readability. Only characters you have to escape are the so called "unsafe entities" (like < and > are, as they can mark a new tag and therefore are ambigous to the browser.
If you declare the document encoding as UTF-8, then you can insert any character (also non ASCII, like letters from foreign alphabets or accented letters) which will not violate markup syntax.
Only reason to keep &... characters is compatibility with ancient browser not recognizing UTF-8.
Although I included the ISO-8859-1 content-type META, my website isn't displaying special characters, such as ã and ê. If I 'echo' a string from a MYSQL query, the special character is displaying properly. If I write the SAME character in plain HTML, it won't display in the same website.
Thanks in advance.
http://popguest.com.br/event/index3.php?c=48&p=3
Maybe you could use a simple routine to encode all special characters to entities (&#x----; with correct hexadecimal unicode number after x)
(and switching to UTF-8 is not a bad idea)
I have a question about ASCII code and HTML.
Most sites state what ASCII is but then mention things like HTML alternative or HTML code. Is this still ASCII?
Any way, my actual question is, is < ASCII (if not, what 'language' is it)?
ASCII is an encoding : it defines how the char you see are encoded in 0 & 1 (in fact in bytes). This problem is totally unrelated to how a browser displays the characters it decodes in a HTML file.
You can send to a browser a file containing the characters < in any encoding, be it UTF-8, ASCII, or another one.
< is a character entity reference, coming from SGML and defined both in XML and HTML.
Here's the official reference about HTML4 character entities.
< is an HTML entity. Html entities are used when a character cannot be safely used within the browser. For example if you wanted to use a less than sign within the content of your page, using < would get interpreted by the browser as the start of a new tag. Using an html entity tells the browser to render the actual character and not read it as the start of a tag.
http://www.w3schools.com/html/html_entities.asp
The notation < consists of four characters, which all have a representation in the ASCII character code, but that’s immaterial. In HTML (as well as in SGML and XML), the notation denotes the LESS-THAN character “<”, in most contexts. The “<” character, too, has a representation in ASCII, but this too is immaterial.
People often use the expression “ASCII character” to denote a character that has a representation in ASCII, i.e. an ASCII code. In reality, the characters need not be ASCII encoded. But the concept “ASCII character” is still useful for some practical purposes. And using it, we can say that < is a sequence of ASCII characters that denotes an ASCII character.
The “language” here is really a set of markup languages in which some sequences of ASCII characters are defined to mean certain (ASCII or non-ASCII) characters.
The need for using < (when you wish to include “<” as text content) stems from the principle that in most contexts in HTML, the “<” character starts a tag, instead of being taken as such.
Things like < are called entity references in SGML tradition, though in HTML contexts often prefixed with the word character to emphasize that the predefined entities of HTML all evaluate to single characters. The HTML5 drafts, abandoning the SGML tradition, use the term named character references instead.
Any way, my actual question is, is < ASCII (if not, what 'language'
is it)?
They are called entities and are part of HTML.