Effects of Non-ASCII Characters in HTML vs HTML Encoded Characters - html

I had an issue earlier today where someone couldn't compile a static site due to some non-ASCII characters in a kramdown file. While writing a small script that finds these characters in our content, I ran across a large number of non-HTML encoded special characters.
What are the implications in including these characters directly in the HTML? Take the © character.
If I include the character directly in HTML, it seems to render correctly in my browser. That being said, I don't know the side-effects for those who don't have fonts installed that support these characters.
What are the side effects of leaving these non-ASCII characters in the HTML? I know in some situations it can lead to strange (?) characters showing up, but I'd like more specific information on how these special characters get rendered.
If I HTML encode these special characters and a client doesn't have a font that supports them, does it show the same (?) character? Is there any meaningful difference between using the HTML-encoded vs non encoded characters?usign

Is there any meaningful difference between using the HTML-encoded vs non encoded characters?
Not in terms of the browser being able to display them in general.
If you want to use these as you call them "non-standard" characters (which are very much standard characters, just not ASCII characters), you should specify an encoding, preferably utf-8. The HTML5 way of doing this (which is backwards compatible and supported by pretty much all browsers) is
<meta charset="utf-8">
That said, some tools compiling static HTML from markdown etc. might have problems with it, but that depends on the tool. You're safer using the entities like © there; which you can also always use without specifying an encoding.
This is not the full story, as the way a browser is decoding a file can also be influenced by other factors, like HTTP Response Headers. Also, even if you omit it, as you could observe, browsers do everything they can to still parse it correctly, there's just no guarantee.

Related

Why to include <meta charset=“” />?

I mean if a browser is already reading the HTML file and is able to read the text <meta charset=“” /> that means it already knows the encoding of the HTML file. So why is it needed to be specified inside the HTML file? Isn’t it redundant?
Is it because browser starts reading file using smallest charset, like ASCII, and it is subset of many charsets?
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
For a web page, the original idea was that the web server would return a similar Content-Type http header along with the web page itself — not in the HTML itself, but as one of the response headers that are sent before the HTML page.
This causes problems. Suppose you have a big web server with lots of sites and hundreds of pages contributed by lots of people in lots of different languages and all using whatever encoding their copy of Microsoft FrontPage saw fit to generate. The web server itself wouldn’t really know what encoding each file was written in, so it couldn’t send the Content-Type header.
It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy… how can you read the HTML file until you know what encoding it’s in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
But that meta tag really has to be the very first thing in the section because as soon as the web browser sees this tag it’s going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.
See also W3.org:
Always declare the encoding of your document using a meta element with a charset attribute, or using the http-equiv and content attributes (called a pragma directive). The declaration should fit completely within the first 1024 bytes at the start of the file, so it's best to put it immediately after the opening head tag.
So yes. The entire premise is that until the HTML parser of your browser reads that meta tag, there should not be any bytes that can be ambiguously interpreted as other bytes; the entire text shown including the charset attribute value ("utf-8") fits into the ASCII encoding.
From Joel's article:
Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working.
The average HTML parser goes like this:
Is there a Content-Type response header with a charset parameter? Use that to decode the bytes of the received content into a string.
Start reading the HTML as ASCII (or UTF-8). Is there a <meta http-equiv="Content-Type"> header with a usable charset? Use that.
Start parsing the bytes and use heuristics to determine the most likely encoding used.
It is an obsolete tag, but the reason: we have ISO 646 (since 1967) which defines a standard set of characters. ASCII specifies the few optional characters on ISO 646, so ISO 646 is the mother of most of encodings.
Note: most systems are based on this standard, ev. using the extension ISO 2022, where you can encode 7-bit and 8-bit characters with few different encodings (e.g. used for Asian character set, where we need more then 256 characters). In any case, the start of a text is compatible with ISO 646. Then control sequences may change the meaning.
So browser can read most of ASCII data (really ISO 646, ISO 2022), and detect exactly how to interpret all other characters.
On Western languages, you get mostly ASCII on lower codes (until 127), but how to interpret the higher codes depends on language (Nordic characters, Western accented characters, Greek characters, etc.). And there are various encoding, which cannot be really detected without explicit specification.
Note: this method fails on few encodings, e.g. multibytes, like UCS-2, UTF-16, UTF-32, but W3C had some methods to detect it: the header should be mostly ASCII charset, so we should have a lot of 00 characters. EBCDIC and other encodings not based on ISO 646 (or ASCII) were already seldom. In principle you can check for some byte strings, but I do not know if browser did it.
In short: with heuristic (and ISO 646) you can guess on how to read ASCII charset, but to know how to interpret "special characters", e.g. accented characters, we must have more information, given by META or by HTTP header. Note: this works also with many Asian encoding (ISO 2022 based)
Why META? It is about control. HTTP header often required webmaster intervention, but with META the author of a page could override the encoding. (e.g. writing static pages, now most dynamic page generators can override HTTP headers).

HTML encoding of Japanese text

I'm making a static HTML page that displays courtesy text in multiple languages. I noticed that if I paste ウェブサイトのメンテナンスの下で into Expression Blend, that text appears the same in the code. I think it's bad for compatibility and should be replaced by proper HTML entities.
I have tried http://www.opinionatedgeek.com/DotNet/Tools/HTMLEncode/encode.aspx but it returns me the same Japanese text.
Is it correct, from the point of view of browser compatibility, to paste that Japanese right into the source code of an HTML page?
Else, what is the correct HTML encoding of that text? Or, better, is there any tool that I can use to convert non-ASCII characters to HTML entities, possibly online and possibly free?
I think it's bad for compatibility and should be replaced by proper
HTML entities.
Quite the opposite actually, your preference should be to not use html entities but rather correctly declare document encoding as UTF-8 and use the actual characters. There are quite a few compelling reasons to do so, but the real question is why not use it since it's a well- and widely supported standard?
Some of those points have been summarised previously:
UTF-8 encodings are easier to read and edit for those who understand
what the character means and know how to type it.
UTF-8 encodings are just as unintelligible as HTML entity encodings
for those who don't understand them, but they have the advantage of
rendering as special characters rather than hard to understand decimal
or hex encodings.
[For example] Wikipedia... actually go through articles and convert
character entities to their corresponding real characters for the sake
of user-friendliness and searchability.
As long as you mark your web-page as UTF-8, either in the http headers or the meta tags, having foreign characters in your web-pages should be a non-issue. Alternately you could encode/decode these strings using encodeURI/decodeURI functions in JavaScript
encodeURI('ウェブサイトのメンテナンスの下で')
//returns"%E3%82%A6%E3%82%A7%E3%83%96%E3%82%B5%E3%82%A4%E3%83%88%E3%81%AE%E3%83%A1%E3%83%B3%E3%83%86%E3%83%8A%E3%83%B3%E3%82%B9%E3%81%AE%E4%B8%8B%E3%81%A7"
decodeURI("%E3%82%A6%E3%82%A7%E3%83%96%E3%82%B5%E3%82%A4%E3%83%88%E3%81%AE%E3%83%A1%E3%83%B3%E3%83%86%E3%83%8A%E3%83%B3%E3%82%B9%E3%81%AE%E4%B8%8B%E3%81%A7")
//returns ウェブサイトのメンテナンスの下で
If you are looking for a tool to convert a bunch of static strings to unicode characters, you could simply use encodeURI/decodeURI functions from a web-page developer console (firebug for mozilla/firefox). Hope this helps!
HTML entities are only useful if you need to represent a character that cannot be represented in the encoding your document is saved in. For example, ASCII has no specification for how to represent "€". If you want to use that character in an ASCII encoded HTML document, you have to encode it as € or not use it at all.
If you are using a character encoding for your document that can represent all the characters you need though, like UTF-8, there's no need for HTML entities. You simply need to make sure the browser knows what encoding the document is in so it can interpret it correctly. This is really the preferable method, since it simply keeps the source code readable. It really makes no sense to want to work with HTML entities if you can simply work with the actual characters.
See http://kunststube.net/frontback for some more information.

HTML Unicode Issue: How to display special characters

Currently, I have my webpage set to Unicode/UTF-8. When trying to display a special character (for example, em dash, double arrow, etc), it shows up as a question mark symbol. I cannot change these characters to the HTML entity equivalent. How can I circumvent this issue?
A question mark in a lozenge, �, indicates a character-level error: the data contains bytes that do no represent any character, according to the character encoding being applied. This typically happens when the document is declared as UTF-8 encoded but is really in iso-8859-1, windows-1252, or some similar encoding. Windows-1252 is a common default encoding used by various programs on Windows platforms. So you may need to open the file in your authoring program and re-save it as UTF-8 encoded.
If problems remain, please post the URL. Posting the code alone is not sufficient, since the character encoding is primarily specified in HTTP headers.
If you see a question mark in a small box, then it might be a font-level problem (lack of glyph in the fonts being used), but this would be very rare for common characters like the em dash. Different browsers have different ways of indicating character- or font-level problems.
Make sure your document is set to the correct character encoding in the actual code editor, as well as in the doctype. Both are necessary. I spent hours trying to tweak HTML when the only problem was that I needed to set the text setting in Coda.
<head>
<meta charset="utf-8">
See the following screenshot:
Make sure your characters are actually UTF-8 characters. They will probably look something like this:
® or U+0020
http://www.kinsmancreative.com/transfer/char/index.php is a handy site for finding the decimal values of commonly used UTF-8 special characters if you need a reference.

Are unicode characters better or more semantic than the simple text versions?

When I copy/paste text from most sites and pdfs, the following characters are almost always in the unicode equivalent:
double quote: " is “ and ” (“ and ”)
single quote: ' is ‘ and ’ (‘ and ’)
ellipsis: ... is … (…)
I understand ones that can't be represented without unicode like © and ¢, but even for those, I wonder.
When should you use these unicode equivalents? Are they more semantic than not using them? Are they better interpreted by devices (copy/paste/print)? I always find it annoying getting those quote and ellipsis characters because with textmate + programming, you don't use them.
When should you use these unicode equivalents? Are they more semantic than not using them?
Note that these are not “unicode equivalents”. Those characters are available in many character sets other than Unicode, and they are strictly distinct from the alternatives that you propose.
In typography, the left and right versions of the single and double quotation marks are correct. They provide the traditional appearance for those characters that has been used in print media for many years. The ellipsis character provides the correct spacing for an ellipsis that does not naturally occur when using consecutive full stop characters. So the reason all of these are used is to make the text appear correctly to human readers.
Are they better interpreted by devices (copy/paste/print)?
Any system that uses any character set should be designed to correctly handle that character set. If the text is encoded in Unicode, then any recent system (from the last 15 years at least) should be able to handle it, since Unicode is the de facto standard character set for all modern systems.
Not all Unicode-conformant systems will be able to display all characters correctly. This will depend on the fonts available, and even the rendering system that uses the fonts. But any Unicode-conformant system will be able to transmit the characters unaltered (such as in a copy and paste operation).
I always find it annoying getting those quote and ellipsis characters because with textmate + programming, you don't use them.
It is unusual to copy English (or whatever language) text directly into a program without having to add separate delimiters to that text. But most modern programming languages will not have any difficulty handling the text once it is property delimited.
Any systems that cannot handle Unicode correctly should be updated. Legacy character encodings will have no place in the future.
I think there's a simple explanation: MS Word converts these characters/sequences automatically as you type and a lot of text in the internet has been copied from this text editor.
Most of the articles I get for my site from other authors are sent as .doc file and I have to convert it. Usually, it contains these characters you've mentioned.
I'd also add one more: many different types of dashes instead of the hyphen. And also the low opening double quote (as seen in some european languages).
I usually let them stay in the text (all my pages are unicode). It's just important to remember it when playing around with regex etc (especially the dashes can be tricky and hard to spot).
HTML entities serve a triple purpose:
Being able to use characters that do not belong to the document character set, e.g., insert an euro symbol in a ISO-8859-1 document.
Escape characters that have a special meaning in HTML, such as angle brackets.
Make it easier to type characters that are not in your keyboard or are not supported by your editor, e.g. a copyright symbol.
Update:
My info is correct but I suspect I've answered the wrong question...
On the web, I would consider that markup adds semantic meaning, content does not. So it doesn't really matter which you use in this context.
Typographers would insist on “ and ”, where programmers don't care and just use regular old quotes ".
The key here is interoperability. There are different encoding schemes. As we've all been victim to, people paste content into an editor from WORD, which uses windows-1251 encoding. When you serve this content up via AJAX is usually breaks because AJAX uses UTF-8 encoding by default.
Office 2010 now allows for the saving of documents in UTF-8 format. Also, databases have different unicode encoding schemes. The best bet is to use UTF-8 end-to-end.
When you copy-pasta text that includes special characters, they will be left as they are. This is perfectly fine if the characters match the charset used by the webpage.
HTML entities are just a convenience for producing specific characters in any character set. Keyboards tend not to have keys to get symbols like ©, so the HTML entity is a shortcut.
I'm going to generalize and say that most of the time the content is UTF-8 (please correct me if I'm wrong). The copied characters are usually copied correctly and everything works great, if they aren't copied correctly, or the charset is subject to change, or you're after i18n support, go with the HTML or XML entities. Otherwise, leave them as they are, the browser will display them just fine.

Content type vs HTML encoding

I'm bulding a site and I've set its content type to use charset UTF-8. I'm also using HTML encoding for the special characters, ie: instead of having á I've got á.
Now I wonder (still bulding the site) if it was really necesary to do both things. Looking for the answer I found this:
http://www.w3.org/International/questions/qa-escapes.en.php
It says that I shoud not use HTML encoding for any special characters but for >, < and &. But the reason is that escapes
can make it difficult to read and maintain source code, and can also significantly increase file size.
I think that's true but very poor argument. Is it really THE SAME thing using the escapes and the special characters?
The article is in fact correct. If you have proper UTF-8 encoded data, there is no reason to use HTML entities for special characters on normal web pages any more.
I say "on normal web pages", because there are highly exotic borderline scenarios where using entities is still the safest bet (e.g. when serving JavaScript code to an external page with unknown encoding). But for serving pages to a browser, this doesn't apply.