UTF-8 encoding in sublime text and visual studio - html

The question might be a bit basic – considering I'm not what the vast majority would consider a newcomer to front end web development.
I am teaching an 8 year old html, css and javascript. I'm taking the opportunity to also teach about utf-8 encoding, in particular the way HTML uses it to allow non-English characters to be encoded and displayed.
I want to show him how accented characters do not appear properly without including <meta charset="UTF-8"/>.
Surprisingly I was able to display "Á" in the test webpage when in theory this shouldn't have been possible as the utf-8 charset meta tag was missing.
After some research I came to the conclusion that in modern IDE's the encoding system comes "built in", hence there's no real need to write down <meta charset />. If this is wrong please correct me as I am currently confused as to what exactly happened and I don't want to teach wrong information to an 8 year old.

After some research I came to the conclusion that in modern IDE's the encoding system comes "built in", hence there's no real need to write down . If this is wrong please correct me
Yes, that is wrong!
Surprisingly I was able to display "Á" in the test webpage when in theory this shouldn't have been possible as the utf-8 charset meta tag was missing.
This is also wrong, let me explain!
UTF-8 is an encoding system. This means it describes how to map bytes into textual characters. It's certainly possible to display "Á" without using utf-8.
The letter A (normal, no accents) is encoded with the number 65 in both ASCII and UTF-8. In fact, all english characters and punctuation are encoded the same way across virtually all encodings, so encoding problems rarely become apparent in English-only text.
However, accented letters, non-english characters and emojis (😁) are encoded differently in different encoding systems. What causes "corrupt" text to be displayed is an encoding mismatch: your web browser thinks the encoding used is X while the file was actually encoded with system Y, so byte values no longer map to correct characters. For example, system X uses number 250 to encode 😁, while system Y uses number 190, and under system Y 250 is mapped to "Ë". So now my 😁 appear as "Ë".
<meta charset="utf-8"/> specifies the encoding used for the HTML file. It is absolutely needed. Your webpage worked without because browsers may use other ways to get it, including educated guesses, but it should always be explicitly written in the HTML to avoid problems down the line.

You should specify the encoding for several reasons:
Even if the encoding system would come buit-in, you cannot know which is the default encoding chosen for the IDE.
HTML5 specification says that the default encoding should be taken from the transport layer when not specified which will be the default encoding charset for HTTP1.1: ISO-8859-1.
See the full explaination here: Why it's necessary to specify the character encoding in an HTML5 document if the default character encoding for HTML5 is UTF-8?

Related

Why to include <meta charset=“” />?

I mean if a browser is already reading the HTML file and is able to read the text <meta charset=“” /> that means it already knows the encoding of the HTML file. So why is it needed to be specified inside the HTML file? Isn’t it redundant?
Is it because browser starts reading file using smallest charset, like ASCII, and it is subset of many charsets?
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
For a web page, the original idea was that the web server would return a similar Content-Type http header along with the web page itself — not in the HTML itself, but as one of the response headers that are sent before the HTML page.
This causes problems. Suppose you have a big web server with lots of sites and hundreds of pages contributed by lots of people in lots of different languages and all using whatever encoding their copy of Microsoft FrontPage saw fit to generate. The web server itself wouldn’t really know what encoding each file was written in, so it couldn’t send the Content-Type header.
It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy… how can you read the HTML file until you know what encoding it’s in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
But that meta tag really has to be the very first thing in the section because as soon as the web browser sees this tag it’s going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.
See also W3.org:
Always declare the encoding of your document using a meta element with a charset attribute, or using the http-equiv and content attributes (called a pragma directive). The declaration should fit completely within the first 1024 bytes at the start of the file, so it's best to put it immediately after the opening head tag.
So yes. The entire premise is that until the HTML parser of your browser reads that meta tag, there should not be any bytes that can be ambiguously interpreted as other bytes; the entire text shown including the charset attribute value ("utf-8") fits into the ASCII encoding.
From Joel's article:
Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working.
The average HTML parser goes like this:
Is there a Content-Type response header with a charset parameter? Use that to decode the bytes of the received content into a string.
Start reading the HTML as ASCII (or UTF-8). Is there a <meta http-equiv="Content-Type"> header with a usable charset? Use that.
Start parsing the bytes and use heuristics to determine the most likely encoding used.
It is an obsolete tag, but the reason: we have ISO 646 (since 1967) which defines a standard set of characters. ASCII specifies the few optional characters on ISO 646, so ISO 646 is the mother of most of encodings.
Note: most systems are based on this standard, ev. using the extension ISO 2022, where you can encode 7-bit and 8-bit characters with few different encodings (e.g. used for Asian character set, where we need more then 256 characters). In any case, the start of a text is compatible with ISO 646. Then control sequences may change the meaning.
So browser can read most of ASCII data (really ISO 646, ISO 2022), and detect exactly how to interpret all other characters.
On Western languages, you get mostly ASCII on lower codes (until 127), but how to interpret the higher codes depends on language (Nordic characters, Western accented characters, Greek characters, etc.). And there are various encoding, which cannot be really detected without explicit specification.
Note: this method fails on few encodings, e.g. multibytes, like UCS-2, UTF-16, UTF-32, but W3C had some methods to detect it: the header should be mostly ASCII charset, so we should have a lot of 00 characters. EBCDIC and other encodings not based on ISO 646 (or ASCII) were already seldom. In principle you can check for some byte strings, but I do not know if browser did it.
In short: with heuristic (and ISO 646) you can guess on how to read ASCII charset, but to know how to interpret "special characters", e.g. accented characters, we must have more information, given by META or by HTTP header. Note: this works also with many Asian encoding (ISO 2022 based)
Why META? It is about control. HTTP header often required webmaster intervention, but with META the author of a page could override the encoding. (e.g. writing static pages, now most dynamic page generators can override HTTP headers).

Effects of Non-ASCII Characters in HTML vs HTML Encoded Characters

I had an issue earlier today where someone couldn't compile a static site due to some non-ASCII characters in a kramdown file. While writing a small script that finds these characters in our content, I ran across a large number of non-HTML encoded special characters.
What are the implications in including these characters directly in the HTML? Take the © character.
If I include the character directly in HTML, it seems to render correctly in my browser. That being said, I don't know the side-effects for those who don't have fonts installed that support these characters.
What are the side effects of leaving these non-ASCII characters in the HTML? I know in some situations it can lead to strange (?) characters showing up, but I'd like more specific information on how these special characters get rendered.
If I HTML encode these special characters and a client doesn't have a font that supports them, does it show the same (?) character? Is there any meaningful difference between using the HTML-encoded vs non encoded characters?usign
Is there any meaningful difference between using the HTML-encoded vs non encoded characters?
Not in terms of the browser being able to display them in general.
If you want to use these as you call them "non-standard" characters (which are very much standard characters, just not ASCII characters), you should specify an encoding, preferably utf-8. The HTML5 way of doing this (which is backwards compatible and supported by pretty much all browsers) is
<meta charset="utf-8">
That said, some tools compiling static HTML from markdown etc. might have problems with it, but that depends on the tool. You're safer using the entities like © there; which you can also always use without specifying an encoding.
This is not the full story, as the way a browser is decoding a file can also be influenced by other factors, like HTTP Response Headers. Also, even if you omit it, as you could observe, browsers do everything they can to still parse it correctly, there's just no guarantee.

How HTML meta charset works

How does meta charset work? Please correct my understanding if I am wrong. As I understand it, the charset is used as to indicate what encoding the page is to be shown? If I put a very specific encoding, others might not be able to see it displayed correctly. But why? Isn't the encoding set on the meta tage and the browser renders characters based on the charset? Or do I have the wrong idea (probably)?
Letters, numbers and other characters have to represented in computers as bytes.
There are different ways (character encodings) that can be used to represent the same characters. Usually you'll want to use UTF-8 these days.
Meta charset tells the browser which one you have used so it knows how to decode the bytes into characters correctly.
If you tell the browser you are using UTF-8 when you are actually using ISO-8859-1, then you'll get errors (the wrong characters) showing up in places where the encodings do not overlap.
character_set Specifies the character encoding for the HTML document.
In theory, any character encoding can be used, but no browser understands all of them. The more widely a character encoding is used, the better the chance that a browser will understand it.

HTML Unicode Issue: How to display special characters

Currently, I have my webpage set to Unicode/UTF-8. When trying to display a special character (for example, em dash, double arrow, etc), it shows up as a question mark symbol. I cannot change these characters to the HTML entity equivalent. How can I circumvent this issue?
A question mark in a lozenge, �, indicates a character-level error: the data contains bytes that do no represent any character, according to the character encoding being applied. This typically happens when the document is declared as UTF-8 encoded but is really in iso-8859-1, windows-1252, or some similar encoding. Windows-1252 is a common default encoding used by various programs on Windows platforms. So you may need to open the file in your authoring program and re-save it as UTF-8 encoded.
If problems remain, please post the URL. Posting the code alone is not sufficient, since the character encoding is primarily specified in HTTP headers.
If you see a question mark in a small box, then it might be a font-level problem (lack of glyph in the fonts being used), but this would be very rare for common characters like the em dash. Different browsers have different ways of indicating character- or font-level problems.
Make sure your document is set to the correct character encoding in the actual code editor, as well as in the doctype. Both are necessary. I spent hours trying to tweak HTML when the only problem was that I needed to set the text setting in Coda.
<head>
<meta charset="utf-8">
See the following screenshot:
Make sure your characters are actually UTF-8 characters. They will probably look something like this:
® or U+0020
http://www.kinsmancreative.com/transfer/char/index.php is a handy site for finding the decimal values of commonly used UTF-8 special characters if you need a reference.

Foreign characters in website

I found a website that contains the string "don’t". The obvious intent was the word "don't". I looked at the source expecting to see some character references, but didn't (it just shows the literal string "don’t". A Google search yielded nothing (expect lots of other sites that have the same problem!). Can anyone explain what's happening here?
Edit: Here's the meta tag that was used:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Would this not cause the page to be served up as Latin-1 in the HTTP header?
In your browser, switch the page encoding to "UTF-8". You're seeing a right single quote character, which is encoded by the octets 0xE2 0x80 0x99 in UTF-8. In your charset, windows-1252, those 3 octets render as "’". The page should be explicitly specifying UTF-8 as its charset either in the HTTP headers or in an HTML <meta> tag, but it probably isn't.
According to Character encondings in HTML a lemme in wikipedia:
HTML (Hypertext Markup Language) has
been in use since 1991, but HTML 4.0
(December 1997) was the first
standardized version where
international characters were given
reasonably complete treatment. When an
HTML document includes special
characters outside the range of
seven-bit ASCII two goals are worth
considering: the information's
integrity, and universal browser
display.
I suppose the site you checked, isn't impelemented with this in mind.
This has all got to do with encoding. Take a look back at the source, is there a tag at the top specifying it (charset)? My guess is it'll be UTF8 - although it could be something completely different.
This thread explains all. A combination of using a weird UTF-8 apostrophe character (probably originating from a Word Document), on a server that probably reports its encoding as non-UTF-8, despite the page having UTF characters (and possible even correctly reporting its own encoding).