Difference between meta charset="utf-8" and Notepad++ utf-8 encoding? - html

Do I need to do both of them, only set the notepad++ encoding or only do it in the meta tag?

If you save as “UTF-8” (and not as “UTF-8 without BOM”) in Notepad++, then the meta tag is not needed, since browsers and search engines will infer the encoding from the BOM. This is what actually happens, and it is being defined formally in clause 8.2.2.1 Determining the character encoding in HTML5.
Writing a meta tag does not change the actual encoding. If present, it should match the encoding, of course.

The meta tags tells the browser what encoding the file has been saved in, so it needs to match the encoding you tell notepad++ to save it in. If you were to save it in UTF-8 which uses a variable number of bytes per character, and have a meta tag stating ISO-8859-1 (Latin-1) then the browser will interpret each single byte as a character.
For example, if you save a cent character in a UTF-8 encoded document then it'll use two bytes: C2A2. However, if you interpret those bytes as Latin-1 you'll get two characters. Oddly enough, the second of those is the cent character.

The meta tag tells the web browser what encoding to open it as, not notepad. You need to set it in notepad to make sure that it is editing the files in UTF-8 format. So the answer is both.

Related

confusion between encoding of a web document and the encoding explicitly used in the document

I know it's a very dumb question but unfortunately couldn't figure it out on my own. I always have a confusion when it comes to encoding and character set topics. I'll explain what I understand from the topic then I'll ask my questions.
when you want to save a file, you do it in a certain character encoding, meaning that each character of the file fits in memory according to its encoding. right?
for example if a html file has utf-16 encoding, does that means that browser uses utf-16 encoding to decode the given file to read the source code?
does using charset attribute in meta element defines what encoding the language(html) should use to properly display characters in browser?
and html added an "html character reference"on its own and it has nothing to do with unicode character codes?
Edit1:
so after the #snakecharmerb I realized some of my mistakes:
1- I didn't know that there is no metadata about [text]files encoding.
2- the charset attribute tell the browser the encoding of the file because this information can't be conceived from file itself(to some extent it can. see this answer)
3- a text file can only have one encoding and if a file encoded with utf-8 it means it follows Unicode Character Set(UCS). you can't use utf-8 encoding with another character set and today the terms utf-8 and unicode are almost interchangeable.
when you want to save a file, you do it in a certain character encoding, meaning that each character of the file fits in memory according to its encoding. right?
yes, each character is encoded to a specific numeric value; decoding converts the numeric value back to the character
for example if a html file has utf-16 encoding, does that means that browser uses utf-16 encoding to decode the given file to read the source code?
the browser will attempt to decode the page using the encoding provided in the Content-Type header in the response headers from the web server; if the header is missing or does not specify an encoding, the meta charset tag in the page will be used. If neither is specified, the browser may attempt to infer the encoding from the document content, and finally fallback to latin-1
the w3c recommends always setting the meta tag, only setting the Content-Type header if you are sure it will be correct, and always using UTF-8 as your encoding.
does using charset attribute in meta element defines what encoding the language(html) should use to properly display characters in browser?
it tells the browser which encoding should be used to decode the page
and html added an "html character reference"on its own and it has nothing to do with unicode character codes?
html entities (like ' or ') are independent of any particular encoding, but their constituent characters will themselves will be encoded and decoded

How HTML meta charset works

How does meta charset work? Please correct my understanding if I am wrong. As I understand it, the charset is used as to indicate what encoding the page is to be shown? If I put a very specific encoding, others might not be able to see it displayed correctly. But why? Isn't the encoding set on the meta tage and the browser renders characters based on the charset? Or do I have the wrong idea (probably)?
Letters, numbers and other characters have to represented in computers as bytes.
There are different ways (character encodings) that can be used to represent the same characters. Usually you'll want to use UTF-8 these days.
Meta charset tells the browser which one you have used so it knows how to decode the bytes into characters correctly.
If you tell the browser you are using UTF-8 when you are actually using ISO-8859-1, then you'll get errors (the wrong characters) showing up in places where the encodings do not overlap.
character_set Specifies the character encoding for the HTML document.
In theory, any character encoding can be used, but no browser understands all of them. The more widely a character encoding is used, the better the chance that a browser will understand it.

meta charset for unicode

I am using Jekyll which has some issues with UTF-8 files. I was able to work around this by saving the file as Unicode (UTF-16 LE).
However it is an HTML document, which until now I have been using the
<meta charset="utf-8">
line in the file. Is this charset still correct or should I be using another?
If you save the file as UTF-16 LE, you have to update the <meta> tag to match.
The document cited deals with “incorrect UTF-8 characters”, whatever that means. Just don’t do incorrect UTF-8 characters.
Saving an HTML file as UTF-16 is normally pointless, because UTF-16 just does not work on the web. Of course the meta tag should describe the real encoding, but that’s not the point, and charset declaration in HTTP headers will override any meta tags.
So keep using UTF-8, and fix the problem with your character data, instead of creating a new, serious problem.
I found some
information from the World Wide Web Consortium.
HTML5 with UTF-16
Ensure that there is a byte-order mark
at the beginning of the file.
The HTML Working Group is currently discussing whether
you can use a meta element declaration in the head
element when the encoding is UTF-16. For now, don't.

Displaying UTF-16 characters on web browser

I printed some UTF-16 encoded characters and tried to display it in Firefox and it displayed it as �.
So I went to Tools->Encoding and changed the encoding from UTF-8 to UTF-16 (I also tried changing charset directly in the HTML) However, when I did that, my page was completely flooded with symbols:
਍ℼ佄呃偙⁅瑨汭ാ㰊瑨汭ാഊ㰊敨摡ാ †ഠ †㰠楴汴㹥楬畮⁸‭楆敲潦⁸楤灳慬獹朠牡慢敧挠慨慲瑣牥⁳湩氠敩⁵景眠扥 瀠条⁥‭畓数⁲獕牥⼼楴汴㹥਍††氼湩敲㵬猢潨瑲畣⁴捩湯•牨晥∽瑨灴⼺振湤献瑳瑡捩渮瑥猯灵牥獵牥椯杭是癡捩湯椮潣㸢਍††氼湩敲㵬愢灰敬琭畯档椭潣≮栠敲㵦栢瑴㩰⼯摣⹮獳慴楴⹣敮............
How can web browsers display UTF-16 characters without wrecking the page?
The “flooded with symbols” excerpt looks like an HTML document that is UTF-8 encoded but treated as if it were UTF-16 encoded. Or it might contain mostly UTF-8 data with some UTF-16 encoded data thrown in, which won’t work.
If you save your data as properly UTF-16 encoded and declare the encoding in HTTP headers and/or meta tags, then some browsers will display it OK, some won’t. Search engines generally fail to process UTF-16, and UTF-16 is mostly not used and should not be used on the web, except by mutual agreement between consenting well-informed partners.
Firefox could not figure the correct charset in your document.
For web pages head meta tag should be used to indicate the content's charset.
It should be placed in the beginning of the HTML file indicating which charset the browser should use for the rest of the file.
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
So the browser is charset blind until it reads that line. But using utf-8 is no problem. Because every character up to that point is encoded in utf-8 the same way it would be in ASCII (same goes for latin-1 and others). That's not the case in utf-16.
W3C says:
There are three different Unicode character encodings: UTF-8, UTF-16
and UTF-32. Of these three, only UTF-8 should be used for Web content.
So you should use utf-8. But if you still want to try something with utf-16 use the BOM in the begging of your file. You're going to give your browser a better chance of figuring it out and properly decode the content.
This other answer is very succinct about utf-16 usage.
While Joel gives a full lesson on character encoding and why HTML uses it declaration inside the content and not as a header information.
Sending UTF-16 data as a Web page to browsers is an XSS risk in older browsers. (See another answer.) Don’t do it. Instead, convert the data to UTF-8 on the server and send UTF-8 over HTTP.
The way to make this work is for the page to say what encoding it's in. In the case of UTF-16, it also helps to include a BOM. The "flooded with Chinese" effect is most likely because your page is UTF-16LE but the browser treated it as UTF-16BE or vice versa...

What to do with umlauts (äöü) in metatags?

Declaring them as &xuml; etc. didn't work, just writing them as they are leads to display errors.
What to do?
If your page is encoded as UTF-8, you should be able to use special characters directly (i.e. without converting them into their HTML entity counterparts) without problems. Note that if you declare the encoding in a content-type meta tag, you should put that tag to the very beginning of the head section.
Use an encoding which can encode the characters. I'd recommend UTF-8, which is generally the preferred solution for western languages.
Keep in mind that HTTP headers have precedence over <meta http-equiv=...>, but you should set both to ensure using the correct encoding when loading the document from non-HTTP sources (eg when saving the file locally).
You should never have to use HTML entities for those characters, since they have no special meaning in HTML. Just make sure the character encoding of the text you're outputting matches your charset header.