utf-8 on web page translating some characters - html

I seem to have encountered a weird behaviour with my HTML page, where it can
display some Chinese characters only and some are tagged as ??
I have already changed the HTTP header to utf-8
Content-Type: "text/html; charset=utf-8"
�?��?好
Question is why only some Chinese characters can be shown and some not??
Edit :-
I have dug deeper in the issue and there is two parts in the code.
There is a function in the code to encode strings to cp1252 encoding on this string
before encoding :-
<pre>
<font size=\"2\">\x{e6}\x{99}\x{9a}\x{e5}\x{ae}\x{89} </font>
</pre>
after encoding :-
Then I tested by changing the encoding to iso-8859-1 and everything is showing fine. My question now is why is that ?? I'm assuming cp1252 is older and does not support some utf-8 encodings ??
Thank You.

Related

Swift 3 - Apostrophes being converted to â on screen-scrape

In my Swift 3 application, I am getting HTML text from a web api to render inside of a UIWebView. However, apostrophes specifically and maybe other special characters are rendering as accent letters instead of their real value. For example, the text for “Transportation’s” displays as “Transportationâs”.
The code is simple.
var myHTMLString = try String(contentsOf: myURL, encoding: .ascii)
//myHTMLString = "some bâd string"
webview.loadHTMLString(myHTMLString, baseURL: nil);
The values are correct on the API. Why is this happening when grabbing?
Change the encoding to .utf8
I'm not sure why the .ascii isn't working, though I suspect it has to do with HTML entity encoding. If I find a reason, I'll update this answer...
Update: This w3schools.com page explains that HTML is now considered UTF-8 standard. UTF-8 can handle a larger set of characters. Apparently the webview browser understands HTML entities (for example, &apos; representing an apostrophe) in UTF-8, but not ASCII.

Basics on encoding

Today I've started my first HTML page. Where is the page encoding stored exactly?
At first, é turned into é. Then I used my text editor to save the file with an encoding. "UTF-8" didn't work. Then I used "ISO 8859-1", which did work. How did my browser know it was encoded with "ISO 8859-1"?
I can't see it anywhere in my file, so I'm very curious about where the info is stored.
The encoding is stored in the header of the file itself. Notepad++ and similar programs usually provide a number of options to change and view it.
Additionally, you can provide a value by using the meta tag:
<meta charset="UTF-8"> (HTML5)
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"> (HTML4)
Those tags are used by browsers to parse your file. However, they do not define the encoding of the file itself (and that's what seems to be happening in your case: your file has encoding A, and the browser is trying to read encoding B), and browsers can ignore those conditions.
The default encoding can also be defined (and overwritten) by your server. A sample .htaccess encoding configuration:
AddDefaultCharset utf-8
AddType 'text/html; charset=utf-8' .html .htm .shtml
UTF-8 is the recommended encoding standard for the web.
The UTF-8 encoding for é is the two hex bytes C3A9.
C3 A9, when interpreted as ISO 8859-1 is two characters: é.
Browsers tend to guess correctly at the encoding. Or you can explicitly tell it how to interpret the bytes. Try that out -- you will probably see the text change between é and é.
A third case is when "double encoding" occurs. That is, somehow, the é is seen as UTF-8, hex C383 C2A9.
So, to really be sure of what is going on, you need to get the HEX.

Include Unicode Signature (BOM) in HTML files or not?

In Dreamweaver I have the option "Include Unicode Signature (BOM)".
If I check this box and save the file the HTML file it looks good when viewed in the web browser. If not it gives me strange symbols for Swedish letters like åäö.
If I serve this HTML file with strange letters using the header respond "Content-Type: text/html; charset=utf-8" it still gives me strange symbols.
Q1) Does that mean that it's not a UTF-8 encoded file (the one without BOM that shows strange symbols)?
Q2) What makes a file UTF-8 encoded, is it just the Unicode signature (BOM)?
Q3) Should I or should I not add the Include Unicode Signature (BOM) in my files (HTML, Javascript, CSS, PHP)?
I know that I can add <meta charset="UTF-8"> in the HTML code or type AddDefaultCharset UTF-8 in my .htaccess. I just figure the optimal solution would be to have a header respond that says "it's a UTF-8 encoded file" and then also actually serve a UTF-8 encoded file. Nothing else.
Q4) I thought HTML files were plain text-files. What other information is hidden in those files and how can I read this information?
The BOM is entirely optional for UTF-8. The Unicode consortium points out that it can create problems while offering no real advantage; the W3C says that it can be a substitute for other forms of declaring the encodings and should work on all modern browsers.
The BOM is only there to clarify the endianness of the encoding. Since UTF-8 only has one kind of endianness it is superfluous. It's only useful for UTF-16 and other encodings. A UTF-8 encoded file is UTF-8 encoded regardless of the presence of the BOM.
HTML files do not "hide" any other information, they're plain text.
My recommendation would be:
encode as UTF-8 without BOM
add the HTTP Content-Type header to denote the encoding of the file
also add the <meta> tag into the HTML itself as a fallback, should the file be interpreted outside of an HTTP context (meaning where no HTTP header exists because the file is not read over HTTP)
This gives you the best compatibility with the least potential for issues. If your characters are still appearing funny, then your file is not actually UTF-8 encoded or the HTTP header is not being set correctly.

meta tag to correct ®

I'm having some trouble getting a special character properly encoded.
® keeps coming through instead of the registered trademark symbol. I've tried changing the meta tag to UTF-8 and Windows-1252, but it still comes through in the encoded format? Can I add a meta tag to fix this?
Make sure to save your file with the proper encoding:
.
Here is an example; on the left side, the file is saved with Window-1252 encoding.
On the right side, it's saved with UTF-8 encoding
HTML options
For such characters, encoding with ISO-8859-1 might do it too, but UTF-8 is greatly encouraged.
Make sure your DOCTYPE is clearly defined : <!DOCTYPE HTML>.
Make sure your meta tag is written properly: <meta charset="UTF-8">.
PHP options
If you use PHP within your page, add the following at the beginning of the page:
<?php header('Content-Type: text/html; charset=utf-8'); ?>
If the content is output from a database, you might want to use utf8_encode() to encode different encodings to UTF-8
utf8_encode()
Encodes an ISO-8859-1 string to UTF-8
The information about encoding should correspond to the actual encoding. So instead of making guesses and trial and error, find out what the encoding really is. It seems to be UTF-8, and if declaring UTF-8 in a meta tag does not help, the probable culprit is an HTTP header that the server sends and that declares a different encoding, trumping the meta tag. Use e.g. an HTTP header viewer to check out the situation.
If the server announces iso-8859-1 or windows-1252 and if you cannot change this, then you just have to use that encoding instead of UTF-8. Then save the page in your authoring program as windows-1252 encoded.

validator.w3.org reports a markup error - detected character encoding "utf-8"

validator.w3.org reports for www.besaltnlight.ca:
Character Encoding Override in effect!
The detected character encoding "utf-8" has been suppressed and "iso-8859-1" used instead.
The php code outputs iso-8859-1 and php sets that as the default characterset.
What is causing this problem? Am I using the wrong doctype?
Oh, and would any of this cause quirks mode in IE?
Thanks for your help.
Gerry
The document is encoded in UTF-8. It has a byte order mark, smart quotes, and an ellipsis, all properly encoded in UTF-8. It begins with two byte order marks, which is invalid. You must remove one, and the validator also says that the presence of a BOM in a UTF-8 document may be confusing, so you may remove them both.
Since you’re outputting UTF-8, you must change the HTTP header to:
Content-type: text/html; charset=utf-8
Since you are missing that header, you force the browser to guess. Additionally, the meta tag must be changed to
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
for the same reason.
Your output starts with a Unicode byte order mark, encoded in UTF-8.
This is likely the first some bytes of your PHP file, or any PHP file included by your main file. Your editor may not even show them. Interpreted as ISO-8859-1 the start of the output looks like <!DOCTYPE html, which are even two byte order marks, one after each other.
As said by jleedev, either make sure your files are really encoded in Latin-1, or declare the encoding as UTF-8.