Foreign characters in website - html

I found a website that contains the string "don’t". The obvious intent was the word "don't". I looked at the source expecting to see some character references, but didn't (it just shows the literal string "don’t". A Google search yielded nothing (expect lots of other sites that have the same problem!). Can anyone explain what's happening here?
Edit: Here's the meta tag that was used:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Would this not cause the page to be served up as Latin-1 in the HTTP header?

In your browser, switch the page encoding to "UTF-8". You're seeing a right single quote character, which is encoded by the octets 0xE2 0x80 0x99 in UTF-8. In your charset, windows-1252, those 3 octets render as "’". The page should be explicitly specifying UTF-8 as its charset either in the HTTP headers or in an HTML <meta> tag, but it probably isn't.

According to Character encondings in HTML a lemme in wikipedia:
HTML (Hypertext Markup Language) has
been in use since 1991, but HTML 4.0
(December 1997) was the first
standardized version where
international characters were given
reasonably complete treatment. When an
HTML document includes special
characters outside the range of
seven-bit ASCII two goals are worth
considering: the information's
integrity, and universal browser
display.
I suppose the site you checked, isn't impelemented with this in mind.

This has all got to do with encoding. Take a look back at the source, is there a tag at the top specifying it (charset)? My guess is it'll be UTF8 - although it could be something completely different.

This thread explains all. A combination of using a weird UTF-8 apostrophe character (probably originating from a Word Document), on a server that probably reports its encoding as non-UTF-8, despite the page having UTF characters (and possible even correctly reporting its own encoding).

Related

Why to include <meta charset=“” />?

I mean if a browser is already reading the HTML file and is able to read the text <meta charset=“” /> that means it already knows the encoding of the HTML file. So why is it needed to be specified inside the HTML file? Isn’t it redundant?
Is it because browser starts reading file using smallest charset, like ASCII, and it is subset of many charsets?
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
For a web page, the original idea was that the web server would return a similar Content-Type http header along with the web page itself — not in the HTML itself, but as one of the response headers that are sent before the HTML page.
This causes problems. Suppose you have a big web server with lots of sites and hundreds of pages contributed by lots of people in lots of different languages and all using whatever encoding their copy of Microsoft FrontPage saw fit to generate. The web server itself wouldn’t really know what encoding each file was written in, so it couldn’t send the Content-Type header.
It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy… how can you read the HTML file until you know what encoding it’s in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
But that meta tag really has to be the very first thing in the section because as soon as the web browser sees this tag it’s going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.
See also W3.org:
Always declare the encoding of your document using a meta element with a charset attribute, or using the http-equiv and content attributes (called a pragma directive). The declaration should fit completely within the first 1024 bytes at the start of the file, so it's best to put it immediately after the opening head tag.
So yes. The entire premise is that until the HTML parser of your browser reads that meta tag, there should not be any bytes that can be ambiguously interpreted as other bytes; the entire text shown including the charset attribute value ("utf-8") fits into the ASCII encoding.
From Joel's article:
Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working.
The average HTML parser goes like this:
Is there a Content-Type response header with a charset parameter? Use that to decode the bytes of the received content into a string.
Start reading the HTML as ASCII (or UTF-8). Is there a <meta http-equiv="Content-Type"> header with a usable charset? Use that.
Start parsing the bytes and use heuristics to determine the most likely encoding used.
It is an obsolete tag, but the reason: we have ISO 646 (since 1967) which defines a standard set of characters. ASCII specifies the few optional characters on ISO 646, so ISO 646 is the mother of most of encodings.
Note: most systems are based on this standard, ev. using the extension ISO 2022, where you can encode 7-bit and 8-bit characters with few different encodings (e.g. used for Asian character set, where we need more then 256 characters). In any case, the start of a text is compatible with ISO 646. Then control sequences may change the meaning.
So browser can read most of ASCII data (really ISO 646, ISO 2022), and detect exactly how to interpret all other characters.
On Western languages, you get mostly ASCII on lower codes (until 127), but how to interpret the higher codes depends on language (Nordic characters, Western accented characters, Greek characters, etc.). And there are various encoding, which cannot be really detected without explicit specification.
Note: this method fails on few encodings, e.g. multibytes, like UCS-2, UTF-16, UTF-32, but W3C had some methods to detect it: the header should be mostly ASCII charset, so we should have a lot of 00 characters. EBCDIC and other encodings not based on ISO 646 (or ASCII) were already seldom. In principle you can check for some byte strings, but I do not know if browser did it.
In short: with heuristic (and ISO 646) you can guess on how to read ASCII charset, but to know how to interpret "special characters", e.g. accented characters, we must have more information, given by META or by HTTP header. Note: this works also with many Asian encoding (ISO 2022 based)
Why META? It is about control. HTTP header often required webmaster intervention, but with META the author of a page could override the encoding. (e.g. writing static pages, now most dynamic page generators can override HTTP headers).

W3 validation error "content" "charset" [duplicate]

In order to define charset for HTML5 Doctype, which notation should I use?
Short:
<meta charset="utf-8" />
Long:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
In HTML5, they are equivalent. Use the shorter one, as it is easier to remember and type. Browser support is fine since it was designed for backwards compatibility.
Both forms of the meta charset declaration are equivalent and should work the same across browsers. But, there are a few things you need to remember when declaring your web files character-set as UTF-8:
Save your file(s) in UTF-8 encoding without the byte-order mark (BOM).
Declare the encoding in your HTML files using meta charset (like above).
Your web server must serve your files, declaring the UTF-8 encoding in the Content-Type HTTP header.
Apache servers are configured to serve files in ISO-8859-1 by default, so you need to add the following line to your .htaccess file:
AddDefaultCharset UTF-8
This will configure Apache to serve your files declaring UTF-8 encoding in the Content-Type response header, but your files must be saved in UTF-8 (without BOM) to begin with.
Notepad cannot save your files in UTF-8 without the BOM. A free editor that can is Notepad++. On the program menu bar, select "Encoding > Encode in UTF-8 without BOM". You can also open files and re-save them in UTF-8 using "Encoding > Convert to UTF-8 without BOM".
More on the Byte Order Mark (BOM) at Wikipedia.
Another reason to go with the short one is that it matches other instances where you might specify a character set in markup. For example:
<script type="javascript" charset="UTF-8" src="/script.js"></script>
<p><a charset="UTF-8" href="http://example.com/">Example Site</a></p>
Consistency helps to reduce errors and make code more readable.
Note that the charset attribute is case-insensitive. You can use UTF-8 or utf-8, however UTF-8 is clearer, more readable, more accurate.
Also, there is absolutely no reason at all to use any value other than UTF-8 in the meta charset attribute or page header. UTF-8 is the default encoding for Web documents since HTML4 in 1999 and the only practical way to make modern Web pages.
Also you should not use HTML entities in UTF-8. Characters like the copyright symbol should be typed directly. The only entities you should use are for the five reserved markup characters: less than, greater than, ampersand, prime, double prime.
Entities need an HTML parser, which you may not always want to use going forward. They introduce errors, make your code less readable, increase your file sizes, and sometimes decode incorrectly in various browsers depending on which entities you used. Learn how to type/insert copyright, trademark, open quote, close quote, apostrophe, em dash, en dash, bullet, Euro, and any other characters you encounter in your content, and use those actual characters in your code.
The Mac has a Character Viewer that you can turn on in the Keyboard System Preference, and you can find and then drag and drop the characters you need, or use the matching Keyboard Viewer to see which keys to type. For example, trademark is Option + 2. UTF-8 contains all of the characters and symbols from every written human language.
So there is no excuse for using -- instead of an em dash. It is not a bad idea to learn the rules of punctuation and typography also ... for example, knowing that a period goes inside a close quote, not outside.
Using a <meta> tag for something like content-type and encoding is highly
ironic, since without knowing those things, you couldn't parse the file
to get the value of the meta tag.
No, that is not true. The browser starts out parsing the file as the browser's default encoding, either UTF-8 or ISO-8859-1. Since US-ASCII is a subset of both ISO-8859-1 and UTF-8, the browser can read <html><head> just fine either way ... it is the same. When the browser encounters the meta charset tag, if the encoding is different than what the browser is already using, the browser reloads the page in the specified encoding.
That is why we put the meta charset tag at the top, right after the head tag, before anything else, even the title. That way you can use UTF-8 characters in your title.
You must save your file(s) in UTF-8 encoding without BOM
That is not strictly true. If you only have US-ASCII characters in your document, you can Save it as US-ASCII and serve it as UTF-8, because it is a subset. But if there are Unicode characters, you are correct, you must Save as UTF-8 without BOM.
If you want a good text editor that will save your files
in UTF-8, I recommend Notepad++.
On the Mac, use Bare Bones TextWrangler (free) from Mac App Store, or Bare Bones BBEdit which is at Mac App Store for $39.99 ... very cheap for such a great tool.
In either app, there is a menu at the bottom of the document window where you specify the document encoding and you can easily choose "UTF-8 no BOM". And of course you can set that as the default for new documents in Preferences.
But if your Webserver serves the encoding in the HTTP header,
which is recommended, both [meta tags] are needless.
That is incorrect. You should of course set the encoding in the HTTP header, but you should also set it in the meta charset attribute so that the page can be saved by the user, out of the browser onto local storage and then opened again later, in which case the only indication of the encoding that will be present is the meta charset attribute.
You should also set a base tag for the same reason ... on the server, the base tag is unnecessary, but when opened from local storage, the base tag enables the page to work as if it is on the server, with all the assets in place and so on, no broken links.
AddDefaultCharset UTF-8
Or you can just change the encoding of particular file types like so:
AddType text/html;charset=utf-8 html
A tip for serving both UTF-8 and Latin-1 (ISO-8859-1) files is to give the UTF-8 files a "text" extension and Latin-1 files "txt."
AddType text/plain;charset=iso-8859-1 txt
AddType text/plain;charset=utf-8 text
Finally, consider saving your documents with Unix line endings, not legacy DOS or (classic) Mac line endings, which don't help and may hurt, especially down the line as we get further and further from those legacy systems.
An HTML document with valid HTML5, UTF-8 encoding, and Unix line endings is a job well done. You can share and edit and store and read and recover and rely on that document in many contexts. It's lingua franca. It's digital paper.
<meta charset="utf-8"> was introduced with/for HTML5.
As mentioned in the documentation, both are valid. However, <meta charset="utf-8"> is only for HTML5 (and easier to type/remember).
In due time, the old style is bound to become deprecated in the near future. I'd stick to the new <meta charset="utf-8">. There's only one way, but up. In tech's case, that's phasing out the old (really, REALLY fast)
Documentation: HTML meta charset Attribute—W3Schools
While not contesting the other answers, I think the following is worthy of mentioning.
The “long” (http-equiv) notation and the “short” one are equal. Whichever comes first wins;
Web server headers will override all the <meta> tags;
BOM (byte order mark) will override everything, and in many cases it will affect HTML 4 (and probably other stuff, too);
If you don't declare any encoding, you will probably get your text in “fallback text encoding” that is defined your browser. Neither in Firefox nor in Chrome it's UTF-8;
In absence of other clues the browser will attempt to read your document as if it was in ASCII to get the encoding, so you can't use any weird encodings (UTF-16 with BOM should do, though);
While the specifications say that the encoding declaration must be within the first 512 bytes of the document, most browsers will try reading more than that.
You can test by running echo 'HTTP/1.1 200 OK\r\nContent-type: text/html; charset=windows-1251\r\n\r\n\xef\xbb\xbf<!DOCTYPE html><html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta charset="windows-1251"><title>привет</title></head><body>привет</body></html>' | nc -lp 4500 and pointing your browser at localhost:4500. (Of course you will want to change or remove parts. The BOM part is \xef\xbb\xbf. Be wary of the encoding of your shell.)
Please mind that it's very important that you explicitly declare the encoding. Letting browsers guess can lead to security issues.
Use <meta charset="utf-8" /> for web browsers when using HTML5.
Use <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> when using HTML4 or XHTML, or for outdated DOM parsers, like DOMDocument in PHP 5.3.
To embed a signature in an email, I would use the long version:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
The reason is that not many email readers use HTML5, so it's always better use old HTML styles. Actually, it's better to use tables than divs + CSS as well.
There is some news based on Mozilla Foundation, and SitePoint:
Do not use this value (http-equiv=content-type) as it is obsolete.
Prefer the charset attribute on the <meta> element.

Displaying UTF-16 characters on web browser

I printed some UTF-16 encoded characters and tried to display it in Firefox and it displayed it as �.
So I went to Tools->Encoding and changed the encoding from UTF-8 to UTF-16 (I also tried changing charset directly in the HTML) However, when I did that, my page was completely flooded with symbols:
਍ℼ佄呃偙⁅瑨汭ാ㰊瑨汭ാഊ㰊敨摡ാ †ഠ †㰠楴汴㹥楬畮⁸‭楆敲潦⁸楤灳慬獹朠牡慢敧挠慨慲瑣牥⁳湩氠敩⁵景眠扥 瀠条⁥‭畓数⁲獕牥⼼楴汴㹥਍††氼湩敲㵬猢潨瑲畣⁴捩湯•牨晥∽瑨灴⼺振湤献瑳瑡捩渮瑥猯灵牥獵牥椯杭是癡捩湯椮潣㸢਍††氼湩敲㵬愢灰敬琭畯档椭潣≮栠敲㵦栢瑴㩰⼯摣⹮獳慴楴⹣敮............
How can web browsers display UTF-16 characters without wrecking the page?
The “flooded with symbols” excerpt looks like an HTML document that is UTF-8 encoded but treated as if it were UTF-16 encoded. Or it might contain mostly UTF-8 data with some UTF-16 encoded data thrown in, which won’t work.
If you save your data as properly UTF-16 encoded and declare the encoding in HTTP headers and/or meta tags, then some browsers will display it OK, some won’t. Search engines generally fail to process UTF-16, and UTF-16 is mostly not used and should not be used on the web, except by mutual agreement between consenting well-informed partners.
Firefox could not figure the correct charset in your document.
For web pages head meta tag should be used to indicate the content's charset.
It should be placed in the beginning of the HTML file indicating which charset the browser should use for the rest of the file.
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
So the browser is charset blind until it reads that line. But using utf-8 is no problem. Because every character up to that point is encoded in utf-8 the same way it would be in ASCII (same goes for latin-1 and others). That's not the case in utf-16.
W3C says:
There are three different Unicode character encodings: UTF-8, UTF-16
and UTF-32. Of these three, only UTF-8 should be used for Web content.
So you should use utf-8. But if you still want to try something with utf-16 use the BOM in the begging of your file. You're going to give your browser a better chance of figuring it out and properly decode the content.
This other answer is very succinct about utf-16 usage.
While Joel gives a full lesson on character encoding and why HTML uses it declaration inside the content and not as a header information.
Sending UTF-16 data as a Web page to browsers is an XSS risk in older browsers. (See another answer.) Don’t do it. Instead, convert the data to UTF-8 on the server and send UTF-8 over HTTP.
The way to make this work is for the page to say what encoding it's in. In the case of UTF-16, it also helps to include a BOM. The "flooded with Chinese" effect is most likely because your page is UTF-16LE but the browser treated it as UTF-16BE or vice versa...

Single quotes showing as diamond shaped question mark in browsers (no database or PHP)

I am working with a web page in which I switched the character set from iso-8859-1 to utf-8. The top of the page reads like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>[title of site]</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
I am only using ASCII characters in the page, and since utf-8 encoding supersets ASCII, this should be fine. However, single quotes in the text are showing up as question marks surrounded by black diamonds. I have verified these are are ASCII single quotes (not straight quotes).
I've read much online that describes solutions to the problem that involve PHP, magic quotes, database configuration, etc. However, this is a flat HTML page that isn't being rendered by any programs.
Also, many who have this problem are told to switch to UTF-8 to fix the problem. This is exactly how I introduced the problem.
Please look at http://mch.blackcatwebinc.com/src/events.html to see this problem.
The only quotes in ASCII are the single quote ' (0x27 or 39) and the double quote " (0x22 or 33). What you have there is an 8-bit encoding that places quotes at 145 (0x91) and 146 (0x92) called CP1252; it's the standard 8-bit Western European encoding for Windows. If what you want is UTF-8, you need to convert that to UTF-8, since it's not valid UTF-8; valid UTF-8 uses multiple bytes for characters above 127 (0x7F), and places the opening and closing quotes at U+2018 and U+2019 respectively.
According to the W3C, the meta charset
should appear as close as possible to the top of the head element
From http://www.w3.org/International/questions/qa-html-encoding-declarations#metacontenttype
So, I might try to place the meta tag above the title.
Also, as mentioned in the first answer by #user1505373, UTF is always capitalized and there is no space after the = in any of the examples I saw.
Your source code is not saved in UTF-8 but Latin1 CP1252, and those quotes are not simple quotes but U+2019 RIGHT SINGLE QUOTATION MARKS (encoded in Latin1). Save the source file in UTF-8 and it'll work.
The simplest fix is to change UTF-8 to windows-1252 in the meta tag. This works, because the server announces no encoding in the Content-Type header, so browsers and other clients will use the one specified in a meta tag.
The name windows-1252 is the preferred MIME name for the 8-bit Windows Latin-1 encoding, also known as cp1252 and some other names (often misrepresented as “ANSI”).
As #deceze explains, the actual encoding of the data is windows-1252, not UTF-8. You can alternatively change the actual encoding to UTF-8 by saving the file with a suitable command in your authoring software. But what really matters is that the declared encoding matches the real one.
Yet another possibility is to use “escapes” for the apostrophe, such as ’. They work independently of encoding, but they make the source code less legible.
The only difference I see between your tag and the one on the site I'm working on is the space after the semicolon and that utf is lowercase on yours. Try capitalizing UTF.
All ASCII printable characters have their equivalent HTML Entity Code. Some of these characters are generally supported by most common OS typefaces, some are categorized as Symbols that bring us to your rendering issue.
What you supposedly have there is a closing single quote, and in order to get it rightly printed you should use it's entity code, or ’ respectively.
If it turns to be an opening single quote, then you should use ‘ instead.
Note, there's no HTML Entity Name for the two ASCII characters (and some more) so you're required to opt the entity code variant.

HTML Unicode Issue: How to display special characters

Currently, I have my webpage set to Unicode/UTF-8. When trying to display a special character (for example, em dash, double arrow, etc), it shows up as a question mark symbol. I cannot change these characters to the HTML entity equivalent. How can I circumvent this issue?
A question mark in a lozenge, �, indicates a character-level error: the data contains bytes that do no represent any character, according to the character encoding being applied. This typically happens when the document is declared as UTF-8 encoded but is really in iso-8859-1, windows-1252, or some similar encoding. Windows-1252 is a common default encoding used by various programs on Windows platforms. So you may need to open the file in your authoring program and re-save it as UTF-8 encoded.
If problems remain, please post the URL. Posting the code alone is not sufficient, since the character encoding is primarily specified in HTTP headers.
If you see a question mark in a small box, then it might be a font-level problem (lack of glyph in the fonts being used), but this would be very rare for common characters like the em dash. Different browsers have different ways of indicating character- or font-level problems.
Make sure your document is set to the correct character encoding in the actual code editor, as well as in the doctype. Both are necessary. I spent hours trying to tweak HTML when the only problem was that I needed to set the text setting in Coda.
<head>
<meta charset="utf-8">
See the following screenshot:
Make sure your characters are actually UTF-8 characters. They will probably look something like this:
® or U+0020
http://www.kinsmancreative.com/transfer/char/index.php is a handy site for finding the decimal values of commonly used UTF-8 special characters if you need a reference.