HTML Unicode Issue: How to display special characters - html

Currently, I have my webpage set to Unicode/UTF-8. When trying to display a special character (for example, em dash, double arrow, etc), it shows up as a question mark symbol. I cannot change these characters to the HTML entity equivalent. How can I circumvent this issue?

A question mark in a lozenge, �, indicates a character-level error: the data contains bytes that do no represent any character, according to the character encoding being applied. This typically happens when the document is declared as UTF-8 encoded but is really in iso-8859-1, windows-1252, or some similar encoding. Windows-1252 is a common default encoding used by various programs on Windows platforms. So you may need to open the file in your authoring program and re-save it as UTF-8 encoded.
If problems remain, please post the URL. Posting the code alone is not sufficient, since the character encoding is primarily specified in HTTP headers.
If you see a question mark in a small box, then it might be a font-level problem (lack of glyph in the fonts being used), but this would be very rare for common characters like the em dash. Different browsers have different ways of indicating character- or font-level problems.

Make sure your document is set to the correct character encoding in the actual code editor, as well as in the doctype. Both are necessary. I spent hours trying to tweak HTML when the only problem was that I needed to set the text setting in Coda.
<head>
<meta charset="utf-8">
See the following screenshot:

Make sure your characters are actually UTF-8 characters. They will probably look something like this:
® or U+0020
http://www.kinsmancreative.com/transfer/char/index.php is a handy site for finding the decimal values of commonly used UTF-8 special characters if you need a reference.

Related

UTF-8 encoding in sublime text and visual studio

The question might be a bit basic – considering I'm not what the vast majority would consider a newcomer to front end web development.
I am teaching an 8 year old html, css and javascript. I'm taking the opportunity to also teach about utf-8 encoding, in particular the way HTML uses it to allow non-English characters to be encoded and displayed.
I want to show him how accented characters do not appear properly without including <meta charset="UTF-8"/>.
Surprisingly I was able to display "Á" in the test webpage when in theory this shouldn't have been possible as the utf-8 charset meta tag was missing.
After some research I came to the conclusion that in modern IDE's the encoding system comes "built in", hence there's no real need to write down <meta charset />. If this is wrong please correct me as I am currently confused as to what exactly happened and I don't want to teach wrong information to an 8 year old.
After some research I came to the conclusion that in modern IDE's the encoding system comes "built in", hence there's no real need to write down . If this is wrong please correct me
Yes, that is wrong!
Surprisingly I was able to display "Á" in the test webpage when in theory this shouldn't have been possible as the utf-8 charset meta tag was missing.
This is also wrong, let me explain!
UTF-8 is an encoding system. This means it describes how to map bytes into textual characters. It's certainly possible to display "Á" without using utf-8.
The letter A (normal, no accents) is encoded with the number 65 in both ASCII and UTF-8. In fact, all english characters and punctuation are encoded the same way across virtually all encodings, so encoding problems rarely become apparent in English-only text.
However, accented letters, non-english characters and emojis (😁) are encoded differently in different encoding systems. What causes "corrupt" text to be displayed is an encoding mismatch: your web browser thinks the encoding used is X while the file was actually encoded with system Y, so byte values no longer map to correct characters. For example, system X uses number 250 to encode 😁, while system Y uses number 190, and under system Y 250 is mapped to "Ë". So now my 😁 appear as "Ë".
<meta charset="utf-8"/> specifies the encoding used for the HTML file. It is absolutely needed. Your webpage worked without because browsers may use other ways to get it, including educated guesses, but it should always be explicitly written in the HTML to avoid problems down the line.
You should specify the encoding for several reasons:
Even if the encoding system would come buit-in, you cannot know which is the default encoding chosen for the IDE.
HTML5 specification says that the default encoding should be taken from the transport layer when not specified which will be the default encoding charset for HTTP1.1: ISO-8859-1.
See the full explaination here: Why it's necessary to specify the character encoding in an HTML5 document if the default character encoding for HTML5 is UTF-8?

Can I use HTML-entities without a fallback?

I am wondering, if I can use html-entities like
<h5><em>⇆</em> Headline</h5>
without any fallback if I use utf-8? (because on my systems this works totally fine). Are all these chars from http://dev.w3.org/html5/html-author/charref really all embedded into the utf-8-charset by default?
And how would I use it correctly, like this:
<h5><em>⇆</em> Headline</h5>
that
<h5><em>&lrarr; </em> Headline</h5>
or
<h5><em>⇆</em> Headline</h5>
There are two separate issues here:
get the browser to understand which character you want
render that character visually
For the first point, there are two options:
Embed the character directly as is, for which you will need to serve the HTML in an encoding that can encode that character. Yes, "⇆" is a Unicode character and can be encoded by any Unicode encoding. UTF-8 is the best choice here. The browser then simply needs to understand that the document is encoded in UTF-8 and it will be able to read and understand the character correctly. Set the appropriate HTTP header to denote the encoding.
Embed the character as an HTML entity. HTML entities is a way to embed any arbitrary character using only ASCII characters, e.g. &lrarr;. To encode this, your encoding of choice only needs to be able to encode &, l, r, a and ;, which are very standard characters in any encoding. This special sequence of characters is understood by the browser to mean the character "⇆". By embedding characters as HTML entities you can largely ignore the intricacies of managing encodings correctly, but it makes your source code rather unreadable. You should not do this in this day and age.
Whether you use named entities (&lrarr;) or refer to the character by its Unicode code (⇆) doesn't really matter, they both result in the same thing.
Having handled this, the character needs to be actually rendered as a glyph on screen. For this, an appropriate font is necessary. You'll have to test whether most of your target audience uses a system which has a font installed by default which contains this character. You can also provide your own font to the browser which contains this character as a web font.

How HTML meta charset works

How does meta charset work? Please correct my understanding if I am wrong. As I understand it, the charset is used as to indicate what encoding the page is to be shown? If I put a very specific encoding, others might not be able to see it displayed correctly. But why? Isn't the encoding set on the meta tage and the browser renders characters based on the charset? Or do I have the wrong idea (probably)?
Letters, numbers and other characters have to represented in computers as bytes.
There are different ways (character encodings) that can be used to represent the same characters. Usually you'll want to use UTF-8 these days.
Meta charset tells the browser which one you have used so it knows how to decode the bytes into characters correctly.
If you tell the browser you are using UTF-8 when you are actually using ISO-8859-1, then you'll get errors (the wrong characters) showing up in places where the encodings do not overlap.
character_set Specifies the character encoding for the HTML document.
In theory, any character encoding can be used, but no browser understands all of them. The more widely a character encoding is used, the better the chance that a browser will understand it.

How to make the website show signs like "č" and "ć"?

I'm making a website that is in Croatian, and I need to use signs like: "č", "ć", "ž", "đ" and "š". They are currently displayed as little boxes.
Info:
I use Notepad ++.
I set the encoding there to UTF-8.
I put the following line of HTML in: <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
However, it does not work. Even Notepad ++ can't display my characters using UTF-8, so that would suggest that I should probably use something else...
http://webdesign.maratz.com/lab/utf_table/
Use HTML entities, for example
č : č
ž : ž
This sounds more like a font issue than a character encoding issue. If it were a character encoding issue, the characters would most likely be displayed as 2+ ASCII characters. The boxes, however, typically mean the character encoding is correct, but that specific character is not available in the font being used (which is especially common with lesser-used fonts). This would explain why it's behaving incorrectly in both the website and Notepad++.
To fix the issue, simply use a different font in your editor and website.
Note: I recommend a widely used font for the best chance of it working. Specifying a generic name in the website (e.g. serif or sans-serif) will probably have even better results, as the OS/browser would decide on the best font to use.
In short, be consistent about your character encoding throughout.
Configure your editor to save in the encoding you want
If you use any server side programming, make sure it isn't transcoding your data
If you use a database, make sure it is configured to use the same encoding
Configure your server to emit a Content-Type header that specifies that encoding
Use the meta tag in your question
The W3C provides useful material on encodings that starts here.
A useful site for special characters and their ASCII-codes: CopyPaste Character
To 'type' them, use the alt codes.
However, to use them in your site, you'll better use the HTML codes like you can find on CPC
As a test, try this:
<span style="font-family:Arial Unicode MS">
č ć ž đ š
</span>
You should be able to see your characters correctly.
I've just copied and pasted a line from your question along with your meta tag, placed it into a plain text file in vi.
It works just fine - all characters are displayed fine: http://www.dusystems.com/tmp/1.html
If you can't do the same with your editor then the problem is with the editor and not character sets and encodings.
If you're on Windows you can use its built-in Notepad to edit UTF-8 files. Open Notepad, type all of your special characters, add the meta tag. When doing Save As select UTF-8 from the Encoding drop-down in the dialog. Save as something.html and open in IE. It will 100% work.

Foreign characters in website

I found a website that contains the string "don’t". The obvious intent was the word "don't". I looked at the source expecting to see some character references, but didn't (it just shows the literal string "don’t". A Google search yielded nothing (expect lots of other sites that have the same problem!). Can anyone explain what's happening here?
Edit: Here's the meta tag that was used:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Would this not cause the page to be served up as Latin-1 in the HTTP header?
In your browser, switch the page encoding to "UTF-8". You're seeing a right single quote character, which is encoded by the octets 0xE2 0x80 0x99 in UTF-8. In your charset, windows-1252, those 3 octets render as "’". The page should be explicitly specifying UTF-8 as its charset either in the HTTP headers or in an HTML <meta> tag, but it probably isn't.
According to Character encondings in HTML a lemme in wikipedia:
HTML (Hypertext Markup Language) has
been in use since 1991, but HTML 4.0
(December 1997) was the first
standardized version where
international characters were given
reasonably complete treatment. When an
HTML document includes special
characters outside the range of
seven-bit ASCII two goals are worth
considering: the information's
integrity, and universal browser
display.
I suppose the site you checked, isn't impelemented with this in mind.
This has all got to do with encoding. Take a look back at the source, is there a tag at the top specifying it (charset)? My guess is it'll be UTF8 - although it could be something completely different.
This thread explains all. A combination of using a weird UTF-8 apostrophe character (probably originating from a Word Document), on a server that probably reports its encoding as non-UTF-8, despite the page having UTF characters (and possible even correctly reporting its own encoding).