Displaying unicode symbols in HTML - html

I want to simply display the tick (✔) and cross (✘) symbols in a HTML page but it shows up as either a box or goop ✔ - obviously something to do with the encoding.
I have set the meta tag to show utf-8 but obviously I'm missing something.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Edit/Solution: From comments made, using FireBug I found the headers being passed by my page were in fact "Content-Type: text/html" and not UTF-8. Looking at the file format using Notepad++ showed my file was formatted as "UTF-8 without BOM". Changing this to just UTF-8 the symbols now show correctly... but firebug still seems to indicate the same content-type.

You should ensure the HTTP server headers are correct.
In particular, the header:
Content-Type: text/html; charset=utf-8
should be present.
The meta tag is ignored by browsers if the HTTP header is present.
Also ensure that your file is actually encoded as UTF-8 before serving it, check/try the following:
Ensure your editor save it as UTF-8.
Ensure your FTP or any file transfer program does not mess with the file.
Try with HTML encoded entities, like &#uuu;.
To be really sure, hexdump the file and look as the character, for the ✔, it should be E2 9C 94 .
Note: If you use an unicode character for which your system can't find a glyph (no font with that character), your browser should display a question mark or some block like symbol. But if you see multiple roman characters like you do, this denotes an encoding problem.

I know an answer has already been accepted, but wanted to point a few things out.
Setting the content-type and charset is obviously a good practice, doing it on the server is much better, because it ensures consistency across your application.
However, I would use UTF-8 only when the language of my application uses a lot of characters that are available only in the UTF-8 charset. If you want to show a unicode character or symbol in one of cases, you can do so without changing the charset of your page.
HTML renderers have always been able to display symbols which are not part of the encoding character set of the page, as long as you mention the symbol in its numeric character reference (NCR). Sounds weird but its true.
So, even if your html has a header that states it has an encoding of ansi or any of the iso charsets, you can display a check mark by using its html character reference, in decimal - ✓ or in hex - ✓
So its a little difficult to understand why you are facing this issue on your pages. Can you check if the NCR value is correct, this is a good reference http://www.fileformat.info/info/unicode/char/2713/index.htm

Make sure that you actually save the file as UTF-8, alternatively use HTML entities (&#nnn;) for the special characters.

Unlike proposed by Nicolas, the meta tag isn’t actually ignored by the browsers. However, the Content-Type HTTP header always has precedence over the presence of a meta tag in the document.
So make sure that you either send the correct encoding via the HTTP header, or don’t send this HTTP header at all (not recommended). The meta tag is mainly a fallback option for local documents which aren’t sent via HTTP traffic.
Using HTML entities should also be considered a workaround – that’s tiptoeing around the real problem. Configuring the web server properly prevents a lot of nuisance.

I think this is a file problem, you simple saved your file in 1-byte encoding like latin-1. Google up your editor and how to set files to utf-8.
I wonder why there are editors that don't default to utf-8.

Related

Include Unicode Signature (BOM) in HTML files or not?

In Dreamweaver I have the option "Include Unicode Signature (BOM)".
If I check this box and save the file the HTML file it looks good when viewed in the web browser. If not it gives me strange symbols for Swedish letters like åäö.
If I serve this HTML file with strange letters using the header respond "Content-Type: text/html; charset=utf-8" it still gives me strange symbols.
Q1) Does that mean that it's not a UTF-8 encoded file (the one without BOM that shows strange symbols)?
Q2) What makes a file UTF-8 encoded, is it just the Unicode signature (BOM)?
Q3) Should I or should I not add the Include Unicode Signature (BOM) in my files (HTML, Javascript, CSS, PHP)?
I know that I can add <meta charset="UTF-8"> in the HTML code or type AddDefaultCharset UTF-8 in my .htaccess. I just figure the optimal solution would be to have a header respond that says "it's a UTF-8 encoded file" and then also actually serve a UTF-8 encoded file. Nothing else.
Q4) I thought HTML files were plain text-files. What other information is hidden in those files and how can I read this information?
The BOM is entirely optional for UTF-8. The Unicode consortium points out that it can create problems while offering no real advantage; the W3C says that it can be a substitute for other forms of declaring the encodings and should work on all modern browsers.
The BOM is only there to clarify the endianness of the encoding. Since UTF-8 only has one kind of endianness it is superfluous. It's only useful for UTF-16 and other encodings. A UTF-8 encoded file is UTF-8 encoded regardless of the presence of the BOM.
HTML files do not "hide" any other information, they're plain text.
My recommendation would be:
encode as UTF-8 without BOM
add the HTTP Content-Type header to denote the encoding of the file
also add the <meta> tag into the HTML itself as a fallback, should the file be interpreted outside of an HTTP context (meaning where no HTTP header exists because the file is not read over HTTP)
This gives you the best compatibility with the least potential for issues. If your characters are still appearing funny, then your file is not actually UTF-8 encoded or the HTTP header is not being set correctly.

meta tag to correct ®

I'm having some trouble getting a special character properly encoded.
® keeps coming through instead of the registered trademark symbol. I've tried changing the meta tag to UTF-8 and Windows-1252, but it still comes through in the encoded format? Can I add a meta tag to fix this?
Make sure to save your file with the proper encoding:
.
Here is an example; on the left side, the file is saved with Window-1252 encoding.
On the right side, it's saved with UTF-8 encoding
HTML options
For such characters, encoding with ISO-8859-1 might do it too, but UTF-8 is greatly encouraged.
Make sure your DOCTYPE is clearly defined : <!DOCTYPE HTML>.
Make sure your meta tag is written properly: <meta charset="UTF-8">.
PHP options
If you use PHP within your page, add the following at the beginning of the page:
<?php header('Content-Type: text/html; charset=utf-8'); ?>
If the content is output from a database, you might want to use utf8_encode() to encode different encodings to UTF-8
utf8_encode()
Encodes an ISO-8859-1 string to UTF-8
The information about encoding should correspond to the actual encoding. So instead of making guesses and trial and error, find out what the encoding really is. It seems to be UTF-8, and if declaring UTF-8 in a meta tag does not help, the probable culprit is an HTTP header that the server sends and that declares a different encoding, trumping the meta tag. Use e.g. an HTTP header viewer to check out the situation.
If the server announces iso-8859-1 or windows-1252 and if you cannot change this, then you just have to use that encoding instead of UTF-8. Then save the page in your authoring program as windows-1252 encoded.

Safe HTML form accept charset?

I faced a parameter encoding issue when submitting a form with the get method (I can't use the post method). Some accentuated characters were not escaped in the URL, since my page was UTF8. The Spring controller retrieved bad characters instead.
I solved this issue by setting accept-charset="ISO-8859-1" on my form, but now, I am wondering which charset is safe for all server/browser combination. Is there any recommended for my forms and 'get' URLs?
This is frustrating (to put it mildly) with servlets. The standard URL encoding must use UTF-8 yet servlets not only default to ISO-8859-1 but don't offer any way to change that with code.
Sure you can req.setRequestEncoding("UTF-8") before you read anything, but for some ungodly reason this only affects request body, not query string parameters. There is nothing in the servlet request interface to specify the encoding used for query string parameters.
Using ISO-8859-1 in your form is a hack. Using this ancient encoding will cause more problems than solve for sure. Especially since browsers do not support ISO-8859-1 and always treat it as Windows-1252. Whereas servlets treat ISO-8859-1 as ISO-8859-1, so you will be screwed beyond belief if you go with this.
To change this in Tomcat for example, you can use the URIEncoding attribute in your <connector> element:
<connector ... URIEncoding="UTF-8" ... />
If you don't use a container that has these settings, can't change its settings or some other issue, you can still make it work because ISO-8859-1 decoding retains full information from the original binary.
String correct = new String(request.getParameter("test").getBytes("ISO-8859-1"), "UTF-8")
So let's say test=ä and if everything is correctly set, the browser encodes it as test=%C3%A4. Your servlet will incorrectly decode it as ISO-8859-1 and give you the resulting string "ä". If you apply the correction, you can get ä back:
System.out.println(new String("ä".getBytes("ISO-8859-1"), "UTF-8").equals("ä"));
//true
nickdos is right.
Another way of doing this is using the meta-data tag:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
Also keep in mind when handling the response on the server, the code should also use the correct (same) encoding.
Example:
use stringParamer.getBytes("utf-8") instead of stringParamer.getBytes()
And when using Spring make sure the correct encoding is configured for message converters in the DispatcherServlet's configuration file (XYZ_-servlet.xml), e.g.:
<bean id="stringHttpMessageConverter" class="org.springframework.http.converter.StringHttpMessageConverter">
<property name="supportedMediaTypes" value = "text/plain;charset=UTF-8"/>
</bean>
The problem is URL's always get encoded as 127-ASCII. Because your form sends back additional characters values outside the standard ASCII set via a GET you have several issues going on:
URL's are limited to 2048 characters, so your form values might be getting truncated
If a user enters characters outside the ISO accept-type you set in the Form attribute, they would not be encoded correctly into the URL. That is because the browser translates everything into 127-ASCII when encoding URL's after first using the page's encoding. Any special character not in that ISO set would be encoded incorrectly.
The browser always translates the characters in your URL first using the page encoding or meta tags. But if there is a server HTTP-header, that encoding would override your meta tag encoding. The default encoding for HTML5 pages is UTF-8. But you are using an ISO standard overriding that. Even so, all encoding done by your browsers replaces non-ASCII characters with a "%" followed by hexadecimal digits from the pages encoding or in your case the form's set encoding. That is then sent up to the server so look at your URL to see what has been sent.
When your URL comes to the server, it comes in as 127-ASCII, so you would need to first get the string as ASCII, then decode back to the page encoding or in your case the Form accept values used to get the true values.
I recommend you remove the form encoding, use the pages UTF-8 settings for broader character support, and drop in these two metatags below to make sure you are sending back UTF-8 encoded data, which includes all the characters needed and is easily decoded on the server as described above by other posters above.
<meta charset="utf-8" />
<meta content="text/html; charset=utf-8" http-equiv="content-type" />

HTML - When using UTF-8 or ISO-8859-1 do I still need to type the codes for the special characters?

It is as the title says:
HTML - When using UTF-8 or ISO-8859-1 do I still need to type the codes for the special characters?
Or can I just type them normally?
Ex: I'm using UTF-8 in my HTML META tag. I need to type ç should I just type it or type its code which is ç
I know this is a trivial question, but it's fundamental so I just can't skip it.
No, you only need to use a character reference if:
The character you want cannot be represented in the character encoding you are using or
The character has some special meaning in HTML (such as < or &).
Note that declaring you are using UTF-8 in the meta tag is insufficient. You also have to encode the HTML source in UTF-8 (good editors will default to this) and not override it with a declaration of some other encoding in the real HTTP headers. You should also set the real HTTP headers to state that UTF-8 is being used.
Yes, you can include those characters directly in your HTML source, without using the entity for the character. Just make sure that the encoding you are saving the file in really does match what the web server serves it in.
The part about ensuring that the encoding is correct is important, and easy to get wrong. One thing to note is that the meta tag is not the primary source of information that the browser uses for interpreting the encoding of the document. The primary source of information is the Content-type header, sent as part of the HTTP headers. The meta tag was originally supposed to be used to communicate to the web server what Content-type to use, but most web servers use configuration separate from the document itself for this. So if you are saving your document as UTF-8, make sure that the web server is configured to serve pages as UTF-8 as well.
The meta tag is used by browsers as a fallback if the Content-type header is not provided or does not include valid encoding information. It is useful to have if you are ever going to be loading from a source that doesn't provide Content-type information, like using a file: URL to view the page on your local machine.
So, there are 3 places you should make sure your encoding is set up properly; in your text editor (so that it saves the file with the appropriate encoding), in your web server configuration (so that it communicates the appropriate encoding to the browser), and in the meta tag, so that when you view the page locally, it is displayed with the correct encoding.
Finally, you shouldn't use ISO-8859-1. That's a legacy encoding, only still supported for compatibility. Every major browser and text editor supports UTF-8 by now, which covers all of Unicode, and provides a lot fewer encoding headaches.

Characters not displaying correctly in different browsers

I used certain characters in website such as • — “ ” ‘ ’ º ©.
I found that when testing to see what my website looked like under different browsers (BrowserLab)
the afore-mentioned characters are replaced with �.
I then changed the charset in the webpage header from:
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
to
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Suddenly all the pages have the above mentioned characters replaced with a ?.
Even more puzzling is this is not always consistent across and even within the same page, as some sections display the character • and © correctly.
In particular, I need to replace the character • with one that will display across browsers, can anyone help me with the answer? Thanks.
You should save your HTML source as UTF8.
Alternatively, you can use HTML entities instead.
The source code needs to be saved in the same encoding as you're instructing the browser to parse it in. If you're saving your files in UTF-8, instruct the browser to parse it as UTF-8 by setting an appropriate HTTP header or HTML meta tag (headers preferable, your web server may be setting one without you knowing). Use a decent editor that clearly tells you what encoding you're saving the file as. If it doesn't display correctly, there's a discrepancy between what you're telling your browser the file is encoded in and what it's really encoded in.
Check to see if Apache is setup to send the charset. Look for the directive "AddDefaultCharset" and set it to Off in .htaccess or your config file.
Most/all browsers will take what is sent in the HTTP headers over what is in the document.
If you're using Notepad++, I suggest You to use Edit Plus editor to copy the text (which has the special characters) and paste it in your file. This should work.
Yes I had this problem too in notepad++ copy and pasting wasn't working with some symbols
I think SLaks is right
HTML entities for copyright symbol &#169