Safe HTML form accept charset? - html

I faced a parameter encoding issue when submitting a form with the get method (I can't use the post method). Some accentuated characters were not escaped in the URL, since my page was UTF8. The Spring controller retrieved bad characters instead.
I solved this issue by setting accept-charset="ISO-8859-1" on my form, but now, I am wondering which charset is safe for all server/browser combination. Is there any recommended for my forms and 'get' URLs?

This is frustrating (to put it mildly) with servlets. The standard URL encoding must use UTF-8 yet servlets not only default to ISO-8859-1 but don't offer any way to change that with code.
Sure you can req.setRequestEncoding("UTF-8") before you read anything, but for some ungodly reason this only affects request body, not query string parameters. There is nothing in the servlet request interface to specify the encoding used for query string parameters.
Using ISO-8859-1 in your form is a hack. Using this ancient encoding will cause more problems than solve for sure. Especially since browsers do not support ISO-8859-1 and always treat it as Windows-1252. Whereas servlets treat ISO-8859-1 as ISO-8859-1, so you will be screwed beyond belief if you go with this.
To change this in Tomcat for example, you can use the URIEncoding attribute in your <connector> element:
<connector ... URIEncoding="UTF-8" ... />
If you don't use a container that has these settings, can't change its settings or some other issue, you can still make it work because ISO-8859-1 decoding retains full information from the original binary.
String correct = new String(request.getParameter("test").getBytes("ISO-8859-1"), "UTF-8")
So let's say test=ä and if everything is correctly set, the browser encodes it as test=%C3%A4. Your servlet will incorrectly decode it as ISO-8859-1 and give you the resulting string "ä". If you apply the correction, you can get ä back:
System.out.println(new String("ä".getBytes("ISO-8859-1"), "UTF-8").equals("ä"));
//true

nickdos is right.
Another way of doing this is using the meta-data tag:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
Also keep in mind when handling the response on the server, the code should also use the correct (same) encoding.
Example:
use stringParamer.getBytes("utf-8") instead of stringParamer.getBytes()
And when using Spring make sure the correct encoding is configured for message converters in the DispatcherServlet's configuration file (XYZ_-servlet.xml), e.g.:
<bean id="stringHttpMessageConverter" class="org.springframework.http.converter.StringHttpMessageConverter">
<property name="supportedMediaTypes" value = "text/plain;charset=UTF-8"/>
</bean>

The problem is URL's always get encoded as 127-ASCII. Because your form sends back additional characters values outside the standard ASCII set via a GET you have several issues going on:
URL's are limited to 2048 characters, so your form values might be getting truncated
If a user enters characters outside the ISO accept-type you set in the Form attribute, they would not be encoded correctly into the URL. That is because the browser translates everything into 127-ASCII when encoding URL's after first using the page's encoding. Any special character not in that ISO set would be encoded incorrectly.
The browser always translates the characters in your URL first using the page encoding or meta tags. But if there is a server HTTP-header, that encoding would override your meta tag encoding. The default encoding for HTML5 pages is UTF-8. But you are using an ISO standard overriding that. Even so, all encoding done by your browsers replaces non-ASCII characters with a "%" followed by hexadecimal digits from the pages encoding or in your case the form's set encoding. That is then sent up to the server so look at your URL to see what has been sent.
When your URL comes to the server, it comes in as 127-ASCII, so you would need to first get the string as ASCII, then decode back to the page encoding or in your case the Form accept values used to get the true values.
I recommend you remove the form encoding, use the pages UTF-8 settings for broader character support, and drop in these two metatags below to make sure you are sending back UTF-8 encoded data, which includes all the characters needed and is easily decoded on the server as described above by other posters above.
<meta charset="utf-8" />
<meta content="text/html; charset=utf-8" http-equiv="content-type" />

Related

meta tag to correct ®

I'm having some trouble getting a special character properly encoded.
® keeps coming through instead of the registered trademark symbol. I've tried changing the meta tag to UTF-8 and Windows-1252, but it still comes through in the encoded format? Can I add a meta tag to fix this?
Make sure to save your file with the proper encoding:
.
Here is an example; on the left side, the file is saved with Window-1252 encoding.
On the right side, it's saved with UTF-8 encoding
HTML options
For such characters, encoding with ISO-8859-1 might do it too, but UTF-8 is greatly encouraged.
Make sure your DOCTYPE is clearly defined : <!DOCTYPE HTML>.
Make sure your meta tag is written properly: <meta charset="UTF-8">.
PHP options
If you use PHP within your page, add the following at the beginning of the page:
<?php header('Content-Type: text/html; charset=utf-8'); ?>
If the content is output from a database, you might want to use utf8_encode() to encode different encodings to UTF-8
utf8_encode()
Encodes an ISO-8859-1 string to UTF-8
The information about encoding should correspond to the actual encoding. So instead of making guesses and trial and error, find out what the encoding really is. It seems to be UTF-8, and if declaring UTF-8 in a meta tag does not help, the probable culprit is an HTTP header that the server sends and that declares a different encoding, trumping the meta tag. Use e.g. an HTTP header viewer to check out the situation.
If the server announces iso-8859-1 or windows-1252 and if you cannot change this, then you just have to use that encoding instead of UTF-8. Then save the page in your authoring program as windows-1252 encoded.

HTML - When using UTF-8 or ISO-8859-1 do I still need to type the codes for the special characters?

It is as the title says:
HTML - When using UTF-8 or ISO-8859-1 do I still need to type the codes for the special characters?
Or can I just type them normally?
Ex: I'm using UTF-8 in my HTML META tag. I need to type ç should I just type it or type its code which is ç
I know this is a trivial question, but it's fundamental so I just can't skip it.
No, you only need to use a character reference if:
The character you want cannot be represented in the character encoding you are using or
The character has some special meaning in HTML (such as < or &).
Note that declaring you are using UTF-8 in the meta tag is insufficient. You also have to encode the HTML source in UTF-8 (good editors will default to this) and not override it with a declaration of some other encoding in the real HTTP headers. You should also set the real HTTP headers to state that UTF-8 is being used.
Yes, you can include those characters directly in your HTML source, without using the entity for the character. Just make sure that the encoding you are saving the file in really does match what the web server serves it in.
The part about ensuring that the encoding is correct is important, and easy to get wrong. One thing to note is that the meta tag is not the primary source of information that the browser uses for interpreting the encoding of the document. The primary source of information is the Content-type header, sent as part of the HTTP headers. The meta tag was originally supposed to be used to communicate to the web server what Content-type to use, but most web servers use configuration separate from the document itself for this. So if you are saving your document as UTF-8, make sure that the web server is configured to serve pages as UTF-8 as well.
The meta tag is used by browsers as a fallback if the Content-type header is not provided or does not include valid encoding information. It is useful to have if you are ever going to be loading from a source that doesn't provide Content-type information, like using a file: URL to view the page on your local machine.
So, there are 3 places you should make sure your encoding is set up properly; in your text editor (so that it saves the file with the appropriate encoding), in your web server configuration (so that it communicates the appropriate encoding to the browser), and in the meta tag, so that when you view the page locally, it is displayed with the correct encoding.
Finally, you shouldn't use ISO-8859-1. That's a legacy encoding, only still supported for compatibility. Every major browser and text editor supports UTF-8 by now, which covers all of Unicode, and provides a lot fewer encoding headaches.

Characters not displaying correctly in different browsers

I used certain characters in website such as • — “ ” ‘ ’ º ©.
I found that when testing to see what my website looked like under different browsers (BrowserLab)
the afore-mentioned characters are replaced with �.
I then changed the charset in the webpage header from:
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
to
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Suddenly all the pages have the above mentioned characters replaced with a ?.
Even more puzzling is this is not always consistent across and even within the same page, as some sections display the character • and © correctly.
In particular, I need to replace the character • with one that will display across browsers, can anyone help me with the answer? Thanks.
You should save your HTML source as UTF8.
Alternatively, you can use HTML entities instead.
The source code needs to be saved in the same encoding as you're instructing the browser to parse it in. If you're saving your files in UTF-8, instruct the browser to parse it as UTF-8 by setting an appropriate HTTP header or HTML meta tag (headers preferable, your web server may be setting one without you knowing). Use a decent editor that clearly tells you what encoding you're saving the file as. If it doesn't display correctly, there's a discrepancy between what you're telling your browser the file is encoded in and what it's really encoded in.
Check to see if Apache is setup to send the charset. Look for the directive "AddDefaultCharset" and set it to Off in .htaccess or your config file.
Most/all browsers will take what is sent in the HTTP headers over what is in the document.
If you're using Notepad++, I suggest You to use Edit Plus editor to copy the text (which has the special characters) and paste it in your file. This should work.
Yes I had this problem too in notepad++ copy and pasting wasn't working with some symbols
I think SLaks is right
HTML entities for copyright symbol &#169

Displaying unicode symbols in HTML

I want to simply display the tick (✔) and cross (✘) symbols in a HTML page but it shows up as either a box or goop ✔ - obviously something to do with the encoding.
I have set the meta tag to show utf-8 but obviously I'm missing something.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Edit/Solution: From comments made, using FireBug I found the headers being passed by my page were in fact "Content-Type: text/html" and not UTF-8. Looking at the file format using Notepad++ showed my file was formatted as "UTF-8 without BOM". Changing this to just UTF-8 the symbols now show correctly... but firebug still seems to indicate the same content-type.
You should ensure the HTTP server headers are correct.
In particular, the header:
Content-Type: text/html; charset=utf-8
should be present.
The meta tag is ignored by browsers if the HTTP header is present.
Also ensure that your file is actually encoded as UTF-8 before serving it, check/try the following:
Ensure your editor save it as UTF-8.
Ensure your FTP or any file transfer program does not mess with the file.
Try with HTML encoded entities, like &#uuu;.
To be really sure, hexdump the file and look as the character, for the ✔, it should be E2 9C 94 .
Note: If you use an unicode character for which your system can't find a glyph (no font with that character), your browser should display a question mark or some block like symbol. But if you see multiple roman characters like you do, this denotes an encoding problem.
I know an answer has already been accepted, but wanted to point a few things out.
Setting the content-type and charset is obviously a good practice, doing it on the server is much better, because it ensures consistency across your application.
However, I would use UTF-8 only when the language of my application uses a lot of characters that are available only in the UTF-8 charset. If you want to show a unicode character or symbol in one of cases, you can do so without changing the charset of your page.
HTML renderers have always been able to display symbols which are not part of the encoding character set of the page, as long as you mention the symbol in its numeric character reference (NCR). Sounds weird but its true.
So, even if your html has a header that states it has an encoding of ansi or any of the iso charsets, you can display a check mark by using its html character reference, in decimal - ✓ or in hex - ✓
So its a little difficult to understand why you are facing this issue on your pages. Can you check if the NCR value is correct, this is a good reference http://www.fileformat.info/info/unicode/char/2713/index.htm
Make sure that you actually save the file as UTF-8, alternatively use HTML entities (&#nnn;) for the special characters.
Unlike proposed by Nicolas, the meta tag isn’t actually ignored by the browsers. However, the Content-Type HTTP header always has precedence over the presence of a meta tag in the document.
So make sure that you either send the correct encoding via the HTTP header, or don’t send this HTTP header at all (not recommended). The meta tag is mainly a fallback option for local documents which aren’t sent via HTTP traffic.
Using HTML entities should also be considered a workaround – that’s tiptoeing around the real problem. Configuring the web server properly prevents a lot of nuisance.
I think this is a file problem, you simple saved your file in 1-byte encoding like latin-1. Google up your editor and how to set files to utf-8.
I wonder why there are editors that don't default to utf-8.

Setting the character encoding in form submit for Internet Explorer

I have a page that contains a form. This page is served with content type text/html;charset=utf-8. I need to submit this form to server using ISO-8859-1 character encoding. Is this possible with Internet Explorer?
Setting accept-charset attribute to form element, like this, works for Firefox, Opera etc. but not for IE.
<form accept-charset="ISO-8859-1">
...
</form>
Edit: This form is created by server A and will be submitted to server B. I have no control over server B.
If I set server A to serve content with charset ISO-8859-1 everything works, but I am looking a way to make this work without changes to server A's encoding. I have another question about setting the encoding in server A.
There is a simple hack to this:
Insert a hidden input field in the form with an entity which only occur in the character set the server your posting (or doing a GET) to accepts.
Example: If the form is located on a server serving ISO-8859-1 and the form will post to a server expecting UTF-8 insert something like this in the form:
<input name="iehack" type="hidden" value="☠" />
IE will then "detect" that the form contains a UTF-8 character and use UTF-8 when you POST or GET. Strange, but it does work.
With decent browsers:
<form accept-charset="ISO-8859-1" .... >
With IE (any):
document.charset = 'ISO-8859-1'; // do this before submitting your non-utf8 <form>!
It seems that this can't be done, not at least with current versions of IE (6 and 7).
IE supports form attribute accept-charset, but only if its value is 'utf-8'.
The solution is to modify server A to produce encoding 'ISO-8859-1' for page that contains the form.
I've got the same problem here. I have an UTF-8 Page an need to post to an ISO-8859-1 server.
Looks like IE can't handle ISO-8859-1. But it can handle ISO-8859-15.
<form accept-charset="ISO-8859-15">
...
</form>
So this worked for me, since ISO-8859-1 and ISO-8859-15 are almost the same.
If you have any access to the server at all, convert its processing to UTF-8. The art of submitting non-UTF-8 forms is a long and sorry story; this document about forms and i18n may be of interest. I understand you do not seem to care about international support; you can always convert the UTF-8 data to html entities to make sure it stays Latin-1.
Just got the same problem and I have a relatively simple solution that does not require any change in the page character encoding(wich is a pain in the ass).
For example, your site is in utf-8 and you want to post a form to a site in iso-8859-1. Just change the action of the post to a page on your site that will convert the posted values from utf-8 to iso-8859-1.
this could be done easily in php with something like this:
<?php
$params = array();
foreach($_POST as $key=>$value) {
$params[] = $key."=".rawurlencode(utf8_decode($value));
}
$params = implode("&",$params);
//then you redirect to the final page in iso-8859-1
?>
For Russian symbols 'windows-1251'
<form action="yourProcessPage.php" method="POST" accept-charset="utf-8">
<input name="string" value="string" />
...
</form>
When simply convert string to cp1251
$string = $_POST['string'];
$string = mb_convert_encoding($string, "CP1251", "UTF-8");
Looks like Microsoft knows accept-charset, but their doc doesn't tell for which version it starts to work...
You don't tell either in which versions of browser you tested it.
I seem to remember that Internet Explorer gets confused if the accept-charset encoding doesn't match the encoding specified in the content-type header. In your example, you claim the document is sent as UTF-8, but want form submits in ISO-8859-1. Try matching those and see if that solves your problem.
I am pretty sure it won't be possible with older versions of IE. Before the accept-charset attribute was devised, there was no way for form elements to specify which character encoding they accepted, and the best that browsers could do is assume the encoding of the page the form is in will do.
It is a bit sad that you need to know which encoding was used -- nowadays we would expect our web frameworks to take care of such details invisibly and expose the text data to the application as Unicode strings, already decoded...
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">