How to detect character set encoding? - html

For example, chinese text(GB2312) is pasted into a text box(or text area) of a html page and the form is posted. At the server side, is there any means by which this character set gets detected?
How would this detection behave if texts belonging to different character sets are pasted in a text box?

You need to tell the browser what encoding to use by adding an accept-charset="UTF-8" (or similar) attribute to the form. Apparently this defaults to the character set of the page, but I wouldn't count on that. The browser won't tell you what encoding it used when it submits the form, so you need to assume it used the one you told it to.

The web browser should send up a content type including encoding when it posts the data.
I find it helpful to think of text as "just text" (without any particular encoding) until an encoding is required. So the browser shouldn't care what encoding (if any) was used to originally produce the text (e.g. if it was copied and pasted from a file, the file's encoding is irrelevant). It decides what encoding to use when posting it to the server, obviously making sure that it's an encoding which covers all the characters it needs to send.

if you use php on the server, you can use mb_detect_encoding

Related

Spaces in filenames used in URLs and image sources. Does the browser handle them or should I encode manually

I'm only asking because I tried sending an email (with PEAR) which containd an image whose name was something like "header image_3021" and noticed that it didn't show up in the email.
When I checked the SRC in the recieved email the space was replaced with + and that somehow made the link point to the wrong file. Now, IIRC, + is a correct encoding for spaces in URLs yet the browser could not locate "header+image_2031".
I checked the original content of the email both with Gmail's show original and in the server logs and the space was still there, so the replacement was done either by the browser or by Gmail's rendering process.
I have since modified my upload algorithms to not allow spaces in filenames but I have to ask: What's the best way to make sure the browser will display images with spaces in their file names? Replace them with %20 myself? Let the browser do it? Just disallow them?
Definitely encode them. By doing so, you remove the nuance of how different clients will interpret the string. As #mr alien said, there are out of box php functions that will handle that for you.

HTML - When using UTF-8 or ISO-8859-1 do I still need to type the codes for the special characters?

It is as the title says:
HTML - When using UTF-8 or ISO-8859-1 do I still need to type the codes for the special characters?
Or can I just type them normally?
Ex: I'm using UTF-8 in my HTML META tag. I need to type ç should I just type it or type its code which is ç
I know this is a trivial question, but it's fundamental so I just can't skip it.
No, you only need to use a character reference if:
The character you want cannot be represented in the character encoding you are using or
The character has some special meaning in HTML (such as < or &).
Note that declaring you are using UTF-8 in the meta tag is insufficient. You also have to encode the HTML source in UTF-8 (good editors will default to this) and not override it with a declaration of some other encoding in the real HTTP headers. You should also set the real HTTP headers to state that UTF-8 is being used.
Yes, you can include those characters directly in your HTML source, without using the entity for the character. Just make sure that the encoding you are saving the file in really does match what the web server serves it in.
The part about ensuring that the encoding is correct is important, and easy to get wrong. One thing to note is that the meta tag is not the primary source of information that the browser uses for interpreting the encoding of the document. The primary source of information is the Content-type header, sent as part of the HTTP headers. The meta tag was originally supposed to be used to communicate to the web server what Content-type to use, but most web servers use configuration separate from the document itself for this. So if you are saving your document as UTF-8, make sure that the web server is configured to serve pages as UTF-8 as well.
The meta tag is used by browsers as a fallback if the Content-type header is not provided or does not include valid encoding information. It is useful to have if you are ever going to be loading from a source that doesn't provide Content-type information, like using a file: URL to view the page on your local machine.
So, there are 3 places you should make sure your encoding is set up properly; in your text editor (so that it saves the file with the appropriate encoding), in your web server configuration (so that it communicates the appropriate encoding to the browser), and in the meta tag, so that when you view the page locally, it is displayed with the correct encoding.
Finally, you shouldn't use ISO-8859-1. That's a legacy encoding, only still supported for compatibility. Every major browser and text editor supports UTF-8 by now, which covers all of Unicode, and provides a lot fewer encoding headaches.

accept-charset="UTF-8" parameter doesnt do anything, when used in form

I am using accept-charset="utf-8" attribute in form and found that the when do a form post with non-ascii, the headers have different accept charset option in the request header. Is there anything i am missing ? My form looks like this
<form method="post" action="controller" accept-charset="UTF-8">
..input text box
.. submit button
</form>
Thanks in advance
The question, as asked, is self-contradictory: the heading says that the accept-charset parameter does not do anything, whereas the question body says that when the accept-charset attribute (this is the correct term) is used, “the headers have different accept charset option in the request header”. I suppose a negation is missing from the latter statement.
Browsers send Accept-Charset parameters in HTTP request headers according to their own principles and settings. For example, my Chrome sends Accept-Charset:windows-1252,utf-8;q=0.7,*;q=0.3. Such a header is typically ignored by server-side software, but it could be used (and it was designed to be used) to determine which encoding is to be used in the server response, in case the server-side software (a form handler, in this case) is capable of using different encodings in the response.
The accept-charset attribute in a form element is not expected to affect HTTP request headers, and it does not. It is meant to specify the character encoding to be used for the form data in the request, and this is what it actually does. The HTML 4.01 spec is obscure about this, but the W3C HTML5 draft puts it much better, though for some odd reason uses plural: “gives the character encodings that are to be used for the submission”. I suppose the reason is that you could specify alternate encodings, to prepare for situations where a browser is unable to use your preferred encoding. And what actually happens in Chrome for example is that if you use accept-charset="foobar utt-8", then UTF-8 used.
In practice, the attribute is used to make the encoding of data submission different from the encoding of the page containing the form. Suppose your page is ISO-8859-1 encoded and someone types Greek or Hebrew letters into your form. Browsers will have to do some error recovery, since those characters cannot be represented in ISO-8859-1. (In practice they turn the characters to numeric character references, which is logically all wrong but pragmatically perhaps the best they can do.) Using <form charset=utf-8> helps here: no matter what the encoding is, the form data will be sent as UTF-8 encoding, which can handle any character.
If you wish to tell the form handler which encoding it should use in its response, then you can add a hidden (or non-hidden) field into the form for that.

Displaying unicode symbols in HTML

I want to simply display the tick (✔) and cross (✘) symbols in a HTML page but it shows up as either a box or goop ✔ - obviously something to do with the encoding.
I have set the meta tag to show utf-8 but obviously I'm missing something.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Edit/Solution: From comments made, using FireBug I found the headers being passed by my page were in fact "Content-Type: text/html" and not UTF-8. Looking at the file format using Notepad++ showed my file was formatted as "UTF-8 without BOM". Changing this to just UTF-8 the symbols now show correctly... but firebug still seems to indicate the same content-type.
You should ensure the HTTP server headers are correct.
In particular, the header:
Content-Type: text/html; charset=utf-8
should be present.
The meta tag is ignored by browsers if the HTTP header is present.
Also ensure that your file is actually encoded as UTF-8 before serving it, check/try the following:
Ensure your editor save it as UTF-8.
Ensure your FTP or any file transfer program does not mess with the file.
Try with HTML encoded entities, like &#uuu;.
To be really sure, hexdump the file and look as the character, for the ✔, it should be E2 9C 94 .
Note: If you use an unicode character for which your system can't find a glyph (no font with that character), your browser should display a question mark or some block like symbol. But if you see multiple roman characters like you do, this denotes an encoding problem.
I know an answer has already been accepted, but wanted to point a few things out.
Setting the content-type and charset is obviously a good practice, doing it on the server is much better, because it ensures consistency across your application.
However, I would use UTF-8 only when the language of my application uses a lot of characters that are available only in the UTF-8 charset. If you want to show a unicode character or symbol in one of cases, you can do so without changing the charset of your page.
HTML renderers have always been able to display symbols which are not part of the encoding character set of the page, as long as you mention the symbol in its numeric character reference (NCR). Sounds weird but its true.
So, even if your html has a header that states it has an encoding of ansi or any of the iso charsets, you can display a check mark by using its html character reference, in decimal - ✓ or in hex - ✓
So its a little difficult to understand why you are facing this issue on your pages. Can you check if the NCR value is correct, this is a good reference http://www.fileformat.info/info/unicode/char/2713/index.htm
Make sure that you actually save the file as UTF-8, alternatively use HTML entities (&#nnn;) for the special characters.
Unlike proposed by Nicolas, the meta tag isn’t actually ignored by the browsers. However, the Content-Type HTTP header always has precedence over the presence of a meta tag in the document.
So make sure that you either send the correct encoding via the HTTP header, or don’t send this HTTP header at all (not recommended). The meta tag is mainly a fallback option for local documents which aren’t sent via HTTP traffic.
Using HTML entities should also be considered a workaround – that’s tiptoeing around the real problem. Configuring the web server properly prevents a lot of nuisance.
I think this is a file problem, you simple saved your file in 1-byte encoding like latin-1. Google up your editor and how to set files to utf-8.
I wonder why there are editors that don't default to utf-8.

File upload mojibake

How do you do a file upload in an HTML form without running into mojibake?
I have a form that has three fields:
a file field
a required text field
a text field which accepts Japanese characters
I've set up my HTML form with the attribute enctype='multipart/form-data'. But when the form submission fails due to the missing required field, I get redirected to the same page but my 2nd text field (the one that accepts the Jap. chars) is already mojibaked.
However, if I remove the enctype or change it to anything else, and when the form submission fails, I see the Japanese chars as they are (no mojibake). The problem is, if this succeeds, I am unable to read the uploaded files.
Any ideas how to fix this??
Mojibake (mangled display of Japanese characters) can have two causes:
The data on the page is in the right character encoding, but the browser does not recognize it.
Some characters on the page use the wrong encoding (the server wrote them in an incorrect encoding).
If the other characters on the page (outside of your form) show correctly, you produced broken output on your server.
If everything is clobbered, and you can fix it by manually setting a different encoding from the browser's menu, then the page encoding is not properly specified.
What kind of content-type headers and HTML meta tags do you use?
I've figured it out (by reverse-engineering appfuse (appfuse.org) which does not seem to be affected by mojibake with its file upload form ).
It solved it by setting the charset encoding to UTF-8 in the server side (with spring's org.springframework.web.filter.CharacterEncodingFilter ). Thus, I guess multipart-/form-data really does screw up the character encoding ( or at least for java ).