HTML character sets & MySQL character sets - mysql

Which HTML character set would cover all these? Which character set do I need in MySQL to export and then import them?
SAINT RAPHAEL ARNÁIZ BARÓN (Spanish)
St Thérèse of the Child Jesus, Virgin, Doctor (French)
M. Orsola (Giulia) Ledóchowska, Religious (Eastern European)

In MySQL, use the UTF-8 character set. This will allow you to represent a very wide variety of data appropriately in your DBMS. If you use your mySQL collation settings correctly, MySql will collate (sort) your info nicely as well.
To render this stuff into HTML, you probably need to entitize characters other than the basic 7-bit ASCII ones. For example, look at this web page describing the Unicode character for uppercase Ñ http://www.fileformat.info/info/unicode/char/00D1/index.htm
In HTML this is represented by ampersand poundsign x d 1 semicolon
Your web app language (PHP? Java?) has functions built in to convert between UTF-8 strings (to stash in the DBMS) and entitized html (for display on the web). Use them.

Use MySQL's UTF-8 character set for your tables and columns, and send a SET NAMES UTF8 statement after initialising the MySQL connection in your scripting language of choice. Ensure your script also sends a HTTP header indicating that your page is in UTF-8, and you should be good to go. You may want to read this, and the links for further reading look good too.
In PHP, to send this HTTP header, you would use
header("Content-Type: text/html; charset=UTF-8");. At the top of your <head> element in your HTML page, you can also add <meta charset="UTF-8"> (in HTML5), or <meta http-equiv="Content-type" content="text/html;charset=UTF-8"> (in HTML 4.01 or HTML5; but you can't use both ways and still get valid HTML5).

Related

How to make latin extended work?

I've been googling for some but can't realize how to make letters like č, ć, ž, š, đ work. I tried adding <body lang="sr"> because it actually is Serbian (sr=serbian) but doesn't work. I get this PoÄetna instead of Početna.
I tried adding <meta charset="ISO-8859-2"> into the head section but still nothing. What am I missing?
Pick a character encoding that supports the characters you want to use. ISO-8859-2 should do the job, but this isn't the 1990s any more. UTF-8 should be the default choice.
Ensure your editor is configured to save in that encoding.
Specify that you are using that encoding with document level meta data: <meta charset="utf-8">
Specify that you are using that encoding in your HTTP response (this takes priority over the document level): Content-Type: text/html;charset=UTF-8.

Safe HTML form accept charset?

I faced a parameter encoding issue when submitting a form with the get method (I can't use the post method). Some accentuated characters were not escaped in the URL, since my page was UTF8. The Spring controller retrieved bad characters instead.
I solved this issue by setting accept-charset="ISO-8859-1" on my form, but now, I am wondering which charset is safe for all server/browser combination. Is there any recommended for my forms and 'get' URLs?
This is frustrating (to put it mildly) with servlets. The standard URL encoding must use UTF-8 yet servlets not only default to ISO-8859-1 but don't offer any way to change that with code.
Sure you can req.setRequestEncoding("UTF-8") before you read anything, but for some ungodly reason this only affects request body, not query string parameters. There is nothing in the servlet request interface to specify the encoding used for query string parameters.
Using ISO-8859-1 in your form is a hack. Using this ancient encoding will cause more problems than solve for sure. Especially since browsers do not support ISO-8859-1 and always treat it as Windows-1252. Whereas servlets treat ISO-8859-1 as ISO-8859-1, so you will be screwed beyond belief if you go with this.
To change this in Tomcat for example, you can use the URIEncoding attribute in your <connector> element:
<connector ... URIEncoding="UTF-8" ... />
If you don't use a container that has these settings, can't change its settings or some other issue, you can still make it work because ISO-8859-1 decoding retains full information from the original binary.
String correct = new String(request.getParameter("test").getBytes("ISO-8859-1"), "UTF-8")
So let's say test=ä and if everything is correctly set, the browser encodes it as test=%C3%A4. Your servlet will incorrectly decode it as ISO-8859-1 and give you the resulting string "ä". If you apply the correction, you can get ä back:
System.out.println(new String("ä".getBytes("ISO-8859-1"), "UTF-8").equals("ä"));
//true
nickdos is right.
Another way of doing this is using the meta-data tag:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
Also keep in mind when handling the response on the server, the code should also use the correct (same) encoding.
Example:
use stringParamer.getBytes("utf-8") instead of stringParamer.getBytes()
And when using Spring make sure the correct encoding is configured for message converters in the DispatcherServlet's configuration file (XYZ_-servlet.xml), e.g.:
<bean id="stringHttpMessageConverter" class="org.springframework.http.converter.StringHttpMessageConverter">
<property name="supportedMediaTypes" value = "text/plain;charset=UTF-8"/>
</bean>
The problem is URL's always get encoded as 127-ASCII. Because your form sends back additional characters values outside the standard ASCII set via a GET you have several issues going on:
URL's are limited to 2048 characters, so your form values might be getting truncated
If a user enters characters outside the ISO accept-type you set in the Form attribute, they would not be encoded correctly into the URL. That is because the browser translates everything into 127-ASCII when encoding URL's after first using the page's encoding. Any special character not in that ISO set would be encoded incorrectly.
The browser always translates the characters in your URL first using the page encoding or meta tags. But if there is a server HTTP-header, that encoding would override your meta tag encoding. The default encoding for HTML5 pages is UTF-8. But you are using an ISO standard overriding that. Even so, all encoding done by your browsers replaces non-ASCII characters with a "%" followed by hexadecimal digits from the pages encoding or in your case the form's set encoding. That is then sent up to the server so look at your URL to see what has been sent.
When your URL comes to the server, it comes in as 127-ASCII, so you would need to first get the string as ASCII, then decode back to the page encoding or in your case the Form accept values used to get the true values.
I recommend you remove the form encoding, use the pages UTF-8 settings for broader character support, and drop in these two metatags below to make sure you are sending back UTF-8 encoded data, which includes all the characters needed and is easily decoded on the server as described above by other posters above.
<meta charset="utf-8" />
<meta content="text/html; charset=utf-8" http-equiv="content-type" />

Characters not displaying correctly in different browsers

I used certain characters in website such as • — “ ” ‘ ’ º ©.
I found that when testing to see what my website looked like under different browsers (BrowserLab)
the afore-mentioned characters are replaced with �.
I then changed the charset in the webpage header from:
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
to
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Suddenly all the pages have the above mentioned characters replaced with a ?.
Even more puzzling is this is not always consistent across and even within the same page, as some sections display the character • and © correctly.
In particular, I need to replace the character • with one that will display across browsers, can anyone help me with the answer? Thanks.
You should save your HTML source as UTF8.
Alternatively, you can use HTML entities instead.
The source code needs to be saved in the same encoding as you're instructing the browser to parse it in. If you're saving your files in UTF-8, instruct the browser to parse it as UTF-8 by setting an appropriate HTTP header or HTML meta tag (headers preferable, your web server may be setting one without you knowing). Use a decent editor that clearly tells you what encoding you're saving the file as. If it doesn't display correctly, there's a discrepancy between what you're telling your browser the file is encoded in and what it's really encoded in.
Check to see if Apache is setup to send the charset. Look for the directive "AddDefaultCharset" and set it to Off in .htaccess or your config file.
Most/all browsers will take what is sent in the HTTP headers over what is in the document.
If you're using Notepad++, I suggest You to use Edit Plus editor to copy the text (which has the special characters) and paste it in your file. This should work.
Yes I had this problem too in notepad++ copy and pasting wasn't working with some symbols
I think SLaks is right
HTML entities for copyright symbol &#169

UTF-8 html without BOM displays strange characters

I have some HTML which contains some forign characters (€, ó, á). The HTML document is saved as UTF-8 without BOM. When I view the page in the browser the forign characters seem to get replaced with stranger character combinations (€, ó, Ã). It's only when I save my HTML document as UTF-8 with BOM that the characters then display properly.
I'd really rather not have to include a BOM in my files, but has anybody got any idea why it might do this? and a way to fix it? (other than including a BOM)
You are probably not specifying the correct character set in your HTML file. The BOM (thanks #Jukka) sends the browser into UTF-.8 mode; in its absence, you need to use other means to declare the document UTF.8.
If you have access to your server configuration, you may want to make sure the server isn't sending the wrong character set info. See e.g. How to change the default encoding to UTF-8 for Apache?
If you have access only to your HTML, adding this meta tag in your document's head should do the trick:
<meta http-equiv='Content-Type' content='Type=text/html; charset=utf-8'>
or as #Mathias points out, the new HTML 5
<meta charset="utf-8">
(valid only if you use a HTML 5 doctype, against which there is no good argument any more even if you don't use HTML 5 markup.)
Insert <meta charset="utf-8"> in <head>.
Or set the header Content-Type: text/html;charset=utf-8 on the server-side.
You can also do add in .htaccess: AddDefaultCharset UTF-8 more info here http://www.askapache.com/htaccess/setting-charset-in-htaccess.html

Displaying unicode symbols in HTML

I want to simply display the tick (✔) and cross (✘) symbols in a HTML page but it shows up as either a box or goop ✔ - obviously something to do with the encoding.
I have set the meta tag to show utf-8 but obviously I'm missing something.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Edit/Solution: From comments made, using FireBug I found the headers being passed by my page were in fact "Content-Type: text/html" and not UTF-8. Looking at the file format using Notepad++ showed my file was formatted as "UTF-8 without BOM". Changing this to just UTF-8 the symbols now show correctly... but firebug still seems to indicate the same content-type.
You should ensure the HTTP server headers are correct.
In particular, the header:
Content-Type: text/html; charset=utf-8
should be present.
The meta tag is ignored by browsers if the HTTP header is present.
Also ensure that your file is actually encoded as UTF-8 before serving it, check/try the following:
Ensure your editor save it as UTF-8.
Ensure your FTP or any file transfer program does not mess with the file.
Try with HTML encoded entities, like &#uuu;.
To be really sure, hexdump the file and look as the character, for the ✔, it should be E2 9C 94 .
Note: If you use an unicode character for which your system can't find a glyph (no font with that character), your browser should display a question mark or some block like symbol. But if you see multiple roman characters like you do, this denotes an encoding problem.
I know an answer has already been accepted, but wanted to point a few things out.
Setting the content-type and charset is obviously a good practice, doing it on the server is much better, because it ensures consistency across your application.
However, I would use UTF-8 only when the language of my application uses a lot of characters that are available only in the UTF-8 charset. If you want to show a unicode character or symbol in one of cases, you can do so without changing the charset of your page.
HTML renderers have always been able to display symbols which are not part of the encoding character set of the page, as long as you mention the symbol in its numeric character reference (NCR). Sounds weird but its true.
So, even if your html has a header that states it has an encoding of ansi or any of the iso charsets, you can display a check mark by using its html character reference, in decimal - ✓ or in hex - ✓
So its a little difficult to understand why you are facing this issue on your pages. Can you check if the NCR value is correct, this is a good reference http://www.fileformat.info/info/unicode/char/2713/index.htm
Make sure that you actually save the file as UTF-8, alternatively use HTML entities (&#nnn;) for the special characters.
Unlike proposed by Nicolas, the meta tag isn’t actually ignored by the browsers. However, the Content-Type HTTP header always has precedence over the presence of a meta tag in the document.
So make sure that you either send the correct encoding via the HTTP header, or don’t send this HTTP header at all (not recommended). The meta tag is mainly a fallback option for local documents which aren’t sent via HTTP traffic.
Using HTML entities should also be considered a workaround – that’s tiptoeing around the real problem. Configuring the web server properly prevents a lot of nuisance.
I think this is a file problem, you simple saved your file in 1-byte encoding like latin-1. Google up your editor and how to set files to utf-8.
I wonder why there are editors that don't default to utf-8.