File enconding (UTF-8 not working properly) - mysql

In my webpage, there is a form with multiple inputs. However, the input chars behave differently from the input "label" chars. I tried setting the file encoding to UTF-8 and UTF-8 +BOM (I'm using EditPlus).
Using UTF-8:
Using UTF-8 + BOM:
The input chars come from a mysql database where the collation is utf8_unicode_ci (using phpmyadmin) so i don't know if that's the problem's source. Any ideas?

This means both pieces of data are not in the same encoding. If the file is interpreted as Latin-1 (or a similar encoding), you get the first result in which the data in the input field is valid (meaning it's Latin-1 encoded) but the label is wrong (meaning it's not Latin-1 encoded). When the file is interpreted as UTF-8, the label is correct (meaning it's UTF-8 encoded) but the data in the input field is wrong (meaning it's not UTF-8 encoded). If data shows up as the � UNICODE REPLACEMENT CHARACTER, it's a sure sign the document is being interpreted as a Unicode encoding (e.g. UTF-8), but the byte sequence is invalid.
I'll guess that the label is hardcoded in the file but the data in the input field comes from a database. In this case you need to set the connection encoding for the database to return UTF-8.
As to why the file is interpreted in Latin-1 without BOM and in UTF-8 with BOM: because the browser recognizes the BOM as signifying UTF-8, without it it defaults to Latin-1. You need to set the correct HTTP header to tell the browser what encoding the file is in, and get rid of the BOM.
Read these resources:
UTF-8 all the way through
Handling Unicode Front To Back In A Web App
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text

solved it: Just changed the file enconding to "Western European (Windows) 1252" (using EditPlus) and now every character is correctly shown.

Related

HTML special character

I need some help, i entered a special character in form like "ä" and it was save in a database, but when I call it using a javascript , a suggestion type, it does not display correctly, it display like this "ä".
Is it possible from the form "ä" it will be save as "&auml" in the database? how is it?
This is caused by misinterpreting a UTF-8 string as ISO-8859-1. The character "ä" is represented in UTF-8 by the two bytes 0xC3 0xA4, but in ISO-8859-1, the byte 0xC3 represents the character "Ã", and the byte 0xA4 represents the character "¤".
Most likely, your program is sending the page as UTF-8 but not telling the browser what encoding was used, so the browser is assuming ISO-8859-1 (the default). You need to send a Content-Type HTTP header that specifies the encoding, such as text/html; charset=UTF-8.
If you haven't already, you should define the character set used by your page.
Add this inside your <head> tags:
<meta charset="utf-8">
There are other character sets as well, but UTF-8, as stated by Google back in 2008, has become one of the most popular today.
This is due to your character set, but there are two character sets that you need to keep in mind. The first is the character set in the database, and the second is the character set in your HTML. They need to match.
Even if you do something like $mysqli->set_charset("utf8"); in your connection script, the database needs to have utf8 also for the data to display properly.

Messy code when I open a lua file contain Chinese with MacVim

How to set MacVim display code.
Here is the mess when I open the lua file which create in Windows XP.
gControlMode = 0; -- 1£º¿ªÆôÖØÁ¦¸ÐÓ¦£¬ 0:¿ª´¥ÆÁģʽ
gState = GS_GAME;
sTotalTime = 0; --µ±Ç°¹Ø¿¨»¨µÄ×Üʱ¼ä
The text you posted seems like a Latin-1 (or ISO-8859-1, CP819) decoding of the CP936 encoding (or EUC‑CN, or GB18030 encodings1) of this text2:
gControlMode = 0; -- 1:开启重力感应, 0:开触屏模式
gState = GS_GAME;
sTotalTime = 0; --当前关卡花的总时间
When opening a file, Vim tries the list of encodings specified in the fileencodings option. Usually, latin1 is the last value in this list; reading as Latin-1 will always be successful since it is an 8-bit encoding that maps all 256 values. Thus, Vim is opening your CP936 encoded file as Latin-1.
You have several choices for getting Vim to use another encoding:
You can specify an encoding with the ++enc= option to Vim’s :edit command (this will cause Vim to ignore the fileencodings list for the buffer):
:e ++enc=cp936 /path/to/file
You can apply this to an already-loaded file by leaving off the path:
:e ++enc=cp936
You can add your preferred encoding to fileencodings just before latin1 (e.g. in your ~/.vimrc):
let &fileencodings = substitute(&fileencodings, 'latin1', 'cp936,\0', '')
You can set the encoding option to your desired encoding. This is usually discouraged because it has wide-ranging impacts (see :help encoding).
It might make sense, if possible, to switch your files to UTF-8 since many editors will properly auto-detect UTF-8. Once you have the file loaded properly (see above), Vim can do the conversion like this (set fileencoding, then :write):
:set fenc=utf-8 | w
Vim should pretty much automatically handle reading and writing UTF-8 files (encoding defaults to UTF-8, and utf-8 is in the default fileencodings), but if you are using other editors (i.e. whatever Windows editor edited/created the CP936 file(s)), you may need to configure them to use UTF-8 instead of (e.g.) CP936.
1 I am not familiar with the encodings used for Chinese text, these encodings seem to be identical for the “expected” text.
2 I do not read Chinese, but the presence and locations of the FULLWIDTH COLON and FULLWIDTH COMMA (and Google's translation of this text) make me think this is the text you expected.

UTF-8 is an Encoding or a Document Character Set?

According with W3C Recommendation says that every aplicattion requires its document character set (Not be confused with Character Encoding).
A document character set consists of:
A Repertoire: A set of abstract characters, such as the Latin letter "A", the Cyrillic letter "I", the Chinese character meaning "water", etc.
Code positions: A set of integer references to characters in the repertoire.
Each document is a sequence of characters from the repertoire.
Character Encoding is:
How those characters may be represented
When i save a file in Windows notepad im guessing that this are the "Document Character Sets":
ANSI
UNICODE
UNICODE BIG ENDIAN
UTF-8
Simple 3 questions:
I want to know if those are the "document character sets". And if they are,
Why is UTF-8 on the list? UTF-8 is not supposed to be an encoding?
If im not wrong with all this stuff:
Are there another Document Character Sets that Windows do not allow you to define?
How to define another document character sets?
In my understanding:
ANSI is both a character set and an encoding of that character set.
Unicode is a character set; the the encoding in question is probably UTF-16. An alternative encoding of the same character set is big-endian UTF-16, which is probably what the third option is referring to.
UTF-8 is an encoding of Unicode.
The purpose of that dropdown in the Save dialog is really to select both a character set and an encoding for it, but they've been a little careless with the naming of the options.
(Technically, though, an encoding just maps integers to byte sequences, so any encoding could be used with any character set that is small enough to "fit" the encoding. However, the UTF-* encodings are designed with Unicode in mind.)
Also, see Joel on Software's mandatory article on the subject.
UTF-8 is a character encoding that is also used to specify a character set for HTML and other textual documents. It is one of several Unicode encodings (UTF-16 is another).
To answer your questions:
It is on the list because Microsoft decided to implement it in notepad.
There are many other character sets, though defining your own is not useful, so not really possible.
You can't define other character sets to save with notepad. Try using a programmers editor such as notepad++ that will give you more character sets to use.

How to distinguish UTF-8 and ASCII files?

How to distinguish UTF-8 (no BOM) and ASCII files?
If the file contains any bytes with the top bit set, then it is not ASCII.
So if the only possibilities are ASCII or UTF-8, then it's UTF-8.
If the file contains only bytes with the top bit clear, then it's meaningless to distinguish whether it's ASCII or UTF-8, since it represents exactly the same series of characters either way. But you can call it ASCII.
Of course this doesn't distinguish UTF-8 from ISO Latin or CP1252, and neither does it confirm that the so-called UTF-8 is actually valid.
http://msdn.microsoft.com/en-us/library/dd318672%28v=vs.85%29.aspx
IsTextUnicode Function
Determines if a buffer is likely to contain a form of Unicode text.

what encoding stackoverflow used in mysql?

I can not save the character 𝑴 in my mysql which encoding is utf8, but i found stackoverflow can save it and display it.
I made a mistake. stackoverflow also can not save 𝑴 .
If you can't store the character, you are encoding or decoding it incorrectly, or converting it to a character set that doesn't support the character.
The UTF-8 encoding can handle almost any character that exists in any language, so it's quite unlikely that it's a limitation of that encoding.
You have to use the Unicode character set or some Unicode encoding (UTF-7, UTF-8, UTF-16, UTF-32) for all steps of the process. If you convert the text to some other character set and then back, you can only support the characters of that specific character set.
Stackoverflow is trying to display the character as &#119924. So maybe that character value is being saved in the database (certainly some character value is being saved in the database), but the reason why we can't see that character is because of the font which is being used to display the HTML: perhaps it's the font being used, not the database, that doesn't support that character value.