Messy code when I open a lua file contain Chinese with MacVim - macvim

How to set MacVim display code.
Here is the mess when I open the lua file which create in Windows XP.
gControlMode = 0; -- 1£º¿ªÆôÖØÁ¦¸ÐÓ¦£¬ 0:¿ª´¥ÆÁģʽ
gState = GS_GAME;
sTotalTime = 0; --µ±Ç°¹Ø¿¨»¨µÄ×Üʱ¼ä

The text you posted seems like a Latin-1 (or ISO-8859-1, CP819) decoding of the CP936 encoding (or EUC‑CN, or GB18030 encodings1) of this text2:
gControlMode = 0; -- 1:开启重力感应, 0:开触屏模式
gState = GS_GAME;
sTotalTime = 0; --当前关卡花的总时间
When opening a file, Vim tries the list of encodings specified in the fileencodings option. Usually, latin1 is the last value in this list; reading as Latin-1 will always be successful since it is an 8-bit encoding that maps all 256 values. Thus, Vim is opening your CP936 encoded file as Latin-1.
You have several choices for getting Vim to use another encoding:
You can specify an encoding with the ++enc= option to Vim’s :edit command (this will cause Vim to ignore the fileencodings list for the buffer):
:e ++enc=cp936 /path/to/file
You can apply this to an already-loaded file by leaving off the path:
:e ++enc=cp936
You can add your preferred encoding to fileencodings just before latin1 (e.g. in your ~/.vimrc):
let &fileencodings = substitute(&fileencodings, 'latin1', 'cp936,\0', '')
You can set the encoding option to your desired encoding. This is usually discouraged because it has wide-ranging impacts (see :help encoding).
It might make sense, if possible, to switch your files to UTF-8 since many editors will properly auto-detect UTF-8. Once you have the file loaded properly (see above), Vim can do the conversion like this (set fileencoding, then :write):
:set fenc=utf-8 | w
Vim should pretty much automatically handle reading and writing UTF-8 files (encoding defaults to UTF-8, and utf-8 is in the default fileencodings), but if you are using other editors (i.e. whatever Windows editor edited/created the CP936 file(s)), you may need to configure them to use UTF-8 instead of (e.g.) CP936.
1 I am not familiar with the encodings used for Chinese text, these encodings seem to be identical for the “expected” text.
2 I do not read Chinese, but the presence and locations of the FULLWIDTH COLON and FULLWIDTH COMMA (and Google's translation of this text) make me think this is the text you expected.

Related

Character encoding issues with migrating from MSSQL to MySQL

We have an application called JIRA running on Windows using MSSQL and I need to migrate it to Linux/MySQL. The character encoding in the existing MSSQL db is latin1 but I need to use UTF-8 in MySQL.
I take an xml dump of the MSSQL data using a backup mechanism provided by the application. Run it through python filter to convert the encoding from latin1 to UTF-8. Here is the python code that was provided to me by my colleague.
#!/usr/bin/python
import codecs, re
try:
highpoints = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:
highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
#fin = codecs.open('unicodestuff.txt', encoding='utf-8', errors='replace')
fin = codecs.open('entities.xml', encoding='latin1')
fout = codecs.open('stripped.xml', encoding='utf-8', mode='w', errors='replace')
for line in fin:
line = highpoints.sub(u'', line)
fout.write(line)
fin.close()
fout.close()
I take the filtered xml dump and using a "restore" mechanism in the application, I restore the data. However, after restoring the data, I spot checked few records on the MySQL side and I see some weird characters and I am assuming these are related to character encoding. For example,
On the MSSQL side, the text string is
““Number of debits exceeds maximum of 0”
“2-Restrict All Credits”
Default ของประเภทบัญชีถูกต้อง แต่เลขบัญชีไม่ถูกต้อง
Branch : 724 มาบุญครอง
whereas on the MYSQL side, the corresponding text appears as
â??â??Number of debits exceeds maximum of 0â?
â??2-Restrict All Creditsâ?
Default à¸à¸­à¸à¸à¸£à¸°à¹à¸ à¸à¸à¸±à¸à¸à¸µà¸à¸¹à¸à¸à¹à¸­à¸ à¹à¸à¹à¹à¸¥à¸à¸à¸±à¸à¸à¸µà¹à¸¡à¹à¸à¸¹à¸à¸à¹à¸­à¸
Branch : 724 มาà¸à¸¸à¸à¸à¸£à¸­à¸
Can you please provide me some ideas to fix these character encoding issues? Kindly let me know if additional information is required.
Thanks
Sam
Clearly your XML file does not actually use the Latin-1 character set. You've shown that text such as "ของประเภทบัญชีถูกต้อง แต่เลขบัญชีไม่ถูกต้อง" is present in it. The Latin-1 character set does what it says on the label: it represents letters from Latin alphabets. Those letters do not exist in it. If the headers in your XML file claim that it's in Latin-1, then those headers are untrue and the XML is, strictly speaking, not valid. But it might still be usable.
Now the problem is, what character encoding is that XML file actually using? To find out, you may have to examine the XML file in hexadecimal. There are three main possibilities: (1) it's using an old codepage such as 874 which contains these characters; (2) it's using UTF-16; (3) it's using UTF-8.
If you examine in hexadecimal a section of the XML which contains some of this non-latin text, and some of the latin letters nearby, here's what you might see. If it's in a codepage such as 874, each latin letter will be one byte with a value from 32 to 7F, and each nonlatin letter will be one (or possibly two?) bytes with values of 80 to FF. If it's in UTF-16, each latin letter will be two bytes, one from 32 to 7F and the other being always 00, and the nonlatin letters will be two bytes with neither being 00. If it's in UTF-8, the latin letters will be one byte from 32 to 7F, and the nonlatin letters will be (probably) three bytes, all being from 80 to FF.
There may be an alternative to examining hexadecimal. Some text editor programs can save text files in your choice of encoding formats. TextPad 7, for instance, can save as ANSI, DOS, UTF-8, Unicode, or Unicode (big-endian). The latter two options are actually UTF-16. Try loading the XML into such a program, and saving copies of it as UTF-8 and as Unicode. One of these copies should be the same size as the original (plus or minus two or three bytes), and the other will be a different size. Whichever matches the size is probably the correct format. If both differ, then you've got something weird.
Anyway, if you save a version as UTF-8 and then are able to open it and see your data intact, you should then be able to import that without using a Python translator.

File enconding (UTF-8 not working properly)

In my webpage, there is a form with multiple inputs. However, the input chars behave differently from the input "label" chars. I tried setting the file encoding to UTF-8 and UTF-8 +BOM (I'm using EditPlus).
Using UTF-8:
Using UTF-8 + BOM:
The input chars come from a mysql database where the collation is utf8_unicode_ci (using phpmyadmin) so i don't know if that's the problem's source. Any ideas?
This means both pieces of data are not in the same encoding. If the file is interpreted as Latin-1 (or a similar encoding), you get the first result in which the data in the input field is valid (meaning it's Latin-1 encoded) but the label is wrong (meaning it's not Latin-1 encoded). When the file is interpreted as UTF-8, the label is correct (meaning it's UTF-8 encoded) but the data in the input field is wrong (meaning it's not UTF-8 encoded). If data shows up as the � UNICODE REPLACEMENT CHARACTER, it's a sure sign the document is being interpreted as a Unicode encoding (e.g. UTF-8), but the byte sequence is invalid.
I'll guess that the label is hardcoded in the file but the data in the input field comes from a database. In this case you need to set the connection encoding for the database to return UTF-8.
As to why the file is interpreted in Latin-1 without BOM and in UTF-8 with BOM: because the browser recognizes the BOM as signifying UTF-8, without it it defaults to Latin-1. You need to set the correct HTTP header to tell the browser what encoding the file is in, and get rid of the BOM.
Read these resources:
UTF-8 all the way through
Handling Unicode Front To Back In A Web App
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
solved it: Just changed the file enconding to "Western European (Windows) 1252" (using EditPlus) and now every character is correctly shown.

how to get the real file contents using TFilestream?

i try to get the file contents using TFilestream:
procedure ShowFileCont(myfile : string);
var
tr : string;
fs : TFileStream;
Begin
Fs := TFileStream.Create(myfile, fmOpenRead or fmShareDenyNone);
SetLength(tr, Fs.Size);
Fs.Read(tr[1], Fs.Size);
Showmessage(tr);
Fs.Free;
end;
I do a little text file with contents only:
aaaaaaaJ“њРЉTщЂ®8ЈЏVд"Ј¦AИaaaaaaa
And save this file (using AkelPad) with 1251 (ansi) codepege
Save with 65001 (UTF8) codepage.
these to files has different size but there contents is equal - i oped them both in notepad and they both has the same contents
But when i run ShowFileCont proc it shows to me different results:
aaaaaaaJ?ЊT?8?V?"?A?aaaaaaa
aaaaaaaJ“њРЉTщЂ®8ЈЏVд"Ј¦AИaaaaaaa
Questions:
how to get the real file contents using TFilestream?
How to explain that these 2 files has different size but the content (in notepad) is equeal?
Add: Sorry, i didn't say that i use Lazarus FPC and string = utf8string
Why do the files have different size?
Because they use different encodings. The 1251 encoding maps each character to a single byte. But UTF-8 uses variable numbers of bytes for each character.
How do I get the true file contents?
You need to use a string type that matches the encoding used in the file. So, for example, if the content is UTF-8 encoded, which is the best choice, then you load the content into a UTF-8 string. You are using FPC in a mode where string is UTF-8 encoded. In which case the code in the question is what you need.
Loading an MBCS encoded file with a code page of 1251, say, is more tricky. You can load that into an AnsiString variable and so long as your system's locale is 1251 then any conversions will be performed correctly.
But the code will behave differently when run on a machine with a different locale. And if you wanted to load text using different MBCS encodings, for example 1252, then you cannot use this approach. You would need to load into a byte array and then convert from 1252, say, to UTF-8 so that you could then store that UTF-8 in a string variable.
In order to do that you can use the LConvEncoding unit from LCL. For example, you can use CP1251ToUTF8, CP1252ToUTF8 etc. to convert from MBCS to UTF-8.
How can I determine from the file what encoding is used?
You cannot. You can make a guess that will be accurate in many cases. But in general, it is simply impossible to identify the encoding of an array of bytes that is meant to represent text.
It is sometimes possible to take a file and rule out certain encodings. For example, not all byte streams are valid UTF-8 or UTF-16 text. And so you can rule out such files. But for encodings like 1251, 1252 etc. then any byte stream is valid. There's simply no way for you to tell 1251 encoded streams apart from 1252 encoded streams with 100% accuracy.
The LConvEncoding unit has GuessEncoding which sounds like it may be of some use.
Their contents are obviously not equal. You can see for yourself that the file sizes are different. Things of different size are never equal.
Your files might appear equal in Notepad because Notepad knows how to recognize certain character encodings. You saved your file two different ways. One way used an encoding that assigns one byte to each of 256 possible values. The other way uses an encoding that assigns between one and six bytes to each of more than 10,000 possible values. Some of the characters you saved require more than one byte, which explains why one version of the file is bigger than the other.
TFileStream doesn't pay attention to any of that. It just deals with bytes. Depending on your Delphi version, your string variable may or may not pay attention to encodings. Prior to Delphi 2009, string stored one byte per character. As of Delphi 2009, string uses two bytes per character, so your SetLength call is wrong, and everything after that is pointless to investigate much further.
With one byte per character, your ShowMessage call is not going to interpret the string as UTF-8-encoded. Instead, it will interpret your string using whatever your system code page is. If you know that the string you've read is encoded with UTF-8, then you'll want to convert it to UTF-16 prior to display by calling UTF8Decode. That will return a WideString, and you can use any number of functions to display it, such as MessageBoxW. If you have Delphi 2009 or later, then the compiler will insert conversion code for you automatically, if you've used Utf8String instead of string.

How to Encode and Decode "Acute accented characters" using Perl

I am working in a web based educational website, where we are using Perl, MySQL 5, Apache and Template Toolkit. we are planning to introduce the support for multiple\ Language in our website.
What we have done in
IF we have a Tab name like Courses Main Page<\h1> in our template file, we have converted that to
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<h1>[% glossary.$language.courses_main_page %]<\h1>
where $language is getting the value which user selects when he logs in.
We have a table to maintain this data in our Mysql DB:
CREATE TABLE translation ( english varchar(255) NOT NULL,
language varchar(255) NOT NULL, translation varchar(2000) NOT
NULL, ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Translation of
Element text to a foreign language'
IN the connect function of MySQL, I am providing 'SET character_set_results=NULL'.
I tried with utf8, but the issue which is limited to some tabs got increased to many sections.
So as soon as the user logins into the system, we fetch all the translation and store it in a PERL hash and Cache it. we pass this hash to template file which will replace the value.
Problem: Acute accented characters like á and é etc are getting replaced with some different character set symbols.
For ex: in Front end we are seeing "Cursos Página Principal" for Cursos Página Principal.
It is very similar to the solution given in htmlentities and é (e acute)
Can any one tell me how to achieve the same in Perl.
Denoting the charset
For ex: in Front end we are seeing "Cursos Página Principal" for Cursos Página Principal.
This mojibake happens when the characters are transferred as UTF-8 but interpreted as ISO-8859-1 or similar. So I suggest the easiest way to fix this is making sure that your HTML page gets shipped to the client with a proper mime type, i.e.
Content-Type: text/html; charset=utf-8
If that information is present in the HTML header, the value there will override any setting in the HTML document itself. So make sure that either you set the HTML header, or that your HTML header specifies no charset at all, so that the browser will have a look at the meta setting.
In some browsers (Firefox for example) you can manually change the character set using View / Character Encoding. You can use that to check whether a wrong character encoding while rendering really is the cause of the problem.
Actually encoding and decoding
There are some situations where fixing the charset won't help. It might be that you simply don't control that part of your framework. Or that something translates your characters from ISO-8859-1 to UTF-8 twice, so that the unreadable symbols are in fact represented as UTF-8 already. In these cases, you can use the Encode module to encode the characters in Perl directly, using HTML character references as output:
use Encode qw(decode encode FB_HTMLCREF);
# maybe: $unicodeString = decode("utf-8", $byteString);
$htmlString = encode("ascii", $unicodeString, FB_HTMLCREF);
Whether or not the decode step is neccessary depends on how you talk to your database. If your database connection is capable of supporting unicode, then you'll already have unicode strings, and you can simply encode these to HTML. For DBD::mysql there is a parameter mysql_enable_utf8 => 1 which achieves this. Using it is preferable to decoding things in your own code. This answer has details on the syntax.
One example on what these functions do:
$byteString = "Cursos P\xc3\xa1gina Principal."; # two bytes
$unicodeString = "Cursos P\N{U+00E1}gina Principal."; # one unicode character
$htmlString = "Cursos Página Principal."; # html character reference

what type of encoding is this "Non-ISO extended-ASCII English text, with very long lines, with CRLF line terminators"?

I am trying to export the out put of mssql query which must use utf (utf-16 I suppose) encoding according to description I am using -W -u
functions with sqlcmd.
Ä is converted to character z (with 2 dots or something like ^ inverted) by default, and lists it as ansi character set.
when I try to use notepad++ to convert this utf file to utf-8 shows me with some strange highlighted characters (x8E) for Ä, and some other for other characters like x86 and x94 does not matter what ever encoding I use as default in Nottepad++.
When I transfered the file to a Ubuntu 12.04 machine and using file command says that its
user#user:~/Desktop/enc
oding/checkencoding$ file
convertit4.csv convertit4.csv: Non-ISO extended-ASCII English
text, with very long lines, with CRLF line terminators
user#user:~/Desktop/encoding/checkencoding$ chardet
convertit4.csv convertit4.csv: ISO-8859-2 (confidence: 0.77)
I am confused what kind of encoding it uses.
the purpose is to convert it to utf-8 encoding without any errors to upload it to magmi importer.
note: I am using this command to remove the underline after the headers type c:\outfiles\convertit1.temp | findstr /r /v "^\-[;\-]*$" > c:\outfiles\convertit4.csv hope this line of codes is not the problem.
I hope the information are complete to solve this issue, If any more information needed, please let me know,
Regards.
Try the -f option as in
http://www.yaldex.com/sql_server_tutorial_3/ch06lev1sec1.html