Pinyin tone marked symbols and MySQL - mysql

I’m creating a MySQL database storing Chinese characters with associated pīnyīn pronunciations. I’ve set up everything to work in UTF-8 charset, so I’m having no troubles with most of the symbols I’m using. Except, strangely, some of certain latin characters with tone marks, and only when I write them into the database from $_POST, using PHP.
Those are: all characters with an acute accent (á, é, í, ó, ú), except ǘ (?!); and all characters with a grave accent (à, è ì ò ù), again, except ǜ. When they are typed into a form, and that form is submitted to the db, those characters are just cut off, like they never existed. E.g., cháng submits like chng. Any other characters (with a caron, like ǎ, or a macron, like ā) are written in fine, and so are actual Chinese characters.
Again, I’m using UTF-8 everywhere possible, and this sort of problem so far has been only experienced upon submitting data from a form. Before, I ran a script to manually insert an array, containing those characters, to the database, and everything went fine.
Any ideas?

I think you may post pinyin in a numbered format.
e.g. cháng as cha2ng
And dealing with the post information in php script by some mapping methods.
Here's a method to deal with it.
Convert numbered to accentuated Pinyin?
Hopefully, it helps you.

I got a solution!
Before:
SELECT 'liàng' = 'liǎng';
Change to:
SELECT CONVERT('liàng' USING BINARY)= CONVERT('liǎng' USING BINARY) as equal;

Related

MySQL WorkBench Unicode hell under Windows(R)

I'm tired, guys... Since 6.2.5 the WB does not support unicode in its configuration.
Let's compare:
6.2.5:
Workplace restores just fine. Cyrillic in paths and names is OK.
8.0.2:
Same workplace isn't restorable: any Cyrillic in path/names turns into unreadable hieroglyphs
Almost 2yo ago I already filed the bugreport. But no reaction, no corrections, no attention...
So, please suggest me how to enable normal unicode support, or suggest me another functionally close tool. Fonts aren't problem: as you can see on second screensho, the 2-byte UTF8 chars were converted to 2-char representations (note repeating "D"-like symbol)

Reading CSV file with Chinese Character [One character cannot be shown]

When I am opening a csv file containing Chinese characters, using Microsoft Excel, TextWrangler and Sublime Text, there are some Chinese words, which cannot be displayed properly. I have no ideas why this is the case.
Specifically, the csv file can be found in the following link: https://www.hkex.com.hk/eng/plw/csv/List_of_Current_SEHK_EP.CSV
One of the word that cannot be displayed correctly is shown here:
As you can see a ? can be found.
Using mac file command as suggested by
http://osxdaily.com/2015/08/11/determine-file-type-encoding-command-line-mac-os-x/ tell me that the csv format is utf-16le.
I am wondering what's the problem, why I cannot read that specific text?
Is it related to encoding? Or is it related to my laptop setting? Trying to use Mac and windows 10 on Mac (via Parallel Desktop) cannot display the work correctly.
Thanks for the help. I really want to know why this specific text cannot be displayed properly.
The actual name of HSBC Broking Securities is:
滙豐金融證券(香港)有限公司
The first character, U+6ED9 滙, is one of the troublesome HKSCS characters: characters that weren't available in standard pre-Unicode Big-5, which were grafted on in incompatible ways later.
For a while there was an unfortunate convention of converting these characters into Private Use Area characters when converting to Unicode. This data was presumably converted back then and is now mangled, replacing 滙 with U+E05E  Private Use Area Character.
For PUA cases that you're sure are the result of HKSCS-compatibility-bodge, you can convert back to proper Unicode using this table.

special characters with Net::Twitter::Lite

I try to send characters like ü, ä, ß, à and so on to twitter. If I use unicode characters in my scripts they come out wrong in twitter. If I use HTML (which is possible in twitter's web-interface and which used to work previously) I see now ü rather than "ü" in the post. Is there a parameter or something that I have to set? Some call to encode/decode? I am using:
use Net::Twitter::Lite::WithAPIv1_1;
I find myself checking out the test suite of perl modules quite often as it is a good source for examples.
Net::Twitter expects decoded characters, not encoded bytes So, sending
encoded utf8 to Net::Twitter will result in double encoded data.
Source: https://metacpan.org/source/MMIMS/Net-Twitter-Lite-0.12006/t/unicode.t
Try
use encoding 'utf8';
at the begining of your script. sometimes this is the solution of many utf8 problems.

Extended ASCII characters show up as junk in MySQL db is inserted through perl

I have a MySQL 'articles' table and I am trying to make the following insert using SQLyog.
insert into articles (id,title) values (2356606,'Jérôme_Lejeune');
This works fine and the data shows fine when I do a select query.
The problem is that when I do the same insert query through my perl script, the name shows up with some junk characters in place of é and ô in the database. I need to know how to properly store the name through my script. The part of code that does the insert is like this.
$sql_insert = "insert into articles (id,title) values (?,?)";
$sth_insert = $dbh->prepare($sql_insert);
$sth_insert->execute($id,$title);
$id and $title have the correct required data which I have checked by print before I am inserting them. Please assist.
You have opened up the character encoding can of worms, and you have a lot to learn before you will solve this problem and have it stay solved.
You are probably already used to thinking of how a character of text can be encoded as a string of bits. Under the ASCII encoding, for example, the 8-bit string 01000001 (65) is used to indicate the A character. When you start to think about how many different languages there are and how many different kinds of characters there are, you quickly realize that an 8-bit encoding is not going to get you very far. So a number of other character encodings have proliferated. Some of the most popular are latin1 (ISO-8859-1) and UTF-8. Both of these encodings can render the é and ô characters, but they use quite different bit strings to represent them. As you write to a file (or to the terminal) or add a row to a database, Perl and MySQL have a notion of what the character encoding of the output stream is. An encoding is also used when you read data. If you don't know what this encoding is, then it doesn't make any sense to say that the data looks good/looks bad when you store it and retrieve it.
Perl and MySQL can, with the right settings, handle both of these encodings and several more. Which encoding you choose to use is not as important as making sure that all the pieces of your application are using the same encoding. But you should choose an encoding that
can encode all of the characters you will need (for this problem, you mention é and ô, but will there be others? what about in the future?)
is supported by all the pieces of your application (front-end, database, back-end)
Here's some suggested reading to get you headed in the right direction:
The Encode module for Perl
character sets in MySQL
(others should feel free to recommend additional links)
I can't speak to MySQL so much, but character encoding support in Perl is rapidly evolving (which isn't to say that it ain't damn good). The latest versions of Perl will have the best support (for the most obscure character sets) and the best features (for example, regular expressions and character classes) for characters beyond ASCII.
There are few things to follow.
First you have to make sure, that Perl understands that data which is moving between your program and DB is encoded as UTF-8 (i expect your databases and tables are set properly). For this you need to say it loud out on connecting to database, like this:
my($dbh) = DBI->connect(
'dbi:mysql:test',
'user',
'password',
{
mysql_enable_utf8 => 1,
}
);
Next, you need send data to output and you must set it to decaode data as UTF-8. For this i like pretty good module:
use utf8::all;
But this module is not in core, so you may want to set it with binmode yourself too:
binmode STDIN, ":utf8";
binmode STDOUT, ":utf8";
And if you deal with webpages, you have to make sure, that browser understoods that you are sending your data encoded as UTF-8. For that you should make sure your HTTP-headers include encoding:
Content-Type: text/html; charset=utf-8;
and set it with HTML META-tag too:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Now you should get your road covered.

Migrating MS Access data to MySQL: character encoding issues

We have an MS Access .mdb file produced, I think, by an Access 2000 database. I am trying to export a table to SQL with mdbtools, using this command:
mdb-export -S -X \\ -I orig.mdb Reviewer > Reviewer.sql
That produces the file I expect, except one thing: Some of the characters are represented as question marks. This: "He wasn't ready" shows up like this: "He wasn?t ready", only in some cases (primarily single/double curly quotes), where maybe the content was pasted into the DB from MS Word. Otherwise, the data look great.
I have tried various values for "export MDB_ICONV=". I've tried using iconv on the resulting file, with ISO-8859-1 in the from/to, with UTF-8 in the from/to, with WINDOWS-1250 and WINDOWS-1252 and WINDOWS-1256 in the from, in various combinations. But I haven't succeeded in getting those curly quotes back.
Frankly, based on the way the resulting file looks, I suspect the issue is either in the original .mdb file, or in mdbtools. The malformed characters are all single question marks, but it is clear that they are not malformed versions of the same thing; so (my gut says) there's not enough data in the resulting file; so (my gut says) the issue can't be fixed in the resulting file.
Has anyone run into this one before? Any tips for moving forward? FWIW, I don't have and never have had MS Access -- the file is coming from a 3rd party -- so this could be as simple as changing something on the database, and I would be very glad to hear that.
Thanks.
Looks like "smart quotes" have claimed yet another victim.
MS word takes plain ascii quotes and translates them to the double-byte left-quote and right-quote characters and translates a single quote into the double byte apostrophe character. The double byte characters in question blelong to to an MS code page which is roughly compatable with unicode-16 except for the silly quote characters.
There is a perl script called 'demoroniser.pl' which undoes all this malarky and converts the quotes back to plain ASCII.
It's most likely due to the fact that the data in the Access file is UTF, and MDB Tools is trying to convert it to ascii/latin/is0-8859-1 or some other encoding. Since these encodings don't map all the UTF characters properly, you end up with question marks. The information here may help you fix your encoding issues by getting MDB Tools to use the correct encoding.