Mysql fails to save UTF string in some cases

Mysql fails to save UTF string in some cases - mysql

during spam fighting, I found some spam comments stored without any content...
After trying to isolate the problem, here is what I have found after saving similar comments to file along with the MySQL database...
This is (HEX because of unknown input encoding) what comment first few "chars" look like:
D1EA E0F7 E0F2 FC20 EFEE EFF3 EBFF F0ED FBE5 20EF F0EE E3F0 E0EC ECFB
After executing INSERT INTO test VALUES (0xD1EAE0F7E0F2FC20EFEEEFF3EBFFF0EDFBE520EFF0EEE3F0E0ECECFB21),(0x21D1EAE0F7E0F2FC20EFEEEFF3EBFFF0EDFBE520EFF0EEE3F0E0ECECFB), (0x21) test mysql table (utf-8) contains 3 rows, first without any text, second and third with single character "!" as a text... (note that 21 hex code for "!" is also in the end of first entry, yet it is not saved). (latin1 encoding saved some useless text replacements for every byte, but this post is not about it)
Of course, D1EA (D=1101 0001 should be followed by one 10xxxxxx byte, not 1110xxxx) isn't valid UTF-8 character, but robust system like database server should be able to deal with it...
My guess is, Mysql (ver. 5.1.66-0+squeeze1) shouldn't choose when to save data and when not, even if it's not valid UTF-8 encoded character... Or at least, it should not claim query was successfull when it decides not to store the data!
Is it bug in mysql, or what?
Thanks

Encoding is Windows-1251, and decodes to
Скачать популярные программы
//"Download popular software" google translated
You should reject non-UTF8 input in your code before doing anything with it.
if( !mb_check_encoding($input, "UTF-8") ) {
header("HTTP/1.1 400 Bad Request");
die("Invalid encoding");
}
FTR, your queries are hex literals, not misencoded text.

Related

Ascii control characters SOH, DLE, STX, ETX : How to escape binary data over a UART?

I want a simple, light-weight way for two basic 8-bit MCUs to talk to each other over an 8-bit UART connection, sending both ASCII characters as 8-bit values, and binary data as 8-bit values.
I would rather not re-invent the wheel, so I'm wondering if some ASCII implementation would work, using ASCII control characters in some standard way.
The problem: either I'm not understanding it correctly, or it's not capable of doing what I want.
The Wikipedia page on control characters says a packet could be sent like this:
< DLE > < SOH > - data link escape and start of heading
Heading data
< DLE > < STX > - data link escape and start of text
< payload >
< DLE > < ETX > - data link escape and end of text
But what if the payload is binary data containing two consecutive bytes equivalent to DLE and ETX? how should those bytes be escaped?
The link may be broken and re-established, so a receiving MCU should be able to start receiving mid-packet, and have a simple way of telling when the next packet has begun, so it can ignore data until the end of that partial packet.
Error checking will happen at a higher level to ensure that a received packet is valid - unless ASCII standads can solve this too

Since you are going to transfer binary data along with text messages, you indeed would have to make sure the receiver won't confuse control bytes with payload contents. One way to do that is to encode the payload data so that none of the special characters appear on the output. If the overhead is not a problem, then a simplest encoding like Base16 should be enough. Otherwise, you may want to take a look at escapeless encodings that have been specifically designed to remove certain characters from encoded data.

I understand this is an old question but I thought I should suggest Serial Line Internet Protocol (SLIP) which is defined in RFC 1055. It is a very simple protocol.

Character encoding issues with migrating from MSSQL to MySQL

We have an application called JIRA running on Windows using MSSQL and I need to migrate it to Linux/MySQL. The character encoding in the existing MSSQL db is latin1 but I need to use UTF-8 in MySQL.
I take an xml dump of the MSSQL data using a backup mechanism provided by the application. Run it through python filter to convert the encoding from latin1 to UTF-8. Here is the python code that was provided to me by my colleague.
#!/usr/bin/python
import codecs, re
try:
highpoints = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:
highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
#fin = codecs.open('unicodestuff.txt', encoding='utf-8', errors='replace')
fin = codecs.open('entities.xml', encoding='latin1')
fout = codecs.open('stripped.xml', encoding='utf-8', mode='w', errors='replace')
for line in fin:
line = highpoints.sub(u'', line)
fout.write(line)
fin.close()
fout.close()
I take the filtered xml dump and using a "restore" mechanism in the application, I restore the data. However, after restoring the data, I spot checked few records on the MySQL side and I see some weird characters and I am assuming these are related to character encoding. For example,
On the MSSQL side, the text string is
““Number of debits exceeds maximum of 0”
“2-Restrict All Credits”
Default ของประเภทบัญชีถูกต้อง แต่เลขบัญชีไม่ถูกต้อง
Branch : 724 มาบุญครอง
whereas on the MYSQL side, the corresponding text appears as
â??â??Number of debits exceeds maximum of 0â?
â??2-Restrict All Creditsâ?
Default à¸à¸à¸à¸à¸£à¸°à¹à¸ à¸à¸à¸±à¸à¸à¸µà¸à¸¹à¸à¸à¹à¸à¸ à¹à¸à¹à¹à¸¥à¸à¸à¸±à¸à¸à¸µà¹à¸¡à¹à¸à¸¹à¸à¸à¹à¸à¸
Branch : 724 à¸¡à¸²à¸à¸¸à¸à¸à¸£à¸à¸
Can you please provide me some ideas to fix these character encoding issues? Kindly let me know if additional information is required.
Thanks
Sam

Clearly your XML file does not actually use the Latin-1 character set. You've shown that text such as "ของประเภทบัญชีถูกต้อง แต่เลขบัญชีไม่ถูกต้อง" is present in it. The Latin-1 character set does what it says on the label: it represents letters from Latin alphabets. Those letters do not exist in it. If the headers in your XML file claim that it's in Latin-1, then those headers are untrue and the XML is, strictly speaking, not valid. But it might still be usable.
Now the problem is, what character encoding is that XML file actually using? To find out, you may have to examine the XML file in hexadecimal. There are three main possibilities: (1) it's using an old codepage such as 874 which contains these characters; (2) it's using UTF-16; (3) it's using UTF-8.
If you examine in hexadecimal a section of the XML which contains some of this non-latin text, and some of the latin letters nearby, here's what you might see. If it's in a codepage such as 874, each latin letter will be one byte with a value from 32 to 7F, and each nonlatin letter will be one (or possibly two?) bytes with values of 80 to FF. If it's in UTF-16, each latin letter will be two bytes, one from 32 to 7F and the other being always 00, and the nonlatin letters will be two bytes with neither being 00. If it's in UTF-8, the latin letters will be one byte from 32 to 7F, and the nonlatin letters will be (probably) three bytes, all being from 80 to FF.
There may be an alternative to examining hexadecimal. Some text editor programs can save text files in your choice of encoding formats. TextPad 7, for instance, can save as ANSI, DOS, UTF-8, Unicode, or Unicode (big-endian). The latter two options are actually UTF-16. Try loading the XML into such a program, and saving copies of it as UTF-8 and as Unicode. One of these copies should be the same size as the original (plus or minus two or three bytes), and the other will be a different size. Whichever matches the size is probably the correct format. If both differ, then you've got something weird.
Anyway, if you save a version as UTF-8 and then are able to open it and see your data intact, you should then be able to import that without using a Python translator.

Extended ASCII characters show up as junk in MySQL db is inserted through perl

I have a MySQL 'articles' table and I am trying to make the following insert using SQLyog.
insert into articles (id,title) values (2356606,'Jérôme_Lejeune');
This works fine and the data shows fine when I do a select query.
The problem is that when I do the same insert query through my perl script, the name shows up with some junk characters in place of é and ô in the database. I need to know how to properly store the name through my script. The part of code that does the insert is like this.
$sql_insert = "insert into articles (id,title) values (?,?)";
$sth_insert = $dbh->prepare($sql_insert);
$sth_insert->execute($id,$title);
$id and $title have the correct required data which I have checked by print before I am inserting them. Please assist.

You have opened up the character encoding can of worms, and you have a lot to learn before you will solve this problem and have it stay solved.
You are probably already used to thinking of how a character of text can be encoded as a string of bits. Under the ASCII encoding, for example, the 8-bit string 01000001 (65) is used to indicate the A character. When you start to think about how many different languages there are and how many different kinds of characters there are, you quickly realize that an 8-bit encoding is not going to get you very far. So a number of other character encodings have proliferated. Some of the most popular are latin1 (ISO-8859-1) and UTF-8. Both of these encodings can render the é and ô characters, but they use quite different bit strings to represent them. As you write to a file (or to the terminal) or add a row to a database, Perl and MySQL have a notion of what the character encoding of the output stream is. An encoding is also used when you read data. If you don't know what this encoding is, then it doesn't make any sense to say that the data looks good/looks bad when you store it and retrieve it.
Perl and MySQL can, with the right settings, handle both of these encodings and several more. Which encoding you choose to use is not as important as making sure that all the pieces of your application are using the same encoding. But you should choose an encoding that
can encode all of the characters you will need (for this problem, you mention é and ô, but will there be others? what about in the future?)
is supported by all the pieces of your application (front-end, database, back-end)
Here's some suggested reading to get you headed in the right direction:
The Encode module for Perl
character sets in MySQL
(others should feel free to recommend additional links)
I can't speak to MySQL so much, but character encoding support in Perl is rapidly evolving (which isn't to say that it ain't damn good). The latest versions of Perl will have the best support (for the most obscure character sets) and the best features (for example, regular expressions and character classes) for characters beyond ASCII.

There are few things to follow.
First you have to make sure, that Perl understands that data which is moving between your program and DB is encoded as UTF-8 (i expect your databases and tables are set properly). For this you need to say it loud out on connecting to database, like this:
my($dbh) = DBI->connect(
'dbi:mysql:test',
'user',
'password',
{
mysql_enable_utf8 => 1,
}
);
Next, you need send data to output and you must set it to decaode data as UTF-8. For this i like pretty good module:
use utf8::all;
But this module is not in core, so you may want to set it with binmode yourself too:
binmode STDIN, ":utf8";
binmode STDOUT, ":utf8";
And if you deal with webpages, you have to make sure, that browser understoods that you are sending your data encoded as UTF-8. For that you should make sure your HTTP-headers include encoding:
Content-Type: text/html; charset=utf-8;
and set it with HTML META-tag too:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Now you should get your road covered.

Migrating MS Access data to MySQL: character encoding issues

We have an MS Access .mdb file produced, I think, by an Access 2000 database. I am trying to export a table to SQL with mdbtools, using this command:
mdb-export -S -X \\ -I orig.mdb Reviewer > Reviewer.sql
That produces the file I expect, except one thing: Some of the characters are represented as question marks. This: "He wasn't ready" shows up like this: "He wasn?t ready", only in some cases (primarily single/double curly quotes), where maybe the content was pasted into the DB from MS Word. Otherwise, the data look great.
I have tried various values for "export MDB_ICONV=". I've tried using iconv on the resulting file, with ISO-8859-1 in the from/to, with UTF-8 in the from/to, with WINDOWS-1250 and WINDOWS-1252 and WINDOWS-1256 in the from, in various combinations. But I haven't succeeded in getting those curly quotes back.
Frankly, based on the way the resulting file looks, I suspect the issue is either in the original .mdb file, or in mdbtools. The malformed characters are all single question marks, but it is clear that they are not malformed versions of the same thing; so (my gut says) there's not enough data in the resulting file; so (my gut says) the issue can't be fixed in the resulting file.
Has anyone run into this one before? Any tips for moving forward? FWIW, I don't have and never have had MS Access -- the file is coming from a 3rd party -- so this could be as simple as changing something on the database, and I would be very glad to hear that.
Thanks.

Looks like "smart quotes" have claimed yet another victim.
MS word takes plain ascii quotes and translates them to the double-byte left-quote and right-quote characters and translates a single quote into the double byte apostrophe character. The double byte characters in question blelong to to an MS code page which is roughly compatable with unicode-16 except for the silly quote characters.
There is a perl script called 'demoroniser.pl' which undoes all this malarky and converts the quotes back to plain ASCII.

It's most likely due to the fact that the data in the Access file is UTF, and MDB Tools is trying to convert it to ascii/latin/is0-8859-1 or some other encoding. Since these encodings don't map all the UTF characters properly, you end up with question marks. The information here may help you fix your encoding issues by getting MDB Tools to use the correct encoding.

iconv gives "Illegal Character" with smart quotes -- how to get rid of them?

I have a MySQL table with 120,000 lines stored in UTF-8 format. There is one field, product name, that contains text with many accents. I need to fill a second field with this same name after converting it to a url-friendly form (ASCII).
Since PHP doesn't directly handle UTF-8, I'm using:
$value = iconv ('UTF-8', 'ISO-8859-1', $value);
to convert the name to ISO-8859-1, followed by a massive strstr statement to replace any accented character by its unaccented equivalent (à becomes a, for example).
However, the original text names were entered with smart quotes, and iconv chokes whenever it comes across one -- I get:
Unknown error type: [8]
iconv() [function.iconv]: Detected an illegal character in input string
To get rid of the smart quotes before using iconv, I have tried using three statements like:
$value = str_replace('â€™', "'", $value);
(â€™ is the raw value of a UTF-8 smart single quote)
Because the text file is so long, these str_replace's cause the script to time out every single time.
What is the fastest way to strip out the smart quotes (or any invalid characters) from a UTF-8 string, prior to running iconv?
Or, is there an easier solution to this whole problem? What is the fastest way to convert a name with many accents, in UTF-8, to a name with no accents, spelled correctly, in ASCII?

Glibc (and the GNU libiconv) supports //TRANSLIT and //IGNORE suffixes.
Thus, on Linux, this works just fine:
$ echo $'\xe2\x80\x99'
’
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1
iconv: illegal input sequence at position 0
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1//translit
'
I'm not sure what iconv is in use by PHP, but the documentation implies that //TRANSLIT and //IGNORE will work there too.

What do you mean by "link-friendly"? Only way that makes sense to me, since the text between <a>...</a> tags can be anything, is actually "URL-friendly", similar to SO's URLs where everything is converted to [a-z-].
If that's what you're going for, you'll need a transliteration library, not a character set conversion library. (I've had no luck getting iconv() to do the work in the past, but I haven't tried in a while.) There's a beta PHP extension translit that probably does the job.
If you can't add extensions to your PHP install, you'll have to look for a PHP library that does the same thing. I haven't used it, but the PHP UTF-8 library implements a utf8_to_ascii library that I assume does something like what you need.
(Also, if iconv() is failing like you said, it means that your input isn't actually valid UTF-8, so no amount of replacing valid UTF-8 with anything else will help the problem. EDIT: I may take that back: if ephemient's answer is correct, the iconv error you're seeing may very well be because there's no direct representation of the character in the destination character set. So, nevermind.)

Have you considered using MySQL's REPLACE string function to change the offending strings into apostrophes, or whatever? You may be able to put together the "string to be replaced" part e.g. by using CONCAT on CHAR calls...

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008