Very odd characters in a mysql dump -- what to do? - mysql

I've got weird data mangling my data migration. These weird characters are embedded as-is in the actual mysql dump file:
北京东方å›悦大酒店<br />\n<br />\n“The impetus
I've been given mysql data dumps with those kinds of chars in them. I'm importing the data into Drupal, by first recreating the mysql tables, and then querying against them using Drupal's Migrate module.
Code looks like this:
DROP TABLE IF EXISTS `news`;
SET #saved_cs_client = ##character_set_client;
SET character_set_client = utf8;
CREATE TABLE `news` (
`id` int(11) NOT NULL auto_increment,
`uid` int(11) NOT NULL,
`pid` int(11) default NULL,
`puid` int(11) default NULL,
`headline` varchar(255) NOT NULL,
`teaser` varchar(500) NOT NULL,
`status` char(1) default NULL,
`date` datetime NOT NULL,
`url` varchar(255) default NULL,
`url_title` varchar(255) default NULL,
`body` text,
`caption` varchar(255) default NULL,
`gid` int(11) default NULL,
`feature` text,
`related` varchar(255) default NULL,
`change1_time` int(11) default NULL,
`change2_time` int(11) default NULL,
`change1_user` varchar(255) default NULL,
`change2_user` varchar(255) default NULL,
`expires` datetime default NULL,
`rank` char(1) default NULL,
PRIMARY KEY (`id`),
KEY `uid` (`uid`),
KEY `status` (`status`),
KEY `expires` (`expires`),
KEY `rank` (`rank`),
KEY `puid` (`puid`),
FULLTEXT KEY `headline` (`headline`,`teaser`,`body`)
) ENGINE=MyISAM AUTO_INCREMENT=6976 DEFAULT CHARSET=utf8;
SET character_set_client = #saved_cs_client;
Fastest solution is the winner here -- I'm on a tight deadline, and really suffering over here! I've tried a search and replace solution, but there appear to be too many different types of weird data. I can orchestrate a new data dump, if I know what to tell them (how to do the data dump).
Thanks,
John

This isn't a direct answer to your question, but I played a bit with the mojibake you quoted in your post. It looks like it was originally Chinese text in UTF-8 encoding, which was interpreted as Latin text in Windows-1252 encoding, re-encoded in UTF-8 and again interpreted as Windows-1252 (and finally once more encoded as UTF-8 when you posted it here). So it's not just mojibake, it's double mojibake.
Also, at some point, a byte was lost from the middle of the string (probably because it was one of the undefined code points in Windows-1252), mangling one of the original characters. Running the text through the encoding chain in reverse (encode as Windows-1252, decode as UTF-8, repeat), I get the output:
北京东方�悦大酒店<br />\n<br />\n“The impetus
where the replacement character � stands for the mangled character.

Related

How to overcome performance issue when converting utf8mb4 to latin1?

By my ignorance, I have altered a few tables without specifying collation.
That caused changed columns, which used to be latin1 characters, to be changed to utf8mb4.
This brought HUGE performance loss running joins. And when I say HUGE I mean fraction of a second changed to one hour or more!
So I have made an other request to convert it back to latin1.
And here comes the problem. Mere 60k row table, with ONE utf8mb4 column of 64 characters required 10 hours to complete. No, it is not a mistake. TEN hours. And my even bigger problem is that I have other tables that have millions of rows giving me ETA in years from today!
So now, I wonder what my options are because I can't afford having these tables to be read-only for longer than one day time.
I know that MYSQL ALTER creates a copy of a table. It makes sense because this is field size change, so I doubt I have an option to use ALGORITHM=INPLACE.
If I cannot do INPLACE then I cannot use LOCK=NONE option.
Why in the world utf8mb4 -> latin1 conversion could make such a big impact?
Note that the converted column is indexed, and this may be a reason for the impact!
ANY suggestion or a link would be greatly appreciated!
Maybe the solution would be to drop index (to avoid funky multibyte issue in the index conversion,) do fast alter, and then add an index?
Thanks in advance for any serious suggestion and I suspect I may not find much of a help because of the uniqueness of it.
EDIT
jobs | CREATE TABLE `jobs` (
`auto_inc_key` int(11) NOT NULL AUTO_INCREMENT,
`request_entered_timestamp` datetime NOT NULL,
`hash_id` char(64) CHARACTER SET latin1 NOT NULL,
`name` varchar(128) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`host` char(20) CHARACTER SET latin1 NOT NULL,
`user_id` int(11) NOT NULL,
`start_date` datetime NOT NULL,
`end_date` datetime NOT NULL,
`state` char(12) CHARACTER SET latin1 NOT NULL,
`location` varchar(50) NOT NULL,
`value` int(10) NOT NULL DEFAULT '0',
`aggregation_job_id` char(64) CHARACTER SET latin1 DEFAULT NULL,
`aggregation_job_order` int(11) DEFAULT NULL,
PRIMARY KEY (`auto_inc_key`),
KEY `host` (`host`),
KEY `hash_id` (`hash_id`),
KEY `user_id` (`user_id`,`request_entered_timestamp`),
KEY `request_entered_timestamp_idx` (`request_entered_timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=9068466 DEFAULT CHARSET=utf8mb4
jobs_archive | CREATE TABLE `jobs_archive` (
`auto_inc_key` int(11) NOT NULL AUTO_INCREMENT,
`request_entered_timestamp` datetime NOT NULL,
`hash_id` char(64) CHARACTER SET latin1 NOT NULL,
`name` varchar(128) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`host` char(20) CHARACTER SET latin1 NOT NULL,
`user_id` int(11) NOT NULL,
`start_date` datetime NOT NULL,
`end_date` datetime NOT NULL,
`state` char(12) CHARACTER SET latin1 NOT NULL,
`value` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`auto_inc_key`),
KEY `host` (`host`),
KEY `hash_id` (`hash_id`),
KEY `user_id` (`user_id`,`request_entered_timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=239432 DEFAULT CHARSET=utf8mb4
(taken from PROCEDURE, but you catch the drift...)
INSERT INTO jobs_archive (SELECT * FROM jobs WHERE (TIMESTAMPDIFF(DAY, request_entered_timestamp, starttime) > days));

Not inserting Turkish characters in SQL table. Displays 'xxx??xxx' when selected

I am trying to add Turkish names on my table but then when displayed it gives me ? instead of any of them. Any help what I am missing here? This is my table:
CREATE TABLE IF NOT EXISTS `offerings` (
`dep` varchar(5) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`grade` varchar(4) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`section` varchar(3) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`name` varchar(100) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`teacher` varchar(50) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`quota` varchar(2) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`lec1` varchar(35) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`lec2` varchar(35) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`lec3` varchar(35) DEFAULT NULL,
`lec4` varchar(35) DEFAULT NULL,
`lec5` varchar(35) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf16;
As suggested from the answer I choose here is the solution to the problem for whoever googles this topic. Special thanks to all who contributed in the solution of my problem.
CREATE TABLE IF NOT EXISTS `offerings` (
`dep` varchar(5) NOT NULL,
`grade` varchar(4) COLLATE utf8_unicode_ci NOT NULL,
`section` varchar(3) COLLATE utf8_unicode_ci NOT NULL,
`name` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
`teacher` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
`quota` varchar(2) COLLATE utf8_unicode_ci,
`lec1` varchar(35) DEFAULT NULL,
`lec2` varchar(35) DEFAULT NULL,
`lec3` varchar(35) DEFAULT NULL,
`lec4` varchar(35) DEFAULT NULL,
`lec5` varchar(35) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
(Beginnings of an answer...)
Please don't use utf16; there is virtually no reason for such in a MySQL table.
So, assuming you switch to utf8, let's see if we can get rid of the ? problems.
utf8 needs to be established in about 4 places.
The column(s) in the database -- Use SHOW CREATE TABLE to verify that they are explicitly set to utf8, or defaulted from the table definition. (It is not enough to change the database default.)
The connection between the client and the server. See SET NAMES utf8.
The bytes you have. (This is probably the case.)
If you are displaying the text in a web page, check the <meta> tag.
What probably happened:
you had utf8-encoded data (good)
SET NAMES latin1 was in effect (default, but wrong)
the column was declared CHARACTER SET latin1 (default, but wrong)
Since the CHARACTER SET disagrees with what you have shown, the problem is possibly more complex. Please provide
SELECT col, HEX(col) FROM tbl WHERE ...
for some simple cell with Turkish characters. With this, I may be able to figure out what happened.
Also, Reference notes on encodings.
VARCHARs are character strings, while NVARCHARS are Unicode character strings. NVARCHARS require more bits per character to store, but have a greater range. Try updating your data types. This should fix your problem.
EDIT This answer is wrong. The OP clearly asked for a MySQL solution, but the above applies only to SQL Server.

MySQL character set for multiple European languages + mathematic symbols

I need some help getting a MySQL table to store and output characters from the following languages:
English
French
Russian
Turkish
German
These are the languages I know of in the data. It also uses mathematical symbols such as this:
b ∈ A. Define s(A):= supn≥0 r A (n) for each A ⊆ ? ∪ {0}.
I'm using htmlentities to encode the text. The ? above is meant to display as a ℕ.
It displays this way when I look at the data in PhpMyAdmin. The other characters are encoded as expected.
The table is set to utf8_unicode_ci and all aspects of the website have been set to UTF-8 (including via the .htaccess file, a PHP header and a meta tag).
Please help?
Additional info:
Hosting environment:
Linux, Apache
Mysql 5.5.38
PHP Version 5.4.4-14
Connection string :
ini_set('default_charset', 'UTF-8');
$mysqli = new mysqli($DB_host , $DB_username, $DB_password);
$mysqli->set_charset("utf8");
$mysqli->select_db($DB_name);
SHOW CREATE TABLE mydatabase.mytable output:
CREATE TABLE `tablename` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`created` datetime NOT NULL,
`updated` datetime NOT NULL,
`product` int(11) NOT NULL,
`ppub` tinytext COLLATE utf8_unicode_ci NOT NULL,
`pubdate` date NOT NULL,
`numerous_other_tinytext_cols` tinytext COLLATE utf8_unicode_ci NOT NULL,
`numerous_other_tinytext_cols` tinytext COLLATE utf8_unicode_ci NOT NULL,
`text` text COLLATE utf8_unicode_ci NOT NULL,
`keywords` tinytext COLLATE utf8_unicode_ci NOT NULL,
`active` int(11) NOT NULL DEFAULT '1',
`orderid` int(11) NOT NULL,
`src` tinytext CHARACTER SET latin1 NOT NULL,
`views` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=17780 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
SELECT DEFAULT_CHARACTER_SET_NAME FROM information_schema.SCHEMATA output:
DEFAULT_CHARACTER_SET_NAME
utf8 [->UTF-8 Unicode]
utf8mb4 [->UTF-8 Unicode]
Fonts used:
Arial
Sample of text in the database:
Let <em>A</em> be a subset of the set of nonnegative integers ℕ ∪ {0}, and let <em>r</em><sub><em>A</em></sub> (<em>n</em>) be the number of representations of <em>n</em> ≥ 0 by the sum <em>a</em> + <em>b</em> with <em>a, b</em> ∈ <em>A</em>.
Output on web page:
Let <em>A</em> be a subset of the set of nonnegative integers ? ∪ {0}, and let <em>r</em><sub><em>A</em></sub> (<em>n</em>) be the number of representations of <em>n</em> ≥ 0 by the sum <em>a</em> + <em>b</em> with <em>a, b</em> ∈ <em>A</em>.
Which becomes
Let A be a subset of the set of nonnegative integers ? ∪ {0}, and let rA (n) be the number of representations of n ≥ 0 by the sum a + b with a, b ∈ A.
While your database and table are configured to use UTF-8, one of your columns still isn't:
CREATE TABLE `tablename` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`created` datetime NOT NULL,
`updated` datetime NOT NULL,
`product` int(11) NOT NULL,
`ppub` tinytext COLLATE utf8_unicode_ci NOT NULL,
`pubdate` date NOT NULL,
`numerous_other_tinytext_cols` tinytext COLLATE utf8_unicode_ci NOT NULL,
`numerous_other_tinytext_cols` tinytext COLLATE utf8_unicode_ci NOT NULL,
`text` text COLLATE utf8_unicode_ci NOT NULL,
`keywords` tinytext COLLATE utf8_unicode_ci NOT NULL,
`active` int(11) NOT NULL DEFAULT '1',
`orderid` int(11) NOT NULL,
`src` tinytext CHARACTER SET latin1 NOT NULL, <--------- This one
`views` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=17780 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
Since all the other symbols got HTML-encoded, they will survive all charsets, but not ℕ, which doesn't have a named entity reference.
You need to convert your column:
ALTER TABLE tablename MODIFY src TINYTEXT CHARACTER SET utf8;
NOTE: I noticed you like mathematical symbols. Some of them are outside of the Basic Multilingual Plane, ie. have codepoints > 0xFFFF, for example mathematical letter variants (fraktur, double-struck, semantic italic etc.).
If you want to support them, you need to switch encoding in MySQL everywhere (table, columns, connection) to utf8mb4, which is the true UTF-8 (utf8 in MySQL means the subset of UTF-8 with BMP only), with utf8mb4_unicode_ci collation. Here is how to do the migration.
Also, I have noticed that you are HTML-encoding HTML. Maybe you have a reason, but in my opinion storing this doesn't make sense:
<em>A</em>
If you want to put it into an HTML document, now you need to HTML-decode it at least once, sometimes twice. I'd rather store what almost everybody else does:
<em>A</em>
This way, you will store Unicode characters natively, in optimal way.

Utf8 characters cut off when saving text field in mysql

I'm working on a website in which we save Japanese characters in a text field.
The table looks like this:
| contactlogs | CREATE TABLE `contactlogs` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`contact_email` varchar(128) CHARACTER SET latin1 NOT NULL,
`name` varchar(128) CHARACTER SET latin1 DEFAULT NULL,
`company_name` varchar(128) CHARACTER SET latin1 DEFAULT NULL,
`email` varchar(128) CHARACTER SET latin1 DEFAULT NULL,
`telephone` varchar(30) CHARACTER SET latin1 DEFAULT NULL,
`fax` varchar(30) CHARACTER SET latin1 DEFAULT NULL,
`subject` text COLLATE utf8_unicode_ci NOT NULL,
`message` text COLLATE utf8_unicode_ci NOT NULL,
`created` datetime DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=4 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
When I retrieve the data that was saved, the message field is often cut off, with a garbage character at the end (I assume because it might not have saved all the data of the last character.
The website itself is in cakephp. The data is being saved by just doing a $this->model->save($data) (Haven't changed anything about the way it's saving, the model itself is empty). There's no special database settings. Just set host, login, persistent => false, driver => mysql, database, prefix.
Your DB and tables are probably under latin1 default collation, so without 'encoding' => 'utf8' in your database config cake is gonna be talking to the DB in latin1. The transcoding of UTF to latin1 and back leaves space for data to be lost. This is not really cake's problem.
So go on and change the connection to UTF and all default collations in the DB too. Newly saved data will be displayed fine, for old ones it will depend, I won't go into details cause it's out of context.

ActiveRecord in Rails 3.0.3 turns the 8th field of MySQL into a BigDecimal. How to fix it?

I have the member table which has 9 fields: id,email,... so on.
member_type is the 8th field
The 8th field is always converted to decimal, no matter what name it is or what type it is.
Here is some experimenting I have done:
irb(main):010:0> Member.all()[0].attributes
=> {"created_date"=>nil, "email"=>"tanixxx#yahoo.com", "id"=>1, "is_admin"=>0, "
member_type"=>#<BigDecimal:4f87ce0,'0.0',4(8)>, "name"=>"tanin", "password"=>"3c
f622832f10a313cb74a59e6032f115", "profile_picture_path"=>"aaaaa", "status"=>"APP
ROVED"}
Please notice :member_type, which is the 8th field.
Now if I query only some fields, the result is correct:
irb(main):007:0> Member.all(:select=>"member_type,email")[0].attributes
=> {"email"=>"tanixxx#yahoo.com", "member_type"=>"GENERAL"}
I think there must be a bug in ActiveRecord.
Here is some more experiment. I have added "test_8th_field" to be the 8th field and I got this:
irb(main):016:0> Member.all[0].attributes
=> {"created_date"=>nil, "email"=>"tanixxx#yahoo.com", "id"=>1, "is_admin"=>0, "
member_type"=>"GENERAL", "name"=>"tanin", "password"=>"3cf622832f10a313cb74a59e6
032f115", "profile_picture_path"=>"aaaaa", "status"=>"APPROVED", "test_8th_field
"=>#<BigDecimal:30c87f0,'0.0',4(8)>}
The 8th field is a BigDecimal (it is a text field in MySQL, though). But the member_type field is amazingly correct this time.
I don't know what is wrong with the number 8...
Please help me.
Here is my schema dump, including test_8th_field:
CREATE TABLE IF NOT EXISTS `members` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`email` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`password` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`name` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`profile_picture_path` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`status` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`is_admin` int(11) NOT NULL,
`test_8th_field` text COLLATE utf8_unicode_ci NOT NULL,
`member_type` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT 'GENERAL',
`created_date` datetime NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=2 ;
I have solved it. It turns out that the MySql binary library does not match the version for the MySql database itself.