MySQL character set for multiple European languages + mathematic symbols - mysql

I need some help getting a MySQL table to store and output characters from the following languages:
English
French
Russian
Turkish
German
These are the languages I know of in the data. It also uses mathematical symbols such as this:
b ∈ A. Define s(A):= supn≥0 r A (n) for each A ⊆ ? ∪ {0}.
I'm using htmlentities to encode the text. The ? above is meant to display as a ℕ.
It displays this way when I look at the data in PhpMyAdmin. The other characters are encoded as expected.
The table is set to utf8_unicode_ci and all aspects of the website have been set to UTF-8 (including via the .htaccess file, a PHP header and a meta tag).
Please help?
Additional info:
Hosting environment:
Linux, Apache
Mysql 5.5.38
PHP Version 5.4.4-14
Connection string :
ini_set('default_charset', 'UTF-8');
$mysqli = new mysqli($DB_host , $DB_username, $DB_password);
$mysqli->set_charset("utf8");
$mysqli->select_db($DB_name);
SHOW CREATE TABLE mydatabase.mytable output:
CREATE TABLE `tablename` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`created` datetime NOT NULL,
`updated` datetime NOT NULL,
`product` int(11) NOT NULL,
`ppub` tinytext COLLATE utf8_unicode_ci NOT NULL,
`pubdate` date NOT NULL,
`numerous_other_tinytext_cols` tinytext COLLATE utf8_unicode_ci NOT NULL,
`numerous_other_tinytext_cols` tinytext COLLATE utf8_unicode_ci NOT NULL,
`text` text COLLATE utf8_unicode_ci NOT NULL,
`keywords` tinytext COLLATE utf8_unicode_ci NOT NULL,
`active` int(11) NOT NULL DEFAULT '1',
`orderid` int(11) NOT NULL,
`src` tinytext CHARACTER SET latin1 NOT NULL,
`views` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=17780 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
SELECT DEFAULT_CHARACTER_SET_NAME FROM information_schema.SCHEMATA output:
DEFAULT_CHARACTER_SET_NAME
utf8 [->UTF-8 Unicode]
utf8mb4 [->UTF-8 Unicode]
Fonts used:
Arial
Sample of text in the database:
Let <em>A</em> be a subset of the set of nonnegative integers ℕ ∪ {0}, and let <em>r</em><sub><em>A</em></sub> (<em>n</em>) be the number of representations of <em>n</em> ≥ 0 by the sum <em>a</em> + <em>b</em> with <em>a, b</em> ∈ <em>A</em>.
Output on web page:
Let <em>A</em> be a subset of the set of nonnegative integers ? ∪ {0}, and let <em>r</em><sub><em>A</em></sub> (<em>n</em>) be the number of representations of <em>n</em> ≥ 0 by the sum <em>a</em> + <em>b</em> with <em>a, b</em> ∈ <em>A</em>.
Which becomes
Let A be a subset of the set of nonnegative integers ? ∪ {0}, and let rA (n) be the number of representations of n ≥ 0 by the sum a + b with a, b ∈ A.

While your database and table are configured to use UTF-8, one of your columns still isn't:
CREATE TABLE `tablename` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`created` datetime NOT NULL,
`updated` datetime NOT NULL,
`product` int(11) NOT NULL,
`ppub` tinytext COLLATE utf8_unicode_ci NOT NULL,
`pubdate` date NOT NULL,
`numerous_other_tinytext_cols` tinytext COLLATE utf8_unicode_ci NOT NULL,
`numerous_other_tinytext_cols` tinytext COLLATE utf8_unicode_ci NOT NULL,
`text` text COLLATE utf8_unicode_ci NOT NULL,
`keywords` tinytext COLLATE utf8_unicode_ci NOT NULL,
`active` int(11) NOT NULL DEFAULT '1',
`orderid` int(11) NOT NULL,
`src` tinytext CHARACTER SET latin1 NOT NULL, <--------- This one
`views` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=17780 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
Since all the other symbols got HTML-encoded, they will survive all charsets, but not ℕ, which doesn't have a named entity reference.
You need to convert your column:
ALTER TABLE tablename MODIFY src TINYTEXT CHARACTER SET utf8;
NOTE: I noticed you like mathematical symbols. Some of them are outside of the Basic Multilingual Plane, ie. have codepoints > 0xFFFF, for example mathematical letter variants (fraktur, double-struck, semantic italic etc.).
If you want to support them, you need to switch encoding in MySQL everywhere (table, columns, connection) to utf8mb4, which is the true UTF-8 (utf8 in MySQL means the subset of UTF-8 with BMP only), with utf8mb4_unicode_ci collation. Here is how to do the migration.
Also, I have noticed that you are HTML-encoding HTML. Maybe you have a reason, but in my opinion storing this doesn't make sense:
<em>A</em>
If you want to put it into an HTML document, now you need to HTML-decode it at least once, sometimes twice. I'd rather store what almost everybody else does:
<em>A</em>
This way, you will store Unicode characters natively, in optimal way.

Related

How to overcome performance issue when converting utf8mb4 to latin1?

By my ignorance, I have altered a few tables without specifying collation.
That caused changed columns, which used to be latin1 characters, to be changed to utf8mb4.
This brought HUGE performance loss running joins. And when I say HUGE I mean fraction of a second changed to one hour or more!
So I have made an other request to convert it back to latin1.
And here comes the problem. Mere 60k row table, with ONE utf8mb4 column of 64 characters required 10 hours to complete. No, it is not a mistake. TEN hours. And my even bigger problem is that I have other tables that have millions of rows giving me ETA in years from today!
So now, I wonder what my options are because I can't afford having these tables to be read-only for longer than one day time.
I know that MYSQL ALTER creates a copy of a table. It makes sense because this is field size change, so I doubt I have an option to use ALGORITHM=INPLACE.
If I cannot do INPLACE then I cannot use LOCK=NONE option.
Why in the world utf8mb4 -> latin1 conversion could make such a big impact?
Note that the converted column is indexed, and this may be a reason for the impact!
ANY suggestion or a link would be greatly appreciated!
Maybe the solution would be to drop index (to avoid funky multibyte issue in the index conversion,) do fast alter, and then add an index?
Thanks in advance for any serious suggestion and I suspect I may not find much of a help because of the uniqueness of it.
EDIT
jobs | CREATE TABLE `jobs` (
`auto_inc_key` int(11) NOT NULL AUTO_INCREMENT,
`request_entered_timestamp` datetime NOT NULL,
`hash_id` char(64) CHARACTER SET latin1 NOT NULL,
`name` varchar(128) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`host` char(20) CHARACTER SET latin1 NOT NULL,
`user_id` int(11) NOT NULL,
`start_date` datetime NOT NULL,
`end_date` datetime NOT NULL,
`state` char(12) CHARACTER SET latin1 NOT NULL,
`location` varchar(50) NOT NULL,
`value` int(10) NOT NULL DEFAULT '0',
`aggregation_job_id` char(64) CHARACTER SET latin1 DEFAULT NULL,
`aggregation_job_order` int(11) DEFAULT NULL,
PRIMARY KEY (`auto_inc_key`),
KEY `host` (`host`),
KEY `hash_id` (`hash_id`),
KEY `user_id` (`user_id`,`request_entered_timestamp`),
KEY `request_entered_timestamp_idx` (`request_entered_timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=9068466 DEFAULT CHARSET=utf8mb4
jobs_archive | CREATE TABLE `jobs_archive` (
`auto_inc_key` int(11) NOT NULL AUTO_INCREMENT,
`request_entered_timestamp` datetime NOT NULL,
`hash_id` char(64) CHARACTER SET latin1 NOT NULL,
`name` varchar(128) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`host` char(20) CHARACTER SET latin1 NOT NULL,
`user_id` int(11) NOT NULL,
`start_date` datetime NOT NULL,
`end_date` datetime NOT NULL,
`state` char(12) CHARACTER SET latin1 NOT NULL,
`value` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`auto_inc_key`),
KEY `host` (`host`),
KEY `hash_id` (`hash_id`),
KEY `user_id` (`user_id`,`request_entered_timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=239432 DEFAULT CHARSET=utf8mb4
(taken from PROCEDURE, but you catch the drift...)
INSERT INTO jobs_archive (SELECT * FROM jobs WHERE (TIMESTAMPDIFF(DAY, request_entered_timestamp, starttime) > days));

Not inserting Turkish characters in SQL table. Displays 'xxx??xxx' when selected

I am trying to add Turkish names on my table but then when displayed it gives me ? instead of any of them. Any help what I am missing here? This is my table:
CREATE TABLE IF NOT EXISTS `offerings` (
`dep` varchar(5) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`grade` varchar(4) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`section` varchar(3) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`name` varchar(100) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`teacher` varchar(50) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`quota` varchar(2) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`lec1` varchar(35) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`lec2` varchar(35) CHARACTER SET utf16 COLLATE utf16_turkish_ci DEFAULT NULL,
`lec3` varchar(35) DEFAULT NULL,
`lec4` varchar(35) DEFAULT NULL,
`lec5` varchar(35) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf16;
As suggested from the answer I choose here is the solution to the problem for whoever googles this topic. Special thanks to all who contributed in the solution of my problem.
CREATE TABLE IF NOT EXISTS `offerings` (
`dep` varchar(5) NOT NULL,
`grade` varchar(4) COLLATE utf8_unicode_ci NOT NULL,
`section` varchar(3) COLLATE utf8_unicode_ci NOT NULL,
`name` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
`teacher` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
`quota` varchar(2) COLLATE utf8_unicode_ci,
`lec1` varchar(35) DEFAULT NULL,
`lec2` varchar(35) DEFAULT NULL,
`lec3` varchar(35) DEFAULT NULL,
`lec4` varchar(35) DEFAULT NULL,
`lec5` varchar(35) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
(Beginnings of an answer...)
Please don't use utf16; there is virtually no reason for such in a MySQL table.
So, assuming you switch to utf8, let's see if we can get rid of the ? problems.
utf8 needs to be established in about 4 places.
The column(s) in the database -- Use SHOW CREATE TABLE to verify that they are explicitly set to utf8, or defaulted from the table definition. (It is not enough to change the database default.)
The connection between the client and the server. See SET NAMES utf8.
The bytes you have. (This is probably the case.)
If you are displaying the text in a web page, check the <meta> tag.
What probably happened:
you had utf8-encoded data (good)
SET NAMES latin1 was in effect (default, but wrong)
the column was declared CHARACTER SET latin1 (default, but wrong)
Since the CHARACTER SET disagrees with what you have shown, the problem is possibly more complex. Please provide
SELECT col, HEX(col) FROM tbl WHERE ...
for some simple cell with Turkish characters. With this, I may be able to figure out what happened.
Also, Reference notes on encodings.
VARCHARs are character strings, while NVARCHARS are Unicode character strings. NVARCHARS require more bits per character to store, but have a greater range. Try updating your data types. This should fix your problem.
EDIT This answer is wrong. The OP clearly asked for a MySQL solution, but the above applies only to SQL Server.

mysql weird stored data

Im having a strange problem with some data that is saved in mysql.
im saving a certain value as text in my db but for some reason some of the text is transformed to this "ATELOPHOBIA" kind of symbols.
the code:
$data = mysql_real_escape_string($_GET['data']);
$id = $_GET["id"];
mysql_query("INSERT INTO table (ID, name, pos,data) VALUES ('$id', '$regionname','$regionps[0]','$data')");
table structure:
CREATE TABLE IF NOT EXISTS `sim_scanner` (
`ID` varchar(64) COLLATE latin1_general_ci NOT NULL,
`simname` varchar(36) COLLATE latin1_general_ci NOT NULL,
`simpos` varchar(25) COLLATE latin1_general_ci NOT NULL,
`time` varchar(25) COLLATE latin1_general_ci NOT NULL,
`data` text COLLATE latin1_general_ci NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;
the data that is transforming itself to that is a value like this , <151.25,255.54,1254.12>(vector), as the data is generated by a script i don't know the exact value its basicly a characters position once it moves the data is another vector
thank you for your time
your mysql character encoding is different to the data character encoding that's why ...
for example try use the utf-8_general_ci in DB character set..
UTF-8 definition for the characters they support, usually UTF8 supports most of the characters

Very odd characters in a mysql dump -- what to do?

I've got weird data mangling my data migration. These weird characters are embedded as-is in the actual mysql dump file:
北京东方å›悦大酒店<br />\n<br />\n“The impetus
I've been given mysql data dumps with those kinds of chars in them. I'm importing the data into Drupal, by first recreating the mysql tables, and then querying against them using Drupal's Migrate module.
Code looks like this:
DROP TABLE IF EXISTS `news`;
SET #saved_cs_client = ##character_set_client;
SET character_set_client = utf8;
CREATE TABLE `news` (
`id` int(11) NOT NULL auto_increment,
`uid` int(11) NOT NULL,
`pid` int(11) default NULL,
`puid` int(11) default NULL,
`headline` varchar(255) NOT NULL,
`teaser` varchar(500) NOT NULL,
`status` char(1) default NULL,
`date` datetime NOT NULL,
`url` varchar(255) default NULL,
`url_title` varchar(255) default NULL,
`body` text,
`caption` varchar(255) default NULL,
`gid` int(11) default NULL,
`feature` text,
`related` varchar(255) default NULL,
`change1_time` int(11) default NULL,
`change2_time` int(11) default NULL,
`change1_user` varchar(255) default NULL,
`change2_user` varchar(255) default NULL,
`expires` datetime default NULL,
`rank` char(1) default NULL,
PRIMARY KEY (`id`),
KEY `uid` (`uid`),
KEY `status` (`status`),
KEY `expires` (`expires`),
KEY `rank` (`rank`),
KEY `puid` (`puid`),
FULLTEXT KEY `headline` (`headline`,`teaser`,`body`)
) ENGINE=MyISAM AUTO_INCREMENT=6976 DEFAULT CHARSET=utf8;
SET character_set_client = #saved_cs_client;
Fastest solution is the winner here -- I'm on a tight deadline, and really suffering over here! I've tried a search and replace solution, but there appear to be too many different types of weird data. I can orchestrate a new data dump, if I know what to tell them (how to do the data dump).
Thanks,
John
This isn't a direct answer to your question, but I played a bit with the mojibake you quoted in your post. It looks like it was originally Chinese text in UTF-8 encoding, which was interpreted as Latin text in Windows-1252 encoding, re-encoded in UTF-8 and again interpreted as Windows-1252 (and finally once more encoded as UTF-8 when you posted it here). So it's not just mojibake, it's double mojibake.
Also, at some point, a byte was lost from the middle of the string (probably because it was one of the undefined code points in Windows-1252), mangling one of the original characters. Running the text through the encoding chain in reverse (encode as Windows-1252, decode as UTF-8, repeat), I get the output:
北京东方�悦大酒店<br />\n<br />\n“The impetus
where the replacement character � stands for the mangled character.

Mysql german accents not-sensitive search in full-text searches

Let`s have a example hotels table:
CREATE TABLE `hotels` (
`HotelNo` varchar(4) character set latin1 NOT NULL default '0000',
`Hotel` varchar(80) character set latin1 NOT NULL default '',
`City` varchar(100) character set latin1 default NULL,
`CityFR` varchar(100) character set latin1 default NULL,
`Region` varchar(50) character set latin1 default NULL,
`RegionFR` varchar(100) character set latin1 default NULL,
`Country` varchar(50) character set latin1 default NULL,
`CountryFR` varchar(50) character set latin1 default NULL,
`HotelText` text character set latin1,
`HotelTextFR` text character set latin1,
`tagsforsearch` text character set latin1,
`tagsforsearchFR` text character set latin1,
PRIMARY KEY (`HotelNo`),
FULLTEXT KEY `fulltextHotelSearch` (`HotelNo`,`Hotel`,`City`,`CityFR`,`Region`,`RegionFR`,`Country`,`CountryFR`,`HotelText`,`HotelTextFR`,`tagsforsearch`,`tagsforsearchFR`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci;
In this table for example we have only one hotel with Region name = "Graubünden" (please note umlaut ü character)
And now I want to achieve same search match for phrases:
'graubunden' and
'graubünden'
This is simple with use of MySql built in
collations in regular searches as follows:
SELECT *
FROM `hotels`
WHERE `Region` LIKE CONVERT(_utf8 '%graubunden%' USING latin1)
COLLATE latin1_german1_ci
This works fine for 'graubunden' and 'graubünden' and
as a result I receive proper result, but problem is
when we make MySQL full text search
Whats wrong with this SQL statement?:
SELECT
*
FROM
hotels
WHERE
MATCH (`HotelNo`,`Hotel`,`Address`,`City`,`CityFR`,`Region`,`RegionFR`,`Country`,`CountryFR`, `HotelText`, `HotelTextFR`, `tagsforsearch`, `tagsforsearchFR`)
AGAINST( CONVERT('+graubunden' USING latin1) COLLATE latin1_german1_ci IN BOOLEAN MODE)
ORDER BY Country ASC, Region ASC, City ASC
This doesn`t return any result.
Any ideas where the dog is buried ?
When you define individual CHARACTER SETS for your columns, you override the collation you set default on table level.
Each of your columns has default latin1 collation (which is latin1_swedish_ci). You can see it by running SHOW CREATE TABLE.
In FULLTEXT queries, indexed columns have COERCIBILITY of 0, that is all fulltext queries are converted to the collation used in the index, not vice versa.
You need to remove CHARACTER SET definitions from your columns or explicitly set all columns to latin1_german_ci:
CREATE TABLE `hotels` (
`HotelNo` varchar(4) NOT NULL default '0000',
`Hotel` varchar(80) NOT NULL default '',
`City` varchar(100) default NULL,
`CityFR` varchar(100) default NULL,
`Region` varchar(50) default NULL,
`RegionFR` varchar(100) default NULL,
`Country` varchar(50) default NULL,
`CountryFR` varchar(50) default NULL,
`HotelText` text,
`HotelTextFR` text,
`tagsforsearch` text,
`tagsforsearchFR` text,
PRIMARY KEY (`HotelNo`),
FULLTEXT KEY `fulltextHotelSearch` (`HotelNo`,`Hotel`,`City`,`CityFR`,`Region`,`RegionFR`,`Country`,`CountryFR`,`HotelText`,`HotelTextFR`,`tagsforsearch`,`tagsforsearchFR`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci;
INSERT
INTO hotels (hotelText, HotelTextFR, tagsforsearch, tagsforsearchFR)
VALUES ('text', 'text', 'graubünden', 'tags');
SELECT *
FROM hotels
WHERE MATCH (`HotelNo`,`Hotel`,`City`,`CityFR`,`Region`,`RegionFR`,`Country`,`CountryFR`, `HotelText`, `HotelTextFR`, `tagsforsearch`, `tagsforsearchFR`)
AGAINST (CONVERT('+graubunden' USING latin1) COLLATE latin1_german1_ci IN BOOLEAN MODE)
ORDER BY
Country ASC, Region ASC, City ASC;