MySQL FULLTEXT aggravation - mysql

I'm having problems with case-sensitivity in MySQL FULLTEXT searches.
I've just followed the FULLTEXT example in the MySQL doco at http://dev.mysql.com/doc/refman/5.1/en/fulltext-boolean.html . I'll post it here for ease of reference ...
CREATE TABLE articles (
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
title VARCHAR(200),
body TEXT,
FULLTEXT (title,body)
);
INSERT INTO articles (title,body) VALUES
('MySQL Tutorial','DBMS stands for DataBase ...'),
('How To Use MySQL Well','After you went through a ...'),
('Optimizing MySQL','In this tutorial we will show ...'),
('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
('MySQL vs. YourSQL','In the following database comparison ...'),
('MySQL Security','When configured properly, MySQL ...');
SELECT * FROM articles
WHERE MATCH (title,body)
AGAINST ('database' IN NATURAL LANGUAGE MODE);
... my problem is that the example shows that SELECT returning the first and fifth rows ('..DataBase..' and '..database..') but I only get one row ('database') !
The example doesn't demonstrate what collation the table in the example had but I have ended up with latin1_general_cs on the title and body columns of my example table.
My version of MySQL is 5.1.39-log and the connection collation is utf8_unicode_ci .
I'd be really grateful is someone could suggest why my experience differs from the example in the manual !
Be grateful for any advice.

I guess that your default collation is case sensitive somewhere - seeing as you ended up with latin1_general_cs in your table. Perhaps in the start up?
You can check using
show variables like 'collation%'
Which for me gives:
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)
So the example works as advertised on my server.
The collation of columns in your table will default to the database, server, or table collation as appropriate.
In other words, collation specified at the column level overrides any at the table level, which overrides database level, etc.
Column collation is specified using this syntax:
col_name {CHAR | VARCHAR | TEXT} (col_length)
[CHARACTER SET charset_name]
[COLLATE collation_name]
See §9.1 of the MySQL documentation for the gory details.

Related

Use accent senstive primary key in MySQL

Desired result :
Have an accent sensitive primary key in MySQL.
I have a table of unique words, so I use the word itself as a primary key (by the way if someone can give me an advice about it, I have no idea if it's a good design/practice or not).
I need that field to be accent (and why not case) sensitive, because it must distinguish between, for instance, 'demandé' and 'demande', two different inflexions of the French verb "demander". I do not have any problem to store accented words in the database. I just can't insert two accented characters strings that are identical when unaccented.
Error :
When trying to create the 'demandé' row with the following query:
INSERT INTO `corpus`.`token` (`name_token`) VALUES ('demandé');
I got this error :
ERROR 1062: 1062: Duplicate entry 'demandé' for key 'PRIMARY'
Questions :
Where in the process should a make a modification in order to have two different unique primary keys for "demande" and "demandé" in that table ?
SOLUTION using 'collate utf8_general_ci' in table declaration
How can i make accent sensitive queries ? Is the following the right way :
SELECT * FROM corpus.token WHERE name_token = 'demandé' COLLATE utf8_bin
SOLUTION using 'collate utf8_bin' with WHERE statement
I found that i can achieve this point by using the BINARY Keyword (see this sqlFiddle). What is the difference between collate and binary?
Can I preserve other tables from any changes ? (I'll have to rebuild that table anyway, because it's kind of messy)
I'm not very comfortable with encoding in MySQL. I don't have any problem yet with encoding in that database (and I'm kind of lucky because my data might not always use the same encoding... and there is not much I can do about it). I have a feeling that any modification regarding to that "accent sensitive" issue might create some encoding issue with other queries or data integrity. Am I right to be concerned?
Step by step :
Database creation :
CREATE DATABASE corpus DEFAULT CHARACTER SET utf8;
Table of unique words :
CREATE TABLE token (name_token VARCHAR(50), freq INTEGER, CONSTRAINT pk_token PRIMARY KEY (name_token))
Queries
SELECT * FROM corpus.token WHERE name_token = 'demande';
SELECT * FROM corpus.token WHERE name_token = 'demandé';
both returns the same row:
demande
Collations. You have two choices, not three:
utf8_bin treats all of these as different: demandé and demande and Demandé.
utf8_..._ci (typically utf8_general_ci or utf8_unicode_ci) treats all of these as the same: demandé and demande and Demandé.
If you want only case sensitivity (demandé = demande, but neither match Demandé), you are out of luck.
If you want only accent sensitivity (demandé = Demandé, but neither match demande), you are out of luck.
Declaration. The best way to do whatever you pick:
CREATE TABLE (
name VARCHAR(...) CHARACTER SET utf8 COLLATE utf8_... NOT NULL,
...
PRIMARY KEY(name)
)
Don't change collation on the fly. This won't use the index (that is, will be slow) if the collation is different in name:
WHERE name = ... COLLATE ...
BINARY. The datatypes BINARY, VARBINARY and BLOB are very much like CHAR, VARCHAR, and TEXT with COLLATE ..._bin. Perhaps the only difference is that text will be checked for valid utf8 storing in a VARCHAR ... COLLATE ..._bin, but it will not be checked when storing into VARBINARY.... Comparisons (WHERE, ORDER BY, etc) will be the same; that is, simply compare the bits, don't do case folding or accent stripping, etc.
May be you need this
_ci in a collation name=case insensitive
If your searches on that field are always going to be case-sensitive, then declare the collation of the field as utf8_bin... that'll compare for equality the utf8-encoded bytes.
col_name varchar(10) collate utf8_bin
If searches are normally case-insensitive, but you want to make an exception for this search, try;
WHERE col_name = 'demandé' collate utf8_bin
More here
Try this
mysql> SET NAMES 'utf8' COLLATE 'utf8_general_ci';
Query OK, 0 rows affected (0.00 sec)
mysql> CREATE TABLE t1
-> (c1 CHAR(1) CHARACTER SET UTF8 COLLATE utf8_general_ci);
Query OK, 0 rows affected (0.01 sec)
mysql> INSERT INTO t1 VALUES ('a'),('A'),('À'),('á');
Query OK, 4 rows affected (0.00 sec)
Records: 4 Duplicates: 0 Warnings: 0
mysql> SELECT c1, HEX(c1), HEX(WEIGHT_STRING(c1)) FROM t1;
+------+---------+------------------------+
| c1 | HEX(c1) | HEX(WEIGHT_STRING(c1)) |
+------+---------+------------------------+
| a | 61 | 0041 |
| A | 41 | 0041 |
| À | C380 | 0041 |
| á | C3A1 | 0041 |
+------+---------+------------------------+
4 rows in set (0.00 sec)

MySQL column name is a weird character - how do I change it?

I'm examining a table in MySQL that has a weird column name. I want to change the name of the column to not be weird. I can't figure out how to do so.
Firstly, if I first do
SET NAMES utf8;
DESC `tblName`;
I get
| Ԫ | varchar(255) | YES | MUL | NULL | |
Instead, doing
SET NAMES latin1;
DESC `tblName`;
Results in
| ? | varchar(255) | YES | MUL | NULL | |
Fair enough - this make me think the column name is simply a latin1 question mark. But this statement doesn't work:
mysql> ALTER TABLE `tblName` CHANGE COLUMN `?` `newName` VARCHAR(255);
ERROR 1054 (42S22): Unknown column '?' in 'tblName'
So I went to the information_schema table for some info:
mysql> SELECT column_name, HEX(column_name), ordinal_position FROM information_schema.columns WHERE table_schema = 'myschema' AND table_name = 'tblName' ;
| ? | D4AA | 48 |
I looked up this hex point, and assuming I looked it up correctly (which may not be true), I determined this character is "풪" which is the "hangul syllable pweoj". So I tried that in an alter table statement to no avail:
ALTER TABLE `tblName` change column `풪` `newName` VARCHAR(255);
So that's where I'm stuck.
I figured out a way to do this (but I wonder if there's a better solution?)
I did a SHOW CREATE statement:
mysql> SHOW CREATE TABLE `tblName`;
...
`Ԫ` varchar(255) DEFAULT NULL,
I looked for the column in question, which was printed strangely (what you see above doesn't quite match it). The closing backtick wasn't visible. But I highlighted what was visible and pasted that into my ALTER TABLE and that finally fixed the issue.
I believe that Ԫ (question mark in a box) is actually shown because your system does not have a font at that code point. From your `hex(column_name)' we can see that the value is xD4AA, which is the UTF-8 value. This translates to Unicode point 052a for which I don't have the font either on my Windows box.
Setting the char set to latin1, simply meant that Mysql was unable to translate that char to a latin1/cp1252 value so replaced it with "?". (xD4AA could easily be translated to two cp1252 chars, "Ôª". For some reason Mysql chose not to. Perhaps it knew the original encoding?)
Now, how to rename the column? It should be as simple as you say with ALTER TABLE CHANGE COLUMN etc etc. However, it seems that the Mysql console doesn't play nice with non-ASCII chars, especially variable length chars found in UTF-8.
The solution was to pass the SQL as an argument to mysql from Bash instead. For example (Ensure terminal translation is UTF-8 before pasting Ԫ):
mysql --default-character-set=utf8 -e "ALTER TABLE test change column Ԫ test varchar(255);" test

MySQL view - Illegal mix of collations

I'll be very clear: What's the solution for create views in MySQL without have the damned Illegal mix of collations error.
My SQL code is like this (it has some portuguese words), and my database default collation is latin1_swedish_ci:
CREATE VIEW v_veiculos AS
SELECT
v.id,
v.marca_id,
v.modelo,
v.placa,
v.cor,
CASE v.combustivel
WHEN 'A' THEN 'Álcool'
WHEN 'O' THEN 'Óleo Diesel'
WHEN 'G' THEN 'Gasolina'
ELSE 'Não Informado'
END AS combustivel,
marcas.marca,
/*I think that the CONCAT and COALESCE below causes this error, when the next line the view works fine*/
CONCAT(marca, ' ', v.modelo, ' - Placa: ', v.placa, ' - Combustível: ', COALESCE(v.combustivel, 'Não informado')) AS info_completa
FROM veiculos v
LEFT JOIN
marcas on(marcas.id = v.marca_id);
I think that the error cause is because I'm using coalesce and/or concat as the full error's description tells me: Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation 'coalesce'
You may also use CAST() to convert a string to a different character set. The syntax is:
CAST(character_string AS character_data_type CHARACTER SET charset_name)
eg:
SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8);
alternative : use CONVERT(expr USING transcoding_name)
This is kind of old, but well
I had this same error,
As far as I know the Views does not have a collation, the tables does.
So, if you get the "illegal mix..." is because your view is linking (comparing, whatever) 2 tables with different collation
The thing is, if you create a table you can specify the collation, for instance
CREATE TABLE IF NOT EXISTS `vwHotelCode_Terminal` (
`HOTELCODE` varchar(8)
,`TERMINALCODE` varchar(5)
,`DISTKM` varchar(6)
,`DISTMIN` varchar(3)
,`TERMINALNAME` varchar(50)
)ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_spanish_ci ;
But if you don't, the default collation will be applied. So for me the defaul collation is utf8_unicode_ci so my tables will be created with this collation and I ended having some tables with utf8_spanish_ci and the ones I did not specify with utf8_unicode_ci
If you are exporting from one server to another one and the default collation is different, you are probably going to get the "illegal mix" message.
if you have views, phpmyadmin likes to create the tables of all the views and then the views. The tables are created without the collation so it takes the default one. Then, many times, when the view is created uses different collations.
That is actually a bug in MySQL.
Maybe you can update to the latest version of MySQL?
After searching around for a while and taking information from this answer, I found a hack that could be useful.
Simply check the default character set system default_character_set of your database with the below command:
SHOW VARIABLES LIKE "char%";
You'll see something like this:
mysql> SHOW VARIABLES LIKE "char%";
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 | <--
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
I just set the character_set_system which is nothing but default system character set. Copied the create code of the view and created a new view and that's all.
What happens here is the new view that you will create will use the new default character set that you defined for the system. Hence resolving the issue.
Just use below command to set the default character set
SET character_set_server = 'latin2';
This worked in my case.
NOTE: Alternatively you can change the character set of that view. That would also do the trick but I wasn't able to find the solution so I used this hack.
REFERENCE: Read more on Illegal Collation Mix on MariaDB.
A CITATION FROM Illegal Collation Mix on MariaDB:
If you encounter this issue, set the character set in the view to force it to the value you want.
Read more about Collation and Character Sets here.

How to search in mysql so that accented character is same as non-accented?

I'd like to have:
piščanec = piscanec in mysql. I mean, I'd like to search for piscanec to find piščanec also.
So the č and c would be same, š and s etc...
I know it can be done using regexp, but this is slow :-( Any other way with LIKE? I am also using full text searches a lot.
UPDATE:
select CONVERT('čšćžđ' USING ascii) as text
does not work. Produces: ?????
Declare the column with the collation utf8_generic_ci. This collation considers š equal to s and č equal to c:
create temporary table t (t varchar(100) collate utf8_general_ci);
insert into t set t = 'piščanec';
insert into t set t = 'piscanec';
select * from t where t='piscanec';
+------------+
| t |
+------------+
| piščanec |
| piscanec |
+------------+
If you don't want to or can't use the utf8_generic_ci collation for the column--maybe you have a unique index on the column and want to consider piščanec and piscanec distinct?--you can use collation in the query only:
create temporary table t (t varchar(100) collate utf8_bin);
insert into t set t = 'piščanec';
insert into t set t = 'piscanec';
select * from t where t='piscanec';
+------------+
| t |
+------------+
| piscanec |
+------------+
select * from t where t='piscanec' collate utf8_general_ci;
+------------+
| t |
+------------+
| piščanec |
| piscanec |
+------------+
The FULLTEXT index is supposed to use the column collation directly; you don't need to define a new collation. Apparently the fulltext index can only be in the column's storage collation, so if you want to use utf8_general_ci for searches and utf8_slovenian_ci for sorting, you have to use use collate in the order by:
select * from tab order by col collate utf8_slovenian_ci;
It's not straightforward, but you'll probably best off creating your own collation for your fulltrext searches. Here is an example:
http://dev.mysql.com/doc/refman/5.5/en/full-text-adding-collation.html
with more info here:
http://dev.mysql.com/doc/refman/5.5/en/adding-collation.html
That way, you have your collation logic completely independent of your SQL and business logic, and you're not having to do any heavy-lifting yourself with SQL-workarounds.
EDIT: since collations are used for all string-matching operations, this may not be the best way to go: you will end up obfuscating differences between characters that are linguistically discrete.
If you want to suppress these differences for specific operations, then you might consider writing a function that takes a string and replaces - in a targetted way - characters which, for the purposes of the current operation, are to be considered identical.
You could define one table holding your base characters (š, č etc.) and another holding the equivalences. Then run a REPLACE over your string.
Another way is just to CAST your string to ASCII, thereby suppressing all non-ASCII characters.
e.g.
SELECT CONVERT('<your text here>' USING ascii) as as_ascii

mysql check collation of a table

How can I see what collation a table has? I.E. I want to see:
+-----------------------------+
| table | collation |
|-----------------------------|
| t_name | latin_general_ci |
+-----------------------------+
SHOW TABLE STATUS shows information about a table, including the collation.
For example SHOW TABLE STATUS where name like 'TABLE_NAME'
The above answer is great, but it doesn't actually provide an example that saves the user from having to look up the syntax:
show table status like 'test';
Where test is the table name.
(Corrected as per comments below.)
Checking the collation of a specific table
You can query INFORMATION_SCHEMA.TABLES and get the collation for a specific table:
SELECT TABLE_SCHEMA
, TABLE_NAME
, TABLE_COLLATION
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = 't_name';
that gives a much more readable output in contrast to SHOW TABLE STATUS that contains a lot of irrelevant information.
Checking the collation of columns
Note that collation can also be applied to columns (which might have a different collation than the table itself). To fetch the columns' collation for a particular table, you can query INFORMATION_SCHEMA.COLUMNS:
SELECT TABLE_SCHEMA
, TABLE_NAME
, COLUMN_NAME
, COLLATION_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 't_name';
For more details you can refer to the article How to Check and Change the Collation of MySQL Tables
Use this query:
SHOW CREATE TABLE tablename
You will get all information related to table.
Check collation of the whole database
If someone is looking here also for a way to check collation on the whole database:
use mydatabase; (where mydatabase is the name of the database you're going to check)
SELECT ##character_set_database, ##collation_database;
You should see the result like:
+--------------------------+----------------------+
| ##character_set_database | ##collation_database |
+--------------------------+----------------------+
| utf8mb4 | utf8mb4_unicode_ci |
+--------------------------+----------------------+
1 row in set (0.00 sec)