How do I insert chinese characters into MySQL from a sript? - mysql

I have the following script
set username utf8;
insert into tables values ('Active','活跃')
However after the script run, the inserted value for the chinese character is
活跃
What did I miss here ?

so that you have to change Collation to Chinese.
You can change that by here(below image)
CREATE TABLE big5 (BIG5 CHAR(1) CHARACTER SET BIG5);
mysql> INSERT INTO big5 VALUES (0xf9dc);
mysql> SELECT * FROM big5;
+------+
| big5 |
+------+
| 嫺 |
+------+
MySQL Big5 Chinese character set
Read this as well

In database,Also set collation of fields to utf8_general_ci. If you can set collation of databas and tables to utg8_general_ci as well, that will even be better.

Related

Mariadb query utf8 escaped string

I am using 5.5.65-MariaDB MariaDB Server.
I have a table with a column of type medium text, named "remoteData", where I store a json string.
String values in this json string are stored as escaped utf8 sequences, for example
"patientFirstName":"\u0395\u039b\u0395\u03a5\u0398\u0395\u03a1\u0399\u039f\u03a3"
The above value is the Greek Name "ΕΛΕΥΘΕΡΙΟΣ".
I am trying to search this column using the query
Select * from sync_details where remoteData like "%ΛΕΥΘΕΡ%"
but I get an empty set.
I assume this is because of the values being escaped but I don't know what to do.
EDIT: The query will run through php so we can use a solution that includes php functions.
Thank you in advance.
Christoforos
With a database defined to use CHARACTER SET utf8and a utf8_general_ci collation it should just work like this:
CREATE DATABASE IF NOT EXISTS `test` CHARACTER SET utf8 COLLATE utf8_general_ci;
CREATE TABLE `test`.`sync_details` (`remoteData` MEDIUMTEXT);
INSERT INTO `test`.`sync_details` (`remoteData`) VALUES ('{"patientFirstName":"\\u0395\\u039b\\u0395\\u03a5\\u0398\\u0395\\u03a1\\u0399\\u039f\\u03a3"}');
SELECT `remoteData` FROM `test`.`sync_details` WHERE `remoteData` LIKE '%ΛΕΥΘΕΡ%';
+----------------------------------------------+
| remoteData |
+----------------------------------------------+
| {"patientFirstName": "ΕΛΕΥΘΕΡΙΟΣ"} |
+----------------------------------------------+
1 row in set (0,00 sec)
You could also try JSON_EXTRACT to get structured data from the stored JSON object. I just tested it like this:
SELECT JSON_EXTRACT(`remoteData`, "$.patientFirstName")
FROM `test`.`sync_details`
WHERE JSON_EXTRACT(`remoteData`, "$.patientFirstName")
LIKE '%ΛΕΥΘΕΡ%';
+--------------------------------------------------+
| JSON_EXTRACT(`remoteData`, "$.patientFirstName") |
+--------------------------------------------------+
| "ΕΛΕΥΘΕΡΙΟΣ" |
+--------------------------------------------------+
1 row in set (0,00 sec)
To index data in the JSON object you could add a "Generated Column" to your table using the GENERATED ALWAYS syntax
ALTER TABLE `test`.`sync_details` ADD COLUMN `firstName` VARCHAR(100) GENERATED ALWAYS AS (`remoteData` ->> '$.patientFirstName');
CREATE INDEX `firstnames_idx` ON `test`.`sync_details`(`firstName`);
SELECT `firstName` FROM `test`.`sync_details` WHERE `firstName` LIKE '%ΛΕΥΘΕΡ%';
+----------------------+
| firstName |
+----------------------+
| ΕΛΕΥΘΕΡΙΟΣ |
+----------------------+
1 row in set (0,00 sec)
This will only work with MariaDB >= 10.2 and with a utf8 encoded db and a utf8_general_ci collation.

Can I convert user input language to default collation of database?

I want to search user input in my database. database collation is latin1_swedish_ci. I don't want to change that, instead can I change user input utf-8 to latin1_swedish_ci?
Edit:
I approach two methods.
Method 1: I imported and used default collation latin1_swedish_ci and character set latin1. Then I have
Here I can query like SELECT * FROM dict WHERE english_word = '$_value' and I get all the values of column including malayalam_definition in the browser as desired. But problem is I can't query like SELECT * FROM dict WHERE malayalam_definition = '$_value'. It returns no result.
Method 2: I changed collation to utf8_unicode_ci and character set to utf8. Then in mysql I get desired values like
Here I when I query like SELECT * FROM dict WHERE english_word = '$_value' in browser I get question marks in malayalam_definition values like
Result of SHOW VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name | Value |
+--------------------------+--------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
+--------------------------+--------+
7 rows in set (0.00 sec)
Do I need to change character_set_server, then how to do it?
First of all, the "database collation" is only a default. The real question is what is the CHARACTER SET of the columns that you are interested in.
Then, what are the bytes in your client? Are they encoded as latin1? Or utf8? In either case, tell MySQL that that is what is coming at it. This is preferably done in the connection parameters. (What is your client language?) Alternatively, use SET NAMES latin1 or SET NAMES utf8, according to the client encoding.
Now, what MySQL will do on INSERT and SELECT... It will convert the encoding from the client's encoding to the column's encoding as you do an INSERT. No further action is needed to achieve this.
Similarly, MySQL will convert the other way during a SELECT.
(Of course, if the column and the client are talking the same encoding, no "convert" is needed.)
Your question mentions "collation". So far, I have only talked about CHARACTER SETs, also known as "encoding". Contrast with that, the sorting and comparing of two strings -- this is COLLATION.
For the CHARACTER SET latin1, the default COLLATION is latin1_swedish_ci.
For the CHARACTER SET utf8, the default COLLATION is utf8_general_ci.
There are several different "collations" to handle the quirks of German or Turkish or Spanish or (etc) orderings.
Please explain why you are trying to do what you stated. There are many ways you can do it wrong, so I do not want to give you an ALTER statement -- it may just make things worse for the real goal.
It is better to use utf8mb4 instead of utf8. The outside world refers to UTF-8; this is equivalent to MySQL's utf8mb4.
Edit (after OP's Edit)
The first screenshot shows "Mojibake". Another screenshot shows question marks. The causes of each are covered in Trouble with UTF-8 characters; what I see is not what I stored

Use accent senstive primary key in MySQL

Desired result :
Have an accent sensitive primary key in MySQL.
I have a table of unique words, so I use the word itself as a primary key (by the way if someone can give me an advice about it, I have no idea if it's a good design/practice or not).
I need that field to be accent (and why not case) sensitive, because it must distinguish between, for instance, 'demandé' and 'demande', two different inflexions of the French verb "demander". I do not have any problem to store accented words in the database. I just can't insert two accented characters strings that are identical when unaccented.
Error :
When trying to create the 'demandé' row with the following query:
INSERT INTO `corpus`.`token` (`name_token`) VALUES ('demandé');
I got this error :
ERROR 1062: 1062: Duplicate entry 'demandé' for key 'PRIMARY'
Questions :
Where in the process should a make a modification in order to have two different unique primary keys for "demande" and "demandé" in that table ?
SOLUTION using 'collate utf8_general_ci' in table declaration
How can i make accent sensitive queries ? Is the following the right way :
SELECT * FROM corpus.token WHERE name_token = 'demandé' COLLATE utf8_bin
SOLUTION using 'collate utf8_bin' with WHERE statement
I found that i can achieve this point by using the BINARY Keyword (see this sqlFiddle). What is the difference between collate and binary?
Can I preserve other tables from any changes ? (I'll have to rebuild that table anyway, because it's kind of messy)
I'm not very comfortable with encoding in MySQL. I don't have any problem yet with encoding in that database (and I'm kind of lucky because my data might not always use the same encoding... and there is not much I can do about it). I have a feeling that any modification regarding to that "accent sensitive" issue might create some encoding issue with other queries or data integrity. Am I right to be concerned?
Step by step :
Database creation :
CREATE DATABASE corpus DEFAULT CHARACTER SET utf8;
Table of unique words :
CREATE TABLE token (name_token VARCHAR(50), freq INTEGER, CONSTRAINT pk_token PRIMARY KEY (name_token))
Queries
SELECT * FROM corpus.token WHERE name_token = 'demande';
SELECT * FROM corpus.token WHERE name_token = 'demandé';
both returns the same row:
demande
Collations. You have two choices, not three:
utf8_bin treats all of these as different: demandé and demande and Demandé.
utf8_..._ci (typically utf8_general_ci or utf8_unicode_ci) treats all of these as the same: demandé and demande and Demandé.
If you want only case sensitivity (demandé = demande, but neither match Demandé), you are out of luck.
If you want only accent sensitivity (demandé = Demandé, but neither match demande), you are out of luck.
Declaration. The best way to do whatever you pick:
CREATE TABLE (
name VARCHAR(...) CHARACTER SET utf8 COLLATE utf8_... NOT NULL,
...
PRIMARY KEY(name)
)
Don't change collation on the fly. This won't use the index (that is, will be slow) if the collation is different in name:
WHERE name = ... COLLATE ...
BINARY. The datatypes BINARY, VARBINARY and BLOB are very much like CHAR, VARCHAR, and TEXT with COLLATE ..._bin. Perhaps the only difference is that text will be checked for valid utf8 storing in a VARCHAR ... COLLATE ..._bin, but it will not be checked when storing into VARBINARY.... Comparisons (WHERE, ORDER BY, etc) will be the same; that is, simply compare the bits, don't do case folding or accent stripping, etc.
May be you need this
_ci in a collation name=case insensitive
If your searches on that field are always going to be case-sensitive, then declare the collation of the field as utf8_bin... that'll compare for equality the utf8-encoded bytes.
col_name varchar(10) collate utf8_bin
If searches are normally case-insensitive, but you want to make an exception for this search, try;
WHERE col_name = 'demandé' collate utf8_bin
More here
Try this
mysql> SET NAMES 'utf8' COLLATE 'utf8_general_ci';
Query OK, 0 rows affected (0.00 sec)
mysql> CREATE TABLE t1
-> (c1 CHAR(1) CHARACTER SET UTF8 COLLATE utf8_general_ci);
Query OK, 0 rows affected (0.01 sec)
mysql> INSERT INTO t1 VALUES ('a'),('A'),('À'),('á');
Query OK, 4 rows affected (0.00 sec)
Records: 4 Duplicates: 0 Warnings: 0
mysql> SELECT c1, HEX(c1), HEX(WEIGHT_STRING(c1)) FROM t1;
+------+---------+------------------------+
| c1 | HEX(c1) | HEX(WEIGHT_STRING(c1)) |
+------+---------+------------------------+
| a | 61 | 0041 |
| A | 41 | 0041 |
| À | C380 | 0041 |
| á | C3A1 | 0041 |
+------+---------+------------------------+
4 rows in set (0.00 sec)

COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'binary'?

mysql> SELECT LOCATE("n", "München") COLLATE utf8_general_ci;
ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'binary'
How do I get rid of this error?
What I already tried (copy&paste):
$ mysql -u admin -p $DATABASE
Enter password:
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.1.69 Source distribution
Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> SELECT LOCATE("n", "München") COLLATE utf8_general_ci;
ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'binary'
mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT LOCATE("n", "München") COLLATE utf8_general_ci;
ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'binary'
mysql> SELECT LOCATE(_utf8"n", _utf8"München") COLLATE utf8_general_ci;
ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'binary'
mysql> SHOW VARIABLES LIKE "character_set_database";
+------------------------+-------+
| Variable_name | Value |
+------------------------+-------+
| character_set_database | utf8 |
+------------------------+-------+
1 row in set (0.00 sec)
Possibly the server has been compiled with a default character set of binary, so that string literals are being interpreted as such, or the client is set to use a binary mode when communicating with the server. You can change the client and connection character set by calling SET NAMES utf8 (though this is not recommended if your SQL statements are being issued from PHP, for example, as PHP will have its own commands for setting the connection character set). See Connection Character Sets and Collations in the MySQL reference manual.
Alternatively you can use "introducers" to specify explicitly the charset used for the string literals in your LOCATE function, for instance:
LOCATE(_utf8"n", _utf8"München")
See the reference manual page Character String Literal Character Set and Collation for more details.
The COLLATE in my example sets the collation of the return value of
LOCATE, the result of which is of type binary.
To set the collation of the arguments:
mysql> SELECT LOCATE(_utf8"n" COLLATE utf8_general_ci,
_utf8"München" COLLATE utf8_general_ci) AS locate;
+--------+
| locate |
+--------+
| 3 |
+--------+
1 row in set (0.00 sec)
My motivation actually was finding out whether MySQL takes the collation
into account when searching for the substring. Unfortunately it does
not. See the result of the second command:
mysql> SELECT LOCATE(_utf8"ü" COLLATE utf8_general_ci,
_utf8"München" COLLATE utf8_general_ci) AS locate;
+--------+
| locate |
+--------+
| 2 |
+--------+
1 row in set (0.00 sec)
mysql> SELECT LOCATE(_utf8"u" COLLATE utf8_general_ci,
_utf8"München" COLLATE utf8_general_ci) AS locate;
+--------+
| locate |
+--------+
| 0 |
+--------+
1 row in set (0.00 sec)
Test with a temporary table (collation taken into account in WHERE clause, but not in
LOCATE):
mysql> CREATE TEMPORARY TABLE test
(text VARCHAR(100) CHARACTER SET utf8 COLLATE utf8_general_ci);
Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO test VALUES("München");
Query OK, 1 row affected (0.00 sec)
mysql> SELECT text FROM test WHERE text LIKE "%u%";
+---------+
| text |
+---------+
| München |
+---------+
1 row in set (0.00 sec)
mysql> SELECT LOCATE("u", text) AS locate FROM test WHERE text LIKE "%u%";
+--------+
| locate |
+--------+
| 0 |
+--------+
1 row in set (0.01 sec)
I know this is late, but I hope it helps someone. I kept getting the same error and I knew my charsets and collations were fine.
Check for '#' symbols in your statement that don't belong. I was testing my stored procedure out as a select statement with variables, then when creating the stored proc forgot to remove the '#' symbols. Needless to say, I felt very silly.
I also know this doesn't seem to be the case in this question but this is my first SO post and I don't have enough rep to do much else, so I apologize.

How to search in mysql so that accented character is same as non-accented?

I'd like to have:
piščanec = piscanec in mysql. I mean, I'd like to search for piscanec to find piščanec also.
So the č and c would be same, š and s etc...
I know it can be done using regexp, but this is slow :-( Any other way with LIKE? I am also using full text searches a lot.
UPDATE:
select CONVERT('čšćžđ' USING ascii) as text
does not work. Produces: ?????
Declare the column with the collation utf8_generic_ci. This collation considers š equal to s and č equal to c:
create temporary table t (t varchar(100) collate utf8_general_ci);
insert into t set t = 'piščanec';
insert into t set t = 'piscanec';
select * from t where t='piscanec';
+------------+
| t |
+------------+
| piščanec |
| piscanec |
+------------+
If you don't want to or can't use the utf8_generic_ci collation for the column--maybe you have a unique index on the column and want to consider piščanec and piscanec distinct?--you can use collation in the query only:
create temporary table t (t varchar(100) collate utf8_bin);
insert into t set t = 'piščanec';
insert into t set t = 'piscanec';
select * from t where t='piscanec';
+------------+
| t |
+------------+
| piscanec |
+------------+
select * from t where t='piscanec' collate utf8_general_ci;
+------------+
| t |
+------------+
| piščanec |
| piscanec |
+------------+
The FULLTEXT index is supposed to use the column collation directly; you don't need to define a new collation. Apparently the fulltext index can only be in the column's storage collation, so if you want to use utf8_general_ci for searches and utf8_slovenian_ci for sorting, you have to use use collate in the order by:
select * from tab order by col collate utf8_slovenian_ci;
It's not straightforward, but you'll probably best off creating your own collation for your fulltrext searches. Here is an example:
http://dev.mysql.com/doc/refman/5.5/en/full-text-adding-collation.html
with more info here:
http://dev.mysql.com/doc/refman/5.5/en/adding-collation.html
That way, you have your collation logic completely independent of your SQL and business logic, and you're not having to do any heavy-lifting yourself with SQL-workarounds.
EDIT: since collations are used for all string-matching operations, this may not be the best way to go: you will end up obfuscating differences between characters that are linguistically discrete.
If you want to suppress these differences for specific operations, then you might consider writing a function that takes a string and replaces - in a targetted way - characters which, for the purposes of the current operation, are to be considered identical.
You could define one table holding your base characters (š, č etc.) and another holding the equivalences. Then run a REPLACE over your string.
Another way is just to CAST your string to ASCII, thereby suppressing all non-ASCII characters.
e.g.
SELECT CONVERT('<your text here>' USING ascii) as as_ascii