search mysql text field on plain equivalents chars - mysql

I need to search a text field on a database avoiding mismatch for special chars but for the same phrase.
For example, if my search term in DB field is saved as "I lòve mysql ánd query" I would like to match the search for "I love mysql ánd query","I love mysql and query","I löve mysql ánd query",etc.
I was thinking to convert the phrases with a PHP function that I use for url rewrites flattening them out always to "I love mysql and query" but I'm not sure I can flatten them out in the query?

Since your data is already written to the DB with accents, can you try using DB collation to map directly between accented characters:
$connection->query("SET NAMES utf8 COLLATE utf8_general_ci");
You can read more about it here
The page above explaints clearly what this collation will do for you:
mysql> SET NAMES 'utf8' COLLATE 'utf8_general_ci';
Query OK, 0 rows affected (0.00 sec)
mysql> CREATE TABLE t1
(c1 CHAR(1) CHARACTER SET UTF8 COLLATE utf8_general_ci);
Query OK, 0 rows affected (0.01 sec)
mysql> INSERT INTO t1 VALUES ('a'),('A'),('À'),('á');
Query OK, 4 rows affected (0.00 sec)
Records: 4 Duplicates: 0 Warnings: 0
mysql> SELECT c1, HEX(c1), HEX(WEIGHT_STRING(c1)) FROM t1;
+------+---------+------------------------+
| c1 | HEX(c1) | HEX(WEIGHT_STRING(c1)) |
+------+---------+------------------------+
| a | 61 | 0041 |
| A | 41 | 0041 |
| À | C380 | 0041 |
| á | C3A1 | 0041 |
+------+---------+------------------------+
4 rows in set (0.00 sec)
You can also test it for youself directly in the DB (test taken from here):
mysql> SET NAMES 'utf8' COLLATE 'utf8_general_ci';
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT 'a' = 'A', 'a' = 'À', 'a' = 'á';
+-----------+-----------+-----------+
| 'a' = 'A' | 'a' = 'À' | 'a' = 'á' |
+-----------+-----------+-----------+
| 1 | 1 | 1 |
+-----------+-----------+-----------+
1 row in set (0.06 sec)

Related

Adding a UCA Collation to a Unicode Character Set, why it is doesn't work?

In Unicode Locale Data Markup Language(LDML), since version 24, the element and its sub-elements is deprecated. But the MySQL example still uses deprecated element.
The collation defined when I added to MySQL Collation with a latest version of the CLDR Collation definition marked with the element did not take effect.
I want to add to MySQL collation for the UTF8 character using stroke collation in <zh.xml>.
MySQL Path: mysql-8.0.28-winx64\share\charsets\index.xml
MySQL version: 8.0.28
Stroke collation in: https://github.com/unicode-org/cldr/blob/main/common/collation/zh.xml
http://www.unicode.org/reports/tr35/#Element_rules
https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/ldml-rules.html
How to repeat
Step 1. Edit mysql-8.0.28-winx64\share\charsets\index.html
Add some element(collation content copy from CLDR collation zh.xml) like:
<charset name="utf8mb4">
<family>Unicode</family>
<description>UTF-8 Unicode</description>
<collation name="utf8mb4_stroke_ci" id="1030" type='stroke'>
<cr><![CDATA[
[import zh-u-co-private-pinyin]
...more data...
]]></cr>
</collation>
</charset>
Step 2. Restart mysql server
Step 3. Check collation added success
mysql> SHOW COLLATION WHERE Collation = 'utf8mb4_stroke_ci';
+----------------+---------+------+---------+----------+---------+---------------+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
+----------------+---------+------+---------+----------+---------+---------------+
| utf8mb4_stroke_ci | utf8 | 1030 | | | 8 | PAD SPACE |
+----------------+---------+------+---------+----------+---------+---------------+
1 row in set (0.00 sec)
Step 4. Create a database and table then insert some data
mysql> create database collation_test;
Query OK, 1 row affected (0.02 sec)
mysql> use collation_test;
Database changed
mysql> SET NAMES utf8mb4 COLLATE utf8mb4_stroke_ci;
Query OK, 0 rows affected (0.00 sec)
mysql> CREATE TABLE member_stroke (
-> name VARCHAR(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_stroke_ci
-> );
Query OK, 0 rows affected (0.05 sec)
mysql> insert into member_stroke values('一'); -- character '一' means '1', stroke 1.
Query OK, 1 row affected (0.01 sec)
mysql> insert into member_stroke values('二'); -- character '一' means '2', stroke 2.
Query OK, 1 row affected (0.01 sec)
mysql> insert into member_stroke values('三'); -- character '一' means '3', stroke 3.
Query OK, 1 row affected (0.01 sec)
Step 4. Select data and order by name
mysql> select * from member_stroke order by name;
+------+
| name |
+------+
| 一 |
| 三 |
| 二 |
+------+
3 rows in set (0.00 sec)
Expect result
+------+
| name |
+------+
| 一 |
| 二 |
| 三 |
+------+
Additional information
When I use the element to define collation, it success! But is`s deprecated at LDML(version 24) on 2013-09-18.
<charset name="utf8mb4">
<family>Unicode</family>
<description>UTF-8 Unicode</description>
<collation name="utf8mb4_stroke_ci" id="1030" type='stroke' alt='short'>
<rules>
<!-- START AUTOGENERATED STROKE SHORT -->
<reset><last_non_ignorable /></reset>
<p>﷐⠁</p><!-- INDEX 1 -->
<pc>一</pc><!-- 1 -->
<p>﷐⠁</p><!-- INDEX 2 -->
<pc>二</pc><!-- 2 -->
<p>﷐⠁</p><!-- INDEX 3 -->
<pc>三</pc><!-- 3 -->
</rules>
</collation>
</charset>
mysql> select * from member_stroke order by name;
+------+
| name |
+------+
| 一 |
| 二 |
| 三 |
+------+
3 rows in set (0.00 sec)

MySQL 5.5: Why does any base work, when I use "CONV" on BIT data type

To understand the MYSQL's BIT, I created a table name bit_demo in mysql and added a few rows in it as shown below :
mysql> CREATE TABLE `bit_demo` (
-> `mybit` bit(10) NOT NULL DEFAULT b'0'
-> );
Query OK, 0 rows affected (0.05 sec)
mysql> insert into bit_demo values(b'1111111111');
Query OK, 1 row affected (0.00 sec)
mysql> insert into bit_demo values(b'0');
Query OK, 1 row affected (0.00 sec)
mysql> insert into bit_demo values(b'1');
Query OK, 1 row affected (0.00 sec)
But when I use the plain SELECT query, it showed some strange characters on screen:
mysql> select mybit from bit_demo;
+-------+
| mybit |
+-------+
| ♥  |
| |
| ☺ |
+-------+
3 rows in set (0.00 sec)
So I tried using the CONV(value,from_base,to_base) in MySQL to see them in bits as I entered them. Since I didn't know, which value of base works, I tried different values. To my surprise, any value between 2 and 36 for from_base works in this case.
Successful results with from_base or to_base is 2, 10, or 36:
mysql> select conv(mybit,2,2) mybit from bit_demo;
+------------+
| mybit |
+------------+
| 1111111111 |
| 0 |
| 1 |
+------------+
3 rows in set (0.00 sec)
mysql> select conv(mybit,2,36) mybit from bit_demo;
+-------+
| mybit |
+-------+
| SF |
| 0 |
| 1 |
+-------+
3 rows in set (0.00 sec)
mysql> select conv(mybit,36,2) mybit from bit_demo;
+------------+
| mybit |
+------------+
| 1111111111 |
| 0 |
| 1 |
+------------+
3 rows in set (0.00 sec)
mysql> select conv(mybit,10,2) mybit from bit_demo;
+------------+
| mybit |
+------------+
| 1111111111 |
| 0 |
| 1 |
+------------+
3 rows in set (0.00 sec)
mysql> select conv(mybit,36,36) mybit from bit_demo;
+-------+
| mybit |
+-------+
| SF |
| 0 |
| 1 |
+-------+
3 rows in set (0.00 sec)
mysql> select conv(mybit,10,10) mybit from bit_demo;
+-------+
| mybit |
+-------+
| 1023 |
| 0 |
| 1 |
+-------+
3 rows in set (0.00 sec)
As you can see above, results do not change when from_base changes between 2 and 36. But it changes when to_base is between 2 and 36.
I also used CAST(value AS UNSIGNED), and it worked like CONV(value, <2 to 36>, 10):
mysql> select cast(mybit as unsigned) mybit from bit_demo;
+-------+
| mybit |
+-------+
| 1023 |
| 0 |
| 1 |
+-------+
3 rows in set (0.00 sec)
What is the explanation, when any value in from_base works with CONV over BIT?
TL;DR
The answer for all questions and doubts above is: because it's BIT data type, so it stores bits. Bits are exactly what any data in any storage is. Thus, you can not just look at them as on their representation. Bits are only content and "shape" of what they are depends of what is the context.
What BIT actually is?
Some definitions
Well, as it stated in documentation, it stores values as plain bits. What does that mean? That means: data is stored as bit sequence and there is no information about what kind of data is stored. DBMS simply doesn't care about type of data - there is no definition of that and "BIT" should not confuse you. "BIT" does not point to any real data type, but instead it claims that data inside is nothing more than sequence of bits.
What "sequence of bits" means
Storing sequence of bits means that real sense of that sequence will depend from context. You can not really say what certain sequence mean without pointing to the context. For instance, integers and strings. What is integer? Well, it's a number which is stored as sequence of bits. What is string? It's .. sequence of bits too. So how to distinct them? That's exactly why do we have data types. Each type is a structure, which determines how to deal with certain value - and that value is always sequence of bits.
Now, "BIT data-type" is really terrible naming because in fact, there's no "data type" at all. It just tells that it stores data, without binding of what that data means. Let's illustrate with some examples. Let's say, we want to store string "s". How will it be interpreted? With bit sequence, and we may restore it's "internal" view:
mysql> SELECT ORD("s");
+----------+
| ORD("s") |
+----------+
| 115 |
+----------+
1 row in set (0.00 sec)
So now we know "numeric" representation. Next:
mysql> SELECT CONV(115, 10, 2);
+------------------+
| CONV(115, 10, 2) |
+------------------+
| 1110011 |
+------------------+
1 row in set (0.01 sec)
Ok, it is our "bits" as we wanted. I enquoted "bits" because it's only visualization, not real data. Finally, we can insert it as a bit-literal:
mysql> INSERT INTO bit_demo (mybit) VALUES (b'1110011');
Query OK, 1 row affected (0.02 sec)
And now, some "magic":
mysql> SELECT * FROM bit_demo;
+-------+
| mybit |
+-------+
| s |
+-------+
1 row in set (0.00 sec)
Tada! As you can see, I didn't make any conversions - but I can see valid "s" string on the screen. Why? Because when you're "selecting" something and MySQL client displays that, it does it, interpreting incoming data as strings. So that's why "it worked" - we just written bit sequence that may be interpreted as "s" - and, since client was trying to do it (so, interpret incoming data as string), all went well and we saw our string.
More for strings: encodings
Now, strings are very good sample also because they too has issue of how to interpret symbols because of encodings. Symbol is nothing more as sequence of bits and what you see on the screen when symbols is shown is nothing more than the corresponding graphical shape for chosen encoding. To illustrate:
mysql> insert into bit_demo values(b'0111111111');
Query OK, 1 row affected (0.02 sec)
Let it be our value, and now, first case:
mysql> SET NAMES UTF8;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from bit_demo;
+-------+
| mybit |
+-------+
| � |
+-------+
1 row in set (0.00 sec)
and second case:
mysql> SET NAMES cp1251;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from bit_demo;
+-------+
| mybit |
+-------+
| я |
+-------+
1 row in set (0.00 sec)
As you can see, what certain "symbol" actually means depends from which encoding did we use. Bits, which are just values, knows nothing of what those values should mean.
Integer operations
So, now, this is about same issue with CONV(). Your values, passed to that functions, are interpreted just as integer values. No information about such things as "radix" is given, and, more, it's just not applicable here - your bits are storing just value, it will be same for any radix, thus, it doesn't matter what radix you will convert "from" - value in bits won't change. That is why for any arbitrary input radix (so, 2..36) you will see same conversion result if destination radix is same.
Value in bits, when used as an integer, immediately became that "integer" type, but they also will become values, defined by those data types. Again, let's illustrate (for this sample, I'm using 64-bit length):
mysql> INSERT INTO bit_demo VALUES (b'1111111111111111111111111111111111111111111111111111111111111101');
Query OK, 1 row affected (0.07 sec)
Let's see "what" is it:
mysql> SELECT CAST(mybit AS SIGNED) FROM bit_demo;
+-----------------------+
| CAST(mybit AS SIGNED) |
+-----------------------+
| -3 |
+-----------------------+
1 row in set (0.00 sec)
And:
mysql> SELECT CAST(mybit AS UNSIGNED) FROM bit_demo;
+-------------------------+
| CAST(mybit AS UNSIGNED) |
+-------------------------+
| 18446744073709551613 |
+-------------------------+
1 row in set (0.00 sec)
Pretty huge difference, right? Again, that's because with certain data type we've defined rules for stored value, but value itself has no clue how it will be used & represented.
Conclusion
You may think about "BIT data type" as about "no-type" values. Because it really doesn't specify any information about what value means, it only stores that value. How to work with it and how to interpret it is just another thing. You should keep in mind that when using this type - as well as any value, no matter what it is and where it is - in the end is just bits sequence.

MYSQL UTF Characters mixup ü vs u

I have a mysql table with utf8_general_ci encoding where I keep data in different languages mostly English, Turkish, Farsi, etc.
The problem is that the sql statement:
SELECT * FROM `qkw` WHERE `eword` = 'turk'
returns rows with both "turk & türk" values as result.
I have the same problem with indexing which treats ü & u the same. Is this a bug in Mysql or should I use a different encoding? Any suggestions?
Thanks
The different collations are documented here, including the effect you're seeing;
To further illustrate, the following equalities hold in both utf8_general_ci and utf8_unicode_ci (for the effect this has in comparisons or when doing searches, see Section 10.1.7.8, “Examples of the Effect of Collation”):
Ä = A
Ö = O
Ü = U
If you don't want that, you can choose a collation from that list that does not see them as equivalent, for example utf8_swedish_ci.
Your best bet would probably be to use the utf8_turkish_ci collation.
It will distinguish between 'u' and 'ü' as you wish. It is (_ci suffix) a case insensitive collation:
create table t (v varchar(255)
character set utf8
collate utf8_turkish_ci);
insert into t values ("turk"), ("türk"), ("top"), ("twin");
mysql> select * from t order by v;
+-------+
| v |
+-------+
| türk |
| top |
| turk |
| twin |
+-------+
mysql> select * from t where v = "turk";
+------+
| v |
+------+
| turk |
+------+
mysql> select * from t where v = "TURK";
+------+
| v |
+------+
| turk |
+------+
Being based on simply comparing the binary code of each character, Using utf8_bin will produce slightly different results. Not only it will be case sensitive, but the ordering will be different:
mysql> alter table t change column v v varchar(255) collate utf8_bin;
Query OK, 4 rows affected (0.24 sec)
Records: 4 Duplicates: 0 Warnings: 0
mysql> select * from t order by v;
+-------+
| v |
+-------+
| top |
| turk |
| twin |
| türk |
+-------+
4 rows in set (0.00 sec)
mysql> select * from t where v = "turk";
+------+
| v |
+------+
| turk |
+------+
1 row in set (0.00 sec)
mysql> select * from t where v = "TURK";
Empty set (0.00 sec)

LIKE query does not work after I change CHARACTER SET of a MySQL Table to UTF-8

I have MySQL database with 30 rows in customer_customer table. Out of which 5 record has adm_name as Mike.
mysql> select id from customer_customer where adm_name like '%mike%';
+----+
| id |
+----+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
+----+
5 rows in set (0.00 sec)
Now I have changed character set of my table to utf8
mysql> ALTER TABLE customer_customer CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin;
Query OK, 30 rows affected (0.03 sec)
Records: 30 Duplicates: 0 Warnings: 0
Again if I run same like query, then MySQL is not returning me any records.
mysql> select id from customer_customer where adm_name like '%mike%';
Empty set (0.00 sec)
I am not able to understand this behavior. Is there anyone who has came across this situation? Am I doing anything wrong?
You changed collation to binary, in this case comparison is done byte by byte rather than character by character. Here it is a good example and explanation for the BINARY operator.
mysql> SELECT 'a' = 'A';
-> 1
mysql> SELECT BINARY 'a' = 'A';
-> 0

Hmm, why finding by '2' or '2' return the same record?

forgive my newbie question, but why finding by '2' or '2' in Mysql returns the same record?
For example:
Say I have a record with string field named 'slug', and the value is '2'. And the following SQLs returns same record.
SELECT * From articles WHERE slug='2'
SELECT * From articles WHERE slug='2'
It has to do with the collation of your database:
mysql> SHOW VARIABLES LIKE 'collation_%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)
mysql> SELECT '2'='2';
+-----------+
| '2'='2' |
+-----------+
| 0 |
+-----------+
1 row in set (0.00 sec)
mysql> SET NAMES 'utf8' COLLATE 'utf8_unicode_ci';
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT '2'='2';
+-----------+
| '2'='2' |
+-----------+
| 1 |
+-----------+
1 row in set (0.00 sec)
they should not return the same row for equality, but if you use like you are probably getting the same row. using like mysql will use fuzzy matching, so 2 and 2 will be the same (afer all they are both a form of 2, aren't they?)
What is the datatype of slug? i think its numeric one. If so here mysql does cast it to int, and any ways '2' or ' 2 ' will become 2. This wont happen with string datatypes.