mysql group_concat one table to another - mysql

i would like to have a query that will solve my problem in native sql.
i have a table named "synonym" which holds words and the words' synonyms.
id, word, synonym
1, abandon, forsaken
2, abandon, desolate
...
As you can see words are repeated in this table lots of times and this makes the table unnecessarily big. i would like to have a table named "words" which doesn't have duplicate words like:
id, word, synonyms
1, abandon, 234|90
...
note: "234" and "90" here are the id's of forsaken and desolate in newly created words table.
so i already created a new "words" table with unique words from word field at synonym table. what i need is an sql query that will look at the synonym table for each word's synonyms then find their id's from words table and update the "synonyms" field with vertical line seperated ids. then i will just drop the synonym table.
just like:
UPDATE words SET synonyms= ( vertical line seperated id's (id's from words table) of the words at the synonyms at synonym table )
i know i must use group_concat but i couldn't achieved this.
hope this is clear enough. thanks for the help!

Your proposed schema is plain horrible.
Why not use a many-to-many relationship ?
Table words
id word
1 abandon
234 forsaken
Table synonyms
wid sid
1 234
1 90

You can avoid using update and do it using the queries below:
TRUNCATE TABLE words;
INSERT INTO words
SELECT (#rowNum := #rowNum+1),
a.word,
SUBSTRING(REPLACE(a.syns, a.id + '|', ''), 2) syns
FROM (
SELECT a.*,group_concat(id SEPARATOR '|') syns
FROM synonyms a
GROUP BY word
) a,
(SELECT #rowNum := 0) b
Test Script:
CREATE TABLE `ts_synonyms` (
`id` INT(11) NULL DEFAULT NULL,
`word` VARCHAR(20) NULL DEFAULT NULL,
`synonym` VARCHAR(2000) NULL DEFAULT NULL
);
CREATE TABLE `ts_words` (
`id` INT(11) NULL DEFAULT NULL,
`word` VARCHAR(20) NULL DEFAULT NULL,
`synonym` VARCHAR(2000) NULL DEFAULT NULL
);
INSERT INTO ts_synonyms
VALUES ('1','abandon','forsaken'),
('2','abandon','desolate'),
('3','test','tester'),
('4','test','tester4'),
('5','ChadName','Chad'),
('6','Charles','Chuck'),
('8','abandon','something');
INSERT INTO ts_words
SELECT (#rowNum := #rowNum+1),
a.word,
SUBSTRING(REPLACE(a.syns, a.id + '|', ''), 2) syns
FROM (
SELECT a.*,
GROUP_CONCAT(id SEPARATOR '|') syns
FROM ts_synonyms a
GROUP BY word
) a,
(SELECT #rowNum := 0) b;
SELECT * FROM ts_synonyms;
SELECT * FROM ts_words;

Related

SELECT FROM Table WHERE exact number not partial is in a string SQL

I have a table that contains a bunch of numbers seperated by a comma.
I would like to retrieve rows from table where an exact number not a partial number is within the string.
EXAMPLE:
CREATE TABLE IF NOT EXISTS `teams` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
`uids` text NOT NULL,
`islive` tinyint(1) NOT NULL DEFAULT '1',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=5 ;
INSERT INTO `teams` (`id`, `name`, `uids`, `islive`) VALUES
(1, 'Test Team', '1,2,8', 1),
(3, 'Test Team 2', '14,18,19', 1),
(4, 'Another Team', '1,8,20,23', 1);
I would like to search where 1 is within the string.
At present if I use Contains or LIKE it brings back all rows with 1, but 18, 19 etc is not 1 but does have 1 within it.
I have setup a sqlfiddle here
Do I need to do a regex?
You only need 1 condition:
select *
from teams
where concat(',', uids, ',') like '%,1,%'
I would search for all four possible locations of the ID you are searching for:
As the only element of the list.
As the first element of the list.
As the last element of the list.
As an inner element of the list.
The query would look like:
select *
from teams
where uids = '1' -- only
or uids like '1,%' -- first
or uids like '%,1' -- last
or uids like '%,1,%' -- inner
You could probably catch them all with a OR
SELECT ...
WHERE uids LIKE '1,%'
OR uids LIKE '%,1'
OR uids LIKE '%, 1'
OR uids LIKE '%,1,%'
OR uids = '1'
You didn't specify which version of SQL Server you're using, but if you're using 2016+ you have access to the STRING_SPLIT function which you can use in this case. Here is an example:
CREATE TABLE #T
(
id int,
string varchar(20)
)
INSERT INTO #T
SELECT 1, '1,2,8' UNION
SELECT 2, '14,18,19' UNION
SELECT 3, '1,8,20,23'
SELECT * FROM #T
CROSS APPLY string_split(string, ',')
WHERE value = 1
You SQL Fiddle is using MySQL and your syntax is consistent with MySQL. There is a built-in function to use:
select t.*
from teams t
where find_in_set(1, uids) > 0;
Having said that, FIX YOUR DATA MODEL SO YOU ARE NOT STORING LISTS IN A SINGLE COLUMN. Sorry that came out so loudly, it is just an important principle of database design.
You should have a table called teamUsers with one row per team and per user on that team. There are numerous reasons why your method of storing the data is bad:
Numbers should be stored as numbers, not strings.
Columns should contain a single value.
Foreign key relationships should be properly declared.
SQL (in general) has lousy string handling functions.
The resulting queries cannot be optimized.
Simple things like listing the uids in order or removing duplicate are unnecessarily hard.

Bulk apply alias to table columns in MYSQL

I'm working with a 3rd party MYSQL database over which I have no control except I can read from it. It contains 51 tables with identical column structure but slightly different names. They hold daily summaries for a different data source. Example Table:
CREATE TABLE `archive_day_?????` (
`dateTime` int(11) NOT NULL,
`min` double DEFAULT NULL,
`mintime` int(11) DEFAULT NULL,
`max` double DEFAULT NULL,
`maxtime` int(11) DEFAULT NULL,
`sum` double DEFAULT NULL,
`count` int(11) DEFAULT NULL,
`wsum` double DEFAULT NULL,
`sumtime` int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
where ????? changes to indicate the type of data held.
The dateTime field is mirrored across all tables being midnight of every day since the system has been running.
I want to produce a single data set across all tables using an inner join on the dateTime. But to avoid writing
SELECT ad1.maxtime as ad1_maxtime, ad2.maxtime as ad2_maxtime...
51 times for 9 fields is there a way I can bulk create aliases e.g
ad1.* as ad_*, ad2.* as ad_* and so on.
I have looked at Create Aliases In Bulk? but this doesn't seem to work for MySQL. Ultimatly the data is being used by a Django ORM.
EDIT: Unfortunately Union doesn't uniquely identify the fields or group them together e.g.
SELECT * FROM `archive_day_ET` UNION ALL SELECT * FROM `archive_day_inTemp`
results in:
To generate a string with all the field names from those tables, you could query information_schema.columns
For example:
SELECT
GROUP_CONCAT(CONCAT(TABLE_NAME,'.`',column_name,'` AS `',column_name,'_',replace(TABLE_NAME,'archive_day_',''),'`') SEPARATOR ',\r\n')
FROM information_schema.columns
WHERE TABLE_NAME like 'archive_day_%'
A test on db<>fiddle here
And to generate the JOIN's then you could use information_schema.tables
For example:
SELECT CONCAT('FROM (\r\n ',GROUP_CONCAT(CONCAT('SELECT `dateTime` FROM ',TABLE_NAME) SEPARATOR '\r\n UNION\r\n '),'\r\n) AS dt \r\nLEFT JOIN ',
GROUP_CONCAT(CONCAT(TABLE_NAME,' ON ',
TABLE_NAME,'.`dateTime` = dt.`dateTime`') SEPARATOR '\r\nLEFT JOIN ')) as SqlJoins
FROM information_schema.tables
WHERE TABLE_NAME like 'archive_day_%'
A test on db<>fiddle here
For the 2 example tables they would generate
archive_day_ET.`dateTime` AS `dateTime_ET`,
archive_day_ET.`min` AS `min_ET`,
archive_day_ET.`mintime` AS `mintime_ET`,
archive_day_ET.`max` AS `max_ET`,
archive_day_ET.`maxtime` AS `maxtime_ET`,
archive_day_ET.`sum` AS `sum_ET`,
archive_day_ET.`count` AS `count_ET`,
archive_day_ET.`wsum` AS `wsum_ET`,
archive_day_ET.`sumtime` AS `sumtime_ET`,
archive_day_inTemp.`dateTime` AS `dateTime_inTemp`,
archive_day_inTemp.`min` AS `min_inTemp`,
archive_day_inTemp.`mintime` AS `mintime_inTemp`,
archive_day_inTemp.`max` AS `max_inTemp`,
archive_day_inTemp.`maxtime` AS `maxtime_inTemp`,
archive_day_inTemp.`sum` AS `sum_inTemp`,
archive_day_inTemp.`count` AS `count_inTemp`,
archive_day_inTemp.`wsum` AS `wsum_inTemp`,
archive_day_inTemp.`sumtime` AS `sumtime_inTemp`
And
FROM (
SELECT `dateTime` FROM archive_day_ET
UNION
SELECT `dateTime` FROM archive_day_inTemp
) AS dt
LEFT JOIN archive_day_ET ON archive_day_ET.`dateTime` = dt.`dateTime`
LEFT JOIN archive_day_inTemp ON archive_day_inTemp.`dateTime` = dt.`dateTime`

Mysql Matching "Same" Emails

I have a table with 2 columns email and id. I need to find emails that are closely related. For example:
john.smith12#example.com
and
john.smith12#some.subdomains.example.com
These should be considered the same because the username (john.smith12) and the most top level domain (example.com) are the same. They are currently 2 different rows in my table. I've written the below expression which should do that comparison but it takes hours to execute (possibly/probably because of regex). Is there a better way to write this:
select c1.email, c2.email
from table as c1
join table as c2
on (
c1.leadid <> c2.leadid
and
c1.email regexp replace(replace(c2.email, '.', '[.]'), '#', '#[^#]*'))
The explain of this query comes back as:
id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, SIMPLE, c1, ALL, NULL, NULL, NULL, NULL, 577532, NULL
1, SIMPLE, c2, ALL, NULL, NULL, NULL, NULL, 577532, Using where; Using join buffer (Block Nested Loop)
The create table is:
CREATE TABLE `table` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`Email` varchar(100) DEFAULT NULL,
KEY `Table_Email` (`Email`),
KEY `Email` (`Email`)
) ENGINE=InnoDB AUTO_INCREMENT=667020 DEFAULT CHARSET=latin1
I guess the indices aren't being used because of the regexp.
The regex comes out as:
john[.]smith12#[^#]*example[.]com
which should match both addresses.
Update:
I've modified the on to be:
on (c1.email <> '' and c2.email <> '' and c1.leadid <> c2.leadid and substr(c1. email, 1, (locate('#', c1.email) -1)) = substr(c2. email, 1, (locate('#', c2.email) -1))
and
substr(c1.email, locate('#', c1.email) + 1) like concat('%', substr(c2.email, locate('#', c2.email) + 1)))
and the explain with this approach is at least using the indices.
id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, SIMPLE, c1, range, table_Email,Email, table_Email, 103, NULL, 288873, Using where; Using index
1, SIMPLE, c2, range, table_Email,Email, table_Email, 103, NULL, 288873, Using where; Using index; Using join buffer (Block Nested Loop)
So far this has executed for 5 minutes, will update if there is a vast improvement.
Update 2:
I've split the email so the username is a column and domain is a column. I've stored the domain in reverse order so the index of it can be used with a trailing wildcard.
CREATE TABLE `table` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`Email` varchar(100) DEFAULT NULL,
`domain` varchar(100) CHARACTER SET utf8 DEFAULT NULL,
`username` varchar(500) CHARACTER SET utf8 DEFAULT NULL,
KEY `Table_Email` (`Email`),
KEY `Email` (`Email`),
KEY `domain` (`domain`)
) ENGINE=InnoDB AUTO_INCREMENT=667020 DEFAULT CHARSET=latin1
Query to populate new columns:
update table
set username = trim(SUBSTRING_INDEX(trim(email), '#', 1)),
domain = reverse(trim(SUBSTRING_INDEX(SUBSTRING_INDEX(trim(email), '#', -1), '.', -3)));
New query:
select c1.email, c2.email, c2.domain, c1.domain, c1.username, c2.username, c1.leadid, c2.leadid
from table as c1
join table as c2
on (c1.email is not null and c2.email is not null and c1.leadid <> c2.leadid
and c1.username = c2.username and c1.domain like concat(c2.domain, '%'))
New Explain Results:
1, SIMPLE, c1, ALL, table_Email,Email, NULL, NULL, NULL, 649173, Using where
1, SIMPLE, c2, ALL, table_Email,Email, NULL, NULL, NULL, 649173, Using where; Using join buffer (Block Nested Loop)
From that explain it looks like the domain index is not being used. I also tried to force the usage with USE but that also didn't work, that resulted in no indices being used:
select c1.email, c2.email, c2.domain, c1.domain, c1.username, c2.username, c1.leadid, c2.leadid
from table as c1
USE INDEX (domain)
join table as c2
USE INDEX (domain)
on (c1.email is not null and c2.email is not null and c1.leadid <> c2.leadid
and c1.username = c2.username and c1.domain like concat(c2.domain, '%'))
Explain with use:
1, SIMPLE, c1, ALL, NULL, NULL, NULL, NULL, 649173, Using where
1, SIMPLE, c2, ALL, NULL, NULL, NULL, NULL, 649173, Using where; Using join buffer (Block Nested Loop)
You told us that the table has 700K rows.
This is not much, but you are joining it to itself, so in the worst case the engine would have to process 700K * 700K = 490 000 000 000 = 490B rows.
An index can definitely help here.
The best index depends on the data distribution.
What does the following query return?
SELECT COUNT(DISTINCT username)
FROM table
If result is close to 700K, say 100K, then it means that there are a lot of different usernames and you'd better focus on them, rather than domain. If result is low, say, 100, than indexing username is unlikely to be useful.
I hope that there are a lot of different usernames, so, I'd create an index on username, since the query joins on that column using simple equality comparison and this join would greatly benefit from this index.
Another option to consider is a composite index on (username, domain) or even covering index on (username, domain, leadid, email). The order of columns in the index definition is important.
I'd delete all other indexes, so that optimiser can't make another choice, unless there are other queries that may need them.
Most likely it won't hurt to define a primary key on the table as well.
There is one more not so important thing to consider. Does your data really have NULLs? If not, define the columns as NOT NULL. Also, in many cases it is better to have empty strings, rather than NULLs, unless you have very specific requirements and you have to distinguish between NULL and ''.
The query would be slightly simpler:
select
c1.email, c2.email,
c1.domain, c2.domain,
c1.username, c2.username,
c1.leadid, c2.leadid
from
table as c1
join table as c2
on c1.username = c2.username
and c1.domain like concat(c2.domain, '%')
and c1.leadid <> c2.leadid
No REGEXP_REPLACE needed, so it will work in all versions of MySQL/MariaDB:
UPDATE tbl
SET email = CONCAT(SUBSTRING_INDEX(email, '#', 1),
'#',
SUBSTRING_INDEX(
SUBSTRING_INDEX(email, '#', -1),
'.',
-2);
Since no index is useful, you may as well not bother with a WHERE clause.
If you search related data, you should have look to some data mining tools or Elastic Search for instance, which work like you need.
I have another possible "database-only" solution, but I don't know if it would work or if it'd be the best solution. If I have had to do this, I would try to make a table of "word references", filled by splitting all emails by all non alphanumerical characters.
In your example, this table would be filled with : john, smith12, some, subdomains, example and com. Each word with a unique id. Then, another table, a union table, which would link the email with its own words. Indexes would be needed on both tables.
To search closely related emails, you would have to split the source email with a regex and loop on each sub-word, like this one in the answer (with the connected by), then for each word, find it in the word references table, then the union table to find the emails which match it.
Over this request, you could make a select which sums all matched emails, by grouping by email to count the number of words matched by found emails and keep only the most matched email (excluding the source one, of course).
And sorry for this "not-sure-answer", but it was too long for a comment. I'm going to try to make an example.
Here is an example (in oracle, but should work with MySQL) with some data:
---------------------------------------------
-- Table containing emails and people info
CREATE TABLE PEOPLE (
ID NUMBER(11) PRIMARY KEY NOT NULL,
EMAIL varchar2(100) DEFAULT NULL,
USERNAME varchar2(500) DEFAULT NULL
);
-- Table containing word references
CREATE TABLE WORD_REF (
ID number(11) NOT NULL PRIMARY KEY,
WORD varchar2(20) DEFAULT NULL
);
-- Table containg id's of both previous tables
CREATE TABLE UNION_TABLE (
EMAIL_ID number(11) NOT NULL,
WORD_ID number(11) NOT NULL,
CONSTRAINT EMAIL_FK FOREIGN KEY (EMAIL_ID) REFERENCES PEOPLE (ID),
CONSTRAINT WORD_FK FOREIGN KEY (WORD_ID) REFERENCES WORD_REF (ID)
);
-- Here is my oracle sequence to simulate the auto increment
CREATE SEQUENCE MY_SEQ
MINVALUE 1
MAXVALUE 999999
START WITH 1
INCREMENT BY 1
CACHE 20;
---------------------------------------------
-- Some data in the people table
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'john.smith12#example.com', 'jsmith12');
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'john.smith12#some.subdomains.example.com', 'admin');
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'john.doe#another.domain.eu', 'jdo');
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'nathan.smith#example.domain.com', 'nsmith');
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'david.cayne#some.domain.st', 'davidcayne');
COMMIT;
-- Word reference data from the people data
INSERT INTO WORD_REF (ID, WORD)
(select MY_SEQ.NEXTVAL, WORD FROM
(select distinct REGEXP_SUBSTR(EMAIL, '\w+',1,LEVEL) WORD
from PEOPLE
CONNECT BY REGEXP_SUBSTR(EMAIL, '\w+',1,LEVEL) IS NOT NULL
));
COMMIT;
-- Union table filling
INSERT INTO UNION_TABLE (EMAIL_ID, WORD_ID)
select words.ID EMAIL_ID, word_ref.ID WORD_ID
FROM
(select distinct ID, REGEXP_SUBSTR(EMAIL, '\w+',1,LEVEL) WORD
from PEOPLE
CONNECT BY REGEXP_SUBSTR(EMAIL, '\w+',1,LEVEL) IS NOT NULL) words
left join WORD_REF on word_ref.word = words.WORD;
COMMIT;
---------------------------------------------
-- Finaly, the request which orders the emails which match the source email 'john.smith12#example.com'
SELECT COUNT(1) email_match
,email
FROM (SELECT word_ref.id
,words.word
,uni.email_id
,ppl.email
FROM (SELECT DISTINCT regexp_substr('john.smith12#example.com'
,'\w+'
,1
,LEVEL) word
FROM dual
CONNECT BY regexp_substr('john.smith12#example.com'
,'\w+'
,1
,LEVEL) IS NOT NULL) words
LEFT JOIN word_ref
ON word_ref.word = words.word
LEFT JOIN union_table uni
ON uni.word_id = word_ref.id
LEFT JOIN people ppl
ON ppl.id = uni.email_id)
WHERE email <> 'john.smith12#example.com'
GROUP BY email_match DESC;
The request results :
4 john.smith12#some.subdomains.example.com
2 nathan.smith#example.domain.com
1 john.doe#another.domain.eu
You get the name (i.e. the part before '#') with
substring_index(email, '#', 1)
You get the domain with
substring_index(replace(email, '#', '.'), '.', -2))
(because if we substitute the '#' with a dot, then it's always the part after the second-to-last dot).
Hence you find duplicates with
select *
from users
where exists
(
select *
from mytable other
where other.id <> users.id
and substring_index(other.email, '#', 1) =
substring_index(users.email, '#', 1)
and substring_index(replace(other.email, '#', '.'), '.', -2) =
substring_index(replace(users.email, '#', '.'), '.', -2)
);
If this is too slow, then you may want to create a computed column on the two combined and index it:
alter table users add main_email as
concat(substring_index(email, '#', 1), '#', substring_index(replace(email, '#', '.'), '.', -2));
create index idx on users(main_email);
select *
from users
where exists
(
select *
from mytable other
where other.id <> users.id
and other.main_email = users.main_email
);
Of course you can just as well have the two separated and index them:
alter table users add email_name as substring_index(email, '#', 1);
alter table users add email_domain as substring_index(replace(email, '#', '.'), '.', -2);
create index idx on users(email_name, email_domain);
select *
from users
where exists
(
select *
from mytable other
where other.id <> users.id
and other.email_name = users.email_name
and other.email_domain = users.email_dome
);
And of course, if you allow for both upper and lower case in the email address column, you will also want to apply LOWER on it in above expressions (lower(email)).

MySQL LIKE matching at the end of the string

I'm trying to figure out why these two like statements are evaluated equally. In the first, I'm doing a simple select which returns 22 rows. In the second, I'm expecting that my update / replace should also return 22 rows affected. Can anybody see what I'm doing wrong? These should match strings like "I got a knee mri".
SET #acro = 'mri';
SELECT title FROM mytable WHERE title LIKE concat('% ', #acro);
//returns n rows
UPDATE mytable
SET title = REPLACE(title, CONCAT(' ', #acro), CONCAT(' ', UPPER(#acro)))
WHERE title LIKE CONCAT('% ', #acro);
//returns 0 rows
CREATE TABLE `mytable` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`title` text,
`author` varchar(50) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=119232 DEFAULT CHARSET=utf8;
The "rows affected" count is the number of rows that were modified, not the number of rows that were matched.
One explanation is that the column title is using a case insensitive collation, that is, a characterset that has a name ending in _ci.
It's possible that 22 rows were "matched", but no rows needed to be modified.
If the column is defined with characterset/collation latin1_swedish_ci, you could try comparing the results from a query like this:
SET #acro = _latin1'mri';
SELECT title
FROM mytable
WHERE title COLLATE latin1_general_cs LIKE UPPER(CONCAT('% ', #acro));
^^^
Ah, here's the answer: MySQL Update query with LIKE in WHERE clause not affecting matching rows
Replace() is case sensitive but like is not.

INSTR(str,substr) does not work when str contains 'é' or 'ë' and substr only 'e'

In another post on stackoverflow, I read that INSTR could be used to order results by relevance.
My understanding of col LIKE '%str%' andINSTR(col, 'str')` is that they both behave the same. There seems to be a difference in how collations are handled.
CREATE TABLE `users` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(64) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
INSERT INTO users (name)
VALUES ('Joël'), ('René');
SELECT * FROM users WHERE name LIKE '%joel%'; -- 1 record returned
SELECT * FROM users WHERE name LIKE '%rene%'; -- 1 record returned
SELECT * FROM users WHERE INSTR(name, 'joel') > 0; -- 0 records returned
SELECT * FROM users WHERE INSTR(name, 'rene') > 0; -- 0 records returned
SELECT * FROM users WHERE INSTR(name, 'joël') > 0; -- 1 record returned
SELECT * FROM users WHERE INSTR(name, 'rené') > 0; -- 1 record returned
Although INSTR does some conversion, it finds ë in é.
SELECT INSTR('é', 'ë'), INSTR('é', 'e'), INSTR('e', 'ë');
-- returns 1, 0, 0
Am I missing something?
http://sqlfiddle.com/#!2/9bf21/6 (using mysql-version: 5.5.22)
This is due to bug 70767 on LOCATE() and INSTR(), which has been verified.
Though the INSTR() documentation states that it can be used for multi-byte strings, it doesn't seem to work, as you note, with collations like utf8_general_ci, which should be case and accent insensitive
This function is multi-byte safe, and is case sensitive only if at least one argument is a binary string.
The bug report states that although MySQL does this correctly it only does so when the number of bytes is also identical:
However, you can easily observe that they do not (completely) respect collations when looking for one string inside another one. It seems that what's happening is that MySQL looks for a substring which is collation-equal to the target which has exactly the same length in bytes as the target. This is only rarely true.
To pervert the reports example, if you create the following table:
create table t ( needle varchar(10), haystack varchar(10)
) COLLATE=utf8_general_ci;
insert into t values ("A", "a"), ("A", "XaX");
insert into t values ("A", "á"), ("A", "XáX");
insert into t values ("Á", "a"), ("Á", "XaX");
insert into t values ("Å", "á"), ("Å", "XáX");
then run this query, you can see the same behaviour demonstrated:
select needle
, haystack
, needle=haystack as `=`
, haystack LIKE CONCAT('%',needle,'%') as `like`
, instr(needle, haystack) as `instr`
from t;
SQL Fiddle