MySQL Speeding up left outer join / check for null queries - mysql

The object of my query is to get all rows from table a where gender = f and username does not exist in table b where campid = xxxx. Here is the query I am using with success:
SELECT `id`
FROM pool
LEFT JOIN sent
ON pool.username = sent.username
AND sent.campid = 'YA1LGfh9'
WHERE sent.username IS NULL
AND pool.gender = 'f'
The problem is that the query takes over 9 minutes to complete, the pool table contains over 10 million rows and the sent table is eventually going to grow even larger than that. I have created indexes for many of the columns including username and gender. However, MySQL refuses to use any of my indexes for this query. I even tried using FORCE INDEX. Here are my indexes from pool and the output of EXPLAIN for my query:
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| pool | 0 | PRIMARY | 1 | id | A | 9326880 | NULL | NULL | | BTREE | |
| pool | 1 | username | 1 | username | A | 9326880 | NULL | NULL | | BTREE | |
| pool | 1 | source | 1 | source | A | 6 | NULL | NULL | | BTREE | |
| pool | 1 | gender | 1 | gender | A | 9 | NULL | NULL | | BTREE | |
| pool | 1 | location | 1 | location | A | 59030 | NULL | NULL | | BTREE | |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
6 rows in set (0.00 sec)
mysql> explain SELECT `id` FROM pool FORCE INDEX (username) LEFT JOIN sent ON pool.username = sent.username AND sent.campid = 'YA1LGfh9' WHERE sent.username IS NULL AND pool.gender = 'f';
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------------------+
| 1 | SIMPLE | pool | ALL | NULL | NULL | NULL | NULL | 9326881 | Using where |
| 1 | SIMPLE | sent | ALL | NULL | NULL | NULL | NULL | 351 | Using where; Not exists |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------------------+
2 rows in set (0.00 sec)
also, here are my indexes for the sent table:
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| sent | 0 | PRIMARY | 1 | primary_key | A | 351 | NULL | NULL | | BTREE | |
| sent | 1 | username | 1 | username | A | 351 | NULL | NULL | | BTREE | |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
2 rows in set (0.00 sec)
You can see that no indexes are not being used and so my query takes extremely too long. If anyone has a solution that involves reworking the query, please show me an example of how to do it using my data structure so that I won't have any confusion of how to implement and test. Thank you.

First, your original query was correct in your placement of everything... including the camp. By using a LEFT JOIN from Pool to Sent, and then pulling a required equality such as "CAMP" into the WHERE clause as previously suggested is ultimately converting that into an INNER JOIN, thus requiring entry on both sides. Leave it as you had it.
You already have an index on user name on the sent table, but I would do the following.
build an index on the "sent" table on (CampID, UserName) as a composite (ie: multiple key) index. This way the left join will be optimized for BOTH entries.
On your "pool" table, try a composite index on 3 fields of (gender, username, id ).
By doing this, you can take advantage of NOT having to go through all the actual pages of data that encompass your 10+ million records. Since the index HAS the columns for compare, it doesn't have to find the actual record and look at the columns, it can use those of the index directly.
Also, for grins, I added keyword "STRAIGHT_JOIN" which tells MySQL to query exactly as I show and don't try to think for me. MANY times, I've found this to significantly improve query performance... On very few have I been given feedback that it has NOT helped.
SELECT STRAIGHT_JOIN
p.id
FROM
pool p
LEFT JOIN sent s
ON s.campid = 'YA1LGfh9'
AND p.username = s.username
WHERE
p.gender = 'f'
AND s.username IS NULL
All that said, you are still going to be returning how many records out of the 10+ million... if the pool has 10+ million, and the single camp only has 5,000. You will still be returning almost the entire set.

Related

mysql create table and set createtime(int) for index, but select not use, so why?

mysql create table and set createtime(int) for index, but select not use, so why?
mysql> EXPLAIN SELECT
-> player_id,
-> COUNT(*) AS count_num,
-> SUM( add_gold ) AS sum_add_gold
-> FROM
-> cloud_data_player_gold_log
-> WHERE
-> 1 = 1
-> AND create_time >= 1561046400
-> GROUP BY
-> player_id
-> ORDER BY
-> sum_add_gold ASC
-> LIMIT 0, 10;
+----+-------------+----------------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------+
| 1 | SIMPLE | cloud_data_player_gold_log | NULL | ALL | create_time | NULL | NULL | NULL | 555659 | 44.47 | Using where; Using temporary; Using filesort |
+----+-------------+----------------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------+
1 row in set, 1 warning (0.00 sec)
mysql> show index from cloud_data_player_gold_log;
+----------------------------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+----------------------------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| cloud_data_player_gold_log | 0 | PRIMARY | 1 | log_id | A | 555659 | NULL | NULL | | BTREE | | | YES | NULL |
| cloud_data_player_gold_log | 1 | channel_id | 1 | channel_id | A | 6 | NULL | NULL | | BTREE | | | YES | NULL |
| cloud_data_player_gold_log | 1 | game_id | 1 | game_id | A | 12 | NULL | NULL | | BTREE | | | YES | NULL |
| cloud_data_player_gold_log | 1 | game_id_2 | 1 | game_id | A | 12 | NULL | NULL | | BTREE | | | YES | NULL |
| cloud_data_player_gold_log | 1 | game_id_2 | 2 | room_id | A | 14 | NULL | NULL | | BTREE | | | YES | NULL |
| cloud_data_player_gold_log | 1 | create_time | 1 | create_time | A | 15356 | NULL | NULL | | BTREE | | | YES | NULL |
+----------------------------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
6 rows in set (0.04 sec)
table total data line is 560000, the sql run time over 0.6s
MySQL assumes that a significant number of rows (44%) are after create_time >= 1561046400. If it would use an index, this would mean MySQL would have to read half of your table by jumping back and forth inside it. As data is stored in blocks (that consist of more than 1 row), it would actually read the table size several times over, since for every useful row it also reads several unneeded rows (although a lot of this is mitigated by caching).
In such a case, it's faster to just read the whole table from start to end once and throw the unneeded rows away, which MySQL has decided to do here.
You can prevent MySQL from having to read the actual table data by providing all the data it needs in the index, by having a covering index (create_time, player_id, add_gold). Then it can just read the index from create_time >= 1561046400 to end in one go without the time consuming jumping inside the table.
If MySQL estimated incorrectly, e.g. there might actually be only a handful of rows in your time range, or if you just want to test the execution time with the index, you can force MySQL to use it with e.g.
... FROM cloud_data_player_gold_log FORCE INDEX (create_time) WHERE ...
This has the general disadvantage that MySQL cannot adapt to changed data, different create_time-parameters or additional filters, e.g. if using a different index would be actually faster.
An alternative index you could try would be (player_id, create_time) (or, covering, (player_id, create_time, add_gold)), which supports the group by.
It will depend on your data distribution which one will be faster: an index starting with create_time has to read less rows, an index starting with player_id has to sort one less time. Depending on how many rows there are, one will offset the other.

Why does the output of EXPLAIN change after each SHOW index?

I was trying to improve performance on some queries through indexes using EXPLAIN and I noticed each time I used SHOW index FROM TableB; the output of the rows colums in the EXPLAIN of a query changed
Ex:
mysql> EXPLAIN Select A.id
From TableA A
Inner join TableB B
On A.address = B.address And A.code = B.code
Group by A.id
Having count(distinct B.id) = 1;
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------+-------+----------------------------------------------+
| 1 | SIMPLE | B | index | test_index | PRIMARY | 518 | NULL | 10561 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | A | eq_ref | PRIMARY | PRIMARY | 514 | db.B.address,db.B.code | 1 | |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------+-------+----------------------------------------------+
2 rows in set (0.00 sec)
mysql> show index from TableB;
+-----------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-----------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| TableB | 0 | PRIMARY | 1 | id | A | 7 | NULL | NULL | | BTREE | |
| TableB | 0 | PRIMARY | 2 | address | A | 21 | NULL | NULL | | BTREE | |
| TableB | 0 | PRIMARY | 3 | code | A | 10402 | NULL | NULL | | BTREE | |
| TableB | 1 | test_index | 1 | address | A | 1 | NULL | NULL | | BTREE | |
| TableB | 1 | test_index | 2 | code | A | 10402 | NULL | NULL | | BTREE | |
| TableB | 1 | test_index | 3 | id | A | 10402 | NULL | NULL | | BTREE | |
+-----------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
6 rows in set (0.03 sec)
and...
mysql> EXPLAIN Select A.id
From TableA A
Inner join TableB B
On A.address = B.address And A.code = B.code Group by A.id
Having count(distinct B.id) = 1;
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------+-------+----------------------------------------------+
| 1 | SIMPLE | B | index | test_index | PRIMARY | 518 | NULL | 9800 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | A | eq_ref | PRIMARY | PRIMARY | 514 | db.B.address,db.B.code | 1 | |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------+-------+----------------------------------------------+
2 rows in set (0.00 sec)
Why does this happen?
The rows column should be taken as a rough estimate only. It's not a precise number.
It's based on statistical estimates of how many rows will be examined during a query. The actual number of rows cannot be known until you actually execute the query.
The statistics are based on samples read from the table periodically. These samples are re-read occasionally, for example after you run ANALYZE TABLE or certain INFORMATION_SCHEMA queries, or certain SHOW statements.
I don't find 20% variation in statistics to be a big deal. In many situations, think of the graph being like an upturned parabola, and you need to know which side of the minimum point you are on. In complex queries, where the Optimizer is likely to goof, it need a lot more than simple stats, such as Histograms of MariaDB 10.0 / 10.1. (I don't have enough experience with such to say whether that makes much headway.)
Your particular query is probably going to be performed in only one way, regardless of the statistics. An example of a complicated query would be a JOIN with WHERE clauses filtering each table. The optimizer has to decide which table to start with. Another case is a single table with a WHERE and ORDER BY and they cannot both be handled by a single index -- should it use an index to filter, but then have to sort? or should it use an index for ORDER BY, but then have to filter on the fly?

MySQL index slowing down query

MySQL Server version: 5.0.95
Tables All: InnoDB
I am having an issue with a MySQL db query. Basically I am finding that if I index a particular varchar(50) field tag.name, my queries take longer (x10) than not indexing the field. I am trying to speed this query up, however my efforts seem to be counter productive.
The culprit line and field seems to be:
WHERE `t`.`name` IN ('news','home')
I have noticed that if i query the tag table directly without a join using the same criteria and with the name field indexed, i do not have the issue.. It actually works faster as expected.
EXAMPLE Query **
SELECT `a`.*, `u`.`pen_name`
FROM `tag_link` `tl`
INNER JOIN `tag` `t`
ON `t`.`tag_id` = `tl`.`tag_id`
INNER JOIN `article` `a`
ON `a`.`article_id` = `tl`.`link_id`
INNER JOIN `user` `u`
ON `a`.`user_id` = `u`.`user_id`
WHERE `t`.`name` IN ('news','home')
AND `tl`.`type` = 'article'
AND `a`.`featured` = 'featured'
GROUP BY `article_id`
LIMIT 0 , 5
EXPLAIN with index **
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------+---------+---------+-------------------+------+-----------------------------------------------------------+
| 1 | SIMPLE | t | range | PRIMARY,name | name | 152 | NULL | 4 | Using where; Using index; Using temporary; Using filesort |
| 1 | SIMPLE | tl | ref | tag_id,link_id,link_id_2 | tag_id | 4 | portal.t.tag_id | 10 | Using where |
| 1 | SIMPLE | a | eq_ref | PRIMARY,fk_article_user1 | PRIMARY | 4 | portal.tl.link_id | 1 | Using where |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | portal.a.user_id | 1 | |
+----+-------------+-------+--------+--------------------------+---------+---------+-------------------+------+-----------------------------------------------------------+
EXPLAIN without index **
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+------+-------------+
| 1 | SIMPLE | a | index | PRIMARY,fk_article_user1 | PRIMARY | 4 | NULL | 8742 | Using where |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | portal.a.user_id | 1 | |
| 1 | SIMPLE | tl | ref | tag_id,link_id,link_id_2 | link_id | 4 | portal.a.article_id | 3 | Using where |
| 1 | SIMPLE | t | eq_ref | PRIMARY | PRIMARY | 4 | portal.tl.tag_id | 1 | Using where |
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+------+-------------+
TABLE CREATE
CREATE TABLE `tag` (
`tag_id` int(11) NOT NULL auto_increment,
`name` varchar(50) NOT NULL,
`type` enum('layout','image') NOT NULL,
`create_dttm` datetime default NULL,
PRIMARY KEY (`tag_id`)
) ENGINE=InnoDB AUTO_INCREMENT=43077 DEFAULT CHARSET=utf8
INDEXS
SHOW INDEX FROM tag_link;
+----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| tag_link | 0 | PRIMARY | 1 | tag_link_id | A | 42023 | NULL | NULL | | BTREE | |
| tag_link | 1 | tag_id | 1 | tag_id | A | 10505 | NULL | NULL | | BTREE | |
| tag_link | 1 | link_id | 1 | link_id | A | 14007 | NULL | NULL | | BTREE | |
+----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
SHOW INDEX FROM article;
+---------+------------+------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+---------+------------+------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| article | 0 | PRIMARY | 1 | article_id | A | 5723 | NULL | NULL | | BTREE | |
| article | 1 | fk_article_user1 | 1 | user_id | A | 1 | NULL | NULL | | BTREE | |
| article | 1 | create_dttm | 1 | create_dttm | A | 5723 | NULL | NULL | YES | BTREE | |
+---------+------------+------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
Final Solution
It seems that MySQL is just sorted the data incorrectly. In the end it turned out faster to look at the tag table as a sub query returning the ids.
It seems that article_id is the primary key for the article table.
Since you're grouping by article_id, MySQL needs to return the records in order by that column, in order to perform the GROUP BY.
You can see that without the index, it scans all records in the article table, but they're at least in order by article_id, so no later sort is required. The LIMIT optimization can be applied here, since it's already in order, it can just stop after it gets five rows.
In the query with the index on tag.name, instead of scanning the entire articles table, it utilizes the index, but against the tag table, and starts there. Unfortunately, when doing this, the records must later be sorted by article.article_id in order to complete the GROUP BY clause. The LIMIT optimization can't be applied since it must return the entire result set, then order it, in order to get the first 5 rows.
In this case, MySQL just guesses wrongly.
Without the LIMIT clause, I'm guessing that using the index is faster, which is maybe what MySQL was guessing.
How big are your tables?
I noticed in the first explain you have a "Using temporary; Using filesort" which is bad. Your query is likely being dumped to disc which makes it way slower than in memory queries.
Also try to avoid using "select *" and instead query the minimum fields needed.

MySQL: How do I speed up a "Count()" query with a "JOIN" and "order_by"?

I have the following two (simplified for the sake of example) tables in my MySQL db:
DESCRIBE appname_item;
-----------------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+---------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| name | varchar(200) | NO | | NULL | |
+-----------------+---------------+------+-----+---------+----------------+
DESCRIBE appname_favorite;
+---------------+----------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+----------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| user_id | int(11) | NO | MUL | NULL | |
| item_id | int(11) | NO | MUL | NULL | |
+---------------+----------+------+-----+---------+----------------+
I'm trying to get a list of items ordered by the number of favorites. The query below works, however there are thousands of records in the Item table, and the query is taking up to a couple of minutes to complete.
SELECT `appname_item`.`id`, `appname_item`.`name`, COUNT(`appname_favorite`.`id`) AS `num_favorites`
FROM `appname_item`
LEFT OUTER JOIN `appname_favorite` ON (`appname_item`.`id` = `appname_favorite`.`item_id`)
GROUP BY `appname_item`.`id`, `appname_item`.`name`
ORDER BY `num_favorites` DESC;
Here are the results of EXPLAIN, which provides some insight as to why the query is so slow (type "ALL", "using temporary", and "using filesort" should all be avoided if possible.)
+----+-------------+--------------------+------+-----------------------------+-----------------------------+---------+-------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+------+-----------------------------+-----------------------------+---------+-------------------------------+------+---------------------------------+
| 1 | SIMPLE | appname_item | ALL | NULL | NULL | NULL | NULL | 574 | Using temporary; Using filesort |
| 1 | SIMPLE | appname_favorite | ref | appname_favorite_67b70d25 | appname_favorite_67b70d25 | 4 | appname.appname_item.id | 1 | |
+----+-------------+--------------------+------+-----------------------------+-----------------------------+---------+-------------------------------+------+---------------------------------+
I know that the easiest way to optimize the query is to add an Index, but I can't seem to figure out how to add an Index for a Count() query that involves a JOIN and an order_by. I should also mention that I am running this through the Django ORM, so would prefer to not change the sql query and just work on fixing and fine tuning the db to run the query in the most effective way.
I've been trying to figure this out for a while, so any help would be much appreciated!
UPDATE
Here are the indexes that are already in the db:
+--------------------+------------+-----------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------------------+------------+-----------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| appname_favorite | 0 | PRIMARY | 1 | id | A | 594 | NULL | NULL | | BTREE | |
| appname_favorite | 1 | appname_favorite_fbfc09f1 | 1 | user_id | A | 12 | NULL | NULL | | BTREE | |
| appname_favorite | 1 | appname_favorite_67b70d25 | 1 | item_id | A | 594 | NULL | NULL | | BTREE | |
+--------------------+------------+-----------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
Actually you can't avoid filesort because the count is determined at the calculation time and is unknown in the index. The only solution I can imagine is to create a composite index for table appname_item, which may help a little or not, depending on your particular data:
ALTER TABLE appname_item ADD UNIQUE INDEX `item_id_name` (`id` ASC, `name` ASC);
There is nothing wrong with your query - it looks good.
It could be the the optimizer has out-of-date info about the table. Try running this:
ANALYZE TABLE <tableaname>;
for all tables involved.
Firstly, for the count() function, you can check this answer to know more detail:
https://stackoverflow.com/a/2710630/1020600
For example, using MySQL, count(*) will be fast under a MyISAM table
but slow under an InnoDB. Under InnoDB you should use count(1) or
count(pk)
If your storage engines is MYISAM and if you want to count on row (i guess so), use count(*) is enough.
From your EXPLAIN, I found there's no Key for appname_item, if i try to add a condition
where `appname_item`.`id` = `appname_favorite`.`item_id`
then the "key" appears. so funny but it's work.
The final sql like this
explain SELECT `appname_item`.`id`, `appname_item`.`name`, COUNT(*) AS `num_favorites`
FROM `appname_item`
LEFT OUTER JOIN `appname_favorite` ON (`appname_item`.`id` = `appname_favorite`.`item_id`)
where `appname_item`.`id` = `appname_favorite`.`item_id`
GROUP BY `appname_item`.`id`, `appname_item`.`name`
ORDER BY `num_favorites` DESC;
+----+-------------+------------------+--------+---------------+---------+---------+-------------------------------+------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key
| key_len | ref | rows | Extra
|
+----+-------------+------------------+--------+---------------+---------+---------+-------------------------------+------+----------------------------------------------+ | 1 | SIMPLE | appname_favorite | index | item_id |
item_id | 5 | NULL | 2312 | Using
index; Using temporary; Using filesort | | 1 | SIMPLE |
appname_item | eq_ref | PRIMARY | PRIMARY | 4 |
test.appname_favorite.item_id | 1 | Using where
|
+----+-------------+------------------+--------+---------------+---------+---------+-------------------------------+------+----------------------------------------------+
On my computer, table appname_item has 1686 rows and appname_favorite has 2312 rows, old sql takes from 15 to 23ms. new sql takes 3.7 to 5.3ms

How can a 'WHERE column LIKE "%expression%" ' perform better than a MATCH(column) AGAINST("expression") in MySQL?

I've run into a serious MySQL performance bottleneck which I'm unable to understand and resolve. Here are the table structures, indexes and record counts (bear with me, it's only two tables):
mysql> desc elggobjects_entity;
+-------------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------------------+------+-----+---------+-------+
| guid | bigint(20) unsigned | NO | PRI | NULL | |
| title | text | NO | MUL | NULL | |
| description | text | NO | | NULL | |
+-------------+---------------------+------+-----+---------+-------+
3 rows in set (0.00 sec)
mysql> show index from elggobjects_entity;
+--------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| elggobjects_entity | 0 | PRIMARY | 1 | guid | A | 613637 | NULL | NULL | | BTREE | |
| elggobjects_entity | 1 | title | 1 | title | NULL | 131 | NULL | NULL | | FULLTEXT | |
| elggobjects_entity | 1 | title | 2 | description | NULL | 131 | NULL | NULL | | FULLTEXT | |
+--------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
3 rows in set (0.00 sec)
mysql> select count(*) from elggobjects_entity;
+----------+
| count(*) |
+----------+
| 613637 |
+----------+
1 row in set (0.00 sec)
mysql> desc elggentity_relationships;
+--------------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+---------------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| guid_one | bigint(20) unsigned | NO | MUL | NULL | |
| relationship | varchar(50) | NO | MUL | NULL | |
| guid_two | bigint(20) unsigned | NO | MUL | NULL | |
| time_created | int(11) | NO | | NULL | |
+--------------+---------------------+------+-----+---------+----------------+
5 rows in set (0.00 sec)
mysql> show index from elggentity_relationships;
+--------------------------+------------+--------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------------------------+------------+--------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+
| elggentity_relationships | 0 | PRIMARY | 1 | id | A | 11408236 | NULL | NULL | | BTREE | |
| elggentity_relationships | 0 | guid_one | 1 | guid_one | A | NULL | NULL | NULL | | BTREE | |
| elggentity_relationships | 0 | guid_one | 2 | relationship | A | NULL | NULL | NULL | | BTREE | |
| elggentity_relationships | 0 | guid_one | 3 | guid_two | A | 11408236 | NULL | NULL | | BTREE | |
| elggentity_relationships | 1 | relationship | 1 | relationship | A | 11408236 | NULL | NULL | | BTREE | |
| elggentity_relationships | 1 | guid_two | 1 | guid_two | A | 11408236 | NULL | NULL | | BTREE | |
+--------------------------+------------+--------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+
6 rows in set (0.00 sec)
mysql> select count(*) from elggentity_relationships;
+----------+
| count(*) |
+----------+
| 11408236 |
+----------+
1 row in set (0.00 sec)
Now I'd like to use an INNER JOIN on those two tables and perform a full text search.
Query:
SELECT
count(DISTINCT o.guid) as total
FROM
elggobjects_entity o
INNER JOIN
elggentity_relationships r on (r.relationship="image" AND r.guid_one = o.guid)
WHERE
((MATCH (o.title, o.description) AGAINST ('scelerisque' )))
This gave me a 6 minute (!) response time.
On the other hand this one
SELECT
count(DISTINCT o.guid) as total
FROM
elggobjects_entity o
INNER JOIN
elggentity_relationships r on (r.relationship="image" AND r.guid_one = o.guid)
WHERE
((o.title like "%scelerisque%") OR (o.description like "%scelerisque%"))
returned the same count value in 0.02 seconds.
How is that possible? What am I missing here?
(MySQL info: mysql Ver 14.14 Distrib 5.1.49, for debian-linux-gnu (x86_64) using readline 6.1)
EDIT
EXPLAINing the first query (using match .. against) gives:
+----+-------------+-------+----------+-----------------------+--------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+----------+-----------------------+--------------+---------+-------+------+-------------+
| 1 | SIMPLE | r | ref | guid_one,relationship | relationship | 152 | const | 6145 | Using where |
| 1 | SIMPLE | o | fulltext | PRIMARY,title | title | 0 | | 1 | Using where |
+----+-------------+-------+----------+-----------------------+--------------+---------+-------+------+-------------+
while the second query (using LIKE "%..%"):
+----+-------------+-------+--------+-----------------------+--------------+---------+---------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------+--------------+---------+---------------------+------+-------------+
| 1 | SIMPLE | r | ref | guid_one,relationship | relationship | 152 | const | 6145 | Using where |
| 1 | SIMPLE | o | eq_ref | PRIMARY | PRIMARY | 8 | elgg1710.r.guid_one | 1 | Using where |
+----+-------------+-------+--------+-----------------------+--------------+---------+---------------------+------+-------------+
By combining your experience and EXPLAIN's results, it seems that fulltext index is not as useful as you expect in this particular case. This depends on particular data in your database, on database structure or/and particular query.
Usually database engines use no more than one index per table. So when the table has more than one index, query optimizer tries to use the better one. But optimizer is not always clever enough.
EXPLAIN's output shows that database query optimizer decided to use indexes for relationship and title. The relationship filter reduces table elggentity_relationships to 6145 rows. And the title filter reduces the table elggobjects_entity to 72697 rows. Then MySQL needs to join those tables (6145 x 72697 = 446723065 filtering operations) without using any index because indexes have already been used for filtering. In this case this can be too much. MySQL can even make a decision to keep intermediate calculations in the hard drive by trying to keep enough free space in memory.
Now let's take a look into another query. It uses relationship and PRIMARY KEY (of table elggobjects_entity) as its indexes. The relationship filter reduces table elggentity_relationships to 6145 rows. By joining those tables on PRIMARY KEY index, the result gets only 3957 rows. This is not much for the last filter (i.e. LIKE "%scelerisque%"), even if index is NOT used for this purpose at all.
As you can see the speed much depends on indexes selected for a query. So, in this particular case the PRIMARY KEY index is much more useful than fulltext title index, because PRIMARY KEY has bigger impact for result reduction than title.
MySQL is not always clever to set the right indexes. We can do this manually, by using clauses like IGNORE INDEX (index_name), FORCE INDEX (index_name), etc.
But in your case the problem is that if we use MATCH() AGAINST() in a query then the fulltext index is required, because MATCH() AGAINST() doesn't work without fulltext index at all. So this is the main reason why MySQL has chosen incorrect indexes for the query.
UPDATE
OK, I did some investigation.
Firstly, you may try to force MySQL to use guid_one index instead of relationship on table elggentity_relationships: USE INDEX (guid_one).
But for even better performance I think you can try to create one index for the composition of two columns (guid_one, membership). Current index guid_one is very similar, but for 3 columns, not for 2. In this query there are only 2 columns used. In my opinion after index creation MySQL should automatically use the right index. If not, force MySQL to use it.
Note: After index creation don't forget to remove old USE INDEX instruction from your query, because this may prevent query from using the newly created index. :)