Do I need to have a multicolumn index? - mysql

EXPLAIN SELECT *
FROM (
`phppos_items`
)
WHERE (
name LIKE 'AB10LA2%'
OR item_number LIKE 'AB10LA2%'
OR category LIKE 'AB10LA2%'
)
AND deleted =0
ORDER BY `name` ASC
LIMIT 16
+----+-------------+--------------+-------+-----------------------------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+-----------------------------------+------+---------+------+------+-------------+
| 1 | SIMPLE | phppos_items | index | item_number,name,category,deleted | name | 257 | NULL | 32 | Using where |
+----+-------------+--------------+-------+-----------------------------------+------+---------+------+------+-------------+
This query takes 9 seconds to run (the table has 1 million + rows).
I have an index on item_number,name,category,deleted separately. How can I speed up this query?

Best I'm aware, MySQL doesn't know how to perform bitmap OR index scans. But you could rewrite it as the union of three queries to force it to do such a thing, if you've an index on each field. If so, this will be very fast:
select *
from (
select * from (
select *
from phppos_items
where name like 'AB10LA2%' and deleted = 0
order by `name` limit 16
) t
union
select * from (
select *
from phppos_items
where item_number like 'AB10LA2%' and deleted = 0
order by `name` limit 16
) t
union
select * from (
select *
from phppos_items
where category like 'AB10LA2%' and deleted = 0
order by `name` limit 16
) t
) as top rows
order by `name` limit 16

The OR operator can be poison for an execution plan. You could try to re-phrase your query replacing the OR clauses by an equivalent UNION:
SELECT *
FROM (
SELECT * FROM `phppos_items`
WHERE name LIKE 'AB10LA2%'
UNION
SELECT * FROM `phppos_items`
WHERE item_number LIKE 'AB10LA2%'
UNION
SELECT * FROM `phppos_items`
WHERE category LIKE 'AB10LA2%'
)
WHERE deleted =0
ORDER BY `name` ASC
LIMIT 16
This will allow MySQL to run several sub-queries in parallel before applying the UNION operator to each of the subqueries' results. I know this can help a lot with Oracle. Maybe MySQL can do similar things? Note: I assume that LIKE 'AB10LA2%' is quite a selective filter. Otherwise, this might not improve things due to late ordering and limiting in the execution plan. See Denis's answer for a more general approach.
In any case, I think a multi-column index won't help you because you have '%' signs in your search expressions. That way, only the first column in the multi-column index could be used, the rest would still need index-scanning or a full table scan.

Related

Avoid "filesort" on UNION RESULT

Sub-query 1:
SELECT * from big_table
where category = 'fruits' and name = 'apple'
order by yyyymmdd desc
Explain:
table | key | extra
big_table | name_yyyymmdd | using where
Looks great!
Sub-query 2:
SELECT * from big_table
where category = 'fruits' and (taste = 'sweet' or wildcard = '*')
order by yyyymmdd desc
Explain:
table | key | extra
big_table | category_yyyymmdd | using where
Looks great!
Now if I combine those with UNION:
SELECT * from big_table
where category = 'fruits' and name = 'apple'
UNION
SELECT * from big_table
where category = 'fruits' and (taste = 'sweet' or wildcard = '*')
Order by yyyymmdd desc
Explain:
table | key | extra
big_table | name | using index condition, using where
big_table | category | using index condition
UNION RESULT| NULL | using temporary; using filesort
Not so good, it uses filesort.
This is a trimmed down version of a more complexed query, here are some facts about the big_table:
big_table has 10M + rows
There are 5 unique "category"s
There are 5 unique "taste"s
There are about 10,000 unique "name"s
There are about 10,000 unique "yyyymmdd"s
I have created single index on each of those fields, plus composite idx such as yyyymmdd_category_taste_name but Mysql is not using it.
SELECT * FROM big_table
WHERE category = 'fruits'
AND ( name = 'apple'
OR taste = 'sweet'
OR wildcard = '*' )
ORDER BY yyyymmdd DESC
And have INDEX(catgory) or some index starting with category. However, if more than about 20% of the table is category = 'fruits' will probably decide to ignore the index and simply do a table scan. (Since you say there are only 5 categories, I suspect the optimizer will rightly eschew the index.)
Or this might be beneficial: INDEX(category, yyyymmdd), in this order.
The UNION had to do a sort (either in memory on disk, it is not clear) because it was unable to fetch the rows in the desired order.
A composite index INDEX(yyyymmdd, ...) might be used to avoid the 'filesort', but it won't use any columns after yyyymmdd.
When constructing a composite index, start with any WHERE columns compared '='. After that you can add one range or group by or order by. More details.
UNION is often a good choice for avoiding a slow OR, but in this case it would need three indexes
INDEX(category, name)
INDEX(category, taste)
INDEX(category, wildcard)
and adding yyyymmdd would not help unless you add a LIMIT.
And the query would be:
( SELECT * FROM big_table WHERE category = 'fruits' AND name = 'apple' )
UNION DISTINCT
( SELECT * FROM big_table WHERE category = 'fruits' AND taste = 'sweet' )
UNION DISTINCT
( SELECT * FROM big_table WHERE category = 'fruits' AND wildcard = '*' )
ORDER BY yyyymmdd DESC
Adding a limit would be even messier. First tack yyyymmdd on the end of each of the three composite indexes, then
( SELECT ... ORDER BY yyyymmdd DESC LIMIT 10 )
UNION DISTINCT
( SELECT ... ORDER BY yyyymmdd DESC LIMIT 10 )
UNION DISTINCT
( SELECT ... ORDER BY yyyymmdd DESC LIMIT 10 )
ORDER BY yyyymmdd DESC LIMIT 10
Adding an OFFSET would be even worse.
Two other techniques -- "covering" index and "lazy lookup" might help, but I doubt it.
Yet another technique is to put all the words in the same column and use a FULLTEXT index. But this may be problematical for several reasons.
This must be also work without UNION
SELECT * from big_table
where
( category = 'fruits' and name = 'apple' )
OR
( category = 'fruits' and (taste = 'sweet' or wildcard = '*')
ORDER BY yyyymmdd desc;

Efficient query involving union and virtual/derived table [duplicate]

Both of these mysql queries produce exactly the same result but query A is a simple union and it has the where postType clause embedded inside individual queries whereas query B has the same where clause applied to the external select of the virtual table which is a union of individual query results. I am concerned that the virtual table sigma from query B might get too large for no good reason if there are a lot of rows but then I am bit confused because how would the order by work for query A ; would it also not have to make a virtual table or something like that for sorting results. All may depend on how order by works for a union ? If order by for a union is also making a temp table ; would then query A almost equate to query B in resources(it will be much easier for us to implement query B in our system compared to query A)? Please guide/advise in any way possible, thanks
Query A
SELECT `t1`.*, `t2`.*
FROM `t1` INNER JOIN `t2` ON
`t1`.websiteID= `t2`.ownerID
AND `t1`.authorID= `t2`.authorID
AND `t1`.authorID=1559 AND `t1`.postType="simplePost"
UNION
SELECT `t1`.*
FROM `t1` where websiteID=1559 AND postType="simplePost"
ORDER BY postID limit 0,50
Query B
Select * from (
SELECT `t1`.*,`t2`.*
FROM `t1` INNER JOIN `t2` ON
`t1`.websiteID= `t2`.ownerID
AND `t1`.authorID= `t2`.authorID
AND `t1`.authorID=1559
UNION
SELECT `t1`.*
FROM `t1` where websiteID=1559
)
As sigma where postType="simplePost" ORDER BY postID limit 0,50
EXPLAIN FOR QUERY A
id type table type possible_keys keys key_len ref rows Extra
1 PRIMARY t2 ref userID userID 4 const 1
1 PRIMARY t1 ref authorID authorID 4 const 2 Usingwhere
2 UNION t1 ref websiteID websiteID 4 const 9 Usingwhere
NULL UNIONRESULT <union1,2> ALL NULL NULL NULL NULL NULL Usingfilesort
EXPLAIN FOR QUERY B
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 10 Using where; Using filesort
2 DERIVED t2 ref userID userID 4 1
2 DERIVED t1 ref authorID authorID 4 2 Using where
3 UNION t1 ref websiteID websiteID 4 9
NULL UNION RESULT <union2,3> ALL NULL NULL NULL NULL NULL
There is no doubt that version 1 - separate where clauses in each side of the union - will be faster. Let's look at why version - where clause over the union result - is worse:
data volume: there's always going to be more rows in the union result, because there are less conditions on what rows are returned. This means more disk I/O (depending on indexes), more temporary storage to hold the rowset, which means more processing time
repeated scan: the entire result of the union must be scanned again to apply the condition, when it could have been handled during the initial scan. This means double handling the rowset, albeit probably in-memory, still it's extra work.
indexes aren't used for where clauses on a union result. If you have an index over the foreign key fields and postType, it would not be used
If you want maximum performance, use UNION ALL, which passes the rows straight out into the result with no overhead, instead of UNION, which removes duplicates (usually by sorting) and can be expensive and is unnecessary based in your comments
Define these indexes and use version 1 for maximum performance:
create index t1_authorID_postType on t1(authorID, postType);
create index t1_websiteID_postType on t1(websiteID, postType);
perhaps this would work in lieu:
SELECT
`t1`.*
,`t2`.*
FROM `t1`
LEFT JOIN `t2` ON `t1`.websiteID = `t2`.ownerID
AND `t1`.authorID = `t2`.authorID
AND `t1`.authorID = 1559
WHERE ( `t1`.authorID = 1559 OR `t1`.websiteID = 1559 )
AND `t1`.postType = 'simplePost'
ORDER BY postID limit 0 ,50

mysql:choosing the most efficient query from the two

Both of these mysql queries produce exactly the same result but query A is a simple union and it has the where postType clause embedded inside individual queries whereas query B has the same where clause applied to the external select of the virtual table which is a union of individual query results. I am concerned that the virtual table sigma from query B might get too large for no good reason if there are a lot of rows but then I am bit confused because how would the order by work for query A ; would it also not have to make a virtual table or something like that for sorting results. All may depend on how order by works for a union ? If order by for a union is also making a temp table ; would then query A almost equate to query B in resources(it will be much easier for us to implement query B in our system compared to query A)? Please guide/advise in any way possible, thanks
Query A
SELECT `t1`.*, `t2`.*
FROM `t1` INNER JOIN `t2` ON
`t1`.websiteID= `t2`.ownerID
AND `t1`.authorID= `t2`.authorID
AND `t1`.authorID=1559 AND `t1`.postType="simplePost"
UNION
SELECT `t1`.*
FROM `t1` where websiteID=1559 AND postType="simplePost"
ORDER BY postID limit 0,50
Query B
Select * from (
SELECT `t1`.*,`t2`.*
FROM `t1` INNER JOIN `t2` ON
`t1`.websiteID= `t2`.ownerID
AND `t1`.authorID= `t2`.authorID
AND `t1`.authorID=1559
UNION
SELECT `t1`.*
FROM `t1` where websiteID=1559
)
As sigma where postType="simplePost" ORDER BY postID limit 0,50
EXPLAIN FOR QUERY A
id type table type possible_keys keys key_len ref rows Extra
1 PRIMARY t2 ref userID userID 4 const 1
1 PRIMARY t1 ref authorID authorID 4 const 2 Usingwhere
2 UNION t1 ref websiteID websiteID 4 const 9 Usingwhere
NULL UNIONRESULT <union1,2> ALL NULL NULL NULL NULL NULL Usingfilesort
EXPLAIN FOR QUERY B
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 10 Using where; Using filesort
2 DERIVED t2 ref userID userID 4 1
2 DERIVED t1 ref authorID authorID 4 2 Using where
3 UNION t1 ref websiteID websiteID 4 9
NULL UNION RESULT <union2,3> ALL NULL NULL NULL NULL NULL
There is no doubt that version 1 - separate where clauses in each side of the union - will be faster. Let's look at why version - where clause over the union result - is worse:
data volume: there's always going to be more rows in the union result, because there are less conditions on what rows are returned. This means more disk I/O (depending on indexes), more temporary storage to hold the rowset, which means more processing time
repeated scan: the entire result of the union must be scanned again to apply the condition, when it could have been handled during the initial scan. This means double handling the rowset, albeit probably in-memory, still it's extra work.
indexes aren't used for where clauses on a union result. If you have an index over the foreign key fields and postType, it would not be used
If you want maximum performance, use UNION ALL, which passes the rows straight out into the result with no overhead, instead of UNION, which removes duplicates (usually by sorting) and can be expensive and is unnecessary based in your comments
Define these indexes and use version 1 for maximum performance:
create index t1_authorID_postType on t1(authorID, postType);
create index t1_websiteID_postType on t1(websiteID, postType);
perhaps this would work in lieu:
SELECT
`t1`.*
,`t2`.*
FROM `t1`
LEFT JOIN `t2` ON `t1`.websiteID = `t2`.ownerID
AND `t1`.authorID = `t2`.authorID
AND `t1`.authorID = 1559
WHERE ( `t1`.authorID = 1559 OR `t1`.websiteID = 1559 )
AND `t1`.postType = 'simplePost'
ORDER BY postID limit 0 ,50

MySQL: how to increase speed of a select query with 2 joins and 1 subquery

In a table 'ttraces' I have many records for different tasks (whose value is held in 'taskid' column and is a foreign key of a column 'id' in a table 'ttasks'). Each task inserts a record to 'ttraces' every 8-10 seconds, so caching data to increase performance is not a good idea. What I need is to select only the newest records for each task from 'ttraces', that means the records with the maximum value of the column 'time'. At the moment, I have over 500000 records in the table. The very simplified structure of these two tables looks as follows:
-----------------------
| ttasks |
-----------------------
| id | name | blocked |
-----------------------
---------------------
| ttraces |
---------------------
| id | taskid | time |
---------------------
And my query is shown below:
SELECT t.name,tr.time
FROM
ttraces tr
JOIN
ttasks t ON tr.itask = t.id
JOIN (
SELECT taskid, MAX(time) AS max_time
FROM ttraces
GROUP BY itask
) x ON tr.taskid = x.taskid AND tr.time = x.max_time
WHERE t.blocked
All columns used in WHERE and JOIN clauses are indexed. As for now the query runs for ~1,5 seconds. It's extremely crucial to increase its speed. Thanks for all suggestions. BTW: the database is running on a hosted, shared server and I can't move it anywhere else for the moment.
[EDIT]
EXPLAIN SELECT... results are:
--------------------------------------------------------------------------------------------------------------
id select_type table type possible_keys key key_len ref rows Extra
--------------------------------------------------------------------------------------------------------------
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 74
1 PRIMARY t eq_ref PRIMARY PRIMARY 4 x.taskid 1 Using where
1 PRIMARY tr ref taskid,time time 9 x.max_time 1 Using where
2 DERIVED ttraces index NULL itask 5 NULL 570853
--------------------------------------------------------------------------------------------------------------
The engine is InnoDB.
I may be having a bit of a moment, but is this query not logically the same, and (almost certainly) faster?
SELECT t.id, t.name,max(tr.time)
FROM
ttraces tr
JOIN
ttasks t ON tr.itask = t.id
where BLOCKED
group by t.id, t.name
Here's my idea... You need one composite index on ttraces having taskid and time columns (in that order). Than, use this query:
SELECT t.name,
trm.mtime
FROM ttasks AS t
JOIN (SELECT taskid,
Max(time) AS mtime
FROM ttraces
GROUP BY taskid) AS trm
ON t.id = trm.taskid
WHERE t.blocked
Does this code return correct result? If so how is its speed time?
SELECT t.name, max_time
FROM ttasks t JOIN (
SELECT taskid, MAX(time) AS max_time
FROM ttraces
GROUP BY taskid
) x ON t.id = x.taskid
If there are many traces for each task then you can keep a table with only the newest traces. Whenever you insert into ttraces you also upsert into ttraces_newest:
insert into ttraces_newest (id, taskid, time) values
(3, 1, '2012-01-01 08:02:01')
on duplicate key update
`time` = current_timestamp
The primary key to ttraces_newest would be (id, taskid). Querying ttraces_newest would be cheaper. How much cheaper depends on how many traces there are to each task. Now the query is:
SELECT t.name,tr.time
FROM
ttraces_newest tr
JOIN
ttasks t ON tr.itask = t.id
WHERE t.blocked

Need advice optimizing SQL query (update on MySQL)

I did a performance profiling on my database with the slow query log. It turned out this is the number one annoyance:
UPDATE
t1
SET
v1t1 =
(
SELECT
t2.v3t2
FROM
t2
WHERE
t2.v2t2 = t1.v2t1
AND t2.v1t2 <= '2012-04-24'
ORDER BY
t2.v1t2 DESC,
t2.v3t2 DESC
LIMIT 1
);
The subquery itself is already slow. I tried variations with DISTINCT, GROUP BY and more subqueries but nothing performed below 4 seconds. For example the following query
SELECT v2t2, v3t2
FROM t2
WHERE t2.v1t2 <= '2012-04-24'
GROUP BY v2t2
ORDER BY v1t2 DESC
takes:
mysql> SELECT ...
...
69054 rows in set (5.61 sec)
mysql> EXPLAIN SELECT ...
+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
| 1 | SIMPLE | t2 | ALL | v1t2 | NULL | NULL | NULL | 5203965 | Using where; Using temporary; Using filesort |
+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
mysql> SHOW CREATE TABLE t2;
...
PRIMARY KEY (`v3t2`),
KEY `v1t2_v3t2` (`v1t2`,`v3t2`),
KEY `v1t2` (`v1t2`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
SELECT COUNT(*) FROM t1;
+----------+
| COUNT(*) |
+----------+
| 77070 |
+----------+
SELECT COUNT(*) FROM t2;
+----------+
| COUNT(*) |
+----------+
| 5203965 |
+----------+
I am trying to fetch the newest entry (v3t2) and its parent (v2t2). Should not be that big of a deal, should it? Does anyone have any advice which knobs I should turn? Any help or hint is greatly appreciated!
This should be a more appropriate SELECT statement:
SELECT
t1.v2t1,
(
SELECT
t2.v3t2
FROM
t2
WHERE
t2.v2t2 = t1.v2t1
AND t2.v1t2 <= '2012-04-24'
ORDER BY
t2.v1t2 DESC,
t2.v3t2 DESC
LIMIT 1
) AS latest
FROM
t1
Your ORDER BY ... LIMIT 1 is forcing database to perform a full scan of the table to return only 1 row. It looks like very much as a candidate for indexing.
Before you build the index, check the fileds selectivity by running:
SELECT count(*), count(v1t2), count(DISTINCT v1t2) FROM t2;
If you're having high number of non-NULL values in your column and number of distinct values is more then 40% of the non-NULLs, then building index is a good thing to go.
If index provides no help, you should analyze the data in your columns. You're using t2.v1t2 <= '2012-04-24' condition, which, in the case you have a historical set of records in your table, will give nothing to the planner, as all rows are expected to be in the past, thus full scan is the best choice anyway. Thus, indexe is useless.
What you should do instead, is think how to rewrite your query in a way, that only a limited subset of records is checked. Your construct ORDER BY ... DESC LIMIT 1 shows that you probably want the most recent entry up to '2012-04-24' (including). Why don't you try to rewrite your query to a something like:
SELECT v2t2, v3t2
FROM t2
WHERE t2.v1t2 => date_add('2012-04-24' interval '-10' DAY)
GROUP BY v2t2
ORDER BY v1t2 DESC;
This is just an example, knowing the design of your database and nature of your data more precise query can be built.
I would take a look at indexes that are built for the sub-select t2. You should have a index for v2t2 and possibly one for v1t2, and v3t2 because of the ordering. The index should reduce the time the sub select has to go looking for the results before using them in your update query.
Does this work any better? Gets rid of one of the sorts and groups by the key being used.
UPDATE
t1
SET
v1t1 =
(
SELECT
MAX(t2.v3t2)
FROM
t2
WHERE
t2.v2t2 = t1.v2t1
AND t2.v1t2 <= '2012-04-24'
GROUP BY t2.v1t2
ORDER BY t2.v1t2 DESC
LIMIT 1
);
Alternate Version
UPDATE `t1`
SET `v1t1` = (
SELECT MAX(`t2`.`v3t2`)
FROM `t2`
WHERE `t2`.`v2t2` = `t1`.`v2t1`
AND `t2`.`v1t2` = (
SELECT MAX(`t2`.`v1t2`)
FROM `t2`
WHERE `t2`.`v2t2` = `t1`.`v2t1
AND `t2`.`v1t2` <= '2012-04-24'
LIMIT 1
)
LIMIT 1
);
And add this index to t2:
KEY `v2t2_v1t2` (`v2t2`, `v1t2`)