how to set indexes for join and group by queries - mysql

Let's say we have a common join like the one below:
EXPLAIN SELECT *
FROM visited_links vl
JOIN device_tracker dt ON ( dt.Client_id = vl.Client_id
AND dt.Device_id = vl.Device_id )
GROUP BY dt.id
if we execute an explain, it says:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE vl index NULL vl_id 273 NULL 1977 Using index; Using temporary; Using filesort
1 SIMPLE dt ref Device_id,Device_id_2 Device_id 257 datumprotect.vl.device_id 4 Using where
I know that sometimes it's difficult to choose the right indexes when you are using group by but, what indexes could I set for avoiding 'using temporary, using filesort' in this query? why is this happening? and specially, why this happens after using an index?

One point to mention is that the fields returned by the select (* in this case) should either be in the GROUP BY clause or be using agregate functions such as SUM() or MAX(). Otherwise unexpected results can occur. This is because if the database is not told how to choose fields that are not in the group by clause you may get any member of the group, pretty much at random.
The way I look at it is to break the query down into bits.
you have a join on (dt.Client_id = vl.Client_id and dt.Device_id = vl.Device_id) so all of those fields should be indexed in their respective tables.
You are using GROUP BY dt.id so you need an index that includes dt.id
BUT...
an index on (dt.client_id,dt.device_id,dt.id) will not work for the GROUP BY
and
an index on (dt.id, dt.client_id, dt.device_id) will not work for the join.
Sometimes you end up with a query which just can't use an index.
See also:
http://ntsrikanth.blogspot.com/2007/11/sql-query-order-of-execution.html

You didn't post your indices, but first of all, you'll want to have an index for (client_id, device_id) on visited_links, and (client_id, device_id, id) on device_tracker to make sure that query is fully indexed.
From page 191 of the excellent High Performance MySQL, 2nd Ed.:
MySQL has two kinds of GROUP BY strategies when it can't use an index: it can use a temporary table or a filesort to perform the grouping. Either one can be more efficient depending on the query. You can force the optimizer to choose one method or the other with the SQL_BIG_RESULT and SQL_SMALL_RESULT optimizer hints.
In your case, I think the issue stems from joining on multiple columns and using GROUP BY together, even after the suggested indices are in place. If you remove either (a) one of the join conditions or (b) the GROUP BY, this shouldn't need a filesort.
However, keep in mind that a filesort doesn't always use actual files, it can also happen entirely within a memory buffer if the result set is small enough, so the performance penalty may be minimal. Consider the wall-clock time for the query too.
HTH!

Related

is it always bad to have Using temporary and filesort?

I have a situation where I specify up to 100 primary keys in where clause.
An example query:
select product_id, max(date_created) AS last_order_date
from orders
where product_id=28906
or product_id=28903
or product_id=28897
or product_id=28848
or product_id=28841
or product_id=28839
or product_id=28838
or product_id=28837
or product_id=28833
or product_id=28832
or product_id=28831
or product_id=28821
or product_id=28819
or product_id=28816
or product_id=28814
or product_id=28813
or product_id=28802
or product_id=28800
or product_id=28775
or product_id=28773
group by product_id
order by date_created desc
EXPLAIN shows Using index condition; Using temporary; Using filesort
I'm aware that I should avoid a query with Using temporary; Using filesort, but do I have to avoid it even if the query execution time is fast even for a large dataset? I've given a list of IDs, so that query is the best I can do.
What side effects or disadvantages should I expect if I decide to continue using the query?
Explain output:
1 SIMPLE wc_order_product_lookup range product_id product_id 8 NULL 3 Using index condition; Using temporary; Using filesort
Do what Gordon says, but use
ORDER BY last_order_date DESC
order by date_created desc does not make sense.
It may switch to a table scan if the list is "too long". This may be a difference in EXPLAIN between MySQL and MariaDB. (The resultsets will be identical.)
If you do EXPLAIN FORMAT=JSON SELECT ..., you may find that there are two filesorts.
Back to your original question...
"filesort" and "using temporary" are necessary in some cases -- especially like your case. After it GROUPs the results, the ORDER BY calls for sorting in a way that was not specified by the GROUP BY. This necessitates storing the data and sorting it.
"FILEsort" is a misnomer. In most cases, the rows are sitting in RAM and can be very quickly sorted. For very large resultsets and other complex situations, a "temporary" "file" will actually be used.
The Optimizer turns your list of ORs into an IN like Gordon's answer. So, there is essentially no difference between the two ways of writing it. (I find IN to be cleaner and more concise.)
Using index condition means that InnoDB is taking on some of the work that the generic "Handler" normally does. (This is good, but not a big deal.) However, replacing INDEX(product_id) by INDEX(product_id, date_created) is likely to be even better because it is "covering", which will be indicated by Using index.
"I have index for both fields" -- That is not the same as the composite index I am recommending.
You say "100 primary keys", but I suspect you mean "secondary" keys. Please provide SHOW CREATE TABLE orders to discuss this.
I disagree with the old wives' tale: "should avoid a query with Using temporary; Using filesort". Those are just clues that you are doing something that needs such complexity. It can rarely be "avoided".
The filesort is being used for the group by or the order by. It is pretty hard to avoid. You might find that in, though, helps with the where clause:
select product_id, max(date_created) AS last_order_date
from orders
where product_id in (28906, 28903, 28897, . . . )
group by product_id
order by date_created desc;

understanding perf of mysql query using explain extended

I am trying to understand performance of an SQL query using MySQL.
With only indexes on the PK, the query failed to complete in over 10mins.
I have added indexes on all the columns used in the where clauses (timestamp, hostname, path, type) and the query now completes in approx 50seconds -- however this still seems a long time for what does not seem an overly complex query.
So, I'd like to understand what it is about the query that is causing this. My assumption is that my inner subquery is in someway causing an explosion in the number of comparisons necessary.
There are two tables involved:
storage (~5,000 rows / 4.6MB ) and machines (12 rows, <4k)
The query is as follows:
SELECT T.hostname, T.path, T.used_pct,
T.used_gb, T.avail_gb, T.timestamp, machines.type AS type
FROM storage AS T
JOIN machines ON T.hostname = machines.hostname
WHERE timestamp = ( SELECT max(timestamp) FROM storage AS st
WHERE st.hostname = T.hostname AND
st.path = T.path)
AND (machines.type = 'nfs')
ORDER BY used_pct DESC
An EXPLAIN EXTENDED for the query returns the following:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY machines ref hostname,type type 768 const 1 100.00 Using where; Using temporary; Using filesort
1 PRIMARY T ref fk_hostname fk_hostname 768 monitoring.machines.hostname 4535 100.00 Using where
2 DEPENDENT SUBQUERY st ref fk_hostname,path path 1002 monitoring.T.path 648 100.00 Using where
Noticing that the 'extra' column for Row 1 includes 'using filesort' and question:
MySQL explain Query understanding
states that "Using filesort is a sorting algorithm where MySQL isn't able to use an index for sorting and therefore can't do the complete sort in memory."
What is the nature of this query which is causing slow performance?
Why is it necessary for MySQL to use 'filesort' for this query?
Indexes don't get populated, they are there as soon as you create them. That's why inserts and updates become slower the more indexes you have on a table.
Your query runs fast after the first time because the whole result of the query is put into cache. To see how fast the query is without using the cache you can do
SELECT SQL_NO_CACHE T.hostname ...
MySQL uses filesort usually for ORDER BY or in your case to determine the maximum value for timestamp. Instead of going through all possible values and memorizing which value is the greatest, MySQL sorts the values descending and picks the first one.
So, why is your query slow? Two things jumped into my eye.
1) Your subquery
WHERE timestamp = ( SELECT max(timestamp) FROM storage AS st
WHERE st.hostname = T.hostname AND
st.path = T.path)
gets evaluated for every (hostname, path). Have a try with an index on timestamp (btw, I discourage naming columns like keywords / datatypes). If that alone doesn't help, try to rewrite your query. There are two excellent examples in the MySQL manual: The Rows Holding the Group-wise Maximum of a Certain Column.
2) This is a minor issue, but it seems you are joining on char/varchar fields. Numbers / IDs are much faster.

What type of index is ideal for this query?

I have an example query such as:
SELECT
rest.name, rest.shortname
FROM
restaurant AS rest
INNER JOIN specials ON rest.id=specials.restaurantid
WHERE
specials.dateend >= CURDATE()
AND
rest.state='VIC'
AND
rest.status = 1
AND
specials.status = 1
ORDER BY
rest.name ASC;
Just wondering of the below two indexes, which would be best on the restaurant table?
id,state,status,name
state,status,name
Just not sure if column used in the join should be included?
Funny enough though, I have created both types for testing and both times MySQL chooses the primary index, which is just id. Why is that?
Explain Output:
1,'SIMPLE','specials','index','NewIndex1\,NewIndex2\,NewIndex3\,NewIndex4','NewIndex4','11',\N,82,'Using where; Using index; Using temporary; Using filesort',
1,'SIMPLE','rest','eq_ref','PRIMARY\,search\,status\,state\,NewIndex1\,NewIndex2\,id-suburb\,NewIndex3\,id-status-name','PRIMARY','4','db_name.specials.restaurantid',1,'Using where'
Not many rows at the moment so perhaps that's why it's choosing PRIMARY!?
For optimum performance, you need at least 2 indexes:
The most important index is the one on the foreign key:
CREATE INDEX specials_rest_fk ON specials(restaurantid);
Without this, your queries will perform poorly, because every row in rest that matches the WHERE conditions will require a full tablescan of specials.
The next index to define would be the one that helps look up the fewest rows of rest given your conditions. Only one index is ever used, so you want to make that index find as few rows from rest as possible.
My guess, state and status:
CREATE INDEX rest_index_1 on rest(state, status);
Your index suggestion of (id, ...) is pointless, because id is unique - adding more column won't help, and in fact would worsen performance if it were used, because the index entries would be larger and you'd get less entries per I/O page read.
But you can gain performance by writing the query better too; if you move the conditions on specials into the join ON condition, you'll gain significant performance, because join conditions are evaluated as the join is made, but where conditions are evaluated on all joined rows, meaning the temporary result set that is filtered by the WHERE clause is much larger and therefore slower.
Change your query to this:
SELECT rest.name, rest.shortname
FROM restaurant AS rest
INNER JOIN specials
ON rest.id=specials.restaurantid
AND specials.dateend >= CURDATE()
AND specials.status = 1
WHERE rest.state='VIC'
AND rest.status = 1
ORDER BY rest.name;
Note how the conditions on specials are now in the ON clause.

MySQL join query not using Indexes?

I have this query:
SELECT
COUNT(*) AS `numrows`
FROM (`tbl_A`)
JOIN `tbl_B` ON `tbl_A`.`B_id` = `tbl_B`.`id`
WHERE
`tbl_B`.`boolean_value` <> 1;
I added three indexes for tbl_A.B_id, tbl_B.id and tbl_B.boolean_value but mysql still says it doesn't use indexes (in queries not using indexes log) and it examine whole of tables to retrieve the result.
I need to know what I should do to optimize this query.
EDIT:
Explain output:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE tbl_B ALL PRIMARY,boolean_value NULL NULL NULL 5049 Using where
1 SIMPLE tbl_A ref B_id B_id 9 tbl_B.id 9 Using where; Using index
The explain show us that an index is used to make the join to tbl_B but no index is used to filter tbl_A on the boolean value.
An index was available but the engine choose not to use it. Why it happen:
maybe 5049 rows is not a big deal and the engine saw that using the index to filter something like 10% of the rows using the index would be as fast as doing it without using it.
Booleans takes only 3 values: 1, 0 or NULL. So the cardinality of the index will always be very low (3 max). Low cardinality index are usually dropped by the query analyser (which is quite right usually thinking this index won't help him a lot)
It would be interesting to see if the query analyser behaves the same way when you have a 50/50 repartition of true and false value for this boolean, or when you have just a few False.
Now usually boolean fields are useful only on indexes containing multiple keys, so that if your queries use all the fields of the index (in where or order by) the query analyser will trust that index to be really a good tool.
Note that indexes are slowing down your writes and takes extra-spaces, do not add useless indexes. Using logt-query-not-using-indexes is a good thing, but you should compensate that log information with the slow queries log.If the query is fast it's not a problem.
if boolean_value it's really boolean value indexing of it not so good idea. Index wouldn't be effective.

MySQL performance, inner join, how to avoid Using temporary and filesort

I have a table 1 and table 2.
Table 1
PARTNUM - ID_BRAND
partnum is the primary key
id_brand is "indexed"
Table 2
ID_BRAND - BRAND_NAME
id_brand is the primary key
brand_name is "indexed"
The table 1 contains 1 million of records and the table 2 contains 1.000 records.
I'm trying to optimize some query using EXPLAIN and after a lot of try I have reached a dead end.
EXPLAIN
SELECT pm.partnum, pb.brand_name
FROM products_main AS pm
LEFT JOIN products_brands AS pb ON pm.id_brand=pb.id_brand
ORDER BY pb.brand ASC
LIMIT 0, 10
The query returns this execution plan:
ID, SELECT_TYPE, TABLE, TYPE, POSSIBLE_KEYS, KEY, KEY_LEN , REF, ROWS, EXTRA
1, SIMPLE, pm, range, PRIMARY, PRIMARY, 1, , 1000000, Using where; Using temporary; Using filesort
1, SIMPLE, pb, ref, PRIMARY, PRIMARY, 4, demo.pm.id_pbrand, 1,
The MySQL query optimizer shows a temporary + filesort in the execution plan.
How can I avoid this?
The "EVIL" is in the ORDER BY pb.brand ASC. Ordering by that external field seems to be the bottleneck..
First of all, I question the use of an outer join seeing as the order by is operating on the rhs, and the NULL's injected by the left join are likely to play havoc with it.
Regardless, the simplest approach to speeding up this query would be a covering index on pb.id_brand and pb.brand. This will allow the order by to be evaluated 'using index' with the join condition. The alternative is to find some way to reduce the size of the intermediate result passed to the order-by.
Still, the combination of outer-join, order-by, and limit, leaves me wondering what exactly you are querying for, and if there might not be a better way of expressing the query itself.
Try replacing the join with a subquery. MySQL's optimizer kind of sucks; subqueries often give better performance than joins.
First, try changing your index on the products_brands table. Delete the existing one on brand_name, and create a new one:
ALTER TABLE products_brands ADD INDEX newIdx (brand_name, id_brand)
Then, the table will already have a "orderedByBrandName" index with the ids you need for the join, and you can try:
EXPLAIN
SELECT pb.brand_name, pm.partnum
FROM products_brands AS pb
LEFT JOIN products_main AS pm ON pb.id_brand = pm.id_brand
LIMIT 0, 10
Note that I also changed the order of the tables in the query, so you start with the small one.
This question is somewhat outdated, but I did find it, and so will other people.
Mysql uses temporary if the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue.
So you just need to have the join order reversed by using STRAIGHT_JOIN, to bypass the order invented by optimizer:
SELECT STRAIGHT_JOIN pm.partnum, pb.brand_name
FROM products_brands AS pb
RIGHT JOIN products_main AS pm ON pm.id_brand=pb.id_brand
ORDER BY pb.brand ASC
LIMIT 0, 10
Also make sure that max_heap_table_size AND tmp_table_size variables are set to a number big enough to store the results:
SET global tmp_table_size=100000000;
SET global max_heap_table_size=100000000;
-- 100 megabytes in this example. These can be set in my.cnf config file, too.