Optimise Select Query containing inner join - mysql

Mysql Version - 5.5.39
I have these two tables Bugs and BugStatus
I want to fetch the Open and Closed bug counts for a given user.
I am currently using this query
SELECT BugStatus.name,
count(BugStatus.name) AS count
FROM bugs
INNER JOIN BugStatus ON bugs.status = bugstatus.id
WHERE bugs.assignee='irakam'
GROUP BY bugstatus.name;
Now let's assume I am going to have 100,000 rows in my Bugs table. Does this query still stand or how should I modify it. I did use Explain but I am still confused. So can this query be optimised?
SQLFiddle link - Click here

Select bs.name,
count(*) as count -- simply count(*) unless you are avoiding nulls
from bugs
inner join BugStatus AS bs ON bugs.status = bs.id
where bugs.assignee='irakam'
group by bs.name;
bugs: INDEX(assignee) -- since filtering occurs first
Index Cookbook

You can further optimize your table by creating an index on bugs.status and bugs.assignee:
CREATE INDEX idx_bugs_assignee_status on bugs(assignee, status);
As far as the execution plan goes:
Select Type: Simple
This means you are executing a simple query, without any subqueries or unions.
Type: ALL
This means that you are doing a full-table scan is being done on the bug status table (every row is inspected), should be avoided for large tables, but this is ok for the BugStatus table, since it only contains 2 rows.
Type: ref
This means all rows with the matching index values are read from the Bugs table, for each combination of rows found in BugStatus.
possible_keys
This lists out the possible indexes that might be used to answer your query (The primary key of BugStatus, and the foreign key on bugs.status)
Key
This is the actual index that the optimizer chose to answer your query (none in the case of the BugStatus table, since a full-table scan is being performed on it, and the foreign key on status in the case of the bugs table.)
ref
This shows the index that was used on the joined table to compare results.
rows
This column indicates the number of rows that were examined.
extra: Using temporary; Using filesort
'Using temporary' means that mysql needs to create a temporary table to sort your results, which is done because of your GROUP BY clause.
'Using filesort' this means the database had to perform an another pass over your results to figure out how to retrieve the sorted rows.
extra: Using where
Means you had a WHERE clause in your query.
See: https://dev.mysql.com/doc/refman/5.5/en/explain-output.html

Related

MySQL: Efficiency of views containing GROUP BY

The fact that I haven't been able to come up (or research) a solution to this question means that I'm either too stupid to read the docs or it is in fact a complicated problem.
In a rather big database I often need a query like this:
SELECT ... WHERE condition GROUP BY something;
This takes a fraction of a second to complete. So I put this in a VIEW:
CREATE VIEW view_x AS SELECT ... GROUP BY something;
And when I then do
SELECT * FROM view_x WHERE condition;
it takes more than a minute to complete. Now it's easy to see why: In the plain SELECT, the DB engine first selects a few hundred results from millions of records and then does the aggregating and grouping only on the matching records. When using the view, it seems to first evaluate the entire dataset, aggregating and grouping everything, and then returns only the records meeting the condition and throwing away the expensively calculated rest.
Is there a more intelligent VIEW solution, or do I have to use the full SELECT each time?
Thanks.
EDIT: Here's the original SQL code for the view:
CREATE VIEW v_status1 AS SELECT
FROM_UNIXTIME(J.ts_start) AS job_start,
J.id AS job_id, J.carrier, J.n_wafers,
count(W.id) AS n
FROM job AS J
JOIN wafer AS W ON J.id=W.job_id
GROUP BY J.carrier, J.n_wafers, W.status_id;
table job: 100k records, table wafer: 2M records.
Comparison is between these queries:
SELECT * FROM v_status1 WHERE carrier LIKE 'W96L00%'; -- very slow
versus the identical SELECT in the VIEW definition with the WHERE clause before the GROUP BY clause.
Some additional information: The query yields 9 records. Using the view it takes 19 seconds to execute. Using the direct query, it takes 0.000 seconds according to MySQL Workbench.
When I replace the WHERE clause in the direct query by a HAVING clause with the same condition at the end of the query, I end up at the same execution time as the query using the view.
Yes, I forgot some columns in the GROUP BY part. Put them in, doesn't make much of a difference.
Minimal example (5 seconds execution time):
CREATE VIEW v_status2 AS SELECT
job_id,
status_id,
count(id) AS n
FROM wafer
GROUP BY job_id, status_id;
yields 2 records given some job_id
well, I did the obvious and asked MySQL to EXPLAIN. The output is below. My interpretation is what I suspected all along: MySQL first builds a temporary table, doing all the hard work aggregating and grouping, and then selects only the rows matching the selection criteria. In other words, MySQL is not intelligent enough to first analyze the view to find where it can efficiently cull the original dataset and only work on the remaining records.
BTW, this has nothing to do with joins and indexes. You can see the effect with any sufficiently large two-column table.
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 952929 Using where
2 DERIVED WS index PRIMARY ix_waferstatus_text 123 NULL 9 Using index; Using temporary; Using filesort
2 DERIVED W ref ix_wafer_job_id,wafer_ibfk_2 wafer_ibfk_2 5 jobwatch.WS.id 105881 Using where
2 DERIVED J eq_ref PRIMARY,job_ibkf_2 PRIMARY 4 jobwatch.W.job_id 1 Using where
2 DERIVED T eq_ref PRIMARY PRIMARY 4 jobwatch.J.tool_id 1

Why the performance is affected with a LEFT JOIN and a GROUP BY?

I'm not understanding what MySQL (InnoDB) is doing with my queries. I have a query to extract data from two tables and it runs in ~35 ms. If I run the query without the LEFT JOIN, it does it in ~2.5 ms. Even the “equivalent” query to what the LEFT JOIN is doing takes ~0.5 ms. Why?
The “slow” query is the following:
SELECT
`Assigned`.`id`,
`Assigned`.`name`,
(COUNT(`Action`.`id`)) AS `Action__total_actions`
FROM `actions` AS `Action`
LEFT JOIN `users` AS `Assigned` ON (`Assigned`.`id` = `Action`.`user_assigned_id`)
WHERE
`Action`.`company_id` = 1 AND
`Action`.`action_date` BETWEEN '2014-12-28 00:00:00' AND '2015-01-28 23:59:59'
GROUP BY `Action`.`user_assigned_id`
ORDER BY `Assigned`.`name` ASC;
And I have one primary index for the table users and the next index for table actions:
ALTER TABLE `actions` ADD INDEX `actions_report_by_assigned` (`company_id`, `action_date`, `user_assigned_id`);
This is when it gets weird. If I “extract” the LEFT JOIN, the index is still working (for both queries), but the next one is 10 times faster:
SELECT
`Action`.`user_assigned_id`,
(COUNT(`Action`.`id`)) AS `Action__total_actions`
FROM `actions` AS `Action`
WHERE
`Action`.`company_id` = 1 AND
`Action`.`action_date` BETWEEN '2014-12-28 00:00:00' AND '2015-01-28 23:59:59'
GROUP BY `Action`.`user_assigned_id`
ORDER BY `Action`.`user_assigned_id`;
I think the index is well designed because both queries go through the same total rows that are counting. The EXPLAIN command tells me that the Index it's being used, but it also says in the extra column this: “Using where; Using index; Using temporary; Using filesort” in both queries (besides, one is 10 times faster).
Maybe is the filesort with my LEFT JOIN, because if I remove the GROUP from my first query, it speed up until ~15 ms. Sadly, I can't do that. Am I missing something?
Should I ignore this? What is the best way to solve it?
I would add an INDEX on the single column user_assigned_id because multiple columns indexes are only usable when the query is done on all of the columns of the index OR on the first columns only, in the order of the index so reordering you index to this might also work:
ALTER TABLE `actions` ADD INDEX `actions_report_by_assigned` (`user_assigned_id`, `company_id`, `action_date`);
See http://dev.mysql.com/doc/refman/5.0/en/multiple-column-indexes.html:
For example, if you have a three-column index on (col1, col2, col3), you have indexed search capabilities on (col1), (col1, col2), and (col1, col2, col3).
For now, your actions_report_by_assigned INDEX can't be used for this JOIN:
INNER JOIN `users` AS `Assigned` ON (`Assigned`.`id` = `Action`.`user_assigned_id`)
Because user_assigned_id is the last column of your multiple columns index.
The difference is the order that the tables are being accessed.
The LEFT JOIN is an outer join, it has to return rows from the table on the left side that don't have a matching row from the table on the right side.
An INNER JOIN only returns rows that are matching, so MySQL only has to find matching rows, so it can use either table as the driver for the nested loops operation, and usually, MySQL will use the table that returns fewer rows.
With the outer join, MySQL can't use the table on the right side as the driver, because there may be rows from the table on the left side that also need to be returned.
That's the why of it. As to how to solve it...
It's bit odd to have an expression in a GROUP BY clause, and not return that expression. (It's valid to do that in SQL, but how does the client know which row is for which value of the GROUP BY expressions?)
What is the purpose of the GROUP BY Action.user_assigned_id?
If the LEFT JOIN query you are talking about (which we don't see in the question) is the same as the INNER JOIN, just replacing the INNER keyword with the LEFT keyword...
With a GROUP BY col, sometimes MySQL can make efficient use of an index with leading column col to avoid a "Using filesort" operation, but in your case, there's an ORDER BY on a different expression, so I don't think there's any way to get around a "Using filesort" operation.
Your best bet is probably making sure you have an appropriate index to satisfy the predicates in the WHERE clause, if that limits the rows to a small subset of rows in the table.
... ON `actions` (`company_id`, `action_date`, `user_assigned_id`, `id`)
MySQL should be able to make use of that index for the equality predicate on company_id, and for a range scan operation on action_date. Having the other two columns in the index makes that a covering index, so the query can be satisfied entirely from the index, without any lookups to data pages in the underlying table.
If that's the case, the Extra column in the EXPLAIN output will show "Using index".
Don't use left join on big tables.
Hint : Split the query in smaller piece.
5 minutes query will execute less than 1 seconds
Try it
Also check the explain plan.
Get the fields involved in the join.
Check whether, applied index on both side on the joining fields.
And again check the explain plan, you can see the count get reduced.

understanding perf of mysql query using explain extended

I am trying to understand performance of an SQL query using MySQL.
With only indexes on the PK, the query failed to complete in over 10mins.
I have added indexes on all the columns used in the where clauses (timestamp, hostname, path, type) and the query now completes in approx 50seconds -- however this still seems a long time for what does not seem an overly complex query.
So, I'd like to understand what it is about the query that is causing this. My assumption is that my inner subquery is in someway causing an explosion in the number of comparisons necessary.
There are two tables involved:
storage (~5,000 rows / 4.6MB ) and machines (12 rows, <4k)
The query is as follows:
SELECT T.hostname, T.path, T.used_pct,
T.used_gb, T.avail_gb, T.timestamp, machines.type AS type
FROM storage AS T
JOIN machines ON T.hostname = machines.hostname
WHERE timestamp = ( SELECT max(timestamp) FROM storage AS st
WHERE st.hostname = T.hostname AND
st.path = T.path)
AND (machines.type = 'nfs')
ORDER BY used_pct DESC
An EXPLAIN EXTENDED for the query returns the following:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY machines ref hostname,type type 768 const 1 100.00 Using where; Using temporary; Using filesort
1 PRIMARY T ref fk_hostname fk_hostname 768 monitoring.machines.hostname 4535 100.00 Using where
2 DEPENDENT SUBQUERY st ref fk_hostname,path path 1002 monitoring.T.path 648 100.00 Using where
Noticing that the 'extra' column for Row 1 includes 'using filesort' and question:
MySQL explain Query understanding
states that "Using filesort is a sorting algorithm where MySQL isn't able to use an index for sorting and therefore can't do the complete sort in memory."
What is the nature of this query which is causing slow performance?
Why is it necessary for MySQL to use 'filesort' for this query?
Indexes don't get populated, they are there as soon as you create them. That's why inserts and updates become slower the more indexes you have on a table.
Your query runs fast after the first time because the whole result of the query is put into cache. To see how fast the query is without using the cache you can do
SELECT SQL_NO_CACHE T.hostname ...
MySQL uses filesort usually for ORDER BY or in your case to determine the maximum value for timestamp. Instead of going through all possible values and memorizing which value is the greatest, MySQL sorts the values descending and picks the first one.
So, why is your query slow? Two things jumped into my eye.
1) Your subquery
WHERE timestamp = ( SELECT max(timestamp) FROM storage AS st
WHERE st.hostname = T.hostname AND
st.path = T.path)
gets evaluated for every (hostname, path). Have a try with an index on timestamp (btw, I discourage naming columns like keywords / datatypes). If that alone doesn't help, try to rewrite your query. There are two excellent examples in the MySQL manual: The Rows Holding the Group-wise Maximum of a Certain Column.
2) This is a minor issue, but it seems you are joining on char/varchar fields. Numbers / IDs are much faster.

What type of index is ideal for this query?

I have an example query such as:
SELECT
rest.name, rest.shortname
FROM
restaurant AS rest
INNER JOIN specials ON rest.id=specials.restaurantid
WHERE
specials.dateend >= CURDATE()
AND
rest.state='VIC'
AND
rest.status = 1
AND
specials.status = 1
ORDER BY
rest.name ASC;
Just wondering of the below two indexes, which would be best on the restaurant table?
id,state,status,name
state,status,name
Just not sure if column used in the join should be included?
Funny enough though, I have created both types for testing and both times MySQL chooses the primary index, which is just id. Why is that?
Explain Output:
1,'SIMPLE','specials','index','NewIndex1\,NewIndex2\,NewIndex3\,NewIndex4','NewIndex4','11',\N,82,'Using where; Using index; Using temporary; Using filesort',
1,'SIMPLE','rest','eq_ref','PRIMARY\,search\,status\,state\,NewIndex1\,NewIndex2\,id-suburb\,NewIndex3\,id-status-name','PRIMARY','4','db_name.specials.restaurantid',1,'Using where'
Not many rows at the moment so perhaps that's why it's choosing PRIMARY!?
For optimum performance, you need at least 2 indexes:
The most important index is the one on the foreign key:
CREATE INDEX specials_rest_fk ON specials(restaurantid);
Without this, your queries will perform poorly, because every row in rest that matches the WHERE conditions will require a full tablescan of specials.
The next index to define would be the one that helps look up the fewest rows of rest given your conditions. Only one index is ever used, so you want to make that index find as few rows from rest as possible.
My guess, state and status:
CREATE INDEX rest_index_1 on rest(state, status);
Your index suggestion of (id, ...) is pointless, because id is unique - adding more column won't help, and in fact would worsen performance if it were used, because the index entries would be larger and you'd get less entries per I/O page read.
But you can gain performance by writing the query better too; if you move the conditions on specials into the join ON condition, you'll gain significant performance, because join conditions are evaluated as the join is made, but where conditions are evaluated on all joined rows, meaning the temporary result set that is filtered by the WHERE clause is much larger and therefore slower.
Change your query to this:
SELECT rest.name, rest.shortname
FROM restaurant AS rest
INNER JOIN specials
ON rest.id=specials.restaurantid
AND specials.dateend >= CURDATE()
AND specials.status = 1
WHERE rest.state='VIC'
AND rest.status = 1
ORDER BY rest.name;
Note how the conditions on specials are now in the ON clause.

MySQL join query not using Indexes?

I have this query:
SELECT
COUNT(*) AS `numrows`
FROM (`tbl_A`)
JOIN `tbl_B` ON `tbl_A`.`B_id` = `tbl_B`.`id`
WHERE
`tbl_B`.`boolean_value` <> 1;
I added three indexes for tbl_A.B_id, tbl_B.id and tbl_B.boolean_value but mysql still says it doesn't use indexes (in queries not using indexes log) and it examine whole of tables to retrieve the result.
I need to know what I should do to optimize this query.
EDIT:
Explain output:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE tbl_B ALL PRIMARY,boolean_value NULL NULL NULL 5049 Using where
1 SIMPLE tbl_A ref B_id B_id 9 tbl_B.id 9 Using where; Using index
The explain show us that an index is used to make the join to tbl_B but no index is used to filter tbl_A on the boolean value.
An index was available but the engine choose not to use it. Why it happen:
maybe 5049 rows is not a big deal and the engine saw that using the index to filter something like 10% of the rows using the index would be as fast as doing it without using it.
Booleans takes only 3 values: 1, 0 or NULL. So the cardinality of the index will always be very low (3 max). Low cardinality index are usually dropped by the query analyser (which is quite right usually thinking this index won't help him a lot)
It would be interesting to see if the query analyser behaves the same way when you have a 50/50 repartition of true and false value for this boolean, or when you have just a few False.
Now usually boolean fields are useful only on indexes containing multiple keys, so that if your queries use all the fields of the index (in where or order by) the query analyser will trust that index to be really a good tool.
Note that indexes are slowing down your writes and takes extra-spaces, do not add useless indexes. Using logt-query-not-using-indexes is a good thing, but you should compensate that log information with the slow queries log.If the query is fast it's not a problem.
if boolean_value it's really boolean value indexing of it not so good idea. Index wouldn't be effective.