MySQL JOIN and ORDER BY - performance issue - mysql

I have this query that drives me crazy for quite some time. It has 3 tables (originally it has a lot more but I isolated the performance issue), 1 base table, 1 product table which adds more data, and 1 with product types.
The product types table contains a "max age" column which indicates the maximum age of a row I want to fetch (anything older is considered "archived") and its value is different according to the product type.
My poor performance query goes like this and it takes 50 seconds for a 250,000 rows base table:
(select d_baseservices.ID
from d_baseservices
inner join d_products on d_baseservices.ServiceID = d_products.ServiceID
inner join md_prodtypes on d_products.ProdType = md_prodtypes.ProdType
where
(d_baseservices.CreationDate > (curdate() - INTERVAL md_prodtypes.MaxAge DAY))
order by CreationDate desc
limit 750);
Here is the EXPLAIN of this query:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE md_prodtypes index PRIMARY,ProdType_UNIQUE,ID_MAX_AGE MAX_AGE 5 23 Using index; Using temporary; Using filesort
1 SIMPLE d_products ref PRIMARY,ServiceID_UNIQUE,fk_Products_BaseServices1,fk_d_products_md_prodtypes1 fk_d_products_md_prodtypes1 4 combina.md_prodtypes.ProdType 8625
1 SIMPLE d_baseservices eq_ref PRIMARY,CreationDateDesc_index,CreationDate_index PRIMARY 8 combina.d_products.ServiceID 1 Using where
I found a clue a few days back, when I was able to determine that limiting the query to 750 records would cause is to go fast, but 751 would bring poor performance.
I tried creating indexes of many kinds, with no success.
I tried removing the reference to MAX_AGE and the curdate function and just set a fixed value, with little success as the query now takes 20 seconds:
(select d_baseservices.ID
from d_baseservices
inner join d_products on d_baseservices.ServiceID = d_products.ServiceID
inner join md_prodtypes on d_products.ProdType = md_prodtypes.ProdType
where
(d_baseservices.CreationDate > '2015-09-21 19:02:25')
order by CreationDate desc
limit 750);
And the EXPLAIN command output:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE md_prodtypes index PRIMARY,ProdType_UNIQUE,ID_MAX_AGE ProdType_UNIQUE 4 23 Using index; Using temporary; Using filesort
1 SIMPLE d_products ref PRIMARY,ServiceID_UNIQUE,fk_Products_BaseServices1,fk_d_products_md_prodtypes1 fk_d_products_md_prodtypes1 4 combina.md_prodtypes.ProdType 8625
1 SIMPLE d_baseservices eq_ref PRIMARY,CreationDateDesc_index,CreationDate_index PRIMARY 8 combina.d_products.ServiceID 1 Using where\
Can anyone please help? I'm stuck for almost a month

It's hard to say exactly what to do without knowing more about the specific data you have (how many rows in each table, how many rows you expect the query to return, the distribution of the data values, etc), but I'll make some educated guesses and hopefully point you in the right direction.
First an explanation about why taking md_prodtypes.MaxAge out of the query greatly reduced the run time: Prior to that change the database had no ability at all to filter using indexes because in order to see if rows are candidates for inclusion it had to join the three tables in order to compare CreationDate from the first table to MaxAge in the third table. There is simply no index that you can add to correlate these two values. You're forcing the database engine to look at every single row.
As to the 750 magic number - I'm guessing that past 750 results the database has to page data or that it's hitting some other memory limit based on the values in your specific MySQL configuration file. I wouldn't read too much into that 750 number.
Lastly I'd like to point out that the EXPLAIN of your second query is a bit strange since it's showing md_prodtypes as the first table despite the fact that you took MaxAge out of the WHERE. That means the database is starting from md_prodtypes then moving up to d_products and finally to d_baseservices and only then filtering based on the date. I'm guessing that you're expecting it to first filter on the date then join only when it's decided what baseservices records to include. It's impossible to know why this is happening with the information you've provided. Perhaps you are missing an index.
Another possibility may have to do with variance in your CreationDate column. Let me explain by example: Say you had a table of users, and each user had a gender column that could be either f or m. Let's pretend that we have a 50%/50% split of females and males. Now, if you add an index on the column gender and do a query filtered by WHERE gender='f' expecting that the index will filter out half of the records, you'd be surprised to see that the database will totally ignore the index and just scan the table. The reason being is that it's cheaper to just read the whole table if you know the index isn't filtering out enough (the alternative being jumping constantly from the index to the main table data). In your case, if the WHERE on the CreationDate column doesn't filter out enough records, then even if you have an index on it, it won't be used.

With a constant date...
INDEX(CreationDate)
That will encourage the optimizer to start with the table that can be filtered. Also, since the ORDER BY is on the same field, the WHERE, ORDER BY and LIMIT can all be done at the same time.
Otherwise, it must read all the relevant records from all 3 tables, sort them, then deliver 750 (or 751) of them.
Using MAX_AGE...
Now the optimizer won't know whether it is better to do as above or find all the rows, sort them, then deliver the LIMIT.

Related

MySQL: Efficiency of views containing GROUP BY

The fact that I haven't been able to come up (or research) a solution to this question means that I'm either too stupid to read the docs or it is in fact a complicated problem.
In a rather big database I often need a query like this:
SELECT ... WHERE condition GROUP BY something;
This takes a fraction of a second to complete. So I put this in a VIEW:
CREATE VIEW view_x AS SELECT ... GROUP BY something;
And when I then do
SELECT * FROM view_x WHERE condition;
it takes more than a minute to complete. Now it's easy to see why: In the plain SELECT, the DB engine first selects a few hundred results from millions of records and then does the aggregating and grouping only on the matching records. When using the view, it seems to first evaluate the entire dataset, aggregating and grouping everything, and then returns only the records meeting the condition and throwing away the expensively calculated rest.
Is there a more intelligent VIEW solution, or do I have to use the full SELECT each time?
Thanks.
EDIT: Here's the original SQL code for the view:
CREATE VIEW v_status1 AS SELECT
FROM_UNIXTIME(J.ts_start) AS job_start,
J.id AS job_id, J.carrier, J.n_wafers,
count(W.id) AS n
FROM job AS J
JOIN wafer AS W ON J.id=W.job_id
GROUP BY J.carrier, J.n_wafers, W.status_id;
table job: 100k records, table wafer: 2M records.
Comparison is between these queries:
SELECT * FROM v_status1 WHERE carrier LIKE 'W96L00%'; -- very slow
versus the identical SELECT in the VIEW definition with the WHERE clause before the GROUP BY clause.
Some additional information: The query yields 9 records. Using the view it takes 19 seconds to execute. Using the direct query, it takes 0.000 seconds according to MySQL Workbench.
When I replace the WHERE clause in the direct query by a HAVING clause with the same condition at the end of the query, I end up at the same execution time as the query using the view.
Yes, I forgot some columns in the GROUP BY part. Put them in, doesn't make much of a difference.
Minimal example (5 seconds execution time):
CREATE VIEW v_status2 AS SELECT
job_id,
status_id,
count(id) AS n
FROM wafer
GROUP BY job_id, status_id;
yields 2 records given some job_id
well, I did the obvious and asked MySQL to EXPLAIN. The output is below. My interpretation is what I suspected all along: MySQL first builds a temporary table, doing all the hard work aggregating and grouping, and then selects only the rows matching the selection criteria. In other words, MySQL is not intelligent enough to first analyze the view to find where it can efficiently cull the original dataset and only work on the remaining records.
BTW, this has nothing to do with joins and indexes. You can see the effect with any sufficiently large two-column table.
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 952929 Using where
2 DERIVED WS index PRIMARY ix_waferstatus_text 123 NULL 9 Using index; Using temporary; Using filesort
2 DERIVED W ref ix_wafer_job_id,wafer_ibfk_2 wafer_ibfk_2 5 jobwatch.WS.id 105881 Using where
2 DERIVED J eq_ref PRIMARY,job_ibkf_2 PRIMARY 4 jobwatch.W.job_id 1 Using where
2 DERIVED T eq_ref PRIMARY PRIMARY 4 jobwatch.J.tool_id 1

Condition on joined table faster than condition on reference

I have a query involving two tables: table A has lots of rows, and contains a field called b_id, which references a record from table B, which has about 30 different rows. Table A has an index on b_id, and table B has an index on the column name.
My query looks something like this:
SELECT COUNT(A.id) FROM A INNER JOIN B ON B.id = A.b_id WHERE (B.name != 'dummy') AND <condition>;
With condition being some random condition on table A (I have lots of those, all exhibiting the same behavior).
This query is extremely slow (taking north of 2 seconds), and using explain, shows that query optimizer starts with table B, coming up with about 29 rows, and then scans table A. Doing a STRAIGHT_JOIN, turned the order around and the query ran instantaneously.
I'm not a fan of black magic, so I decided to try something else: come up with the id for the record in B that has the name dummy, let's say 23, and then simplify the query to:
SELECT COUNT(A.id) FROM A WHERE (b_id != 23) AND <condition>;
To my surprise, this query was actually slower than the straight join, taking north of a second.
Any ideas on why the join would be faster than the simplified query?
UPDATE: following a request in the comments, the outputs from explain:
Straight join:
+----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+
| 1 | SIMPLE | A | ALL | b_id | NULL | NULL | NULL | 200707 | Using where |
| 1 | SIMPLE | B | eq_ref | PRIMARY,id_name | PRIMARY | 4 | schema.A.b_id | 1 | Using where |
+----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+
No join:
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | A | ALL | b_id | NULL | NULL | NULL | 200707 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
UPDATE 2:
Tried another variant:
SELECT COUNT(A.id) FROM A WHERE b_id IN (<all the ids except for 23>) AND <condition>;
This runs faster than the no join, but still slower than the join, so it seems that the inequality operation is responsible for part of the performance hit, but not all.
If you are using MySQL 5.6 or later then you can ask the query optimizer what it is doing;
SET optimizer_trace="enabled=on";
## YOUR QUERY
SELECT COUNT(*) FROM transactions WHERE (id < 9000) and user != 11;
##END YOUR QUERY
SELECT trace FROM information_schema.optimizer_trace;
SET optimizer_trace="enabled=off";
You will almost certainly need to refer to the following sections in the MySQL reference Tracing the Optimiser and The Optimizer
Looking at the first explain it appears that the query is quicker probably because the optimizer can use the table B to filter down to the rows required based on the join and then use the foreign key to get the rows in table A.
In the explain it's this bit that is interesting; there is only one row matching and it's using schema.A.b_id. Effectively this is pre-filtering the rows from A which is where I think the performance difference comes from.
| ref | rows | Extra |
| schema.A.b_id | 1 | Using where |
So, as is usual with queries it is all down to indexes - or more accurately missing indexes. Just because you have indexes on individual fields it doesn't necessarily mean that these are suitable for the query you're running.
Basic rule: If the EXPLAIN doesn't say Using Index then you need to add a suitable index.
Looking at the explain output the first interesting thing is ironically the last thing on each line; namely the Extra
In the first example we see
| 1 | SIMPLE | A | .... Using where |
| 1 | SIMPLE | B | ... Using where |
Both of these Using where is not good; ideally at least one, and preferably both should say Using index
When you do
SELECT COUNT(A.id) FROM A WHERE (b_id != 23) AND <condition>;
and see Using where then you need to add an index as it's doing a table scan.
for example if you did
EXPLAIN SELECT COUNT(A.id) FROM A WHERE (Id > 23)
You should see Using where; Using index (assuming here that Id is the primary key and has an index)
If you then added a condition onto the end
EXPLAIN SELECT COUNT(A.id) FROM A WHERE (Id > 23) and Field > 0
and see Using where then you need to add an index for the two fields. Just having an index on a field doesn't mean that MySQL will be able to use that index during the query across multiple fields - this is something that internally the query optimizer will decide upon. I'm not exactly certain of the internal rules; but generally adding an extra index to match the query helps immensely.
So adding an index (on the two fields in the query above):
ALTER TABLE `A` ADD INDEX `IndexIdField` (`Id`,`Field`)
should change it such that when querying based upon those two fields there is an index.
I've tried this on one of my databases that has Transactions and User tables.
I'll use this query
EXPLAIN SELECT COUNT(*) FROM transactions WHERE (id < 9000) and user != 11;
Running without index on the two fields:
PRIMARY,user PRIMARY 4 NULL 14334 Using where
Then add an index:
ALTER TABLE `transactions` ADD INDEX `IndexIdUser` (`id`, `user`);
Then the same query again and this time
PRIMARY,user,Index 4 Index 4 4 NULL 12628 Using where; Using index
This time it's using the indexes - and as a result will be a lot quicker.
From comments by #Wrikken - and also bear in mind that I don't have the accurate schema / data so some of this investigation has required assumptions about the schema (which may be wrong)
SELECT COUNT(A.id) FROM A FORCE INDEX (b_id)
would perform at least as good as
SELECT COUNT(A.id) FROM A INNER JOIN B ON A.b_id = B.id.
If we look at the first EXPLAIN in the OP we see that there are two elements to the query. Referring to the EXPLAIN documentation for *eq_ref* I can see that this is going to define the rows for consideration based on this relationship.
The order of the explain output doesn't necessarily mean it's doing one and then the other; it's simply what has been chosen to execute the query (at least as far as I can tell).
For some reason the query optimizer has decided not to use the index on b_id - I'm assuming here that because of the query the optimizer has decided that it will be more efficient to do a table scan.
The second explain concerns me a little because it's not considering the index on b_id; possibly because of the AND <condition> (which is omitted so I'm guessing as to what it could be). When I try this with an index on b_id it does use the index; but as soon as a condition is added it doesn't use the index.
So, when doing
SELECT COUNT(A.id) FROM A INNER JOIN B ON A.b_id = B.id.
This all indicates to me is that the PRIMARY index on B is where the speed difference is coming from; I'm assuming because of the schema.A.b_id in the explain that there is a Foreign key on this table; which must be a better collection of related rows than the index on b_id - so the query optimizer can use this relationship to define which rows to pick - and because a primary index is better than secondary indexes it's going to be much quicker to select rows out of B and then use the relationship link to match against the rows in A.
I do not see any strange behavior here. What you need is to understand the basics of how MySQL uses indexes. Here is an article I usually recommend: 3 ways MySQL uses indexes.
It is always funny to observe people writing things like WHERE (B.name != 'dummy') AND <condition>, because this AND <condition> might be the reason why MySQL optimizer chose the specific index, and there is no valid reason to compare the performance of the query with that of another one with WHERE b_id != 23 AND <condition>, because the two queries usually need different indexes to perform good.
One thing you should understand, is that MySQL likes equality comparisons, and does not like range conditions and inequality comparisons. It is usually better to specify the correct values than to use a range condition or specify a != value.
So, let's compare the two queries.
With straight join
For each row in the A.id order (which is the primary key and is clustered, that is data is stored in its order on disk) take data for the row from disk to check if your <condition> is met and b_id, then (I repeat for each matching row) find the appropriate row for b_id, go on disk, take b.name, compare it with 'dummy'. Even though this plan in not at all efficient, you have only 200000 rows in your A table, so that it seems rather performant.
Without straight join
For each row in table B compare if name is matching, look into the A.b_id index (which is obviously sorted by b_id, since it is an index, hence contains A.ids in random order), and for each A.id for the given A.b_id find the corresponding A row on disk to check the <condition>, if it matches count the id, otherwise, discard the row.
As you see, there is nothing strange in the fact that the second query takes so long, you basically force MySQL to randomly access almost each row in A table, where in the first query you read the A table in the order it is stored on disk.
The query with no join does not use any index at all. It actually should take about the same as the query with straight join. My guess is that the order of the b_id!=23 and <condition> is significant.
UPD1: Could you still compare the performance of your query without join with the following:
SELECT COUNT(A.id)
FROM A
WHERE IF(b_id!=23, <condition>, 0);
UPD2: the fact the you do not see an index in EXPLAIN does not mean that no index is used at all. An index is at least used to define the reading order: when there is no other useful index, it is usually the primary key, but, as I said above, when there is an equility condition and the corresponding index, MySQL will use the index. So, basically, to understand which index is used you can look at the order in which rows are output. If the order is the same as the primary key, than no index was used (that is the primary key index was used), if the order of rows is shuffled - than there was some other index involved.
In your case, the second condition seems to be true for most of the rows, but the index is still used, that is to get b_id MySQL goes on disk in random order, that's why it is slow. No black magic here, and this second condition does affect the performance.
Probably this should be a comment rather than an answer but it will be a bit long.
First of all, it is hard to believe that two queries that have (almost) exactly the same explain run at different speed. Furthermore, this is less likely if the one with the extra line in the explain runs faster. And I guess the word faster is the key here.
You've compared speed (the time it takes for a query to finish) and that is an extremely empiric way of testing. For example, you could have improperly disabled the cache, which makes that comparison useless. Not to mention that your <insert your preferred software application here> could have made a page fault or any other operation at the time you've run the test that could have resulted in a decrease of the query speed.
The right way of measuring query performance is based on the explain (that's why it is there)
So the closest thing I have to answer the question: Any ideas on why the join would be faster than the simplified query?... is, in short, a layer 8 error.
I do have some other comments, though, that should be taken into account in order to speed things up. If A.id is a primary key (the name smells like it is), according to your explain, why does the count(A.id) have to scan all the rows? It should be able to get the data directly from the index but I don't see the Using index in the extra flags. It seems you don't even have a unique index on and that it is not a non nullable field. That also smells odd. Make sure that the field is not null and that there is a unique index on it, run the explain again, confirm the extra flags contain the Using index and then (properly) time the query. It should run much faster.
Also note that an approach that would result in the same performance improvement as I mentioned above would be to replace count(A.id) with count(*).
Just my 2 cents.
Because MySQL will not use index for index!=val in where.
The optimizer will decide to use an index by guessing. As a "!=" will more likely fetch everything, it skip and prevent using index to reduce overhead. (yes, mysql is stupid, and it does not statistic index column)
You may do a faster SELECT, by using index in(everything other then val), that MySQL will learn to use the index.
Example here showing query optimizer will choose to not use index by value
The answer to this question is actually a very simple consequence of algorithm design:
The key difference between these two queries is the merge operation.
Before I give a lesson on algorithms, I will mention the reason why the merge operation improves the performance. The merge improves the performance because it reduces the overall load on the aggregation. This is an iteration vs recursion issue. In the iteration analogy, we are simply looping through the entire index and counting the matches. In the recursion analogy, we are dividing and conquering (so to speak); or in other words, we are filtering the results that we need to count, thus reducing the volume of numbers we actually need to count.
Here are the key questions:
Why is a merge sort faster than an insertion sort?
Is a merge sort always faster than an insertion sort?
Let's explain this with a parable:
Let's say we have a deck of playing cards, and we need to sum the numbers of playing cards that have the numbers 7, 8 and 9 (assuming we don't know the answer in advance).
Let's say that we decide upon two ways to solve this problem:
We can hold the deck in one hand and move the cards to the table, one by one, counting as we go.
We can separate the cards into two groups: black suits and red suits. Then we can perform step 1 upon one of the groups and reuse the results for the second group.
If we choose option 2, then we have divided our problem in half. As a consequence, we can count the matching black cards and multiply the number by 2. In other words, we are re-using the part of the query execution plan that required the counting. This reasoning especially works when we know in advance how the cards were sorted (aka "clustered index"). Counting half of the cards is obviously much less time consuming than counting the entire deck.
If we wanted to improve the performance yet again, depending on how large the size of our database is, we may even further consider sorting into four groups (instead of two groups): clubs, diamonds, hearts, and spades. Whether or not we want to perform this further step depends on whether or not the overhead of sorting the cards into the additional groups is justified by the performance gain. In small numbers of cards, the performance gain is likely not worth the extra overhead required to sort into the different groups. As the number of cards grows, the performance gain begins to outweigh the overhead cost.
Here is an excerpt from "Introduction to Algorithms, 3rd edition," (Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein):
(Note: If someone can tell me how to format the sub-notation, I will edit this to improve readability.)
(Also, keep in mind that "n" is the number of objects we are dealing with.)
"As an example, in Chapter 2, we will see two algorithms for sorting.
The first, known as insertion sort, takes time roughly equal to c1n2
to sort n items, where c1 is a constant that does not depend on n.
That is, it takes time roughly proportional to n2. The second, merge
sort, takes time roughly equal to c2n lg n, where lg n stands for
log2 n and c2 is another constant that also does not depend on n.
Insertion sort typically has a smaller constant factor than merge
sort, so that c1 < c2. We shall see that the constant factors can
have far less of an impact on the running time than the dependence on
the input size n. Let’s write insertion sort’s running time as c1n ·
n and merge sort’s running time as c2n · lg n. Then we see that where
insertion sort has a factor of n in its running time, merge sort has
a factor of lg n, which is much smaller. (For example, when n = 1000,
lg n is approximately 10, and when n equals one million, lg n is
approximately only 20.) Although insertion sort usually runs faster
than merge sort for small input sizes, once the input size n becomes
large enough, merge sort’s advantage of lg n vs. n will more than
compensate for the difference in constant factors. No matter how much
smaller c1 is than c2, there will always be a crossover point beyond
which merge sort is faster."
Why is this relevant? Let us look at the query execution plans for these two queries. We will see that there is a merge operation caused by the inner join.

How to prevent MySQL selecting one index when a better one is available?

I have a table with 30,000 rows (and growing), which I join with another table. One some pages, I need to run a some 100+ of those queries, and things get slow. If I EXPLAIN the query, I notice that one table uses a primary key and is fast, but another table using one of its indexes, which is not the best one. Here's an overview:
SIMPLE | acc_entries | ref | ledger,date,type,status,status_ledger_date_type | type | 1 | const | 15359 | Using where
This is a sample query:
SELECT SUM(usd) AS total FROM acc_entries
LEFT JOIN acc_ledgers ON acc_entries.ledger = acc_ledgers.id
WHERE acc_entries.status = 1 AND
acc_ledgers.account = 3004 AND
date >= '2011-01-01' AND
date <= '2011-08-30' AND
type = 'credit'
As you can see, I am using in my WHERE the fields status, ledger (which is the field that joins with acc_ledgers.account), date and type. All of these fields have indexes. However, there is also a specific index that is used for all of them, in that same order. It is called status_ledger_data_type, and as you can see it is one of the indexes that MySQL considers using. However, at the end MySQL opts to use type as an index. This has some 15,000 possible rows (half of the table), whereas the other combined index only features a fraction of this. So my questions is: why does MySQL selects this index when a better one is available, and how can I prevent this?
You can try using index hints to force the use of your desired index.
MySql docs on Index Hints
The Battle Between Force Index and the Query Optimizer
7 ways to convince MySQL to use the right index
Actually, you want your index based on your smaller granularity. The Ledger from your Acc_Entries table will join to your ACC_Ledgers table on ITS primary index of ID, so the Acc_Ledgers is not really utilizing the Ledger portion for the WHERE clause. Your index should match as closely to the WHERE clause of your common queries. In this case, I would have an index on
(Account, Status, Type, Date)
The reason for Account being first, smaller result set. You could have 5,000 entries. Of those, 300 entries for the one account accounts, so you've already eliminated a huge amount of data to go through. Then, the Status... of the 300, you could have 100 # status 1, 100 # status 2, 100 # status 3, so you've now reduced the set even more, etc by other criteria of type and date.
Your query otherwise is completely fine... just a personal style in writing, I try to write my queries with the WHERE conditions as closely matching the index in same sequence too, so I would just have the Account clause first, then Status, Type and Date... but again, thats a personal style in writing queries.

Strange Performance Issues with INNER JOIN vs. LEFT JOIN

I was using a query that looked similar to this one:
SELECT `episodes`.*, IFNULL(SUM(`views_sum`.`clicks`), 0) as `clicks`
FROM `episodes`, `views_sum`
WHERE `views_sum`.`index` = "episode" AND `views_sum`.`key` = `episodes`.`id`
GROUP BY `episodes`.`id`
... which takes ~0.1s to execute. But it's problematic, because some episodes don't have a corresponding views_sum row, so those episodes aren't included in the result.
What I want is NULL values when a corresponding views_sum row doesn't exist, so I tried using a LEFT JOIN instead:
SELECT `episodes`.*, IFNULL(SUM(`views_sum`.`clicks`), 0) as `clicks`
FROM `episodes`
LEFT JOIN `views_sum` ON (`views_sum`.`index` = "episode" AND `views_sum`.`key` = `episodes`.`id`)
GROUP BY `episodes`.`id`
This query produces the same columns, and it also includes the few rows missing from the 1st query.
BUT, the 2nd query takes 10 times as long! A full second.
Why is there such a huge discrepancy between the execution times when the result is so similar? There's nowhere near 10 times as many rows — it's like 60 from the 1st query, and 70 from the 2nd. That's not to mention that the 10 additional rows have no views to sum!
Any light shed would be greatly appreciated!
(There are indexes on episodes.id, views_sum.index, and views_sum.key.)
EDIT:
I copy-pasted the SQL from above, and here are the EXPLAINs, in order:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE views_sum ref index,key index 27 const 6532 Using where; Using temporary; Using filesort
1 SIMPLE episodes eq_ref PRIMARY PRIMARY 4 db102914_itw.views_sum.key 1 Using where
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE episodes ALL NULL NULL NULL NULL 70 Using temporary; Using filesort
1 SIMPLE views_sum ref index,key index 27 const 6532
Here's the query I ultimately came up with, after many, many iterations. (The SQL_NO_CACHE flag is there so I can test execution times.)
SELECT SQL_NO_CACHE e.*, IFNULL(SUM(vs.`clicks`), 0) as `clicks`
FROM `episodes` e
LEFT JOIN
(SELECT * FROM `views_sum` WHERE `index` = "episode") vs
ON vs.`key` = e.`id`
GROUP BY e.`id`
Because the ON condtion views_sum.index = "episode" is static, i.e., isn't dependent on the row it's joined to, I was able to get a massive performance boost by first using a subquery to limit the views_sum table before joining.
My query now takes ~0.2s. And what's even better, the time doesn't grow as you increase the offset of the query (unlike my first LEFT JOIN attempt). It stays the same, even if you do a sort on the clicks column.
You should have a combined index on views_sum.index and views_sum.key. I suspect you will always use both fields together if i look at the names. Also, I would rewrite the first query to use a proper INNER JOIN clause instead of a filtered cartesian product.
I suspect the performance of both queries will be much closer together if you do this. And, more importantly, much faster than they are now.
edit: Thinking about it, I would probably add a third column to that index: views_sum.clicks, which probably can be used for the SUM. But remember that multi-column indexes can only be used left to right.
It's all about the indexes. You'll have to play around with it a bit or post your database schema on here. Just as a rough guess i'd say you should make sure you have an index on views_sum.key.
Normally, a LEFT JOIN will be slower than an INNER JOIN or a CROSS JOIN because it has to view the first table differently. Put another way, the difference in time isn't related to the size of the result, but the full size of the left table.
I also wonder if you're asking MySQL to figure things out for you that you should be doing yourself. Specifically, that SUM() function would normally require a GROUP BY clause.

what index(es) needs to be added for this query to work properly?

This query pops up in my slow query logs:
SELECT
COUNT(*) AS ordersCount,
SUM(ItemsPrice + COALESCE(extrasPrice, 0.0)) AS totalValue,
SUM(ItemsPrice) AS totalValue,
SUM(std_delivery_charge) AS totalStdDeliveryCharge,
SUM(extra_delivery_charge) AS totalExtraDeliveryCharge,
this_.type AS y5_,
this_.transmissionMethod AS y6_,
this_.extra_delivery AS y7_
FROM orders this_
WHERE this_.deliveryDate BETWEEN '2010-01-01 00:00:00' AND '2010-09-01 00:00:00'
AND this_.status IN(1, 3, 2, 10, 4, 5, 11)
AND this_.senderShop_id = 10017
GROUP BY this_.type, this_.transmissionMethod, this_.extra_delivery
ORDER BY this_.deliveryDate DESC;
The table is InnoDB and has about 880k rows and takes between 9-12 seconds to execute. I tried adding the following index ALTER TABLE orders ADD INDEX _deliverydate_senderShopId_status ( deliveryDate , senderShop_id , status, type, transmissionMethod, extra_delivery); with no practical gains. Any help and/or suggestion is welcomed
Here is the query execution plan right now:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE this_ ref FKC3DF62E57562BA6F 8 const 139894 100.00 Using where; Using temporary; Using filesort
I took out the possible_keys value out of the text because i think it listed all the indexes in the table. The key used (FKC3DF62E57562BA6F) looks like
Keyname Type Unique Packed Field Cardinality Collation Null Comment
FKC3DF62E57562BA6F BTREE No No senderShop_id 4671 A
I'll tell you one thing that you can look at for increasing the speed.
You only generally have NULL values in the data for either unknown or non-applicable rows. It appears to me that, since you're treating NULL as 0 anyway, you should think about getting rid of them and making sure that all extrasPrice values are 0 where they were previously NULL so that you can get rid of the time penalty of the coalesce.
In fact, you could go one step further and introduce another column called totalPrice which you set with an insert/update trigger to the actual value ItemsPrice + extrasPrice or (ItemsPrice + COALESCE(extrasPrice,0.0) if you still need nullability of extrasPrice).
Then, you can simply use:
SELECT
COUNT(*) AS ordersCount,
SUM(totalPrice) AS totalValue,
SUM(ItemsPrice) AS totalValue2,
:
(I'm not sure you should have two output columns with the same name or whether that was a typo, that's going to be, at worst, an error, at best, confusing).
This moves the cost of the calculation to insert/update time rather than select time and amortises that cost over all the selects - most database tables are read far more often than written. The consistency of the data is maintained due to the trigger and the performance should be better, at the cost of some storage requirements.
But, since the vast majority of database questions are "How can I get more speed?" rather than "How can I use less disk?", that's often a good idea.
Another suggestion is to provide a non-composite index on the column that reduces your result set the fastest (high cardinality). In other words, if you store only two weeks worth of data (14 different dates) in your table but 400 different shops, you should have an index on senderShop_id and make sure your statistics are up to date.
This should cause the DBMS execution engine to whittle down the result set using that key so that subsequent operations are faster.
A composite index on deliveryDate,senderShop_id,... will not be able to use senderShop_id to whittle down the results because the key ordering will be senderShop_id within deliveryDate.