I have a query involving two tables: table A has lots of rows, and contains a field called b_id, which references a record from table B, which has about 30 different rows. Table A has an index on b_id, and table B has an index on the column name.
My query looks something like this:
SELECT COUNT(A.id) FROM A INNER JOIN B ON B.id = A.b_id WHERE (B.name != 'dummy') AND <condition>;
With condition being some random condition on table A (I have lots of those, all exhibiting the same behavior).
This query is extremely slow (taking north of 2 seconds), and using explain, shows that query optimizer starts with table B, coming up with about 29 rows, and then scans table A. Doing a STRAIGHT_JOIN, turned the order around and the query ran instantaneously.
I'm not a fan of black magic, so I decided to try something else: come up with the id for the record in B that has the name dummy, let's say 23, and then simplify the query to:
SELECT COUNT(A.id) FROM A WHERE (b_id != 23) AND <condition>;
To my surprise, this query was actually slower than the straight join, taking north of a second.
Any ideas on why the join would be faster than the simplified query?
UPDATE: following a request in the comments, the outputs from explain:
Straight join:
+----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+
| 1 | SIMPLE | A | ALL | b_id | NULL | NULL | NULL | 200707 | Using where |
| 1 | SIMPLE | B | eq_ref | PRIMARY,id_name | PRIMARY | 4 | schema.A.b_id | 1 | Using where |
+----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+
No join:
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | A | ALL | b_id | NULL | NULL | NULL | 200707 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
UPDATE 2:
Tried another variant:
SELECT COUNT(A.id) FROM A WHERE b_id IN (<all the ids except for 23>) AND <condition>;
This runs faster than the no join, but still slower than the join, so it seems that the inequality operation is responsible for part of the performance hit, but not all.
If you are using MySQL 5.6 or later then you can ask the query optimizer what it is doing;
SET optimizer_trace="enabled=on";
## YOUR QUERY
SELECT COUNT(*) FROM transactions WHERE (id < 9000) and user != 11;
##END YOUR QUERY
SELECT trace FROM information_schema.optimizer_trace;
SET optimizer_trace="enabled=off";
You will almost certainly need to refer to the following sections in the MySQL reference Tracing the Optimiser and The Optimizer
Looking at the first explain it appears that the query is quicker probably because the optimizer can use the table B to filter down to the rows required based on the join and then use the foreign key to get the rows in table A.
In the explain it's this bit that is interesting; there is only one row matching and it's using schema.A.b_id. Effectively this is pre-filtering the rows from A which is where I think the performance difference comes from.
| ref | rows | Extra |
| schema.A.b_id | 1 | Using where |
So, as is usual with queries it is all down to indexes - or more accurately missing indexes. Just because you have indexes on individual fields it doesn't necessarily mean that these are suitable for the query you're running.
Basic rule: If the EXPLAIN doesn't say Using Index then you need to add a suitable index.
Looking at the explain output the first interesting thing is ironically the last thing on each line; namely the Extra
In the first example we see
| 1 | SIMPLE | A | .... Using where |
| 1 | SIMPLE | B | ... Using where |
Both of these Using where is not good; ideally at least one, and preferably both should say Using index
When you do
SELECT COUNT(A.id) FROM A WHERE (b_id != 23) AND <condition>;
and see Using where then you need to add an index as it's doing a table scan.
for example if you did
EXPLAIN SELECT COUNT(A.id) FROM A WHERE (Id > 23)
You should see Using where; Using index (assuming here that Id is the primary key and has an index)
If you then added a condition onto the end
EXPLAIN SELECT COUNT(A.id) FROM A WHERE (Id > 23) and Field > 0
and see Using where then you need to add an index for the two fields. Just having an index on a field doesn't mean that MySQL will be able to use that index during the query across multiple fields - this is something that internally the query optimizer will decide upon. I'm not exactly certain of the internal rules; but generally adding an extra index to match the query helps immensely.
So adding an index (on the two fields in the query above):
ALTER TABLE `A` ADD INDEX `IndexIdField` (`Id`,`Field`)
should change it such that when querying based upon those two fields there is an index.
I've tried this on one of my databases that has Transactions and User tables.
I'll use this query
EXPLAIN SELECT COUNT(*) FROM transactions WHERE (id < 9000) and user != 11;
Running without index on the two fields:
PRIMARY,user PRIMARY 4 NULL 14334 Using where
Then add an index:
ALTER TABLE `transactions` ADD INDEX `IndexIdUser` (`id`, `user`);
Then the same query again and this time
PRIMARY,user,Index 4 Index 4 4 NULL 12628 Using where; Using index
This time it's using the indexes - and as a result will be a lot quicker.
From comments by #Wrikken - and also bear in mind that I don't have the accurate schema / data so some of this investigation has required assumptions about the schema (which may be wrong)
SELECT COUNT(A.id) FROM A FORCE INDEX (b_id)
would perform at least as good as
SELECT COUNT(A.id) FROM A INNER JOIN B ON A.b_id = B.id.
If we look at the first EXPLAIN in the OP we see that there are two elements to the query. Referring to the EXPLAIN documentation for *eq_ref* I can see that this is going to define the rows for consideration based on this relationship.
The order of the explain output doesn't necessarily mean it's doing one and then the other; it's simply what has been chosen to execute the query (at least as far as I can tell).
For some reason the query optimizer has decided not to use the index on b_id - I'm assuming here that because of the query the optimizer has decided that it will be more efficient to do a table scan.
The second explain concerns me a little because it's not considering the index on b_id; possibly because of the AND <condition> (which is omitted so I'm guessing as to what it could be). When I try this with an index on b_id it does use the index; but as soon as a condition is added it doesn't use the index.
So, when doing
SELECT COUNT(A.id) FROM A INNER JOIN B ON A.b_id = B.id.
This all indicates to me is that the PRIMARY index on B is where the speed difference is coming from; I'm assuming because of the schema.A.b_id in the explain that there is a Foreign key on this table; which must be a better collection of related rows than the index on b_id - so the query optimizer can use this relationship to define which rows to pick - and because a primary index is better than secondary indexes it's going to be much quicker to select rows out of B and then use the relationship link to match against the rows in A.
I do not see any strange behavior here. What you need is to understand the basics of how MySQL uses indexes. Here is an article I usually recommend: 3 ways MySQL uses indexes.
It is always funny to observe people writing things like WHERE (B.name != 'dummy') AND <condition>, because this AND <condition> might be the reason why MySQL optimizer chose the specific index, and there is no valid reason to compare the performance of the query with that of another one with WHERE b_id != 23 AND <condition>, because the two queries usually need different indexes to perform good.
One thing you should understand, is that MySQL likes equality comparisons, and does not like range conditions and inequality comparisons. It is usually better to specify the correct values than to use a range condition or specify a != value.
So, let's compare the two queries.
With straight join
For each row in the A.id order (which is the primary key and is clustered, that is data is stored in its order on disk) take data for the row from disk to check if your <condition> is met and b_id, then (I repeat for each matching row) find the appropriate row for b_id, go on disk, take b.name, compare it with 'dummy'. Even though this plan in not at all efficient, you have only 200000 rows in your A table, so that it seems rather performant.
Without straight join
For each row in table B compare if name is matching, look into the A.b_id index (which is obviously sorted by b_id, since it is an index, hence contains A.ids in random order), and for each A.id for the given A.b_id find the corresponding A row on disk to check the <condition>, if it matches count the id, otherwise, discard the row.
As you see, there is nothing strange in the fact that the second query takes so long, you basically force MySQL to randomly access almost each row in A table, where in the first query you read the A table in the order it is stored on disk.
The query with no join does not use any index at all. It actually should take about the same as the query with straight join. My guess is that the order of the b_id!=23 and <condition> is significant.
UPD1: Could you still compare the performance of your query without join with the following:
SELECT COUNT(A.id)
FROM A
WHERE IF(b_id!=23, <condition>, 0);
UPD2: the fact the you do not see an index in EXPLAIN does not mean that no index is used at all. An index is at least used to define the reading order: when there is no other useful index, it is usually the primary key, but, as I said above, when there is an equility condition and the corresponding index, MySQL will use the index. So, basically, to understand which index is used you can look at the order in which rows are output. If the order is the same as the primary key, than no index was used (that is the primary key index was used), if the order of rows is shuffled - than there was some other index involved.
In your case, the second condition seems to be true for most of the rows, but the index is still used, that is to get b_id MySQL goes on disk in random order, that's why it is slow. No black magic here, and this second condition does affect the performance.
Probably this should be a comment rather than an answer but it will be a bit long.
First of all, it is hard to believe that two queries that have (almost) exactly the same explain run at different speed. Furthermore, this is less likely if the one with the extra line in the explain runs faster. And I guess the word faster is the key here.
You've compared speed (the time it takes for a query to finish) and that is an extremely empiric way of testing. For example, you could have improperly disabled the cache, which makes that comparison useless. Not to mention that your <insert your preferred software application here> could have made a page fault or any other operation at the time you've run the test that could have resulted in a decrease of the query speed.
The right way of measuring query performance is based on the explain (that's why it is there)
So the closest thing I have to answer the question: Any ideas on why the join would be faster than the simplified query?... is, in short, a layer 8 error.
I do have some other comments, though, that should be taken into account in order to speed things up. If A.id is a primary key (the name smells like it is), according to your explain, why does the count(A.id) have to scan all the rows? It should be able to get the data directly from the index but I don't see the Using index in the extra flags. It seems you don't even have a unique index on and that it is not a non nullable field. That also smells odd. Make sure that the field is not null and that there is a unique index on it, run the explain again, confirm the extra flags contain the Using index and then (properly) time the query. It should run much faster.
Also note that an approach that would result in the same performance improvement as I mentioned above would be to replace count(A.id) with count(*).
Just my 2 cents.
Because MySQL will not use index for index!=val in where.
The optimizer will decide to use an index by guessing. As a "!=" will more likely fetch everything, it skip and prevent using index to reduce overhead. (yes, mysql is stupid, and it does not statistic index column)
You may do a faster SELECT, by using index in(everything other then val), that MySQL will learn to use the index.
Example here showing query optimizer will choose to not use index by value
The answer to this question is actually a very simple consequence of algorithm design:
The key difference between these two queries is the merge operation.
Before I give a lesson on algorithms, I will mention the reason why the merge operation improves the performance. The merge improves the performance because it reduces the overall load on the aggregation. This is an iteration vs recursion issue. In the iteration analogy, we are simply looping through the entire index and counting the matches. In the recursion analogy, we are dividing and conquering (so to speak); or in other words, we are filtering the results that we need to count, thus reducing the volume of numbers we actually need to count.
Here are the key questions:
Why is a merge sort faster than an insertion sort?
Is a merge sort always faster than an insertion sort?
Let's explain this with a parable:
Let's say we have a deck of playing cards, and we need to sum the numbers of playing cards that have the numbers 7, 8 and 9 (assuming we don't know the answer in advance).
Let's say that we decide upon two ways to solve this problem:
We can hold the deck in one hand and move the cards to the table, one by one, counting as we go.
We can separate the cards into two groups: black suits and red suits. Then we can perform step 1 upon one of the groups and reuse the results for the second group.
If we choose option 2, then we have divided our problem in half. As a consequence, we can count the matching black cards and multiply the number by 2. In other words, we are re-using the part of the query execution plan that required the counting. This reasoning especially works when we know in advance how the cards were sorted (aka "clustered index"). Counting half of the cards is obviously much less time consuming than counting the entire deck.
If we wanted to improve the performance yet again, depending on how large the size of our database is, we may even further consider sorting into four groups (instead of two groups): clubs, diamonds, hearts, and spades. Whether or not we want to perform this further step depends on whether or not the overhead of sorting the cards into the additional groups is justified by the performance gain. In small numbers of cards, the performance gain is likely not worth the extra overhead required to sort into the different groups. As the number of cards grows, the performance gain begins to outweigh the overhead cost.
Here is an excerpt from "Introduction to Algorithms, 3rd edition," (Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein):
(Note: If someone can tell me how to format the sub-notation, I will edit this to improve readability.)
(Also, keep in mind that "n" is the number of objects we are dealing with.)
"As an example, in Chapter 2, we will see two algorithms for sorting.
The first, known as insertion sort, takes time roughly equal to c1n2
to sort n items, where c1 is a constant that does not depend on n.
That is, it takes time roughly proportional to n2. The second, merge
sort, takes time roughly equal to c2n lg n, where lg n stands for
log2 n and c2 is another constant that also does not depend on n.
Insertion sort typically has a smaller constant factor than merge
sort, so that c1 < c2. We shall see that the constant factors can
have far less of an impact on the running time than the dependence on
the input size n. Let’s write insertion sort’s running time as c1n ·
n and merge sort’s running time as c2n · lg n. Then we see that where
insertion sort has a factor of n in its running time, merge sort has
a factor of lg n, which is much smaller. (For example, when n = 1000,
lg n is approximately 10, and when n equals one million, lg n is
approximately only 20.) Although insertion sort usually runs faster
than merge sort for small input sizes, once the input size n becomes
large enough, merge sort’s advantage of lg n vs. n will more than
compensate for the difference in constant factors. No matter how much
smaller c1 is than c2, there will always be a crossover point beyond
which merge sort is faster."
Why is this relevant? Let us look at the query execution plans for these two queries. We will see that there is a merge operation caused by the inner join.
Related
So this might be a bit silly, but the alternative I was using is worse. I am trying to write an excel sheet using data from my database and a PHP tool called Box/Spout. The thing is that Box/Spout reads rows one at a time, and they are not retrieved via index ( e.g. rows[10], rows[42], rows[156] )
I need to retrieve data from the database in the order the rows come out. I have a database with a list of customers, that came in via Import and I have to write them into the excel spreadsheet. They have phone numbers, emails, and an address. Sorry for the confusion... :/ So I compiled this fairly complex query:
SELECT
`Import`.`UniqueID`,
`Import`.`RowNum`,
`People`.`PeopleID`,
`People`.`First`,
`People`.`Last`,
GROUP_CONCAT(
DISTINCT CONCAT_WS(',', `PhonesTable`.`Phone`, `PhonesTable`.`Type`)
ORDER BY `PhonesTable`.`PhoneID` DESC
SEPARATOR ';'
) AS `Phones`,
GROUP_CONCAT(
DISTINCT CONCAT_WS(',', `EmailsTable`.`Email`)
ORDER BY `EmailsTable`.`EmailID` DESC
SEPARATOR ';'
) AS `Emails`,
`Properties`.`Address1`,
`Properties`.`city`,
`Properties`.`state`,
`Properties`.`PostalCode5`,
...(17 more `People` Columns)...,
FROM `T_Import` AS `Import`
LEFT JOIN `T_CustomerStorageJoin` AS `CustomerJoin`
ON `Import`.`UniqueID` = `CustomerJoin`.`ImportID`
LEFT JOIN `T_People` AS `People`
ON `CustomerJoin`.`PersID`=`People`.`PeopleID`
LEFT JOIN `T_JoinPeopleIDPhoneID` AS `PeIDPhID`
ON `People`.`PeopleID` = `PeIDPhID`.`PeopleID`
LEFT JOIN `T_Phone` AS `PhonesTable`
ON `PeIDPhID`.`PhoneID`=`PhonesTable`.`PhoneID`
LEFT JOIN `T_JoinPeopleIDEmailID` AS `PeIDEmID`
ON `People`.`PeopleID` = `PeIDEmID`.`PeopleID`
LEFT JOIN `T_Email` AS `EmailsTable`
ON `PeIDEmID`.`EmailID`=`EmailsTable`.`EmailID`
LEFT JOIN `T_JoinPeopleIDPropertyID` AS `PeIDPrID`
ON `People`.`PeopleID` = `PeIDPrID`.`PeopleID`
AND `PeIDPrID`.`PropertyCP`='CurrentImported'
LEFT JOIN `T_Property` AS `Properties`
ON `PeIDPrID`.`PropertyID`=`Properties`.`PropertyID`
WHERE `Import`.`CustomerCollectionID`=$ccID
AND `RowNum` >= $rnOffset
AND `RowNum` < $rnLimit
GROUP BY `RowNum`;
So I have indexes on every ON segment, and the WHERE segment. When RowNumber is like around 0->2500 in value, the query runs great and executes within a couple seconds. But it seems like the query execution time exponentially multiplies the larger RowNumber gets.
I have an EXPLAIN here: and at pastebin( https://pastebin.com/PksYB4n2 )
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE Import NULL ref CustomerCollectionID,RowNumIndex CustomerCollectionID 4 const 48108 8.74 Using index condition; Using where; Using filesort;
1 SIMPLE CustomerJoin NULL ref ImportID ImportID 4 MyDatabase.Import.UniqueID 1 100 NULL
1 SIMPLE People NULL eq_ref PRIMARY,PeopleID PRIMARY 4 MyDatabase.CustomerJoin.PersID 1 100 NULL
1 SIMPLE PeIDPhID NULL ref PeopleID PeopleID 5 MyDatabase.People.PeopleID 8 100 NULL
1 SIMPLE PhonesTable NULL eq_ref PRIMARY,PhoneID,PhoneID_2 PRIMARY 4 MyDatabase.PeIDPhID.PhoneID 1 100 NULL
1 SIMPLE PeIDEmID NULL ref PeopleID PeopleID 5 MyDatabase.People.PeopleID 5 100 NULL
1 SIMPLE EmailsTable NULL eq_ref PRIMARY,EmailID,DupeDeleteSelect PRIMARY 4 MyDatabase.PeIDEmID.EmailID 1 100 NULL
1 SIMPLE PeIDPrID NULL ref PeopleMSCP,PeopleID,PropertyCP PeopleMSCP 5 MyDatabase.People.PeopleID 4 100 Using where
1 SIMPLE Properties NULL eq_ref PRIMARY,PropertyID PRIMARY 4 MyDatabase.PeIDPrID.PropertyID 1 100 NULL
I apologize if the formatting is absolutely terrible. I'm not sure what good formatting looks like so I may have jumbled it a bit on accident, plus the tabs got screwed up.
What I want to know is how to speed up the query time. The databases are very large, like in the 10s of millions of rows. And they aren't always like this as our tables are constantly changing, however I would like to be able to handle it when they are.
I tried using LIMIT 2000, 1000 for example, but I know that it's less efficient than using an indexed column. So I switched over to RowNumber. I feel like this was a good decision, but it seems like MySQL is still looping every single row before the offset variable which kind of defeats the purpose of my index... I think? I'm not sure. I also basically split this particular query into about 10 singular queries, and ran them one by one, for each row of the excel file. It takes a LONG time... TOO LONG. This is fast, but, obviously I'm having a problem.
Any help would be greatly appreciated, and thank you ahead of time. I'm sorry again for my lack of post organization.
The order of the columns in an index matters. The order of the clauses in WHERE does not matter (usually).
INDEX(a), INDEX(b) is not the same as the "composite" INDEX(a,b). I deliberately made composite indexes where they seemed useful.
INDEX(a,b) and INDEX(b,a) are not interchangeable unless both a and b are tested with =. (Plus a few exceptions.)
A "covering" index is one where all the columns for the one table are found in the one index. This sometimes provides an extra performance boost. Some of my recommended indexes are "covering". It implies that only the index BTree need be accessed, not also the data BTree; this is where it picks up some speed.
In EXPLAIN SELECT ... a "covering" index is indicated by "Using index" (which is not the same as "Using index condition"). (Your Explain shows no covering indexes currently.)
An index 'should not' have more than 5 columns. (This is not a hard and fast rule.) T5's index had f5 columns to be covering; it was not practical to make a covering index for T2.
When JOINing, the order of the tables does not matter; the Optimizer is free to shuffle them around. However, these "rules" apply:
A LEFT JOIN may force ordering of the tables. (I think it does in this case.) (I ordered the columns based on what I think the Optimizer wants; there may be some flexibility.)
The WHERE clause usually determines which table to "start with". (You test on T1 only, so obviously it will start with T1.
The "next table" to be referenced (via NLJ - Nested Loop Join) is determined by a variety of things. (In your case it is pretty obvious -- namely the ON column(s).)
More on indexing: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
Revised Query
1. Import: (CustomerCollectionID, -- '=' comes first
RowNum, -- 'range'
UniqueID) -- 'covering'
Import shows up in WHERE, so is first in Explain; Also due to LEFTs
Properties: (PropertyID) -- is that the PK?
PeIDPrID: (PropertyCP, PeopleID, PropertyID)
3. People: (PeopleID)
I assume that is the `PRIMARY KEY`? (Too many for "covering")
(Since `People` leads to 3 other table; I won't number the rest.)
EmailsTable: (EmailID, Email)
PeIDEmID: (PeopleID, -- JOIN from People
EmailID) -- covering
PhonesTable: (PhoneID, Type, Phone)
PeIDPhID: (PeopleID, PhoneID)
2. CustomerJoin: (ImportID, -- coming from `Import` (see ON...)
PersID) -- covering
After adding those, I expect most lines of EXPLAIN to say Using index.
The lack of at least a composite index on Import is the main problem leading to your performance complaint.
Bad GROUP BY
When there is a GROUP BY that does not include all the non-aggregated columns that are not directly dependent on the group by column(s), you get random values for the extras. I see from the EXPLAIN ("Rows") that several tables probably have multiple rows. You really ought to think about the garbage being generated by this query.
Curiously, Phones and Emails are feed into GROUP_CONCAT(), thereby avoiding the above issue, but the "Rows" is only 1.
(Read about ONLY_FULL_GROUP_BY; it might explain the issue better.)
(I'm listing this as a separate Answer since it is orthogonal to my other Answer.)
I call this the "explode-implode" syndrome. The query does a JOIN, getting a bunch of rows, thereby generating several rows, and puts multiple rows into an intermediate table. Then the GROUP BY implodes back to down to the original set of rows.
Let me focus on a portion of the query that could be reformulated to provide a performance improvement:
SELECT ...
GROUP_CONCAT(
DISTINCT CONCAT_WS(',', `EmailsTable`.`Email`)
ORDER BY `EmailsTable`.`EmailID` DESC
SEPARATOR ';'
) AS `Emails`,
...
FROM ...
LEFT JOIN `T_Email` AS `EmailsTable`
ON `PeIDEmID`.`EmailID`=`EmailsTable`.`EmailID`
...
GROUP BY `RowNum`;
Instead, move the table and aggregation function into a subquery
SELECT ...
( SELECT GROUP_CONCAT(
DISTINCT CONCAT_WS(',', `Email`)
ORDER BY `EmailID` DESC
SEPARATOR ';' )
FROM T_Email
WHERE `PeIDEmID`.`EmailID` = `EmailID`
) AS `Emails`,
...
FROM ...
-- and Remove: LEFT JOIN `T_Email` ON ...
...
-- and possibly Remove: GROUP BY ...;
Ditto for PhonesTable.
(It is unclear whether the GROUP BY can be removed; other things may need it.)
I have this query that drives me crazy for quite some time. It has 3 tables (originally it has a lot more but I isolated the performance issue), 1 base table, 1 product table which adds more data, and 1 with product types.
The product types table contains a "max age" column which indicates the maximum age of a row I want to fetch (anything older is considered "archived") and its value is different according to the product type.
My poor performance query goes like this and it takes 50 seconds for a 250,000 rows base table:
(select d_baseservices.ID
from d_baseservices
inner join d_products on d_baseservices.ServiceID = d_products.ServiceID
inner join md_prodtypes on d_products.ProdType = md_prodtypes.ProdType
where
(d_baseservices.CreationDate > (curdate() - INTERVAL md_prodtypes.MaxAge DAY))
order by CreationDate desc
limit 750);
Here is the EXPLAIN of this query:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE md_prodtypes index PRIMARY,ProdType_UNIQUE,ID_MAX_AGE MAX_AGE 5 23 Using index; Using temporary; Using filesort
1 SIMPLE d_products ref PRIMARY,ServiceID_UNIQUE,fk_Products_BaseServices1,fk_d_products_md_prodtypes1 fk_d_products_md_prodtypes1 4 combina.md_prodtypes.ProdType 8625
1 SIMPLE d_baseservices eq_ref PRIMARY,CreationDateDesc_index,CreationDate_index PRIMARY 8 combina.d_products.ServiceID 1 Using where
I found a clue a few days back, when I was able to determine that limiting the query to 750 records would cause is to go fast, but 751 would bring poor performance.
I tried creating indexes of many kinds, with no success.
I tried removing the reference to MAX_AGE and the curdate function and just set a fixed value, with little success as the query now takes 20 seconds:
(select d_baseservices.ID
from d_baseservices
inner join d_products on d_baseservices.ServiceID = d_products.ServiceID
inner join md_prodtypes on d_products.ProdType = md_prodtypes.ProdType
where
(d_baseservices.CreationDate > '2015-09-21 19:02:25')
order by CreationDate desc
limit 750);
And the EXPLAIN command output:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE md_prodtypes index PRIMARY,ProdType_UNIQUE,ID_MAX_AGE ProdType_UNIQUE 4 23 Using index; Using temporary; Using filesort
1 SIMPLE d_products ref PRIMARY,ServiceID_UNIQUE,fk_Products_BaseServices1,fk_d_products_md_prodtypes1 fk_d_products_md_prodtypes1 4 combina.md_prodtypes.ProdType 8625
1 SIMPLE d_baseservices eq_ref PRIMARY,CreationDateDesc_index,CreationDate_index PRIMARY 8 combina.d_products.ServiceID 1 Using where\
Can anyone please help? I'm stuck for almost a month
It's hard to say exactly what to do without knowing more about the specific data you have (how many rows in each table, how many rows you expect the query to return, the distribution of the data values, etc), but I'll make some educated guesses and hopefully point you in the right direction.
First an explanation about why taking md_prodtypes.MaxAge out of the query greatly reduced the run time: Prior to that change the database had no ability at all to filter using indexes because in order to see if rows are candidates for inclusion it had to join the three tables in order to compare CreationDate from the first table to MaxAge in the third table. There is simply no index that you can add to correlate these two values. You're forcing the database engine to look at every single row.
As to the 750 magic number - I'm guessing that past 750 results the database has to page data or that it's hitting some other memory limit based on the values in your specific MySQL configuration file. I wouldn't read too much into that 750 number.
Lastly I'd like to point out that the EXPLAIN of your second query is a bit strange since it's showing md_prodtypes as the first table despite the fact that you took MaxAge out of the WHERE. That means the database is starting from md_prodtypes then moving up to d_products and finally to d_baseservices and only then filtering based on the date. I'm guessing that you're expecting it to first filter on the date then join only when it's decided what baseservices records to include. It's impossible to know why this is happening with the information you've provided. Perhaps you are missing an index.
Another possibility may have to do with variance in your CreationDate column. Let me explain by example: Say you had a table of users, and each user had a gender column that could be either f or m. Let's pretend that we have a 50%/50% split of females and males. Now, if you add an index on the column gender and do a query filtered by WHERE gender='f' expecting that the index will filter out half of the records, you'd be surprised to see that the database will totally ignore the index and just scan the table. The reason being is that it's cheaper to just read the whole table if you know the index isn't filtering out enough (the alternative being jumping constantly from the index to the main table data). In your case, if the WHERE on the CreationDate column doesn't filter out enough records, then even if you have an index on it, it won't be used.
With a constant date...
INDEX(CreationDate)
That will encourage the optimizer to start with the table that can be filtered. Also, since the ORDER BY is on the same field, the WHERE, ORDER BY and LIMIT can all be done at the same time.
Otherwise, it must read all the relevant records from all 3 tables, sort them, then deliver 750 (or 751) of them.
Using MAX_AGE...
Now the optimizer won't know whether it is better to do as above or find all the rows, sort them, then deliver the LIMIT.
I'm no MySQL whiz but I get it, I have just inherited a pretty large table (600,000 rows and around 90 columns (Please kill me...)) and I have a smaller table that I've created to link it with a categories table.
I'm trying to query said table with a left join so I have both sets of data in one object but it runs terribly slow and I'm not hot enough to sort it out; I'd really appreciate a little guidance and explanation as to why it's so slow.
SELECT
`products`.`Product_number`,
`products`.`Price`,
`products`.`Previous_Price_1`,
`products`.`Previous_Price_2`,
`products`.`Product_number`,
`products`.`AverageOverallRating`,
`products`.`Name`,
`products`.`Brand_description`
FROM `product_categories`
LEFT OUTER JOIN `products`
ON `products`.`product_id`= `product_categories`.`product_id`
WHERE COALESCE(product_categories.cat4, product_categories.cat3,
product_categories.cat2, product_categories.cat1) = '123456'
AND `product_categories`.`product_id` != 0
The two tables are MyISAM, the products table has indexing on Product_number and Brand_Description and the product_categories table has a unique index on all columns combined; if this info is of any help at all.
Having inherited this system I need to get this working asap before I nuke it and do it properly so any help right now will earn you my utmost respect!
[Edit]
Here is the output of the explain extended:
+----+-------------+--------------------+-------+---------------+------+---------+------+---------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------------+-------+---------------+------+---------+------+---------+----------+--------------------------+
| 1 | SIMPLE | product_categories | index | NULL | cat1 | 23 | NULL | 1224419 | 100.00 | Using where; Using index |
| 1 | SIMPLE | products | ALL | Product_id | NULL | NULL | NULL | 512376 | 100.00 | |
+----+-------------+--------------------+-------+---------------+------+---------+------+---------+----------+--------------------------+
Optimize Table
To establish a baseline, I would first recommend running an OPTIMIZE TABLE command on both tables. Please note that this might take some time. From the docs:
OPTIMIZE TABLE should be used if you have deleted a large part of a
table or if you have made many changes to a table with variable-length
rows (tables that have VARCHAR, VARBINARY, BLOB, or TEXT columns).
Deleted rows are maintained in a linked list and subsequent INSERT
operations reuse old row positions. You can use OPTIMIZE TABLE to
reclaim the unused space and to defragment the data file. After
extensive changes to a table, this statement may also improve
performance of statements that use the table, sometimes significantly.
[...]
For MyISAM tables, OPTIMIZE TABLE works as follows:
If the table has deleted or split rows, repair the table.
If the index pages are not sorted, sort them.
If the table's statistics are not up to date (and the repair could not be accomplished by sorting the index), update them.
Indexing
If space and index management isn't a concern, you can try adding a composite index on
product_categories.cat4, product_categories.cat3, product_categories.cat2, product_categories.cat1
This would be advised if you use a leftmost subset of these columns often in your queries. The query plan indicates that it can use the cat1 index of product_categories. This most likely only includes the cat1 column. By adding all four category columns to an index, it can more efficiently seek to the desired row. From the docs:
MySQL can use multiple-column indexes for queries that test all the
columns in the index, or queries that test just the first column, the
first two columns, the first three columns, and so on. If you specify
the columns in the right order in the index definition, a single
composite index can speed up several kinds of queries on the same
table.
Structure
Furthermore, given that your table has 90 columns you should also be aware that a wider table can lead to slower query performance. You may want to consider Vertically Partitioning your table into multiple tables:
Having too many columns can bloat your record size, which in turn
results in more memory blocks being read in and out of memory causing
higher I/O. This can hurt performance. One way to combat this is to
split your tables into smaller more independent tables with smaller
cardinalities than the original. This should now allow for a better
Blocking Factor (as defined above) which means less I/O and faster
performance. This process of breaking apart the table like this is a
called a Vertical Partition.
The meaning of your query seems to be "find all products that have the category '123456'." Is that correct?
COALESCE is an extraordinarily expensive function to use in a WHERE statement, because it operates on index-hostile NULL values. Your explain result shows that your query is not being very selective on your product_categories table. In MySQL you need to avoid functions in WHERE statements altogether if you want to exploit indexes to make your queries fast.
The thing someone else said about 90-column tables being harmful is also true. But you're stuck with it, so let's just deal with it.
Can we rework your query to get rid of the function-based WHERE? Let's try this.
SELECT /* some columns from the products table */
FROM products
WHERE product_id IN
(
SELECT DISTINCT product_id
FROM product_categories
WHERE product_id <> 0
AND ( cat1='123456'
OR cat2='123456'
OR cat3='123456'
OR cat4='123456')
)
For this to work fast you're going to need to create separate indexes on your four cat columns. The composite unique index ("on all columns combined") is not going to help you. It still may not be so good.
A better solution might be FULLTEXT searching IN BOOLEAN MODE. You're working with the MyISAM access method so this is possible. It's definitely worth a try. It could be very fast indeed.
SELECT /* some columns from the products table */
FROM products
WHERE product_id IN
(
SELECT product_id
FROM product_categories
WHERE MATCH(cat1,cat2,cat3,cat4)
AGAINST('123456' IN BOOLEAN MODE)
AND product_id <> 0
)
For this to work fast you're going to need to create a FULLTEXT index like so.
CREATE FULLTEXT INDEX cat_lookup
ON product_categories (cat1, cat2, cat3, cat4)
Note that neither of these suggested queries produce precisely the same results as your COALESCE query. The way your COALESCE query is set up, some combinations won't match it that will match these queries. For example.
cat1 cat2 cat3 cat4
123451 123453 123455 123456 matches your and my queries
123456 123455 123454 123452 matches my queries but not yours
But it's likely that my queries will produce a useful list of products, even if it has a few more items in yours.
You can debug this stuff by just working with the inner queries on product_categories.
There is something strange. Does the table product_categories indeed have a product_id column? Shouldn't the from and where clauses be like this:
FROM `product_categories` pc
LEFT OUTER JOIN `products` p ON p.category_id = pc.id
WHERE
COALESCE(product_categories.cat4, product_categories.cat3,product_categories.cat2, product_categories.cat1) = '123456'
AND pc.id != 0
I have a table with 10 columns, Now I want to give the users an option to sort the data with any column they want. For example suppose a combo box with 7 items that each of them is a column of the table, now the user choose one item and get the data sorted by the chosen column.
Now what is the problem?
My table has 3M records, and if I sort the data with indexed column I have no problem but with a non index column it takes 3.5mins to sort!!!
What is the solution I am thinking about?
Add index to every column of table that is needed to be sort by! In my case I will have index on 8 columns!!!!
What is the problem of my solution?
Having a lot of index on columns may decrease the speed of INSERT/UPDATE queries! In my case the table is updated frequently (every second!!!!!)
What is your solution for this case?!
Read this for more details on optimization: http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
In some cases, MySQL cannot use indexes to resolve the ORDER BY, although it still uses indexes to find the rows that match the WHERE clause. Using index for sorting often comes together with using index to find rows, however it can also be used just for sort for example if you’re just using ORDER BY without and where clauses on the table. In such case you would see “Index” type in EXPLAIN which correspond to scanning (potentially) complete table in the index order. It is very important to understand in which conditions index can be used to sort data together with restricting amount of rows.
Looking at the same index (A,B) things like ORDER BY A ; ORDER BY A,B ; ORDER BY A DESC, B DESC will be able to use full index for sorting (note MySQL may not select to use index for sort if you sort full table without a limit). However ORDER BY B or ORDER BY A, B DESC will not be able to use index because requested order does not line up with the order of data in BTREE. If you have both restriction and sorting things like this would work A=5 ORDER BY B ; A=5 ORDER BY B DESC; A>5 ORDER BY A ; A>5 ORDER BY A,B ; A>5 ORDER BY A DESC which again can be easily visualized as scanning a range in BTREE. Things like this however would not work A>5 ORDER BY B , A>5 ORDER BY A,B DESC or A IN (3,4) ORDER BY B – in these cases getting data in sorting form would require a bit more than simple range scan in the BTREE and MySQL decides to pass it on.
Option #1: If you are limited to MySQL there's no better option but create 8 indexes for the possible order columns. You're insert/update are going to suffer it for sure but no real visitor will wait for 3.5 minutes for a list to be sorted.
Tune #1: To make it a little faster you can create partial indexes instead of standard indexes which will use much less space (I assume some of these columns are varchar) and this means less writes, smaller footprint in memory. You just need to check the entropy for each column with the substring and make sure you still have distinction over 90%.
For example with a query like this:
> select count(distinct(substring(COLUMN, 1, 5))) as part_5, count(distinct(substring(COLUMN, 1, 10))) as part_10, count(distinct(substring(COLUMN, 1, 20))) as part_20, count(distinct(COLUMN)) as sum from TABLE;
+--------+---------+---------+---------+
| part_5 | part_10 | part_20 | sum |
+--------+---------+---------+---------+
| 892183 | 1996053 | 1996058 | 1996058 |
+--------+---------+---------+---------+
Tune #2: You can make you insert/update statements to execute in the background. The application won't be faster but the user experience is going to be much better.
Tune #3: Use bigger transactions if you can for the inserts/updates.
Option #2: You can try to use one of the search engines which have been built for this usage pattern (too). I would recommend Solr as I'm using it for a while with great satisfaction but I heard good about elastic search as well.
I have a table with 30,000 rows (and growing), which I join with another table. One some pages, I need to run a some 100+ of those queries, and things get slow. If I EXPLAIN the query, I notice that one table uses a primary key and is fast, but another table using one of its indexes, which is not the best one. Here's an overview:
SIMPLE | acc_entries | ref | ledger,date,type,status,status_ledger_date_type | type | 1 | const | 15359 | Using where
This is a sample query:
SELECT SUM(usd) AS total FROM acc_entries
LEFT JOIN acc_ledgers ON acc_entries.ledger = acc_ledgers.id
WHERE acc_entries.status = 1 AND
acc_ledgers.account = 3004 AND
date >= '2011-01-01' AND
date <= '2011-08-30' AND
type = 'credit'
As you can see, I am using in my WHERE the fields status, ledger (which is the field that joins with acc_ledgers.account), date and type. All of these fields have indexes. However, there is also a specific index that is used for all of them, in that same order. It is called status_ledger_data_type, and as you can see it is one of the indexes that MySQL considers using. However, at the end MySQL opts to use type as an index. This has some 15,000 possible rows (half of the table), whereas the other combined index only features a fraction of this. So my questions is: why does MySQL selects this index when a better one is available, and how can I prevent this?
You can try using index hints to force the use of your desired index.
MySql docs on Index Hints
The Battle Between Force Index and the Query Optimizer
7 ways to convince MySQL to use the right index
Actually, you want your index based on your smaller granularity. The Ledger from your Acc_Entries table will join to your ACC_Ledgers table on ITS primary index of ID, so the Acc_Ledgers is not really utilizing the Ledger portion for the WHERE clause. Your index should match as closely to the WHERE clause of your common queries. In this case, I would have an index on
(Account, Status, Type, Date)
The reason for Account being first, smaller result set. You could have 5,000 entries. Of those, 300 entries for the one account accounts, so you've already eliminated a huge amount of data to go through. Then, the Status... of the 300, you could have 100 # status 1, 100 # status 2, 100 # status 3, so you've now reduced the set even more, etc by other criteria of type and date.
Your query otherwise is completely fine... just a personal style in writing, I try to write my queries with the WHERE conditions as closely matching the index in same sequence too, so I would just have the Account clause first, then Status, Type and Date... but again, thats a personal style in writing queries.