I have a query involving two tables: table A has lots of rows, and contains a field called b_id, which references a record from table B, which has about 30 different rows. Table A has an index on b_id, and table B has an index on the column name.
My query looks something like this:
SELECT COUNT(A.id) FROM A INNER JOIN B ON B.id = A.b_id WHERE (B.name != 'dummy') AND <condition>;
With condition being some random condition on table A (I have lots of those, all exhibiting the same behavior).
This query is extremely slow (taking north of 2 seconds), and using explain, shows that query optimizer starts with table B, coming up with about 29 rows, and then scans table A. Doing a STRAIGHT_JOIN, turned the order around and the query ran instantaneously.
I'm not a fan of black magic, so I decided to try something else: come up with the id for the record in B that has the name dummy, let's say 23, and then simplify the query to:
SELECT COUNT(A.id) FROM A WHERE (b_id != 23) AND <condition>;
To my surprise, this query was actually slower than the straight join, taking north of a second.
Any ideas on why the join would be faster than the simplified query?
UPDATE: following a request in the comments, the outputs from explain:
Straight join:
+----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+
| 1 | SIMPLE | A | ALL | b_id | NULL | NULL | NULL | 200707 | Using where |
| 1 | SIMPLE | B | eq_ref | PRIMARY,id_name | PRIMARY | 4 | schema.A.b_id | 1 | Using where |
+----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+
No join:
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | A | ALL | b_id | NULL | NULL | NULL | 200707 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
UPDATE 2:
Tried another variant:
SELECT COUNT(A.id) FROM A WHERE b_id IN (<all the ids except for 23>) AND <condition>;
This runs faster than the no join, but still slower than the join, so it seems that the inequality operation is responsible for part of the performance hit, but not all.
If you are using MySQL 5.6 or later then you can ask the query optimizer what it is doing;
SET optimizer_trace="enabled=on";
## YOUR QUERY
SELECT COUNT(*) FROM transactions WHERE (id < 9000) and user != 11;
##END YOUR QUERY
SELECT trace FROM information_schema.optimizer_trace;
SET optimizer_trace="enabled=off";
You will almost certainly need to refer to the following sections in the MySQL reference Tracing the Optimiser and The Optimizer
Looking at the first explain it appears that the query is quicker probably because the optimizer can use the table B to filter down to the rows required based on the join and then use the foreign key to get the rows in table A.
In the explain it's this bit that is interesting; there is only one row matching and it's using schema.A.b_id. Effectively this is pre-filtering the rows from A which is where I think the performance difference comes from.
| ref | rows | Extra |
| schema.A.b_id | 1 | Using where |
So, as is usual with queries it is all down to indexes - or more accurately missing indexes. Just because you have indexes on individual fields it doesn't necessarily mean that these are suitable for the query you're running.
Basic rule: If the EXPLAIN doesn't say Using Index then you need to add a suitable index.
Looking at the explain output the first interesting thing is ironically the last thing on each line; namely the Extra
In the first example we see
| 1 | SIMPLE | A | .... Using where |
| 1 | SIMPLE | B | ... Using where |
Both of these Using where is not good; ideally at least one, and preferably both should say Using index
When you do
SELECT COUNT(A.id) FROM A WHERE (b_id != 23) AND <condition>;
and see Using where then you need to add an index as it's doing a table scan.
for example if you did
EXPLAIN SELECT COUNT(A.id) FROM A WHERE (Id > 23)
You should see Using where; Using index (assuming here that Id is the primary key and has an index)
If you then added a condition onto the end
EXPLAIN SELECT COUNT(A.id) FROM A WHERE (Id > 23) and Field > 0
and see Using where then you need to add an index for the two fields. Just having an index on a field doesn't mean that MySQL will be able to use that index during the query across multiple fields - this is something that internally the query optimizer will decide upon. I'm not exactly certain of the internal rules; but generally adding an extra index to match the query helps immensely.
So adding an index (on the two fields in the query above):
ALTER TABLE `A` ADD INDEX `IndexIdField` (`Id`,`Field`)
should change it such that when querying based upon those two fields there is an index.
I've tried this on one of my databases that has Transactions and User tables.
I'll use this query
EXPLAIN SELECT COUNT(*) FROM transactions WHERE (id < 9000) and user != 11;
Running without index on the two fields:
PRIMARY,user PRIMARY 4 NULL 14334 Using where
Then add an index:
ALTER TABLE `transactions` ADD INDEX `IndexIdUser` (`id`, `user`);
Then the same query again and this time
PRIMARY,user,Index 4 Index 4 4 NULL 12628 Using where; Using index
This time it's using the indexes - and as a result will be a lot quicker.
From comments by #Wrikken - and also bear in mind that I don't have the accurate schema / data so some of this investigation has required assumptions about the schema (which may be wrong)
SELECT COUNT(A.id) FROM A FORCE INDEX (b_id)
would perform at least as good as
SELECT COUNT(A.id) FROM A INNER JOIN B ON A.b_id = B.id.
If we look at the first EXPLAIN in the OP we see that there are two elements to the query. Referring to the EXPLAIN documentation for *eq_ref* I can see that this is going to define the rows for consideration based on this relationship.
The order of the explain output doesn't necessarily mean it's doing one and then the other; it's simply what has been chosen to execute the query (at least as far as I can tell).
For some reason the query optimizer has decided not to use the index on b_id - I'm assuming here that because of the query the optimizer has decided that it will be more efficient to do a table scan.
The second explain concerns me a little because it's not considering the index on b_id; possibly because of the AND <condition> (which is omitted so I'm guessing as to what it could be). When I try this with an index on b_id it does use the index; but as soon as a condition is added it doesn't use the index.
So, when doing
SELECT COUNT(A.id) FROM A INNER JOIN B ON A.b_id = B.id.
This all indicates to me is that the PRIMARY index on B is where the speed difference is coming from; I'm assuming because of the schema.A.b_id in the explain that there is a Foreign key on this table; which must be a better collection of related rows than the index on b_id - so the query optimizer can use this relationship to define which rows to pick - and because a primary index is better than secondary indexes it's going to be much quicker to select rows out of B and then use the relationship link to match against the rows in A.
I do not see any strange behavior here. What you need is to understand the basics of how MySQL uses indexes. Here is an article I usually recommend: 3 ways MySQL uses indexes.
It is always funny to observe people writing things like WHERE (B.name != 'dummy') AND <condition>, because this AND <condition> might be the reason why MySQL optimizer chose the specific index, and there is no valid reason to compare the performance of the query with that of another one with WHERE b_id != 23 AND <condition>, because the two queries usually need different indexes to perform good.
One thing you should understand, is that MySQL likes equality comparisons, and does not like range conditions and inequality comparisons. It is usually better to specify the correct values than to use a range condition or specify a != value.
So, let's compare the two queries.
With straight join
For each row in the A.id order (which is the primary key and is clustered, that is data is stored in its order on disk) take data for the row from disk to check if your <condition> is met and b_id, then (I repeat for each matching row) find the appropriate row for b_id, go on disk, take b.name, compare it with 'dummy'. Even though this plan in not at all efficient, you have only 200000 rows in your A table, so that it seems rather performant.
Without straight join
For each row in table B compare if name is matching, look into the A.b_id index (which is obviously sorted by b_id, since it is an index, hence contains A.ids in random order), and for each A.id for the given A.b_id find the corresponding A row on disk to check the <condition>, if it matches count the id, otherwise, discard the row.
As you see, there is nothing strange in the fact that the second query takes so long, you basically force MySQL to randomly access almost each row in A table, where in the first query you read the A table in the order it is stored on disk.
The query with no join does not use any index at all. It actually should take about the same as the query with straight join. My guess is that the order of the b_id!=23 and <condition> is significant.
UPD1: Could you still compare the performance of your query without join with the following:
SELECT COUNT(A.id)
FROM A
WHERE IF(b_id!=23, <condition>, 0);
UPD2: the fact the you do not see an index in EXPLAIN does not mean that no index is used at all. An index is at least used to define the reading order: when there is no other useful index, it is usually the primary key, but, as I said above, when there is an equility condition and the corresponding index, MySQL will use the index. So, basically, to understand which index is used you can look at the order in which rows are output. If the order is the same as the primary key, than no index was used (that is the primary key index was used), if the order of rows is shuffled - than there was some other index involved.
In your case, the second condition seems to be true for most of the rows, but the index is still used, that is to get b_id MySQL goes on disk in random order, that's why it is slow. No black magic here, and this second condition does affect the performance.
Probably this should be a comment rather than an answer but it will be a bit long.
First of all, it is hard to believe that two queries that have (almost) exactly the same explain run at different speed. Furthermore, this is less likely if the one with the extra line in the explain runs faster. And I guess the word faster is the key here.
You've compared speed (the time it takes for a query to finish) and that is an extremely empiric way of testing. For example, you could have improperly disabled the cache, which makes that comparison useless. Not to mention that your <insert your preferred software application here> could have made a page fault or any other operation at the time you've run the test that could have resulted in a decrease of the query speed.
The right way of measuring query performance is based on the explain (that's why it is there)
So the closest thing I have to answer the question: Any ideas on why the join would be faster than the simplified query?... is, in short, a layer 8 error.
I do have some other comments, though, that should be taken into account in order to speed things up. If A.id is a primary key (the name smells like it is), according to your explain, why does the count(A.id) have to scan all the rows? It should be able to get the data directly from the index but I don't see the Using index in the extra flags. It seems you don't even have a unique index on and that it is not a non nullable field. That also smells odd. Make sure that the field is not null and that there is a unique index on it, run the explain again, confirm the extra flags contain the Using index and then (properly) time the query. It should run much faster.
Also note that an approach that would result in the same performance improvement as I mentioned above would be to replace count(A.id) with count(*).
Just my 2 cents.
Because MySQL will not use index for index!=val in where.
The optimizer will decide to use an index by guessing. As a "!=" will more likely fetch everything, it skip and prevent using index to reduce overhead. (yes, mysql is stupid, and it does not statistic index column)
You may do a faster SELECT, by using index in(everything other then val), that MySQL will learn to use the index.
Example here showing query optimizer will choose to not use index by value
The answer to this question is actually a very simple consequence of algorithm design:
The key difference between these two queries is the merge operation.
Before I give a lesson on algorithms, I will mention the reason why the merge operation improves the performance. The merge improves the performance because it reduces the overall load on the aggregation. This is an iteration vs recursion issue. In the iteration analogy, we are simply looping through the entire index and counting the matches. In the recursion analogy, we are dividing and conquering (so to speak); or in other words, we are filtering the results that we need to count, thus reducing the volume of numbers we actually need to count.
Here are the key questions:
Why is a merge sort faster than an insertion sort?
Is a merge sort always faster than an insertion sort?
Let's explain this with a parable:
Let's say we have a deck of playing cards, and we need to sum the numbers of playing cards that have the numbers 7, 8 and 9 (assuming we don't know the answer in advance).
Let's say that we decide upon two ways to solve this problem:
We can hold the deck in one hand and move the cards to the table, one by one, counting as we go.
We can separate the cards into two groups: black suits and red suits. Then we can perform step 1 upon one of the groups and reuse the results for the second group.
If we choose option 2, then we have divided our problem in half. As a consequence, we can count the matching black cards and multiply the number by 2. In other words, we are re-using the part of the query execution plan that required the counting. This reasoning especially works when we know in advance how the cards were sorted (aka "clustered index"). Counting half of the cards is obviously much less time consuming than counting the entire deck.
If we wanted to improve the performance yet again, depending on how large the size of our database is, we may even further consider sorting into four groups (instead of two groups): clubs, diamonds, hearts, and spades. Whether or not we want to perform this further step depends on whether or not the overhead of sorting the cards into the additional groups is justified by the performance gain. In small numbers of cards, the performance gain is likely not worth the extra overhead required to sort into the different groups. As the number of cards grows, the performance gain begins to outweigh the overhead cost.
Here is an excerpt from "Introduction to Algorithms, 3rd edition," (Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein):
(Note: If someone can tell me how to format the sub-notation, I will edit this to improve readability.)
(Also, keep in mind that "n" is the number of objects we are dealing with.)
"As an example, in Chapter 2, we will see two algorithms for sorting.
The first, known as insertion sort, takes time roughly equal to c1n2
to sort n items, where c1 is a constant that does not depend on n.
That is, it takes time roughly proportional to n2. The second, merge
sort, takes time roughly equal to c2n lg n, where lg n stands for
log2 n and c2 is another constant that also does not depend on n.
Insertion sort typically has a smaller constant factor than merge
sort, so that c1 < c2. We shall see that the constant factors can
have far less of an impact on the running time than the dependence on
the input size n. Let’s write insertion sort’s running time as c1n ·
n and merge sort’s running time as c2n · lg n. Then we see that where
insertion sort has a factor of n in its running time, merge sort has
a factor of lg n, which is much smaller. (For example, when n = 1000,
lg n is approximately 10, and when n equals one million, lg n is
approximately only 20.) Although insertion sort usually runs faster
than merge sort for small input sizes, once the input size n becomes
large enough, merge sort’s advantage of lg n vs. n will more than
compensate for the difference in constant factors. No matter how much
smaller c1 is than c2, there will always be a crossover point beyond
which merge sort is faster."
Why is this relevant? Let us look at the query execution plans for these two queries. We will see that there is a merge operation caused by the inner join.
This mysql query is runned on a large (about 200 000 records, 41 columns) myisam table :
select t1.* from table t1 where 1 and t1.inactive = '0' and (t1.code like '%searchtext%' or t1.name like '%searchtext%' or t1.ext like '%searchtext%' ) order by t1.id desc LIMIT 0, 15
id is the primary index.
I tried adding a multiple column index on all 3 searched (like) columns. works ok but results are served on a auto filled ajax table on a website and the 2 seond return delay is a bit too slow.
I also tried adding seperate indexes on all 3 columns and a fulltext index on all 3 columns without significant improvement.
What would be the best way to optimize this type of query? I would like to achieve under 1 sec performance, is it doable?
The best thing you can do is implement paging. No matter what you do, that IO cost is going to be huge. If you only return one page of records, 10/25/ or whatever that will help a lot.
As for the index, you need to check the plan to see if your index is actually being used. A full text index might help but that depends on how many rows you return and what you pass in. Using parameters such as % really drain performance. You can still use an index if it ends with % but not starts with %. If you put % on both sides of the text you are searching for, indexes can't help too much.
You can create a full-text index that covers the three columns: code, name, and ext. Then perform a full-text query using the MATCH() AGAINST () function:
select t1.*
from table t1
where match(code, name, ext) against ('searchtext')
order by t1.id desc
limit 0, 15
If you omit the ORDER BY clause the rows are sorted by default using the MATCH function result relevance value. For more information read the Full-Text Search Functions documentation.
As #Vulcronos notes, the query optimizer is not able to use the index when the LIKE operator is used with an expression that starts with a wildcard %.
I have a table with 10 columns, Now I want to give the users an option to sort the data with any column they want. For example suppose a combo box with 7 items that each of them is a column of the table, now the user choose one item and get the data sorted by the chosen column.
Now what is the problem?
My table has 3M records, and if I sort the data with indexed column I have no problem but with a non index column it takes 3.5mins to sort!!!
What is the solution I am thinking about?
Add index to every column of table that is needed to be sort by! In my case I will have index on 8 columns!!!!
What is the problem of my solution?
Having a lot of index on columns may decrease the speed of INSERT/UPDATE queries! In my case the table is updated frequently (every second!!!!!)
What is your solution for this case?!
Read this for more details on optimization: http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
In some cases, MySQL cannot use indexes to resolve the ORDER BY, although it still uses indexes to find the rows that match the WHERE clause. Using index for sorting often comes together with using index to find rows, however it can also be used just for sort for example if you’re just using ORDER BY without and where clauses on the table. In such case you would see “Index” type in EXPLAIN which correspond to scanning (potentially) complete table in the index order. It is very important to understand in which conditions index can be used to sort data together with restricting amount of rows.
Looking at the same index (A,B) things like ORDER BY A ; ORDER BY A,B ; ORDER BY A DESC, B DESC will be able to use full index for sorting (note MySQL may not select to use index for sort if you sort full table without a limit). However ORDER BY B or ORDER BY A, B DESC will not be able to use index because requested order does not line up with the order of data in BTREE. If you have both restriction and sorting things like this would work A=5 ORDER BY B ; A=5 ORDER BY B DESC; A>5 ORDER BY A ; A>5 ORDER BY A,B ; A>5 ORDER BY A DESC which again can be easily visualized as scanning a range in BTREE. Things like this however would not work A>5 ORDER BY B , A>5 ORDER BY A,B DESC or A IN (3,4) ORDER BY B – in these cases getting data in sorting form would require a bit more than simple range scan in the BTREE and MySQL decides to pass it on.
Option #1: If you are limited to MySQL there's no better option but create 8 indexes for the possible order columns. You're insert/update are going to suffer it for sure but no real visitor will wait for 3.5 minutes for a list to be sorted.
Tune #1: To make it a little faster you can create partial indexes instead of standard indexes which will use much less space (I assume some of these columns are varchar) and this means less writes, smaller footprint in memory. You just need to check the entropy for each column with the substring and make sure you still have distinction over 90%.
For example with a query like this:
> select count(distinct(substring(COLUMN, 1, 5))) as part_5, count(distinct(substring(COLUMN, 1, 10))) as part_10, count(distinct(substring(COLUMN, 1, 20))) as part_20, count(distinct(COLUMN)) as sum from TABLE;
+--------+---------+---------+---------+
| part_5 | part_10 | part_20 | sum |
+--------+---------+---------+---------+
| 892183 | 1996053 | 1996058 | 1996058 |
+--------+---------+---------+---------+
Tune #2: You can make you insert/update statements to execute in the background. The application won't be faster but the user experience is going to be much better.
Tune #3: Use bigger transactions if you can for the inserts/updates.
Option #2: You can try to use one of the search engines which have been built for this usage pattern (too). I would recommend Solr as I'm using it for a while with great satisfaction but I heard good about elastic search as well.
I have this query in MySQL:
select *
from alias where
name like '%jandro%';
Which results are:
Jandro, Alejandro
The index on name cannot be applied to higher performance because it is a range filter. Is there any way of improving that query performance?
I have tried with a full-text index, but it only works for complete words.
I also tried with a MEMORY ENGINE table, and it is faster, but I would like a better choice.
EDIT
I think i will have just to accept this for now:
select *
from alias where match(name) against ('jandro*' in boolean mode);
I've done this in the past (not on MySQL, and before full text searching was commonly available on database servers) by creating a lookup table, in which I created all left-chopping substrings to search on.
In my case, it was merited - a key user journey involved searching for names in much the way you suggest, and performance was key.
It worked with triggers on insert, update and delete.
Translated to your example:
Table alias
ID name
1 Jandro
2 Alejandro
Table name_lookup
alias_id name_substring
1 Jandro
1 andro
1 ndro
1 dro
1 ro
2 Alejandro
2 lejandro
2 ejandro
2 jandro
2 andro
2 ndro
2 dro
2 ro
Your query then becomes
select alias_id, name
from alias a,
name_lookup nl
where a.id = ni.alias_id
and ni.name_substring like 'andro%'
That way, you hit the index on the name_substring table.
It's only worth doing for common queries on huge data sets - but it works, and it's quick.
Currently I'm running these two queries:
SELECT COUNT(*) FROM `mytable`
SELECT * FROM `mytable` WHERE `id`=123
I'm wondering what format will be the most efficient. Does the order the queries are executed make a difference? Is there a single query that will do what I want?
Both queries are fairly unrelated. The COUNT doesn't use any indexes, while the SELECT likely uses the primary key for a fast look-up. The only thing the queries have in common is the table.
Since these are so simple, the query optimizer and results cache shouldn't have a problem performing very well on these queries.
Are they causing you performance problems? If not, don't bother optimizing them.
Does the order the queries are executed make a difference?
No, they reach for different things. The count will read a field that contains the number of colums of the table, the select by id will use the index. Both are fast and simple.
Is there a single query that will do what I want?
Yes, but it will make your code less clear, less maintenable (due to mixing concepts) and in the best case will not improve the performance (probably it will make it worse).
If you really really want to group them somehow, create a stored procedure, but unless you use this pair of queries a lot or in several places of the code, it can be an overkill.
First of: Ben S. makes a good point. This is not worth optimizing.
But if one wants to put those two statements in one SQl statement I think this is one way to do it:
select *,count(*) from mytable
union all
select *,-1 from mytable where id = 123
This will give one row for the count(*) (where one ignores all but the last column) and as many rows where id = 123 (where one ignores the last column as it is always -1)
Like this:
| Column1 | Column2 | Column3 | ColumnN | Count(*) Column |
---------------------------------------------------------------
| ignore | ignore | ignore | ignore | 4711 |
|valid data|valid data|valid data|valid data| -1 (ignore) |
Regards
Sigersted
What table engine are you using?
select count(*) is better on MyISAM compare to InnoDB. In MyISAM the number of rows for each table is stored. When doing count(*) the value is return. InnoDB doesn't do this because it supports transactions.
More info:
http://www.mysqlperformanceblog.com/2006/12/01/count-for-innodb-tables/