I have a table from a legacy system which does not have a primary key. It records transactional data for issuing materials in a factory.
For simplicities sake, lets say each row contains job_number, part_number, quantity & date_issued.
I added an index to the date issued column. When I run an EXPLAIN SELECT * FROM issued_parts WHERE date_issued > '20100101', it shows this:
+----+-------------+----------------+------+-------------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------+------+-------------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | issued_parts | ALL | date_issued_alloc | NULL | NULL | NULL | 9724620 | Using where |
+----+-------------+----------------+------+-------------------+------+---------+------+---------+-------------+
So it sees the key, but it doesn't use it?
Can someone explain why?
Something tells me the MySQL Query Optimizer decided correctly.
Here is how you can tell. Run these:
Count of Rows
SELECT COUNT(1) FROM issued_parts;
Count of Rows Matching Your Query
SELECT COUNT(1) FROM issued_parts WHERE date_issued > '20100101';
If the number of rows you are actually retrieving exceeds 5% of the table's total number, the MySQL Query Optimizer decides it would be less effort to do a full table scan.
Now, if your query was more exact, for example, with this:
SELECT * FROM issued_parts WHERE date_issued = '20100101';
then, you will get a different EXPLAIN plan altogether.
possible_keys names keys with the relevant columns in, but that doesn't mean that each key in it is going to be useful for the query. In this case, none are.
There are multiple types of indexes (indices?). A hash index is a fast way to do a lookup on an item given a specific value. If you have a bunch of discreet values that you are querying against, (for example, a list of 10 dates) then you can calculate a hash for each of those values, and look them up in the index. Since you aren't doing a lookup on a specific value, but rather doing a comparison, a hash index won't help you.
On the other hand, a B-Tree index can help you because it gives an ordering to the elements it is indexing. For instance, see here: http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html for mysql (search for B-Tree Index Characteristics) . You may want to check that your table is using a b-tree index for it's index column.
Related
I have some tables I want to join, but it cannot take dozens of seconds.
I want to go from this query that takes ~1s
SELECT COUNT(*) FROM business_group bg WHERE bg.group_id=1040
+----------+
| COUNT(*) |
+----------+
| 1229380 |
+----------+
1 row in set
Time: 1.173s
to this joined query that is taking ~50s
SELECT COUNT(*) FROM business b
INNER JOIN business_group bg ON b.id=bg.business_id
WHERE bg.group_id=1040
+----------+
| COUNT(*) |
+----------+
| 1229380 |
+----------+
1 row in set
Time: 51.346s
Why does it take that long if the only thing it does differently is to join on the primary key of the business table (business.id)?
Besides this primary key index, I also have this one (group_id, business_id) on business_group (with (business_id, group_id) it took even longer).
Following is the execution plan:
+----+-------------+-------+------------+--------+---------------------------------------------------------+-----------------------------+---------+----------------------+---------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+---------------------------------------------------------+-----------------------------+---------+----------------------+---------+----------+--------------------------+
| 1 | SIMPLE | bg | <null> | ref | FKo2q0jurx07ein31bgmfvuk8gf,idx_bg_group_id_business_id | idx_bg_group_id_business_id | 9 | const | 2654528 | 100.0 | Using index |
| 1 | SIMPLE | b | <null> | eq_ref | PRIMARY | PRIMARY | 4 | database.bg.group_id | 1 | 100.0 | Using where; Using index |
+----+-------------+-------+------------+--------+---------------------------------------------------------+-----------------------------+---------+----------------------+---------+----------+--------------------------+
Is it possible to optimize the second query so it takes less time?
business table is ~45M rows while business_group is ~60M rows.
I'm writing this as someone who does a lot of indexing setups on SQL Server rather than MySQL. It is too long as a comment, and is based on what I believe are fundamentals, so hopefully it will help.
Why?
Firstly - why does it take so long for the second query to run? The answer is that it needs to do a lot more work in the second one
To demonstrate, imagine the only non-clustered index you have is on business_group for group_id.
You run the first query SELECT COUNT(*) FROM business_group bg WHERE bg.group_id=1040.
All the engine needs to do is to seek to the appropriate spot in the index (where group_id = 1040), then read/count rows from the index (which is just a series of ints) until it changes - because the non-clustered index is sorted by that column.
Note if you had a second field in the non-clustered index (e.g., group_id, business_id), it would be almost as fast because it's still sorted on group_id first. However, it will be a slightly larger read as each row is 2x the size of the single-column version (but would still be very small).
Imagine you then run a slightly different query, counting business_id instead of * e.g., SELECT COUNT(business_id) FROM business_group bg WHERE bg.group_id=1040.
Assuming business_id is not the PK (and is not in the non-clustered index), then for every row it finds in the index, it then needs to go back and read the business_id from the table check it's not null (either in some sort of loop/reference, or read the whole table - I'm not 100% on how MySQL deals with these). However, it is a lot more work than above.
If business_id was in the index (as above, for group_id, business_id), then it could read that data straight from the index and not need to refer back to the original table - which is good.
Now add the join (your second query) SELECT COUNT(*) FROM business b INNER JOIN business_group bg ON b.id=bg.business_id WHERE bg.group_id=1040. The engine needs to
Get each business_id as above
Potentially sort the business IDs to help with the join
Join it to the business table (to ensure it has a valid row in the business table)
... and to do so, it may need to read all the row's data in the business table
Suggestions #1 - Avoid going to the business table
If you set up foreign keys to ensure that business_id in business_group is valid - then do you need to run the version with the join? Just run the first version.
Suggestion #2 - Indexes
If this was SQL Server and you needed that second query to run as fast as possible, I would set up two non-clustered indexes
NONCLUSTERED INDEX ... ON business_group (group_id, business_id)
NONCLUSTERED INDEX ... ON business (id)
The first means the engine can seek directly to the specific group_id, and then get a sorted list of business_id.
The second provides a sorted list of id (business_id) from the business table. As it has the same sort as the the results from the first index, it means the join is a lot less work.
However, the second one is controversial - many people would say 'no' to this as it overlaps your PK (or, more specifically, clustered index). It would also be sorted the same way. However, at least in SQL Server, this would include all the other data about the businesses e.g., the business name, address, etc. So to read the list of IDs from business, you'd also need to read the rest of the data - taking a lot more time.
However, if you put a non-clustered index just on ID, it will be a very narrow index (just the IDs) and therefore the amount of data to be read would be much less - and therefore often a lot faster.
Note though, that this is not as fast as if you could avoid doing the join altogether (e.g., Suggestion #1 above).
Very simple problem yet hard to find a solution.
Address table with 2,498,739 rows has a field of min_ip and max_ip fields. These are the core anchors of the table for filtering.
The query is very simple.
SELECT *
FROM address a
WHERE min_ip < value
AND max_ip > value;
So it is logical to create an index for the min_ip and max_ip to make the query faster.
Index created for the following.
CREATE INDEX ip_range ON address (min_ip, max_ip) USING BTREE;
CREATE INDEX min_ip ON address (min_ip ASC) USING BTREE;
CREATE INDEX max_ip ON address (max_ip DESC) USING BTREE;
I did try to create just the first option (combination of min_ip and max_ip) but it did not work so I prepared at least 3 indexes to give MySQL more options for index selection. (Note that this table is pretty much static and more of a lookup table)
+------------------------+---------------------+------+-----+---------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+------------------------+---------------------+------+-----+---------------------+-----------------------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| network | varchar(20) | YES | | NULL | |
| min_ip | int(11) unsigned | NO | MUL | NULL | |
| max_ip | int(11) unsigned | NO | MUL | NULL | |
+------------------------+---------------------+------+-----+---------------------+-----------------------------+
Now, it should be straight forward to query the table with min_ip and max_ip as the filter criteria.
EXPLAIN
SELECT *
FROM address a
WHERE min_ip < 2410508496
AND max_ip > 2410508496;
The query performed something around 0.120 to 0.200 secs. However, on a load test, the query rapidly degrades performance.
MySQL server CPU usage sky rocket to 100% CPU usage on just a few simultaneous queries and performance degrades rapidly and does not scale up.
Slow query on mysql server was turned on with 10 secs or higher, and eventually the select query shows up in the logs just after a few seconds of load test.
So I checked the query with explain and found out that it did'nt use an index.
Explain plan result
id select_type table type possible_keys key key_len ref rows Extra
------ ----------- ------ ------ ---------------------- ------ ------- ------ ------- -------------
1 SIMPLE a ALL ip_range,min_ip,max_ip (NULL) (NULL) (NULL) 2417789 Using where
Interestingly, it was able to determine ip_range, ip_min and ip_max as potential indexes but never use any of it as shown in the key column.
I know I can use FORCE INDEX and tried to use explain plan on it.
EXPLAIN
SELECT *
FROM address a
FORCE INDEX (ip_range)
WHERE min_ip < 2410508496
AND max_ip > 2410508496;
Explain plan with FORCE INDEX result
id select_type table type possible_keys key key_len ref rows Extra
------ ----------- ------ ------ ------------- -------- ------- ------ ------- -----------------------
1 SIMPLE a range ip_range ip_range 4 (NULL) 1208894 Using index condition
With FORCE INDEX, yes it uses the ip_range index as key, and rows shows a subset from the query that does not use FORCE INDEX which is 1,208,894 from 2,417,789.
So definitely, using the index should have better performance. (Unless I misunderstood the explain result)
But what is more interesting is, after a couple of test, I found out that on some instances, MySQL does use index even without FORCE INDEX. And my observation is when the value is small, it does use the index.
EXPLAIN
SELECT *
FROM address a
WHERE min_ip < 508496
AND max_ip > 508496;
Explain Result
id select_type table type possible_keys key key_len ref rows Extra
------ ----------- ------ ------ ---------------------- -------- ------- ------ ------ -----------------------
1 SIMPLE a range ip_range,min_ip,max_ip ip_range 4 (NULL) 1 Using index condition
So, it just puzzled me that base on the value pass to the select query, MySQL decides when to use an index and when not to use an index.
I can't imagine what is the basis for determining when to use the index on a certain value being passed to the query. I do understand that
index may not be used if there is no matching index suitable in the WHERE condition but in this case, it is very clear the ip_range index which
is an index based on min_ip and max_ip column is suitable for the WHERE condition in this case.
But the bigger problem I have is, what about other queries. Do I have to go and test those queries on a grand scale.
But even then, as the data grows, can I rely and expect MySQL to use the index?
Yes, I can always use FORCE INDEX to ensure it uses the index. But this is not standard SQL that works on all database.
ORM frameworks may not be able to support FORCE INDEX syntax when they generate the SQL and it tightly couples your query with your index names.
Not sure if anyone has ever encountered this issue but this seems to be a very big problem for me.
Fully agree with Vatev and the others. Not only MySQL does that. Scanning the table is sometimes cheaper than looking at the index first then looking up corresponding entries on disk.
The only time when it for sure uses the index is, when it's a covering index, which means, that every column in the query (for this particular table of course) is present in the index. Meaning, if you need for example only the network column
SELECT network
FROM address a
WHERE min_ip < 2410508496
AND max_ip > 2410508496;
then a covering index like
CREATE INDEX ip_range ON address (min_ip, max_ip, network) USING BTREE;
would only look at the index as there's no need to lookup additional data on disk at all. And the whole index could be kept in memory.
Ranges like that are nasty to optimize. But I have a technique. It requires non-overlapping ranges and stores only a start_ip, not the end_ip (which is effectively available from the 'next' record). It provides stored routines to hide the messy code, involving ORDER BY ... LIMIT 1 and other tricks. For most operations it won't hit more than one block of data, unlike the obvious approaches that tend to fetch half or all the table.
I do agree to all the answers above. but you can try to make only one composite
index like this:
create index ip_rang on address (min_ip ASC,max_ip DESC) using BTREE;
As you know index is also has the disadvantage of using your disk space so consider the optimal index for using.
I'm running follwing query on the table, I'm changing values in the where condition, while running in one case it's taking one index and another case taking it's another(wrong??) index.
row count for query 1 is 402954 it's taking approx 1.5 sec
row count for query 2 is 52097 it's taking approx 35 sec
Both queries query 1 and query 2 are same , only I'm changing values in the where condition
query 1
EXPLAIN SELECT
log_type,count(DISTINCT subscriber_id) AS distinct_count,
count(subscriber_id) as total_count
FROM campaign_logs
WHERE
domain = 'xxx' AND
campaign_id='123' AND
log_type IN ('EMAIL_SENT', 'EMAIL_CLICKED', 'EMAIL_OPENED', 'UNSUBSCRIBED') AND
log_time BETWEEN
CONVERT_TZ('2015-02-12 00:00:00','+05:30','+00:00') AND
CONVERT_TZ('2015-02-19 23:59:58','+05:30','+00:00')
GROUP BY log_type;
EXPLAIN of above query
+----+-------------+---------------+-------+------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------+------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+------+--------+-------------+
| 1 | SIMPLE | campaign_logs | range | campaign_id_index,domain_index,log_type_index,log_time_index,campaignid_domain_logtype_logtime_index | campaignid_domain_logtype_logtime_index | 468 | NULL | 402954 | Using where |
+----+-------------+---------------+-------+------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+------+--------+-------------+
query 2
EXPLAIN SELECT
log_type,count(DISTINCT subscriber_id) AS distinct_count,
count(subscriber_id) as total_count
FROM stats.campaign_logs
WHERE
domain = 'yyy' AND
campaign_id='345' AND
log_type IN ('EMAIL_SENT', 'EMAIL_CLICKED', 'EMAIL_OPENED', 'UNSUBSCRIBED') AND
log_time BETWEEN
CONVERT_TZ('2014-02-05 00:00:00','+05:30','+00:00') AND
CONVERT_TZ('2015-02-19 23:59:58','+05:30','+00:00')
GROUP BY log_type;
explain of above query
+----+-------------+---------------+-------------+------------------------------------------------------------------------------------------------------+--------------------------------+---------+------+-------+------------------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------------+------------------------------------------------------------------------------------------------------+--------------------------------+---------+------+-------+------------------------------------------------------------------------------+
| 1 | SIMPLE | campaign_logs | index_merge | campaign_id_index,domain_index,log_type_index,log_time_index,campaignid_domain_logtype_logtime_index | campaign_id_index,domain_index | 153,153 | NULL | 52097 | Using intersect(campaign_id_index,domain_index); Using where; Using filesort |
+----+-------------+---------------+-------------+------------------------------------------------------------------------------------------------------+--------------------------------+---------+------+-------+------------------------------------------------------------------------------+
Query 1 is using correct index because I have composite index
Query 2 is using index merge , it's taking long time to execute
Why MySql using different indexes for same query
I know we can mention USE INDEX in the query , but why MySql is not picking correct index in this case??. am I doing anything wrong??
No, you're not doing anything wrong.
As Chipmonkey stated in comments, sometimes MySQL will choose the wrong execution plan because of outdated table statistics. You can update the table statistics by performing ANALYZE TABLE.
Still, MySQL optimizer isn't that sophisticated. It sees that in both cases, MySQL will have to visit both the secondary index and then perform a lookup to the clustered index to get the actual table data, so when it saw that perhaps the second query had better selectivity by using the two separate indexes and merging them, you can't blame it too much just because it guessed wrong.
I'm guessing that if you had a covering index so that MySQL could perform the entire query with just the index, it will favor that index over performing a merge.
Try adding subscriber_id to the end of your multi-column index to get a covering index.
Otherwise, use USE INDEX or FORCE INDEX, because that's what they're there for. You know more about the data than MySQL does.
I suggest you try this:
Add this permutation of your compound index.
(campaign_id,domain,log_time,log_type,subscriber_id)
Change your query to remove the WHERE log_type IN() criterion, thus allowing the aggregate function to use all the records it finds in the range scan on log_time. Including subscriber_id in the index should allow the whole query to be satisfied directly from the index. That is, this is a covering index.
Finally, you can filter on your log_type values by wrapping the whole query in
SELECT *
FROM (/*the whole query*/) x
WHERE log_type IN
('EMAIL_SENT', 'EMAIL_CLICKED', 'EMAIL_OPENED', 'UNSUBSCRIBED')
ORDER BY log_type
This should give you better, and more predictable, performance.
(Unless the log_types you want are a tiny subset of the records, in which case please ignore this suggestion.)
I have a table that has 4,000,000 records.
The table is created that : (user_id int, partner_id int, PRIMARY_KEY ( user_id )) engine=InnoDB;
I want to test the performance of select 100 records.
Then, I tested following:
mysql> explain select user_id from MY_TABLE use index (PRIMARY) where user_id IN ( 1 );
+----+-------------+----------+-------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+---------+---------+-------+------+-------------+
| 1 | PRIMARY | MY_TABLE | const | PRIMARY | PRIMARY | 4 | const | 1 | Using index |
+----+-------------+----------+-------+---------------+---------+---------+-------+------+-------------+
1 row in set, 1 warning (0.00 sec)
This is OK.
But, this query is buffered by mysql.
So, this test make no after the first test.
Then, I thinked of a sql that select by random value.
I tested following:
mysql> explain select user_id from MY_TABLE use index (PRIMARY) where user_id IN ( select ceil( rand() ) );
+----+-------------+----------+-------+---------------+---------+---------+------+---------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+---------+---------+------+---------+--------------------------+
| 1 | PRIMARY | MY_TABLE | index | NULL | PRIMARY | 4 | NULL | 3998727 | Using where; Using index |
+----+-------------+----------+-------+---------------+---------+---------+------+---------+--------------------------+
But, it's bad.
Explain shows that possible_keys is NULL.
So, full index scanning is planned, and in fact, it's too slow rather than the one before.
Then, I want to ask you to teach me how do I write random value with index looking up.
Thanks
Using rand() in SQL is usually a sure-fire way to make the query slow. A common theme here is people using it in ORDER BY to get a random sequence. It's slow because not only does it throw away the indexes, but it also reads through the whole table.
However in your case, the fact that the function calls are in a sub-query ought to allow the outer query to still use its indexes. The fact that it isn't seems quite odd (so I've given the question a +1 vote).
My theory is that perhaps MySQL's optimiser is getting it wrong -- it's seeing the functions in the inner query, and deciding incorrectly that it can't use an index.
The only thing I can suggest to work around that is using force index to push MySQL into using the index you want.
See the definition of rand().
If i understand right, you are trying to get a random record from the database. If that is the case, again from the rand() definition:
ORDER BY RAND() combined with LIMIT is useful for selecting a random sample from a set of rows:
SELECT * FROM table1, table2 WHERE a=b AND c<d -> ORDER BY RAND() LIMIT 1000;
It's a limitation of the MySQL optimizer, that it can't tell that the subquery returns exactly one value, it has to assume the subquery returns multiple rows with unpredictable values, potentially even all the values of user_id. Therefore it decides it's just going to do an index scan.
Here's a workaround:
mysql> explain select user_id from MY_TABLE use index (PRIMARY)
where user_id = ( select ceil( rand() ) );
Note that MySQL's RAND() function returns a value in the range 0 <= v < 1.0. If you CEIL() it, you'll likely get the value 1. Therefore you'll virtually always get the row where user_id=1. If you don't have such a row in your table, you'll get an empty set result. You certainly won't get a user chosen randomly among all your users.
To fix that problem, you'd have to multiply the rand() by the number of distinct user_id values. And that brings up the problem that you might have gaps, so a randomly chosen value won't match any existing user_id.
Re your comment:
You'll always see possible keys as NULL when you get an index scan (i.e., "type" is "index").
I tried your explain query on a similar table, and it appears that the optimizer can't figure out that the subquery is a constant expression. You can workaround this limitation by calculating the random number in application code and then using the result as a constant value in your query:
select user_id from MY_TABLE use index (PRIMARY)
where user_id = $random;
I am using MySQl database.
I know if I create a index for a column, it will be fast to query data from a table by using that column index. But, I still have the following questions:
(suppose I have a table named cars, there is a column named country, and I have created index for country column)
I know for example the query SELECT * FROM cars WHERE country='japan'will use the index on column country to query data which is fast. How about != operation? will SELECT * FROM cars WHERE country!='japan'; also use index to query data?
Does WHERE ... IN ... operation use index to query data? For example SELECT * FROM cars WHERE country IN ('japan','usa','sweden');
The general answer is: it depends. It depends on what the database optimizer thinks is the best way to retrieve the data, and its decision may need on the distribution of the data.
For example, if 99% of your rows have country = 'japan', maybe the first query (=) will not use the index, but the country with != will use it.
You can use EXPLAIN SELECT to find out if your query uses an index or not.
For example:
EXPLAIN SELECT *
FROM A
WHERE foo NOT IN (1,4,5,6);
Might yield:
+----+-------------+-------+------+---------------
| id | select_type | table | type | possible_keys
+----+-------------+-------+------+---------------
| 1 | SIMPLE | A | ALL | NULL
+----+-------------+-------+------+---------------
+------+---------+------+------+-------------+
| key | key_len | ref | rows | Extra |
+------+---------+------+------+-------------+
| NULL | NULL | NULL | 2 | Using where |
+------+---------+------+------+-------------+
In this case, the query had no possible_keys and therefore used no key to do the query. It's the key column you'd be interested in.
More information here:
http://dev.mysql.com/doc/refman/5.0/en/explain.html
http://dev.mysql.com/doc/refman/5.0/en/optimization-indexes.html
Use 'EXPLAIN' to see what happens with your query. You will probably be interested in the 'possible_keys' and 'key' column.
EXPLAIN SELECT * FROM CARS WHERE `country` != 'japan'
Both queries will use indexes (assuming there is index with country as it's first column)
When in doubt use EXPLAIN. You will also want to read (at least parts of) this