how to index when querying range of values mysql - mysql

I have a sql query
SELECT level, data, string, soln, uid
FROM user_data
WHERE level = 10 AND (timetaken >= 151 AND timetaken <= 217) AND uid != 1
LIMIT 8852, 1;
which fetches from a table with 1.5 million entries.
I have indexed using
alter table user_data add index a_idx (level, timetaken, uid);
So the issue i am facing is it takes more than 30sec to query in some cases and in somecases less than 0.01sec.
Is there any issue with the indexing here.
Edit:
Added the explain query details
+----+-------------+------------------+-------+---------------+------------+---------+------+-------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+-------+---------------+------------+---------+------+-------+--------------------------+
| 1 | SIMPLE | user_data | range | a_idx | a_idx | 30 | NULL | 24091 | Using where; Using index |
+----+-------------+------------------+-------+---------------+------------+---------+------+-------+--------------------------+
The data field in the table is a text field. Its length is greater than 255 characters in most cases. Does this cause a Issue?

First of all you should try getting the execution plan of this query with EXPLAIN:
EXPLAIN SELECT level, data, string, soln, uid
FROM user_data
WHERE level = 10 AND (timetaken >= 151 AND timetaken <= 217) AND uid != 1
LIMIT 8852, 1;
This is a great slide to follow through on this topic:
http://www.slideshare.net/phpcodemonkey/mysql-explain-explained
Try adding different indexes:
one on uid and level
a separate one on timetaken

The problem is in the high offset. In order to select the 8853rd result, MySQL has to scan all 8852 rows before this.
Btw, using limit without order by may lead to unexpected results.
In order to speed up the queries with high offset, you should move to a since..until pagination strategy

Related

MySql not picking correct index for few queries

I'm running follwing query on the table, I'm changing values in the where condition, while running in one case it's taking one index and another case taking it's another(wrong??) index.
row count for query 1 is 402954 it's taking approx 1.5 sec
row count for query 2 is 52097 it's taking approx 35 sec
Both queries query 1 and query 2 are same , only I'm changing values in the where condition
query 1
EXPLAIN SELECT
log_type,count(DISTINCT subscriber_id) AS distinct_count,
count(subscriber_id) as total_count
FROM campaign_logs
WHERE
domain = 'xxx' AND
campaign_id='123' AND
log_type IN ('EMAIL_SENT', 'EMAIL_CLICKED', 'EMAIL_OPENED', 'UNSUBSCRIBED') AND
log_time BETWEEN
CONVERT_TZ('2015-02-12 00:00:00','+05:30','+00:00') AND
CONVERT_TZ('2015-02-19 23:59:58','+05:30','+00:00')
GROUP BY log_type;
EXPLAIN of above query
+----+-------------+---------------+-------+------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------+------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+------+--------+-------------+
| 1 | SIMPLE | campaign_logs | range | campaign_id_index,domain_index,log_type_index,log_time_index,campaignid_domain_logtype_logtime_index | campaignid_domain_logtype_logtime_index | 468 | NULL | 402954 | Using where |
+----+-------------+---------------+-------+------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+------+--------+-------------+
query 2
EXPLAIN SELECT
log_type,count(DISTINCT subscriber_id) AS distinct_count,
count(subscriber_id) as total_count
FROM stats.campaign_logs
WHERE
domain = 'yyy' AND
campaign_id='345' AND
log_type IN ('EMAIL_SENT', 'EMAIL_CLICKED', 'EMAIL_OPENED', 'UNSUBSCRIBED') AND
log_time BETWEEN
CONVERT_TZ('2014-02-05 00:00:00','+05:30','+00:00') AND
CONVERT_TZ('2015-02-19 23:59:58','+05:30','+00:00')
GROUP BY log_type;
explain of above query
+----+-------------+---------------+-------------+------------------------------------------------------------------------------------------------------+--------------------------------+---------+------+-------+------------------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------------+------------------------------------------------------------------------------------------------------+--------------------------------+---------+------+-------+------------------------------------------------------------------------------+
| 1 | SIMPLE | campaign_logs | index_merge | campaign_id_index,domain_index,log_type_index,log_time_index,campaignid_domain_logtype_logtime_index | campaign_id_index,domain_index | 153,153 | NULL | 52097 | Using intersect(campaign_id_index,domain_index); Using where; Using filesort |
+----+-------------+---------------+-------------+------------------------------------------------------------------------------------------------------+--------------------------------+---------+------+-------+------------------------------------------------------------------------------+
Query 1 is using correct index because I have composite index
Query 2 is using index merge , it's taking long time to execute
Why MySql using different indexes for same query
I know we can mention USE INDEX in the query , but why MySql is not picking correct index in this case??. am I doing anything wrong??
No, you're not doing anything wrong.
As Chipmonkey stated in comments, sometimes MySQL will choose the wrong execution plan because of outdated table statistics. You can update the table statistics by performing ANALYZE TABLE.
Still, MySQL optimizer isn't that sophisticated. It sees that in both cases, MySQL will have to visit both the secondary index and then perform a lookup to the clustered index to get the actual table data, so when it saw that perhaps the second query had better selectivity by using the two separate indexes and merging them, you can't blame it too much just because it guessed wrong.
I'm guessing that if you had a covering index so that MySQL could perform the entire query with just the index, it will favor that index over performing a merge.
Try adding subscriber_id to the end of your multi-column index to get a covering index.
Otherwise, use USE INDEX or FORCE INDEX, because that's what they're there for. You know more about the data than MySQL does.
I suggest you try this:
Add this permutation of your compound index.
(campaign_id,domain,log_time,log_type,subscriber_id)
Change your query to remove the WHERE log_type IN() criterion, thus allowing the aggregate function to use all the records it finds in the range scan on log_time. Including subscriber_id in the index should allow the whole query to be satisfied directly from the index. That is, this is a covering index.
Finally, you can filter on your log_type values by wrapping the whole query in
SELECT *
FROM (/*the whole query*/) x
WHERE log_type IN
('EMAIL_SENT', 'EMAIL_CLICKED', 'EMAIL_OPENED', 'UNSUBSCRIBED')
ORDER BY log_type
This should give you better, and more predictable, performance.
(Unless the log_types you want are a tiny subset of the records, in which case please ignore this suggestion.)

Why my mysql answer that "not using key" when I use rand in where

I have a table that has 4,000,000 records.
The table is created that : (user_id int, partner_id int, PRIMARY_KEY ( user_id )) engine=InnoDB;
I want to test the performance of select 100 records.
Then, I tested following:
mysql> explain select user_id from MY_TABLE use index (PRIMARY) where user_id IN ( 1 );
+----+-------------+----------+-------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+---------+---------+-------+------+-------------+
| 1 | PRIMARY | MY_TABLE | const | PRIMARY | PRIMARY | 4 | const | 1 | Using index |
+----+-------------+----------+-------+---------------+---------+---------+-------+------+-------------+
1 row in set, 1 warning (0.00 sec)
This is OK.
But, this query is buffered by mysql.
So, this test make no after the first test.
Then, I thinked of a sql that select by random value.
I tested following:
mysql> explain select user_id from MY_TABLE use index (PRIMARY) where user_id IN ( select ceil( rand() ) );
+----+-------------+----------+-------+---------------+---------+---------+------+---------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+---------+---------+------+---------+--------------------------+
| 1 | PRIMARY | MY_TABLE | index | NULL | PRIMARY | 4 | NULL | 3998727 | Using where; Using index |
+----+-------------+----------+-------+---------------+---------+---------+------+---------+--------------------------+
But, it's bad.
Explain shows that possible_keys is NULL.
So, full index scanning is planned, and in fact, it's too slow rather than the one before.
Then, I want to ask you to teach me how do I write random value with index looking up.
Thanks
Using rand() in SQL is usually a sure-fire way to make the query slow. A common theme here is people using it in ORDER BY to get a random sequence. It's slow because not only does it throw away the indexes, but it also reads through the whole table.
However in your case, the fact that the function calls are in a sub-query ought to allow the outer query to still use its indexes. The fact that it isn't seems quite odd (so I've given the question a +1 vote).
My theory is that perhaps MySQL's optimiser is getting it wrong -- it's seeing the functions in the inner query, and deciding incorrectly that it can't use an index.
The only thing I can suggest to work around that is using force index to push MySQL into using the index you want.
See the definition of rand().
If i understand right, you are trying to get a random record from the database. If that is the case, again from the rand() definition:
ORDER BY RAND() combined with LIMIT is useful for selecting a random sample from a set of rows:
SELECT * FROM table1, table2 WHERE a=b AND c<d -> ORDER BY RAND() LIMIT 1000;
It's a limitation of the MySQL optimizer, that it can't tell that the subquery returns exactly one value, it has to assume the subquery returns multiple rows with unpredictable values, potentially even all the values of user_id. Therefore it decides it's just going to do an index scan.
Here's a workaround:
mysql> explain select user_id from MY_TABLE use index (PRIMARY)
where user_id = ( select ceil( rand() ) );
Note that MySQL's RAND() function returns a value in the range 0 <= v < 1.0. If you CEIL() it, you'll likely get the value 1. Therefore you'll virtually always get the row where user_id=1. If you don't have such a row in your table, you'll get an empty set result. You certainly won't get a user chosen randomly among all your users.
To fix that problem, you'd have to multiply the rand() by the number of distinct user_id values. And that brings up the problem that you might have gaps, so a randomly chosen value won't match any existing user_id.
Re your comment:
You'll always see possible keys as NULL when you get an index scan (i.e., "type" is "index").
I tried your explain query on a similar table, and it appears that the optimizer can't figure out that the subquery is a constant expression. You can workaround this limitation by calculating the random number in application code and then using the result as a constant value in your query:
select user_id from MY_TABLE use index (PRIMARY)
where user_id = $random;

How can I speed up this query that joins a table on itself?

We have a `users' table that holds information about our users. One of the fields within this table is called 'query'. I am trying to SELECT the user id's of all users that have the same query. So my output should look like this:
user1_id user2_id common_query
43 2 "foo"
117 433 "bar"
1 119 "baz"
1 52 "qux"
Unfortunately, I can't get this query to finish in under an hour (the users table is pretty big). This is my current query:
SELECT u1.id,
u2.id,
u1.query
FROM users u1
INNER JOIN users u2
ON u1.query = u2.query
AND u1.id <> u2.id
My explain:
+----+-------------+-------+-------+----------------------+----------------------+---------+---------------------------------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+----------------------+----------------------+---------+---------------------------------+----------+--------------------------+
| 1 | SIMPLE | u1 | index | index_users_on_query | index_users_on_query | 768 | NULL | 10905267 | Using index |
| 1 | SIMPLE | u2 | ref | index_users_on_query | index_users_on_query | 768 | u1.query | 11 | Using where; Using index |
+----+-------------+-------+-------+----------------------+----------------------+---------+---------------------------------+----------+--------------------------+
As you can see from the explain, the users table is indexed on query and the index appears to be being used in my SELECT. I'm wondering why the 'rows' column on table u2 has a value of 11, and not 1. Is there anything I can do to speed this query up? Is my '<>' comparison within the join bad practice? Also, the id field is the primary key
My biggest concern is the key_len, which indicates that MySQL must compare up to 768 bytes in order to lookup each index entry.
For this query, a hash index on query could be much more performant (as it would involve substantially shorter comparisons, at the cost of calculating hashes and being unable to sort records using that index):
ALTER TABLE users ADD INDEX (query) USING HASH
You might also consider making this a composite on (query, id) so that MySQL need not scan into the record itself to test the <> criterion.
The main driver of the query is the equality on the query field--if it's indexed. The <> to the id is probably not very specific and it shows by the type of select being used for it is 'ref'
Below only applies if 'query' is not indexed....
If id is the primary key you could just do this:
CREATE INDEX index_1 ON users (query);
The result of adding such an index will be a covering index for the query and will result in the fastest execution for the query.
How many queries do you have? You can add table UsersInQueries:
id queryId userId
0 5 453
1 23 732
2 15 761
then select from this table and group by queryId
If you only have up to two users per query, you could do this instead:
select query, min(id) as FirstID, max(id) as SecondId
from users
group by query
having count(*) > 1
If you have more than two users with the same query, can you explain why you would want all pairs of such users?

How can i speed up a group by query that already uses indexes?

We have a MyISAM table with approximately 75 milion rows that has 5 columns:
id (int),
user_id(int),
page_id (int),
type (enum with 6 strings)
date_created(datetime).
We have a primary index on the ID column, a unique index (user_id, page_id, date_created) AND a composite index (page_id, date_created)
The problem is that the query below takes up to 90 seconds to complete
SELECT SQL_NO_CACHE user_id, count(id) nr
FROM `table`
WHERE `page_id`=301
and `date_created` BETWEEN '2012-01-03' AND '2012-02-03 23:59:59'
AND page_id<>user_id
group by `user_id`
This is the explain of this query
+----+-------------+----------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
| 1 | SIMPLE | table | range | page_id | page_id | 12 | NULL | 520024 | Using where; Using temporary; Using filesort |
+----+-------------+----------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
EDIT:
At the suggestion of ypercube I tried adding a new index (page_id, user_id, date_created). However mysql does not use it bu default so i had to suggest it to the query optimizer. Here is the new query and the explain:
SELECT SQL_NO_CACHE user_id, count(*) nr FROM `table` USE INDEX (usridexp) WHERE `page_id`=301 and `date_created` BETWEEN '2012-01-03' AND '2012-02-03 23:59:59' AND page_id<>user_id group by `user_id` ORDER BY NULL
+----+-------------+----------------------------+------+---------------+----------+---------+-------+---------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------------+------+---------------+----------+---------+-------+---------+--------------------------+
| 1 | SIMPLE | table | ref | usridexp | usridexp | 4 | const | 3943444 | Using where; Using index |
+----+-------------+----------------------------+------+---------------+----------+---------+-------+---------+--------------------------+
Some changes that may improve the query:
Change COUNT(id) to COUNT(*). Since id is (I guess) the PRIMARY KEY and NOT NULL, the results will be identical.
Add an ORDER BY NULL after ther GROUP BY clause. In MySQL, a group by operation also sorts the results, unless you specify other wise.
The (page_id, date_created) is probably the best index that MySQL can use for this query but you could also try (page_id, user_id, date_created) (can you also post the EXPLAIN if you add this index?)
Another thing not related to the performance of this query:
If your (user_id, page_id, date_created) is UNIQUE and the id is auto generated (and not used for anything else but as a Primary Key), you can make it the PRIMARY KEY and drop the id column. One less index and 4 bytes less per row.
1) It depends on your data - but you should have multiple indexes available to allow MySQL to choose the best one. e.g. if the table had an index on page_id it wouldn't be scanning so many rows.
2) There is a way of optimising date searches. I haven't actually implemented this myself yet, but have a similar problem that I have thought about quite a bit.
Basically you are looking up data by day - but date compares are really slow. What you could do is create another table that stores earliest and latest ID from table for each day. That table would need to be populated at the end of each day.
After that you could break your query into two parts:
i) Find the IDs to search y running two queries:
select earliestID from idCacheTable where date = '2012-01-03';
select latestID from idCacheTable where date = '2012-02-03';
ii) You can then search directly on the primary key of the table, without doing a date compare on each row, which would be waaaaaay faster.
SELECT SQL_NO_CACHE user_id, count(id) nr
FROM table
WHERE page_id=301
and (id >= earliestID and id <= latestID)
AND page_id<>user_id
group by user_id;
The exact solution to your problem will depend on what your data looks like though, rather than one of those two things always being correct.
Sounds odd, but try to add JOIN statement:
SELECT SQL_NO_CACHE user_id, count(id) nr
FROM `table` t
JOIN `table` t2 ON t.`user_id`= t2.`user_id`
WHERE t.`page_id`=301
and t.`date_created` BETWEEN '2012-01-03' AND '2012-02-03 23:59:59'
AND t.`page_id`<>t.`user_id`
group by t.`user_id`
For similar problem, I got that query execute 20 times faster (3-4s instead 60+). JOIN statement does not perform anything smart - seems that speedup is fully to internal MySql implementation (Tested on MySql 5.1., table have rare user_id duplicates).

Optimizing MySQL Aggregation Query

I've got a very large table (~100Million Records) in MySQL that contains information about files. One of the pieces of information is the modified date of each file.
I need to write a query that will count the number of files that fit into specified date ranges. To do that I made a small table that specifies these ranges (all in days) and looks like this:
DateRanges
range_id range_name range_start range_end
1 0-90 0 90
2 91-180 91 180
3 181-365 181 365
4 366-1095 366 1095
5 1096+ 1096 999999999
And wrote a query that looks like this:
SELECT r.range_name, sum(IF((DATEDIFF(CURDATE(),t.file_last_access) > r.range_start and DATEDIFF(CURDATE(),t.file_last_access) < r.range_end),1,0)) as FileCount
FROM `DateRanges` r, `HugeFileTable` t
GROUP BY r.range_name
However, quite predictably, this query takes forever to run. I think that is because I am asking MySQL to go through the HugeFileTable 5 times, each time performing the DATEDIFF() calculation on each file.
What I want to do instead is to go through the HugeFileTable record by record only once, and for each file increment the count in the appropriate range_name running total. I can't figure out how to do that....
Can anyone help out with this?
Thanks.
EDIT: MySQL Version: 5.0.45, Tables are MyISAM
EDIT2: Here's the descibe that was asked for in the comments
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE r ALL NULL NULL NULL NULL 5 Using temporary; Using filesort
1 SIMPLE t ALL NULL NULL NULL NULL 96506321
First, create an index on HugeFileTable.file_last_access.
Then try the following query:
SELECT r.range_name, COUNT(t.file_last_access) as FileCount
FROM `DateRanges` r
JOIN `HugeFileTable` t
ON (t.file_last_access BETWEEN
CURDATE() + INTERVAL r.range_start DAY AND
CURDATE() + INTERVAL r.range_end DAY)
GROUP BY r.range_name;
Here's the EXPLAIN plan that I got when I tried this query on MySQL 5.0.75 (edited down for brevity):
+-------+-------+------------------+----------------------------------------------+
| table | type | key | Extra |
+-------+-------+------------------+----------------------------------------------+
| t | index | file_last_access | Using index; Using temporary; Using filesort |
| r | ALL | NULL | Using where |
+-------+-------+------------------+----------------------------------------------+
It's still not going to perform very well. By using GROUP BY, the query incurs a temporary table, which may be expensive. Not much you can do about that.
But at least this query eliminates the Cartesian product that you had in your original query.
update: Here's another query that uses a correlated subquery but I have eliminated the GROUP BY.
SELECT r.range_name,
(SELECT COUNT(*)
FROM `HugeFileTable` t
WHERE t.file_last_access BETWEEN
CURDATE() - INTERVAL r.range_end DAY AND
CURDATE() - INTERVAL r.range_start DAY
) as FileCount
FROM `DateRanges` r;
The EXPLAIN plan shows no temporary table or filesort (at least with the trivial amount of rows I have in my test tables):
+----+--------------------+-------+-------+------------------+--------------------------+
| id | select_type | table | type | key | Extra |
+----+--------------------+-------+-------+------------------+--------------------------+
| 1 | PRIMARY | r | ALL | NULL | |
| 2 | DEPENDENT SUBQUERY | t | index | file_last_access | Using where; Using index |
+----+--------------------+-------+-------+------------------+--------------------------+
Try this query on your data set and see if it performs better.
Well, start by making sure that file_last_access is an index for the table HugeFileTable.
I'm not sure if this is possible\better, but try to compute the dates limits first (files from date A to date B), then use some query with >= and <=. It will, theoretically at least, improve the performance.
The comparison would be something like:
t.file_last_access >= StartDate AND t.file_last_access <= EndDate
You could get a small improvement by removing CURDATE() and putting a date in the query as it will run this function for each row twice in your SQL.