I have written the query below and it's taking nearly 5 minutes to run. I have 6 million rows of data in table and found from the execution plan that some how my query does not use indexes even though all fields of the table have indexes.
Query
SELECT
event_date as date,
(CAST('2014-05-31' AS DATE)- INTERVAL 5 MONTH + INTERVAL 1 DAY) AS FROM_DATE,
COUNT(DISTINCT(IF( Column1 !=0 OR Column2!=0 OR Column3 !=0, account, NULL))) AS total_account1,
COUNT(DISTINCT(IF( Column4 !=0 OR Column5 !=0 OR Column6!=0, account, NULL))) AS total_account2,
COUNT(DISTINCT(IF( Column7 !=0 OR Column8 !=0 OR Column9!=0, account, NULL))) AS total_account3
FROM Table_name
WHERE cast(event_date as DATE) BETWEEN CAST('2014-05-31' AS DATE)- INTERVAL 5 MONTH and CAST('2014-05-31' AS DATE)
AND cast(event_date as DATE) < NOW() - INTERVAL 2 DAY
GROUP BY MONTH(event_date)
"Explain" above query output is -
+----+-------------+---------+------+---------------+------+---------+------+---------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+------+---------------+------+---------+------+---------+-----------------------------+
| 1 | SIMPLE | table_name | ALL | NULL | NULL | NULL | NULL | 5764552 | Using where; Using filesort |
+----+-------------+---------+------+---------------+------+---------+------+---------+-----------------------------+
Why is my query not using the indexes available to it?
You can explicitly force engine to use index.
check it http://dev.mysql.com/doc/refman/5.1/en/index-hints.html
Related
I'm looking to achieve efficient indexing technique for my logs table that looks like this:
MariaDB [Webapp]> explain logs;
+----------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| user_id | int(11) | YES | MUL | NULL | |
| activity_name | varchar(20) | NO | | NULL | |
| activity_key | varchar(255) | NO | | NULL | |
| activity_value | varchar(255) | NO | | NULL | |
| activity_date | datetime | NO | MUL | NULL | |
+----------------+--------------+------+-----+---------+----------------+
I do searching like this:
SELECT *
FROM logs
WHERE user_id IN (1, 3)
AND activity_name IN ('login', 'logout')
AND activity_date >= '2020-02-01'
AND activity_date <= '2020-06-01'
Where columns user_id, activity_name and activity_date are involved
And sometimes like this:
SELECT *
FROM logs
WHERE user_id IN (1, 3)
AND activity_name IN ('login', 'logout')
Where both user_id and activity_name are involved but no date.
And like this too:
SELECT *
FROM logs
WHERE user_id IN (1, 3)
AND activity_date >= '2020-02-01'
AND activity_date <= '2020-06-01'
SELECT *
FROM logs
WHERE activity_name IN ('login', 'logout')
AND activity_date >= '2020-02-01'
AND activity_date <= '2020-06-01'
I did read about Compound Indexes and that they would be good if my search was ordered, but as you can see it's not so I think its not suitable..
And I also read that single index can be used just on one column at once, so i think it won't be good for my case..
Any ideas please, I'm not too much familiar with MySQL. How can I make my queries optimal?
Note: I don't use the wildcard (*) because I read it slow down things but I just put it to shorten the query for easier understanding
For each query, the base idea is to have an index whose columns cover the where clause. For your This cannot be achieved using a single index for the four queries - I think that you need 3 indexes.
First, consider the following index:
logs(user_id, activity_name, activity_date)
It matches on the where clause of the first query:
WHERE
user_id IN (1, 3)
AND activity_name IN ('login', 'logout')
AND activity_date >= '2020-02-01'
AND activity_date <= '2020-06-01'
And also on the second query (the third index column is ignored here):
WHERE
user_id IN (1, 3)
AND activity_name IN ('login', 'logout')
For the two other queries, you need two separate indexes:
WHERE
user_id IN (1, 3)
AND activity_date >= '2020-02-01'
AND activity_date <= '2020-06-01'
Needs:
logs(user_id, activity_date)
And:
WHERE
activity_name IN ('login', 'logout')
AND activity_date >= '2020-02-01'
AND activity_date <= '2020-06-01'
Needs:
logs(activity_name, activity_date)
Side note: in general, do not blindly select *; instead, enumerate the columns you want in the result set - especially if you don't want them all. If you just need two or three columns, consider adding them at the end of the index, hence turning it to a covering index.
SELECT IF(priority_date, priority_date, created_at) as created_at
FROM table
WHERE IF(priority_date , priority_date , created_at)
BETWEEN '2017-10-10 00:00:00' AND '2017-10-10 23:59:59';
What is the best way to execute this query, performance-wise?
I have a fairly large table that has two datetimes. created_at and priority_date.
priority_date doesn't always exist, but if it does, it should be what is queried upon, else it falls back to created_at. created_at is always generated upon creation of the row. The above query causes a (nearly) full table scan.
The explain plan for initial query:
+------+-------------+-----------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | table | ALL | NULL | NULL | NULL | NULL | 444877 | Using where |
+------+-------------+-----------------+------+---------------+------+---------+------+--------+-------------+
I should also note that priority_date or created_at may not necessarily both be within the time frame in question on a single row. So doing something like:
WHERE priority_date BETWEEN '2017-10-10 00:00:00' AND '2017-10-10 23:59:59'
OR created_at BETWEEN '2017-10-10 00:00:00' AND '2017-10-10 23:59:59'
Could give bad results if priority_date was 2017-10-04 23:10:43 and created_at was 2017-10-10 01:23:45
My current rows for said table: 582739
Count of WHERE priority_date BETWEEN... : 3908
Count of WHERE created_at BETWEEN...: 3437
Example explain of just one of the columns queried in WHERE BETWEEN:
+------+-------------+-----------------+-------+----------------------------------+----------------------------------+---------+------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------------+-------+----------------------------------+----------------------------------+---------+------+------+-----------------------+
| 1 | SIMPLE | table | range | table_created_at_index | table_created_at_index | 5 | NULL | 3436 | Using index condition |
+------+-------------+-----------------+-------+----------------------------------+----------------------------------+---------+------+------+-----------------------+
Clearly the IF is not the most efficient. The columns are indexed and the explains of individual rows are matching to their counts for rows on the explain plan. How can I leverage having a priority/fallback query without the wild performance loss?
EDIT
The best I've been able to figure (But WOW, is that verbose and copy/paste-y feeling)
SELECT IF(priority_date, priority_date, created_at) as created_at, priority_date
FROM table
WHERE priority_date BETWEEN '2017-10-10 00:00:00' AND '2017-10-10 23:59:59'
OR created_at BETWEEN '2017-10-10 00:00:00' AND '2017-10-10 23:59:59'
HAVING ((priority_date AND priority_date BETWEEN '2017-10-10 00:00:00' AND '2017-10-10 23:59:59')
OR created_at BETWEEN '2017-10-10 00:00:00' AND '2017-10-10 23:59:59');
And its explain plan:
+------+-------------+-----------------+-------------+-----------------------------------------------------------------------+-----------------------------------------------------------------------+---------+------+------+------------------------------------------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------------+-------------+-----------------------------------------------------------------------+-----------------------------------------------------------------------+---------+------+------+------------------------------------------------------------------------------------------------------+
| 1 | SIMPLE | table | index_merge | table_priority_date_index,table_created_at_index | table_priority_date_index,table_created_at_index | 6,5 | NULL | 7343 | Using sort_union(table_priority_date_index,table_created_at_index); Using where |
+------+-------------+-----------------+-------------+-----------------------------------------------------------------------+-----------------------------------------------------------------------+---------+------+------+------------------------------------------------------------------------------------------------------+
First you need a compound index on (priority_date, created_at), then you can use a query like this:
SELECT IF(priority_date, priority_date, created_at) as created_at, priority_date
FROM table
WHERE priority_date BETWEEN '2017-10-10' AND '2017-10-10 23:59:59'
OR (priority_date IS NULL AND created_at BETWEEN '2017-10-10' AND '2017-10-10 23:59:59');
Having priority_date first in the compound index makes a big difference. No union is required.
Explain results on 400k rows with 2000 results:
Extra: Using where; Using index
key: priority_created_compound
rows: 2000
SELECT priority_date as created_at
FROM table
WHERE priority_date BETWEEN '2017-10-10 00:00:00' AND '2017-10-10 23:59:59'
UNION ALL
SELECT created_at
FROM table
WHERE created_at BETWEEN '2017-10-10 00:00:00' AND '2017-10-10 23:59:59'
AND priority_date IS NULL;
You'll need an index starting with priority_date for the first half of this query, and an index on (created_at, priority_date) for the second half.
The first half will naturally not match any rows where the priority_date is NULL.
The second half will do the range-condition on created_at, and then among the subset of matching rows, further test that priority_date is NULL. This may be done by index condition pushdown.
( SELECT priority_date AS created_at
FROM table
WHERE priority_date >= '2017-10-10'
AND priority_date < '2017-10-10' + INTERVAL 1 DAY )
UNION DISTINCT
( SELECT created_at
FROM table
WHERE created_at >= '2017-10-10'
AND created_at < '2017-10-10' + INTERVAL 1 DAY
AND priority_date IS NULL )
With
INDEX(priority_date, created_at) -- in this order
Notes:
This way to do BETWEEN works better for date ranges other than DATETIME, plus avoids computing leap days, etc. (This is no performance difference.)
For each subquery, the one index is "covering" and optimal. No ICP should be needed.
I chose DISTINCT on the UNION -- Though slower than ALL, it may be more to your app's liking. Switch to ALL if there can't be dups, or if dups are OK.
I have got to different tables with temperature values and a timestamp. I join those tables with this query:
SELECT UNIX_TIMESTAMP(l.TimeDate) time
, AVG(l.intemp)
, AVG(n.intemp)
, DATE_FORMAT(l.TimeDate, '%Y-%m-%d-%H') dates
FROM values.temps l
LEFT
JOIN values.net n
ON DATE_FORMAT(l.TimeDate, '%Y-%m-%d-%H') = DATE_FORMAT(n.TimeDate, '%Y-%m-%d-%H')
WHERE YEARWEEK('2017-01-17 00:00:00',1) = YEARWEEK(l.TimeDate,1)
GROUP
BY dates
ORDER
BY dates ASC
This query is a little bit slow, but it works and gives me the values for 1 week. So how can I optimize it?
I haven't responded because actually I'm struggling to think how to express your YEARWEEK condition in terms of a range query.
I thought something like this would work, but it refuses to use 'range'.
SELECT *
FROM my_table
WHERE dt BETWEEN CONCAT(STR_TO_DATE(CONCAT(YEARWEEK('2017-01-25'), ' Monday'), '%x%v %W'), ' 00:00:00')
AND CONCAT(STR_TO_DATE(CONCAT(YEARWEEK('2017-01-25'), ' Sunday'), '%x%v %W'), ' 23:59:59')
Perhaps others can spot my schoolboy error.
+----+-------------+----------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | my_table | ALL | dt | NULL | NULL | NULL | 100 | Using where |
+----+-------------+----------+------+---------------+------+---------+------+------+-------------+
For example I have created 3 index:
click_date - transaction table, daily_metric table
order_date - transaction table
I want to check does my query use index, I use EXPLAIN function and get this result:
+----+--------------+--------------+-------+---------------+------------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+--------------+-------+---------------+------------+---------+------+--------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 668 | Using temporary; Using filesort |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 645 | |
| 2 | DERIVED | <derived4> | ALL | NULL | NULL | NULL | NULL | 495 | |
| 4 | DERIVED | transaction | ALL | order_date | NULL | NULL | NULL | 291257 | Using where; Using temporary; Using filesort |
| 3 | DERIVED | daily_metric | range | click_date | click_date | 3 | NULL | 812188 | Using where; Using temporary; Using filesort |
| 5 | UNION | <derived7> | ALL | NULL | NULL | NULL | NULL | 495 | |
| 5 | UNION | <derived6> | ALL | NULL | NULL | NULL | NULL | 645 | Using where; Not exists |
| 7 | DERIVED | transaction | ALL | order_date | NULL | NULL | NULL | 291257 | Using where; Using temporary; Using filesort |
| 6 | DERIVED | daily_metric | range | click_date | click_date | 3 | NULL | 812188 | Using where; Using temporary; Using filesort |
| NULL | UNION RESULT | <union2,5> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+--------------+-------+---------------+------------+---------+------+--------+----------------------------------------------+
In EXPLAIN results I see, that index order_date of transaction table is not used, do I correct understand ?
Index click_date of daily_metric table was used correct ?
Please tell my how to understand from EXPLAIN result does my created index is used in query properly ?
My query:
SELECT
partner_id,
the_date,
SUM(clicks) as clicks,
SUM(total_count) as total_count,
SUM(count) as count,
SUM(total_sum) as total_sum,
SUM(received_sum) as received_sum,
SUM(partner_fee) as partner_fee
FROM (
SELECT
clicks.partner_id,
clicks.click_date as the_date,
clicks,
orders.total_count,
orders.count,
orders.total_sum,
orders.received_sum,
orders.partner_fee
FROM
(SELECT
partner_id, click_date, sum(clicks) as clicks
FROM
daily_metric WHERE DATE(click_date) BETWEEN '2013-04-01' AND '2013-04-30'
GROUP BY partner_id , click_date) as clicks
LEFT JOIN
(SELECT
partner_id,
DATE(order_date) as order_dates,
SUM(order_sum) as total_sum,
SUM(customer_paid_sum) as received_sum,
SUM(partner_fee) as partner_fee,
count(*) as total_count,
count(CASE
WHEN status = 1 THEN 1
ELSE NULL
END) as count
FROM
transaction WHERE DATE(order_date) BETWEEN '2013-04-01' AND '2013-04-30'
GROUP BY DATE(order_date) , partner_id) as orders ON orders.partner_id = clicks.partner_id AND clicks.click_date = orders.order_dates
UNION ALL SELECT
orders.partner_id,
orders.order_dates as the_date,
clicks,
orders.total_count,
orders.count,
orders.total_sum,
orders.received_sum,
orders.partner_fee
FROM
(SELECT
partner_id, click_date, sum(clicks) as clicks
FROM
daily_metric WHERE DATE(click_date) BETWEEN '2013-04-01' AND '2013-04-30'
GROUP BY partner_id , click_date) as clicks
RIGHT JOIN
(SELECT
partner_id,
DATE(order_date) as order_dates,
SUM(order_sum) as total_sum,
SUM(customer_paid_sum) as received_sum,
SUM(partner_fee) as partner_fee,
count(*) as total_count,
count(CASE
WHEN status = 1 THEN 1
ELSE NULL
END) as count
FROM
transaction WHERE DATE(order_date) BETWEEN '2013-04-01' AND '2013-04-30'
GROUP BY DATE(order_date) , partner_id) as orders ON orders.partner_id = clicks.partner_id AND clicks.click_date = orders.order_dates
WHERE
clicks.partner_id is NULL
ORDER BY the_date DESC
) as t
GROUP BY the_date ORDER BY the_date DESC LIMIT 50 OFFSET 0
Although I can't explain what the EXPLAIN has dumped, I thought there must be an easier solution to what you have and came up with the following. I would suggest the following indexes to optimize your existing query for the WHERE date range and grouping by partner.
Additionally, when you have a query that uses a FUNCTION on a field, it doesn't take advantage of the index. Such as your DATE(order_date) and DATE(click_date). To allow the index to better be used, qualify the full date/time such as 12:00am (morning) up to 11:59pm. I would typically to this via
x >= someDate #12:00 and x < firstDayAfterRange.
in your example would be (notice less than May 1st which gets up to April 30th at 11:59:59pm)
click_date >= '2013-04-01' AND click_date < '2013-05-01'
Table Index
transaction (order_date, partner_id)
daily_metric (click_date, partner_id)
Now, an adjustment. Since your clicks table may have entries the transactions dont, and vice-versa, I would adjust this query to do a pre-query of all possible date/partners, then left-join to respective aggregate queries such as:
SELECT
AllParnters.Partner_ID,
AllParnters.the_Date,
coalesce( clicks.clicks, 0 ) Clicks,
coalesce( orders.total_count, 0 ) TotalCount,
coalesce( orders.count, 0 ) OrderCount,
coalesce( orders.total_sum, 0 ) OrderSum,
coalesce( orders.received_sum, 0 ) ReceivedSum,
coalesce( orders.partner_fee 0 ) PartnerFee
from
( select distinct
dm.partner_id,
DATE( dm.click_date ) as the_Date
FROM
daily_metric dm
WHERE
dm.click_date >= '2013-04-01' AND dm.click_date < '2013-05-01'
UNION
select
t.partner_id,
DATE(t.order_date) as the_Date
FROM
transaction t
WHERE
t.order_date >= '2013-04-01' AND t.order_date < '2013-05-01' ) AllParnters
LEFT JOIN
( SELECT
dm.partner_id,
DATE( dm.click_date ) sumDate,
sum( dm.clicks) as clicks
FROM
daily_metric dm
WHERE
dm.click_date >= '2013-04-01' AND dm.click_date < '2013-05-01'
GROUP BY
dm.partner_id,
DATE( dm.click_date ) ) as clicks
ON AllPartners.partner_id = clicks.partner_id
AND AllPartners.the_date = clicks.sumDate
LEFT JOIN
( SELECT
t.partner_id,
DATE(t.order_date) as sumDate,
SUM(t.order_sum) as total_sum,
SUM(t.customer_paid_sum) as received_sum,
SUM(t.partner_fee) as partner_fee,
count(*) as total_count,
count(CASE WHEN t.status = 1 THEN 1 ELSE NULL END) as COUNT
FROM
transaction t
WHERE
t.order_date >= '2013-04-01' AND t.order_date < '2013-05-01'
GROUP BY
t.partner_id,
DATE(t.order_date) ) as orders
ON AllPartners.partner_id = orders.partner_id
AND AllPartners.the_date = orders.sumDate
order by
AllPartners.the_date DESC
limit 50 offset 0
This way, the first query will be quick on the index to get all possible combinations from EITHER table. Then the left-join will AT MOST join to one row per set. If found, get the number, if not, I am applying COALESCE() so if null, defaults to zero.
CLARIFICATION.
Like you when building your pre-aggregate queries of "clicks" and "orders", the "AllPartners" is the ALIAS result of the select distinct of partners and dates within the date range you were interested in. The resulting columns of that where were "partner_id" and "the_date" respective to your next queries. So this is the basis of joining to the aggregates of "clicks" and "orders". So, since I have these two columns in the alias "AllParnters", I just grabbed those for the field list since they are LEFT-JOINed to the other aliases and may not exist in either/or the respective others.
Could you please help me optimize this query. I've spent lots of time and still cannot rephrase it to be fast enough (say running in the matters of seconds, not minutes as it is now).
The query:
SELECT m.my_id, m.my_value, m.my_timestamp
FROM (
SELECT my_id, MAX(my_timestamp) AS most_recent_timestamp
FROM my_table
WHERE my_timestamp < '2011-03-01 08:00:00'
GROUP BY my_id
) as tmp
LEFT OUTER JOIN my_table m
ON tmp.my_id = m.my_id AND tmp.most_recent_timestamp = m.my_timestamp
ORDER BY m.my_timestamp;
my_table is defined as follows:
CREATE TABLE my_table (
my_id INTEGER NOT NULL,
my_value VARCHAR(4000),
my_timestamp TIMESTAMP default CURRENT_TIMESTAMP NOT NULL,
INDEX MY_ID_IDX (my_id),
INDEX MY_TIMESTAMP_IDX (my_timestamp),
INDEX MY_ID_MY_TIMESTAMP_IDX (my_id, my_timestamp)
);
The goal of this query is to select the most recent my_value for each my_idbefore some timestamp. my_table contains ~100 million entries and it takes ~8 minutes to perform it.
explain:
+----+-------------+-------------+-------+------------------------------------------------+-------------------------+---------+---------------------------+-------+---------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+------------------------------------------------+-------------------------+---------+---------------------------+-------+---------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 90721 | Using temporary; Using filesort |
| 1 | PRIMARY | m | ref | MY_ID_IDX,MY_TIMESTAMP_IDX,MY_ID_TIMESTAMP_IDX | MY_TIMESTAMP_IDX | 4 | tmp.most_recent_timestamp | 1 | Using where |
| 2 | DERIVED | my_table | range | MY_TIMESTAMP_IDX | MY_ID_MY_TIMESTAMP_IDX | 8 | NULL | 61337 | Using where; Using index for group-by |
+----+-------------+-------------+-------+------------------------------------------------+-----------------------+---------+---------------------------+------+---------------------------------------+
If I understand correctly, you should be able to drop the nested select completely, and move the where clause to the main query, order by my_timestamp descending and limit 1.
SELECT my_id, my_value, max(my_timestamp)
FROM my_table
WHERE my_timestamp < '2011-03-01 08:00:00'
GROUP BY my_id
*edit - added max and group by
a trick to get a most recent record can be to use order by together with 'limit 1' instead of max aggregation together with "self" join
somthing like this (not tested):
SELECT m.my_id, m.my_value, m.my_timestamp
FROM my_table m
WHERE my_timestamp < '2011-03-01 08:00:00'
ORDER BY m.my_timestamp DESC
LIMIT 1
;
update above doesn't work because a grouping is required...
other solution that has WHERE-IN-SubSelect instead of the JOIN you've used.
could be faster. please test with your data.
SELECT m.my_id, m.my_value, m.my_timestamp
FROM my_table m
WHERE ( m.my_id, m.my_timestamp ) IN (
SELECT i.my_id, MAX(i.my_timestamp)
FROM my_table i
WHERE i.my_timestamp < '2011-03-01 08:00:00'
GROUP BY i.my_id
)
ORDER BY m.my_timestamp;
I notice in the explain plan that the optimizer is using the MY_ID_MY_TIMESTAMP_IDX index for the sub-query, but not the outer query.
You may be able to speed it up using an index hint. I also updated the ON clause to refer to tmp.most_recent_timestamp using its alias.
SELECT m.my_id, m.my_value, m.my_timestamp
FROM (
SELECT my_id, MAX(my_timestamp) AS most_recent_timestamp
FROM my_table
WHERE my_timestamp < '2011-03-01 08:00:00'
GROUP BY my_id
) as tmp
LEFT OUTER JOIN my_table m use index (MY_ID_MY_TIMESTAMP_IDX)
ON tmp.my_id = m.my_id AND tmp.most_recent_timestamp = m.my_timestamp
ORDER BY m.my_timestamp;