I have a table that has 1.6M rows. Whenever I use the query below, I get an average of 7.5 seconds.
select * from table
where pid = 170
and cdate between '2017-01-01 0:00:00' and '2017-12-31 23:59:59';
I tried adding a LIMIT 1000 or 10000 or change the date to filter for 1 month, it still processes it to an average of 7.5s. I tried adding a composite index for pid and cdate but it resulted to 1 second slower.
Here is the INDEX list
https://gist.github.com/primerg/3e2470fcd9b21a748af84746554309bc
Can I still make it faster? Is this an acceptable performance considering the amount of data?
Looks like the index is missing. Create this index and see if its helping you.
CREATE INDEX cid_date_index ON table_name (pid, cdate);
And also modify your query to below.
select * from table
where pid = 170
and cdate between CAST('2017-01-01 0:00:00' AS DATETIME) and CAST('2017-12-31 23:59:59' AS DATETIME);
Please provide SHOW CREATE TABLE clicks.
How many rows are returned? If it is 100K rows, the effort to shovel that many rows is significant. And what will you do with that many rows? If you then summarize them, consider summarizing in SQL!
Do have cdate as DATETIME.
Do you use id for anything? Perhaps this would be better:
PRIMARY KEY (pid, cdate, id) -- to get benefit from clustering
INDEX(id) -- if still needed (and to keep AUTO_INCREMENT happy)
This smells like Data Warehousing. DW benefits significantly from building and maintaining Summary table(s), such as one that has the daily click count (etc), from which you could very rapidly sum up 365 counts to get the answer.
CAST is unnecessary. Furthermore 0:00:00 is optional -- it can be included or excluded for either DATE or DATETIME. I prefer
cdate >= '2017-01-01'
AND cdate < '2017-01-01' + INTERVAL 1 YEAR
to avoid leap year, midnight, date arithmetic, etc.
Related
I have 3 models called stores, customers, subscriptions.
subscription has two foreign keys from store and customer models and also has start_date and end_date.
The tables are pretty simple. store only has id and name same as customers.
I'm running this query.
SELECT subscription_subscription.store_id, COUNT(*) AS sub_store
FROM subscription_subscription
WHERE CURRENT_DATE() <= subscription_subscription.end_date
GROUP BY subscription_subscription.store_id
ORDER BY sub_store DESC
And here it is: 621760 total, Query took 9.6737 seconds.
All of tables have 1 million rows.
But when I remove the WHERE CURRENT_DATE() <= subscription_subscription.end_date query takes 0.3177 seconds.
How can I optimize date comparison?
You can try these two things:
use a variable to store CURRENT_DATE() and use this variable in query instead of function
Create an index on end_date which includes store_id
How would the following three queries compare in terms of performance? I'm trying to get all records with year=2017:
Using EXTRACT:
SELECT count(*), completed_by_id FROM table
WHERE EXTRACT(YEAR FROM completed_on)=2017
GROUP BY completed_by_id
# Took 11.8s
Using YEAR:
SELECT count(*), completed_by_id FROM table
WHERE YEAR(completed_on)=2017
GROUP BY completed_by_id
# Took 5.15s
Using LIKE 'YEAR%'
SELECT count(*), completed_by_id FROM table
WHERE completed_on LIKE '2017%'
GROUP BY completed_by_id
# Took 6.61s
Note: In my own testing I found YEAR() to be the fastest, LIKE to be the second fastest, and EXTRACT() to be the slowest.
There are about 5M rows in the table and completed_on is DATETIME field that has been indexed.
You haven't described your table or indexes so all advice about query performance is guesswork.
If your completed_on column is a DATETIME, DATE, or TIMESTAMP type and it is indexed, this query will radically outperform all the ones you have shown, and maintain its performance as your table grows.
SELECT count(*), completed_by_id
FROM table
WHERE completed_on >= '2017-01-01'
AND completed_on < '2017-01-01' + INTERVAL 1 YEAR
GROUP BY completed_by_id
Why? It can do a range scan on the index rather than a nonsargable function call on each row's value.
Notice the use of >= at the beginning of the date range and < at the end. We want to include all rows from the first moment of new years day 2017, up until but not including the first moment of new years day 2018. BETWEEN can't do this, because it uses <= rather than < at the end of its range.
If an index is in place, both BETWEEN and the syntax I have shown use a range scan, and perform about the same.
For best results speeding up this query use a compound index on (completed_on, completed_by_id).
If you are storing completed_on as DATE or DATETIME you can use:
SELECT count(*) as cnt, LEFT(completed_on, 4) AS year
FROM table
GROUP BY year
HAVING year=2017
I have a table user_notifications that has 1100000 records and I have to run this below query but it takes more than 3 minutes to complete the query what can I do to improve the fetch time.
SELECT `user_notifications`.`user_id`
FROM `user_notifications`
WHERE `user_notifications`.`notification_template_id` = 175
AND (DATE(sent_at) >= DATE_SUB(CURDATE(), INTERVAL 4 day))
AND `user_notifications`.`user_id` IN (
1203, 1282, 1499, 2244, 2575, 2697, 2828, 2900, 3085, 3989,
5264, 5314, 5368, 5452, 5603, 6133, 6498..
)
the user ids in IN block are sometimes upto 1k.
for optimisation I have indexed on user_id and notification_template_id column in user_notification table.
Big IN() lists are inherently slow. Create a temporary table with an index and put the values in the IN() list into that tempory table instead, then you'll get the power of an indexed join instead of giant IN() list.
You seem to be querying for a small date range. How about having an index based on SENT_AT column? Do you know what index the current query is using?
(1) Don't hide columns in functions if you might need to use an index:
AND (DATE(sent_at) >= DATE_SUB(CURDATE(), INTERVAL 4 day))
-->
AND sent_at >= CURDATE() - INTERVAL 4 day
(2) Use a "composite" index for
WHERE `notification_template_id` = 175
AND sent_at >= ...
AND `user_id` IN (...)
The first column should be the one with '='. It is unclear what to put next, so I suggest adding both of these indexes:
INDEX(notification_template_id, user_id, sent_at)
INDEX(notification_template_id, sent_at)
The Optimizer will probably pick between them correctly.
Composite indexes are not the same as indexes on the individual columns.
(3) Yes, you could try putting the IN list in a tmp table, but the cost of doing such might outweigh the benefit. I don't think of 1K values in IN() as being "too many".
(4) My cookbook on building indexes.
This is my table structure (about 1 millions records):
I need to select a few indices at certain dates, but only Year and Month are relevant:
SELECT `index_name`,`results` FROM `mst_ind` WHERE
((`index_name`='MSCI EAFE Mid NR USD' AND MONTH(`date`) = 3 AND YEAR(`date`) = 2003) OR
(`index_name`='MSCI Morocco PR USD' AND MONTH(`date`) = 3 AND YEAR(`date`) = 2003))
AND `time_period`='M1'
It works fine, but the performance is horrible. I run the query through profiler, but it could not suggest any possible keys.
The primary key contains index_id, date and time_period.
How can I optimize/improve this query?
Thanks!
Update: the explain report:
You are probably invalidating the use of an index as you are applying a transformation to fields that would be indexed by using functions such as MONTH and YEAR.
You could:
write the WHERE clause differently such that it doesn't use the MONTH and YEAR functions, such as:
date >= '2003-03-01' and date < '2003-04-01'
Edit: just realized you probably don't have any indexes on this table. Consider adding indexes to the index_name, date and time_period field.
I have one simple but large table.
id_tick INTEGER eg: 1622911
price DOUBLE eg: 1.31723
timestamp DATETIME eg: '2010-04-28 09:34:23'
For 1 month of data, I have 2.3 millions rows (150MB)
My query aims at returning the latest price at a given time.
I first set up a SQLite table and used the query:
SELECT max(id_tick), price, timestamp
FROM EURUSD
WHERE timestamp <='2010-04-16 15:22:05'
It is running in 1.6s.
As I need to run this query several thousands of time, 1.6s is by far too long...
I then set up a MySQL table and modified the query (the max function differs from MySQL to SQLite):
SELECT id_tick, price, timestamp
FROM EURUSD
WHERE id_tick = (SELECT MAX(id_tick)
FROM EURUSD WHERE timestamp <='2010-04-16 15:22:05')
Execution time is getting far worse 3.6s
(I know I can avoid the sub query using ORDER BY and LIMIT 1 but it does not improve the execution time.)
I am only using one month of data for now, but I will have to use several years at some point.
My questions are then the following:
is there a way to improve my query?
given the large dataset, should I use another database engine?
any tips ?
Thanks !
1) Make sure you have an index on timestamp
2) Assuming that id_tick is both the PRIMARY KEY and Clustered Index, and assuming that id_tick increments as a function of time (since you are doing a MAX)
You can try this:
SELECT id_tick, price, timestamp
FROM EURUSD
WHERE id_tick = (SELECT id_tick
FROM EURUSD WHERE timestamp <='2010-04-16 15:22:05'
ORDER BY id_tick DESC
LIMIT 1)
This should be similar to janmoesen's performance though, since there should be high page correlation between id_tick and timestamp in any event
Do you have any indexed fields ?
indexing timestamp and/or id_tick could change a lot of things.
Also why don't you use an interval for timestamp ?
WHERE timestamp >= '2010-04-15 15:22:05' AND timestamp <= '2010-04-16 15:22:05'
that would ease the burden of the MAX function.
You are doing analysis using ALL the ticks for large intervals? I'd tried to filter data into minute/hour/day etc. graphs.
OK, I guess my index was corrupted somehow, a re-indexation greatly improved the performance.
The following is now executed in 0.0012s (non cached)
SELECT id_tick, price, timestamp
FROM EURUSD
WHERE timestamp <= '2010-05-11 05:30:10'
ORDER by id_tick desc
LIMIT 1
Thanks!