Hey guys I have a quick question regarding sql performance. I have a really really large table and it takes forever to run the query below, note that there is a column with timestamp
select name,emails,
count(*) as cnt
from table
where date(timestamp) between '2016-01-20' and '2016-02-3'
and name is not null
group by 1,2;
So my friend suggested to use this query below:
select name,emails,
count(*) as cnt
from table
where timestamp between date_sub(curdate(), interval 14 day)
and date_add(curdate(), interval 1 day)
and name is not null
group by 1,2;
And this takes much less time to run. Why? What's the difference between those two time function?
And is there another way to run this even faster? Like index?Can someone explain to me how mysql runs? Thanks a lot!
just add index on timestamp field and use query as per below-
select name,emails,
count(*) as cnt
from table
where `timestamp` between '2016-01-20 00:00:00' and '2016-02-03 23:59:59'
and name is not null
group by 1,2;
Why? What's the difference between those two time function
In first query you are getting dates from your own column but with date() function due to this reason mysql is not using index and doing table scan while 2nd suggested table you have removed date(timestamp) function so now mysql will check values from index instead of table scan so it is fast.
Same mysql will use index in my table also.
Related
I have a table that has 1.6M rows. Whenever I use the query below, I get an average of 7.5 seconds.
select * from table
where pid = 170
and cdate between '2017-01-01 0:00:00' and '2017-12-31 23:59:59';
I tried adding a LIMIT 1000 or 10000 or change the date to filter for 1 month, it still processes it to an average of 7.5s. I tried adding a composite index for pid and cdate but it resulted to 1 second slower.
Here is the INDEX list
https://gist.github.com/primerg/3e2470fcd9b21a748af84746554309bc
Can I still make it faster? Is this an acceptable performance considering the amount of data?
Looks like the index is missing. Create this index and see if its helping you.
CREATE INDEX cid_date_index ON table_name (pid, cdate);
And also modify your query to below.
select * from table
where pid = 170
and cdate between CAST('2017-01-01 0:00:00' AS DATETIME) and CAST('2017-12-31 23:59:59' AS DATETIME);
Please provide SHOW CREATE TABLE clicks.
How many rows are returned? If it is 100K rows, the effort to shovel that many rows is significant. And what will you do with that many rows? If you then summarize them, consider summarizing in SQL!
Do have cdate as DATETIME.
Do you use id for anything? Perhaps this would be better:
PRIMARY KEY (pid, cdate, id) -- to get benefit from clustering
INDEX(id) -- if still needed (and to keep AUTO_INCREMENT happy)
This smells like Data Warehousing. DW benefits significantly from building and maintaining Summary table(s), such as one that has the daily click count (etc), from which you could very rapidly sum up 365 counts to get the answer.
CAST is unnecessary. Furthermore 0:00:00 is optional -- it can be included or excluded for either DATE or DATETIME. I prefer
cdate >= '2017-01-01'
AND cdate < '2017-01-01' + INTERVAL 1 YEAR
to avoid leap year, midnight, date arithmetic, etc.
How would the following three queries compare in terms of performance? I'm trying to get all records with year=2017:
Using EXTRACT:
SELECT count(*), completed_by_id FROM table
WHERE EXTRACT(YEAR FROM completed_on)=2017
GROUP BY completed_by_id
# Took 11.8s
Using YEAR:
SELECT count(*), completed_by_id FROM table
WHERE YEAR(completed_on)=2017
GROUP BY completed_by_id
# Took 5.15s
Using LIKE 'YEAR%'
SELECT count(*), completed_by_id FROM table
WHERE completed_on LIKE '2017%'
GROUP BY completed_by_id
# Took 6.61s
Note: In my own testing I found YEAR() to be the fastest, LIKE to be the second fastest, and EXTRACT() to be the slowest.
There are about 5M rows in the table and completed_on is DATETIME field that has been indexed.
You haven't described your table or indexes so all advice about query performance is guesswork.
If your completed_on column is a DATETIME, DATE, or TIMESTAMP type and it is indexed, this query will radically outperform all the ones you have shown, and maintain its performance as your table grows.
SELECT count(*), completed_by_id
FROM table
WHERE completed_on >= '2017-01-01'
AND completed_on < '2017-01-01' + INTERVAL 1 YEAR
GROUP BY completed_by_id
Why? It can do a range scan on the index rather than a nonsargable function call on each row's value.
Notice the use of >= at the beginning of the date range and < at the end. We want to include all rows from the first moment of new years day 2017, up until but not including the first moment of new years day 2018. BETWEEN can't do this, because it uses <= rather than < at the end of its range.
If an index is in place, both BETWEEN and the syntax I have shown use a range scan, and perform about the same.
For best results speeding up this query use a compound index on (completed_on, completed_by_id).
If you are storing completed_on as DATE or DATETIME you can use:
SELECT count(*) as cnt, LEFT(completed_on, 4) AS year
FROM table
GROUP BY year
HAVING year=2017
I need to query a table that has 1,852,789,683 rows which is 179.3GB in size in the fastest way possible. My conditions are it needs to be a whole day (24hrs) Japan time.
Query:
SELECT COUNT(*) CNT
FROM info_table
WHERE DATE(CONVERT_TZ(created_at, '+00:00', '+09:00')) = 20141216;
I have left it running for almost an hour now but it's still not done. Any advice?
DESCRIBE:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE info_table ALL NULL NULL NULL NULL 1793315059 Using where
Your query is going to evaluate that function on the created_at column for every flipping row in the table; that's a full scan.
To enable MySQL to do an efficient range scan operation on an index, you need to reference the bare column in the predicate, and you need an index with a leading column of created_at, and the query needs to be of the form:
WHERE created_at >= val1
AND created_at < val2
The trick will be developing val1 and val2, the expressions that return the upper and lower bounds for the timestamp.
if we know:
DATE(CONVERT_TZ(created_at, '+00:00', '+09:00')) = 20141216
then we know:
CONVERT_TZ(created_at, '+00:00', '+09:00')) >= '2014-12-16'
AND CONVERT_TZ(created_at, '+00:00', '+09:00')) < '2014-12-17'
and (maybe?)...
created_at >= CONVERT_TZ('2014-12-16','+09:00','+00:00')
AND created_at < CONVERT_TZ('2014-12-17','+09:00','+00:00')
I'm not sure about the behavior if the CONVERT_TZ function, whether the inversion is equivalent for all values in your case. Again, the "trick" will be getting the expressions that return the upper and lower bounds of your timestamp.
In our environment, we use GMT for all date, datetime and timestamp in the database; we use GMT for the database connections. The application layer does the appropriate timezone conversions. When I have a need to do something like you're doing, I'd be inclined to write something like this:
created_at >= '2014-12-16' + INTERVAL -9 HOUR
AND created_at < '2014-12-16' + INTERVAL 24-9 HOUR
You should create the statement so that it takes advantage of an index and then create the index if you need to run this often. With a table so large it may take some time to create the index. To use and index you can rewrite the statement as:
select count(*) cnt
from info_table
where created_at >= '2014-12-16' and created_at< '2014-12-17'
Even without and index the above may run a bit faster.
The issues is that you are converting each row-value before it gets checked. Change that to the other side
SELECT COUNT(*) CNT
FROM info_table
WHERE created_at = YourConvertedTimeZoneDateValue
I need to create a query to select some data of my mysql db based on date, but in my where clause i have to options:
1 - trunc the date:
select count(*) from mailing_user where date_format(create_date, '%Y-%m-%d')='2013-11-05';
2 - use between
select count(*) from mailing_user where create_date between '2013-11-05 00:00:00' and '2013-11-05 23:59:59';
the two query's will work, but whats the better? Or, what's recommended? Why?
Here is an article to read.
http://willem.stuursma.name/2009/01/09/mysql-performance-with-date-functions/
If your created_date column is indexed, the 2nd query will be faster.
But if the column is not indexed and if this is your defined date format, you can use the following query.
select count(*) from mailing_user where DATE(create_date) = '2013-11-05';
I use DATE instead of DATE_FORMAT as I can make use of the native feature of getting in this format('2013-11-05').
From your question it seems you want to select records from one day, according to the documentation A DATETIME or TIMESTAMP value can include a trailing fractional seconds part in up to microseconds (6 digits) precision.
So this means your second query might actually get unlucky and miss some records that were inserted into the table at the very last second of that day, so that is why I would say the first one is more precise and is guaranteed to always get you the correct result.
The downside of this is that you cannot index that column using the date_format-function, because MySQL isn't cool with that.
If you don't want to use date_format and get around the precision issue you would change
where create_date between '2013-11-05 00:00:00' and '2013-11-05 23:59:59'
into
where create_date >= '2013-11-05 00:00:00' and create_date < '2013-12-05 00:00:00'
Number 2 will be faster if you have an index on the create_date because number one won't be able to use the index to quickly scan the results.
However this requires there to be an index on the create_date.
Otherwise I imagine they would be similar speed, possibly the second would still be faster because of the smaller processing time to compare(datetime comparison rather than converting to a string and comparing strings), but I doubt it'd be significant.
I have a table that is getting hundreds of requests per minute. The issue that I'm having is that I need a way to select only the rows that have been inserted in the past 5 minutes. I am trying this:
SELECT count(id) as count, field1, field2
FROM table
WHERE timestamp > DATE_SUB(NOW(), INTERVAL 5 MINUTE)
ORDER BY timestamp DESC
My issue is that it returns 70k+ results and counting. I am not sure what it is that I am doing wrong, but I would love to get some help on this. In addition, if there were a way to group them by minute to have it look like:
| count | field1 | field2 |
----------------------------
I'd love the help and direction on this, so please let me know your thoughts.
You don't really need DATE_ADD/DATE_SUB, date arithmetic is much simpler:
SELECT COUNT(id), DATE_FORMAT(`timestamp`, '%Y-%m-%d %H:%i')
FROM `table`
WHERE `timestamp` >= CURRENT_TIMESTAMP - INTERVAL 5 MINUTE
GROUP BY 2
ORDER BY 2
The following seems like it would work which is mighty close to what you had:
SELECT
MINUTE(date_field) as `minute`,
count(id) as count
FROM table
WHERE date_field > date_sub(now(), interval 5 minute)
GROUP BY MINUTE(date_field)
ORDER BY MINUTE(date_field);
Note the added column to show the minute and the GROUP BY clause that gathers up the results into the corresponding minute. Imagine that you had 5 little buckets labeled with the last 5 minutes. Now imagine you tossed each row that was 4 minutes old into it's own bucket. count() will then count the number of entries found in each bucket. That's a quick visualization on how GROUP BY works. http://www.tizag.com/mysqlTutorial/mysqlgroupby.php seems to be a decent writeup on GROUP BY if you need more info.
If you run that and the number of entries in each minute seems too high, you'll want to do some troubleshooting. Try replacing COUNT(id) with MAX(date_field) and MIN(date_field) so you can get an idea what kind of dates it is capturing. If MIN() and MAX() are inside the range, you may have more data written to your database than you realize.
You might also double check that you don't have dates in the future as they would all be > now(). The MIN()/MAX() checks mentioned above should identify that too if it's a problem.