Query huge table fast based on timezone - mysql

I need to query a table that has 1,852,789,683 rows which is 179.3GB in size in the fastest way possible. My conditions are it needs to be a whole day (24hrs) Japan time.
Query:
SELECT COUNT(*) CNT
FROM info_table
WHERE DATE(CONVERT_TZ(created_at, '+00:00', '+09:00')) = 20141216;
I have left it running for almost an hour now but it's still not done. Any advice?
DESCRIBE:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE info_table ALL NULL NULL NULL NULL 1793315059 Using where

Your query is going to evaluate that function on the created_at column for every flipping row in the table; that's a full scan.
To enable MySQL to do an efficient range scan operation on an index, you need to reference the bare column in the predicate, and you need an index with a leading column of created_at, and the query needs to be of the form:
WHERE created_at >= val1
AND created_at < val2
The trick will be developing val1 and val2, the expressions that return the upper and lower bounds for the timestamp.
if we know:
DATE(CONVERT_TZ(created_at, '+00:00', '+09:00')) = 20141216
then we know:
CONVERT_TZ(created_at, '+00:00', '+09:00')) >= '2014-12-16'
AND CONVERT_TZ(created_at, '+00:00', '+09:00')) < '2014-12-17'
and (maybe?)...
created_at >= CONVERT_TZ('2014-12-16','+09:00','+00:00')
AND created_at < CONVERT_TZ('2014-12-17','+09:00','+00:00')
I'm not sure about the behavior if the CONVERT_TZ function, whether the inversion is equivalent for all values in your case. Again, the "trick" will be getting the expressions that return the upper and lower bounds of your timestamp.
In our environment, we use GMT for all date, datetime and timestamp in the database; we use GMT for the database connections. The application layer does the appropriate timezone conversions. When I have a need to do something like you're doing, I'd be inclined to write something like this:
created_at >= '2014-12-16' + INTERVAL -9 HOUR
AND created_at < '2014-12-16' + INTERVAL 24-9 HOUR

You should create the statement so that it takes advantage of an index and then create the index if you need to run this often. With a table so large it may take some time to create the index. To use and index you can rewrite the statement as:
select count(*) cnt
from info_table
where created_at >= '2014-12-16' and created_at< '2014-12-17'
Even without and index the above may run a bit faster.

The issues is that you are converting each row-value before it gets checked. Change that to the other side
SELECT COUNT(*) CNT
FROM info_table
WHERE created_at = YourConvertedTimeZoneDateValue

Related

How to generate faster mysql query with 1.6M rows

I have a table that has 1.6M rows. Whenever I use the query below, I get an average of 7.5 seconds.
select * from table
where pid = 170
and cdate between '2017-01-01 0:00:00' and '2017-12-31 23:59:59';
I tried adding a LIMIT 1000 or 10000 or change the date to filter for 1 month, it still processes it to an average of 7.5s. I tried adding a composite index for pid and cdate but it resulted to 1 second slower.
Here is the INDEX list
https://gist.github.com/primerg/3e2470fcd9b21a748af84746554309bc
Can I still make it faster? Is this an acceptable performance considering the amount of data?
Looks like the index is missing. Create this index and see if its helping you.
CREATE INDEX cid_date_index ON table_name (pid, cdate);
And also modify your query to below.
select * from table
where pid = 170
and cdate between CAST('2017-01-01 0:00:00' AS DATETIME) and CAST('2017-12-31 23:59:59' AS DATETIME);
Please provide SHOW CREATE TABLE clicks.
How many rows are returned? If it is 100K rows, the effort to shovel that many rows is significant. And what will you do with that many rows? If you then summarize them, consider summarizing in SQL!
Do have cdate as DATETIME.
Do you use id for anything? Perhaps this would be better:
PRIMARY KEY (pid, cdate, id) -- to get benefit from clustering
INDEX(id) -- if still needed (and to keep AUTO_INCREMENT happy)
This smells like Data Warehousing. DW benefits significantly from building and maintaining Summary table(s), such as one that has the daily click count (etc), from which you could very rapidly sum up 365 counts to get the answer.
CAST is unnecessary. Furthermore 0:00:00 is optional -- it can be included or excluded for either DATE or DATETIME. I prefer
cdate >= '2017-01-01'
AND cdate < '2017-01-01' + INTERVAL 1 YEAR
to avoid leap year, midnight, date arithmetic, etc.

Comparing performance of query for year in date

How would the following three queries compare in terms of performance? I'm trying to get all records with year=2017:
Using EXTRACT:
SELECT count(*), completed_by_id FROM table
WHERE EXTRACT(YEAR FROM completed_on)=2017
GROUP BY completed_by_id
# Took 11.8s
Using YEAR:
SELECT count(*), completed_by_id FROM table
WHERE YEAR(completed_on)=2017
GROUP BY completed_by_id
# Took 5.15s
Using LIKE 'YEAR%'
SELECT count(*), completed_by_id FROM table
WHERE completed_on LIKE '2017%'
GROUP BY completed_by_id
# Took 6.61s
Note: In my own testing I found YEAR() to be the fastest, LIKE to be the second fastest, and EXTRACT() to be the slowest.
There are about 5M rows in the table and completed_on is DATETIME field that has been indexed.
You haven't described your table or indexes so all advice about query performance is guesswork.
If your completed_on column is a DATETIME, DATE, or TIMESTAMP type and it is indexed, this query will radically outperform all the ones you have shown, and maintain its performance as your table grows.
SELECT count(*), completed_by_id
FROM table
WHERE completed_on >= '2017-01-01'
AND completed_on < '2017-01-01' + INTERVAL 1 YEAR
GROUP BY completed_by_id
Why? It can do a range scan on the index rather than a nonsargable function call on each row's value.
Notice the use of >= at the beginning of the date range and < at the end. We want to include all rows from the first moment of new years day 2017, up until but not including the first moment of new years day 2018. BETWEEN can't do this, because it uses <= rather than < at the end of its range.
If an index is in place, both BETWEEN and the syntax I have shown use a range scan, and perform about the same.
For best results speeding up this query use a compound index on (completed_on, completed_by_id).
If you are storing completed_on as DATE or DATETIME you can use:
SELECT count(*) as cnt, LEFT(completed_on, 4) AS year
FROM table
GROUP BY year
HAVING year=2017

mysql tunning timestamp performance?

Hey guys I have a quick question regarding sql performance. I have a really really large table and it takes forever to run the query below, note that there is a column with timestamp
select name,emails,
count(*) as cnt
from table
where date(timestamp) between '2016-01-20' and '2016-02-3'
and name is not null
group by 1,2;
So my friend suggested to use this query below:
select name,emails,
count(*) as cnt
from table
where timestamp between date_sub(curdate(), interval 14 day)
and date_add(curdate(), interval 1 day) 
and name is not null
group by 1,2;
And this takes much less time to run. Why? What's the difference between those two time function?
And is there another way to run this even faster? Like index?Can someone explain to me how mysql runs? Thanks a lot!
just add index on timestamp field and use query as per below-
select name,emails,
count(*) as cnt
from table
where `timestamp` between '2016-01-20 00:00:00' and '2016-02-03 23:59:59'
and name is not null
group by 1,2;
Why? What's the difference between those two time function
In first query you are getting dates from your own column but with date() function due to this reason mysql is not using index and doing table scan while 2nd suggested table you have removed date(timestamp) function so now mysql will check values from index instead of table scan so it is fast.
Same mysql will use index in my table also.

MySQL fastest way to search by DATE if a record exist

I have found many way to search a mysql record by DATE
Method 1:
SELECT id FROM table WHERE datetime LIKE '2015-01-01%' LIMIT 1
Method 2 (same as method 1 + ORDER BY):
SELECT id FROM table WHERE datetime LIKE '2015-01-01%' ORDER BY datetime DESC LIMIT 1
Method 3:
SELECT id FROM table WHERE datetime BETWEEN '2015-01-01' AND '2015-01-01 23:59:59' LIMIT 1
Method 4:
SELECT id FROM table WHERE DATE_FORMAT( datetime, '%y.%m.%d' ) = DATE_FORMAT( '2015-01-01', '%y.%m.%d' )
Method 5 (I think is the slowest):
SELECT id FROM table WHERE DATE(`datetime`) = '2015-01-01' LIMIT 1
What is the fastest?
In my case the table has 1 million rows, and the date to search is always recent.
The fastest of the methods you've mentioned is
SELECT id
FROM table
WHERE datetime BETWEEN '2015-01-01' AND '2015-01-01 23:59:59'
LIMIT 1
This is made fast when you create an index on the datetime column. The index can be random-accessed to find the first matching row, and then scanned until the last matching row. So it's not necessary to read the whole table, or even the whole index. And, when you use LIMIT 1, it just reads the single row. Very fast, even on an enormous table.
Your other means of search apply a function to each row:
datetime LIKE '2011-01-01%' casts datetime as a string for each row.
Methods 3,4, and 5 all use explicit functions like DATE() on the contents of each row.
The use of these functions defeats the use of indexes to find your data.
Pro tip: Don't use BETWEEN for date arithmetic because it handles the ending condition poorly. Instead use
WHERE datetime >= '2015-01-01'
AND datetime < '2015-01-02'
This performs just as well as BETWEEN and gets you out of having to write the last moment of 2015-01-01 explicitly as 23:59:59. That isn't correct with higher precision timestamps anyway.
The fastest way, assuming there's in index on the datetime column, is a variant of method 3 except both range values are datetime literals:
SELECT id FROM table
WHERE datetime BETWEEN '2015-01-01 00:00:00' AND '2015-01-01 23:59:59'
LIMIT 1
Using literal of the same type as the column means there won't be any casting of the column to perform comparison, giving the best chance of using an index on the column. I have used this in production to great effect.

What is better: select date with trunc date or between

I need to create a query to select some data of my mysql db based on date, but in my where clause i have to options:
1 - trunc the date:
select count(*) from mailing_user where date_format(create_date, '%Y-%m-%d')='2013-11-05';
2 - use between
select count(*) from mailing_user where create_date between '2013-11-05 00:00:00' and '2013-11-05 23:59:59';
the two query's will work, but whats the better? Or, what's recommended? Why?
Here is an article to read.
http://willem.stuursma.name/2009/01/09/mysql-performance-with-date-functions/
If your created_date column is indexed, the 2nd query will be faster.
But if the column is not indexed and if this is your defined date format, you can use the following query.
select count(*) from mailing_user where DATE(create_date) = '2013-11-05';
I use DATE instead of DATE_FORMAT as I can make use of the native feature of getting in this format('2013-11-05').
From your question it seems you want to select records from one day, according to the documentation A DATETIME or TIMESTAMP value can include a trailing fractional seconds part in up to microseconds (6 digits) precision.
So this means your second query might actually get unlucky and miss some records that were inserted into the table at the very last second of that day, so that is why I would say the first one is more precise and is guaranteed to always get you the correct result.
The downside of this is that you cannot index that column using the date_format-function, because MySQL isn't cool with that.
If you don't want to use date_format and get around the precision issue you would change
where create_date between '2013-11-05 00:00:00' and '2013-11-05 23:59:59'
into
where create_date >= '2013-11-05 00:00:00' and create_date < '2013-12-05 00:00:00'
Number 2 will be faster if you have an index on the create_date because number one won't be able to use the index to quickly scan the results.
However this requires there to be an index on the create_date.
Otherwise I imagine they would be similar speed, possibly the second would still be faster because of the smaller processing time to compare(datetime comparison rather than converting to a string and comparing strings), but I doubt it'd be significant.