Optimizing MySQL Aggregation Query - mysql

I've got a very large table (~100Million Records) in MySQL that contains information about files. One of the pieces of information is the modified date of each file.
I need to write a query that will count the number of files that fit into specified date ranges. To do that I made a small table that specifies these ranges (all in days) and looks like this:
DateRanges
range_id range_name range_start range_end
1 0-90 0 90
2 91-180 91 180
3 181-365 181 365
4 366-1095 366 1095
5 1096+ 1096 999999999
And wrote a query that looks like this:
SELECT r.range_name, sum(IF((DATEDIFF(CURDATE(),t.file_last_access) > r.range_start and DATEDIFF(CURDATE(),t.file_last_access) < r.range_end),1,0)) as FileCount
FROM `DateRanges` r, `HugeFileTable` t
GROUP BY r.range_name
However, quite predictably, this query takes forever to run. I think that is because I am asking MySQL to go through the HugeFileTable 5 times, each time performing the DATEDIFF() calculation on each file.
What I want to do instead is to go through the HugeFileTable record by record only once, and for each file increment the count in the appropriate range_name running total. I can't figure out how to do that....
Can anyone help out with this?
Thanks.
EDIT: MySQL Version: 5.0.45, Tables are MyISAM
EDIT2: Here's the descibe that was asked for in the comments
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE r ALL NULL NULL NULL NULL 5 Using temporary; Using filesort
1 SIMPLE t ALL NULL NULL NULL NULL 96506321

First, create an index on HugeFileTable.file_last_access.
Then try the following query:
SELECT r.range_name, COUNT(t.file_last_access) as FileCount
FROM `DateRanges` r
JOIN `HugeFileTable` t
ON (t.file_last_access BETWEEN
CURDATE() + INTERVAL r.range_start DAY AND
CURDATE() + INTERVAL r.range_end DAY)
GROUP BY r.range_name;
Here's the EXPLAIN plan that I got when I tried this query on MySQL 5.0.75 (edited down for brevity):
+-------+-------+------------------+----------------------------------------------+
| table | type | key | Extra |
+-------+-------+------------------+----------------------------------------------+
| t | index | file_last_access | Using index; Using temporary; Using filesort |
| r | ALL | NULL | Using where |
+-------+-------+------------------+----------------------------------------------+
It's still not going to perform very well. By using GROUP BY, the query incurs a temporary table, which may be expensive. Not much you can do about that.
But at least this query eliminates the Cartesian product that you had in your original query.
update: Here's another query that uses a correlated subquery but I have eliminated the GROUP BY.
SELECT r.range_name,
(SELECT COUNT(*)
FROM `HugeFileTable` t
WHERE t.file_last_access BETWEEN
CURDATE() - INTERVAL r.range_end DAY AND
CURDATE() - INTERVAL r.range_start DAY
) as FileCount
FROM `DateRanges` r;
The EXPLAIN plan shows no temporary table or filesort (at least with the trivial amount of rows I have in my test tables):
+----+--------------------+-------+-------+------------------+--------------------------+
| id | select_type | table | type | key | Extra |
+----+--------------------+-------+-------+------------------+--------------------------+
| 1 | PRIMARY | r | ALL | NULL | |
| 2 | DEPENDENT SUBQUERY | t | index | file_last_access | Using where; Using index |
+----+--------------------+-------+-------+------------------+--------------------------+
Try this query on your data set and see if it performs better.

Well, start by making sure that file_last_access is an index for the table HugeFileTable.
I'm not sure if this is possible\better, but try to compute the dates limits first (files from date A to date B), then use some query with >= and <=. It will, theoretically at least, improve the performance.
The comparison would be something like:
t.file_last_access >= StartDate AND t.file_last_access <= EndDate

You could get a small improvement by removing CURDATE() and putting a date in the query as it will run this function for each row twice in your SQL.

Related

Weird query in mysql optimizer

I'm working with mysql 5.5.52 on a Debian 8 machine and sometimes we have a slow query (>3s) that usually spends 0.1s. I've started with the explain command to find what is happening.
This is the query and the explain info
explain
SELECT
`box`.`message_id` ID
, `messages`.`tipo`
, `messages`.`text`
, TIME_TO_SEC(TIMEDIFF(NOW(), `messages`.`date`)) `date`
FROM (`box`)
INNER JOIN `messages` ON `messages`.`id` = `box`.`message_id`
WHERE `box`.`user_id` = '1010231' AND `box`.`deleted` = 0
AND `messages`.`deleted` = 0
AND `messages`.`date` + INTERVAL 10 MINUTE > NOW()
ORDER BY `messages`.`id` ASC LIMIT 100;
id| select_type| table | type | possible_keys | key | key_len| ref | rows | Extra
1|SIMPLE |box |ref |user_id,message_id|user_id| 4|const | 2200 |Using where; Using temporary; Using filesort
1|SIMPLE |messages|eq_ref|PRIMARY |PRIMARY| 4|box.message_id| 1 |Using where
I know that temporary table and filesort are a bad thing, and I suppose that the problem is that order key doesn't belong to the first table in the query (box) and changing it to box.message_id, the explain info is
id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1 SIMPLE box index user_id,message_id message_id 4 443 Using where
1 SIMPLE messages eq_ref PRIMARY PRIMARY 4 box.message_id 1 Using where
It looks better, but I don't understand why it's using the message_id index, and worst, now the query takes 1.5s instead of initial 0.1s
Edit:
Forcing the query to use user_id index, I get the same result (0.1s) as the initial query but without the temporary
explain
SELECT
`box`.`message_id` ID
, `messages`.`tipo`
, `messages`.`text`
, TIME_TO_SEC(TIMEDIFF(NOW(), `messages`.`date`)) `date`
FROM (`box` use index(user_id) )
INNER JOIN `messages` ON `messages`.`id` = `box`.`message_id`
WHERE `box`.`user_id` = '1010231' AND `box`.`deleted` = 0
AND `messages`.`deleted` = 0
AND `messages`.`date` + INTERVAL 10 MINUTE > NOW()
ORDER BY `box`.`message_id` ASC LIMIT 100;
id| select_type| table | type | possible_keys | key | key_len| ref | rows | Extra
1|SIMPLE |box |ref |user_id,message_id|user_id| 4|const | 2200 |Using where; Using filesort
1|SIMPLE |messages|eq_ref|PRIMARY |PRIMARY| 4|box.message_id| 1 |Using where
I think that skipping temporary is better solution than the initial query, next step is check combined index as ysth recommends.
it is not a good idea to calculate on fieldvalues to compare. then you get a FULL TABLE SCAN. MySQL must do it for each ROW before it can check the condition. Its better to do it on the constant piece of condition. Then MySQL can use a Index (if there one on this field)
change from
AND messages.date + INTERVAL 10 MINUTE > NOW()
to
AND messages.date > NOW() - INTERVAL 10 MINUTE
Temporary and file sort are not bad here; they are needed because using the best index (user_id) doesn't naturally produce records sorted in the order you ask for.
It's possible you might do better having a combined user_id,message_id index, but that might also end up worse. Depends on your exact data.
It isn't clear to me if you are seeing longer queries for certain user id's or the same user id sometimes taking much longer.
Update: it seems likely that having a combined index and changing order by to box.user_id,box.message_id will solve your problem, at least for users that don't have a large number of deleted messages.

MySQL query with JOIN not using INDEX

I have the following two tables in MySQL (Simplified).
clicks (InnoDB)
Contains around about 70,000,000 records
Has an index on the date_added column
Has a column link_id which refers to a record in the links table
links (MyISAM)
Contains far fewer records, around about 65,000
I'm trying to run some analytical queries using these tables. I need to pull out some data, about clicks that occurred inside of two specified dates while applying some other user selected filters using other tables and joining them into the links table.
My question revolves around the use of indexes however. When I run the following query:
SELECT
COUNT(1)
FROM
clicks
WHERE
date_added >= '2016-11-01 00:00:00'
AND date_added <= '2016-11-03 23:59:59';
I get a response back in 1.40 sec. Using EXPLAIN I find that the MySQL uses the index on the date_added column as expected.
EXPLAIN SELECT COUNT(1) FROM clicks WHERE date_added >= '2016-11-01 00:00:00' AND date_added <= '2016-11-16 23:59:59';
+----+-------------+--------+-------+---------------+------------+---------+------+---------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+------------+---------+------+---------+--------------------------+
| 1 | SIMPLE | clicks | range | date_added | date_added | 4 | NULL | 1559288 | Using where; Using index |
+----+-------------+--------+-------+---------------+------------+---------+------+---------+--------------------------+
However, when I LEFT JOIN in my links table I find that the query takes much longer to execute:
SELECT
COUNT(1) AS clicks
FROM
clicks AS c
LEFT JOIN links AS l ON l.id = c.link_id
WHERE
c.date_added >= '2016-11-01 00:00:00'
AND c.date_added <= '2016-11-16 23:59:59';
Which completed in 6.50 sec. Using EXPLAIN I find that the index was not used on the date_added column:
EXPLAIN SELECT COUNT(1) AS clicks FROM clicks AS c LEFT JOIN links AS l ON l.id = c.link_id WHERE c.date_added >= '2016-11-01 00:00:00' AND c.date_added <= '2016-11-16 23:59:59';
+----+-------------+-------+--------+---------------+------------+---------+---------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+------------+---------+---------------+---------+-------------+
| 1 | SIMPLE | c | range | date_added | date_added | 4 | NULL | 6613278 | Using where |
| 1 | SIMPLE | l | eq_ref | PRIMARY | PRIMARY | 4 | c.link_id | 1 | Using index |
+----+-------------+-------+--------+---------------+------------+---------+---------------+---------+-------------+
As you can see the index isn't being used for the date_added column in the larger table and seems to take far longer. This seems to get even worse when I join in other tables.
Does anyone know why this is happening or if there's anything I can do to get it to use the index on the date_added column in the clicks table?
Edit
I've just attempted to get my stats out of the database using a different method. The first step in my method involves pulling out a distinct set of link_ids from the clicks table. I've found that I'm seeing the same problem here again, without a JOIN. The index is not being used:
My query:
SELECT
DISTINCT(link_id) AS link_id
FROM
clicks
WHERE
date_added >= '2016-11-01 00:00:00'
AND date_added <= '2016-12-05 10:16:00'
This query took almost a minute to complete. I ran an EXPLAIN on this and found that the query is not using the index as I expected it would:
+----+-------------+---------+-------+---------------+----------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+----------+---------+------+----------+-------------+
| 1 | SIMPLE | clicks | index | date_added | link_id | 4 | NULL | 79786609 | Using where |
+----+-------------+---------+-------+---------------+----------+---------+------+----------+-------------+
I expected that it would use the index on date_added to filter down the result set and then pull out the distinct link_id values. Any idea why this is happening? I have an index on link_id as well as date_added.
Do you want to use an ordinary JOIN in place of the LEFT JOIN? LEFT JOIN preserves all the rows on the right, so it will yield the same value of COUNT() as the unjoined table. If you want to count only the rows from your right-hand table that have matching rows in the left-hand table, use JOIN, not LEFT JOIN.
Try dropping your index on date_added and replacing it with a compound index on (date_added, link_id). This sort of index is called a covering index. When the query planner knows it can get everything it needs from an index, it doesn't have to bounce back to the table. In this case the query planner can random-access the index to the beginning of your date range, then do an index range scan to the end of the range. It's still going to have to refer to the other table, though.
(Edit) For the sake of experimentation, try a narrower date range. See if EXPLAIN changes. In that case, the query planner might be guessing your date_added column's cardinality wrong.
You might try an index hint. For example, try
SELECT COUNT(1) AS clicks
FROM clicks AS c USE INDEX (date_added)
LEFT JOIN links AS l ON l.id = c.link_id
WHERE etc
But, judging from your EXPLAIN output, you're already doing a range scan on date_added. Your next step, like it or not, is the compound covering index.
Make sure there's an index on links(id). There probably is, because it's probably the PK.
Try using COUNT(*) instead of COUNT(1). It probably won't make a difference, but it's worth a try. COUNT(*) simply counts rows rather than evaluating something for each row it counts.
(Nitpick) Your date range smells funny. Use < for the end of your range for best results, like so.
WHERE c.date_added >= '2016-11-01'
AND c.date_added < '2016-11-17';
Edit: Look, the MySQL query planner uses lots of internal knowledge about how tables are structured. And, it can only use one index per table to satisfy a query as of late 2016. That's a limitation.
SELECT DISTINCT column is actually a fairly complex query, because it has to de-dupe the column in question. If there's an index on that column, the query planner is likely to use it. Choosing that index means it could not choose some other index.
Compound indexes (covering indexes) sometimes but not always resolve this kind of index-selection dilemma, and allow index dual usage. You can read about all this at http://use-the-index-luke.com/
But if your operational constraints prevent the adding of compound indexes, you'll need to live with the one-second query. It isn't that bad.
Of course, saying you can't add compound indexes to get your job done is like this:
A: stuff is falling off my truck on the freeway.
B: put a tarp over the stuff and tie it down.
A: my boss won't let me put a tarp on the truck.
B: well, then, drive slow.
Not absolutely sure but consider moving the condition from WHERE condition to JOIN ON condition since you are performing a outer join (LEFT JOIN) it makes difference in performance unlike inner join where the condition be it on where or join on clause is equivalent.
SELECT COUNT(1) AS clicks
FROM clicks AS c
LEFT JOIN links AS l ON l.id = c.link_id
AND (c.date_added >= '2016-11-01 00:00:00'
AND c.date_added <= '2016-11-16 23:59:59');

Search distinct date parts fast in mysql

I've got a database of ~10 million entries, each of which contains a date stored as DATE.
I've indexed that column using a non-unique BTREE.
I'm running a query that counts the number of entries for each distinct year:
SELECT DISTINCT(YEAR(awesome_date)) as year, COUNT(id) as count
FROM all_entries
WHERE awesome_date IS NOT NULL
GROUP BY YEAR(awesome_date)
ORDER BY year DESC;
The query takes about 90 seconds to run at the moment, and the EXPLAIN output demonstrates why:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
----------------------------------------------------------------------------------------------------------------------------------------
1 | SIMPLE | all_entries | ALL | awesome_date | | | | 9759848 | Using where; Using temporary; Using filesort
If I FORCE KEY(awesome_date) that drops the rows count down to ~8 million and the key_len = 4, but is still Using where; Using temporary; Using filesort.
I also run queries selecting DISTINCT(MONTH(awesome_date)) and DISTINCT(DAY(awesome_date)) with the relevant WHERE conditions restricting them to a particular year or month.
Other than storing the year, month and day information in separate columns, is there a way of speeding up this query and/or avoiding temporary tables and filesort?
Without splitting the date to 3 columns, you could:
First, you should remove the DISTINCT, it is useless. – ypercube 1 min ago edit
Remove the ORDER BY year, it would help improve speed (a bit). Change the Group By to: GROUP BY YEAR(awesome_date) DESC (this works in MySQL dialect only).
Change the COUNT(id) to COUNT(*) (assuming that id can never be NULL, this is faster in many MySQL versions).
In all, the query will become:
SELECT YEAR(awesome_date) AS year
, COUNT(*) AS cnt --- not good practise to use reserved words
--- for aliases
FROM all_entries
WHERE awesome_date IS NOT NULL
GROUP BY YEAR(awesome_date) DESC ;
Even better (faster) solutions are:
your proposal to split the column into 3 (year, month, day)
change from MySQL to MariaDB (that is a MySQL fork) and use VIRTUAL PERISTENT column for the year, and add an index on that virtual column.
stay in MySQL and add a persistent year column yourself - by using triggers.

How can i speed up a group by query that already uses indexes?

We have a MyISAM table with approximately 75 milion rows that has 5 columns:
id (int),
user_id(int),
page_id (int),
type (enum with 6 strings)
date_created(datetime).
We have a primary index on the ID column, a unique index (user_id, page_id, date_created) AND a composite index (page_id, date_created)
The problem is that the query below takes up to 90 seconds to complete
SELECT SQL_NO_CACHE user_id, count(id) nr
FROM `table`
WHERE `page_id`=301
and `date_created` BETWEEN '2012-01-03' AND '2012-02-03 23:59:59'
AND page_id<>user_id
group by `user_id`
This is the explain of this query
+----+-------------+----------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
| 1 | SIMPLE | table | range | page_id | page_id | 12 | NULL | 520024 | Using where; Using temporary; Using filesort |
+----+-------------+----------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
EDIT:
At the suggestion of ypercube I tried adding a new index (page_id, user_id, date_created). However mysql does not use it bu default so i had to suggest it to the query optimizer. Here is the new query and the explain:
SELECT SQL_NO_CACHE user_id, count(*) nr FROM `table` USE INDEX (usridexp) WHERE `page_id`=301 and `date_created` BETWEEN '2012-01-03' AND '2012-02-03 23:59:59' AND page_id<>user_id group by `user_id` ORDER BY NULL
+----+-------------+----------------------------+------+---------------+----------+---------+-------+---------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------------+------+---------------+----------+---------+-------+---------+--------------------------+
| 1 | SIMPLE | table | ref | usridexp | usridexp | 4 | const | 3943444 | Using where; Using index |
+----+-------------+----------------------------+------+---------------+----------+---------+-------+---------+--------------------------+
Some changes that may improve the query:
Change COUNT(id) to COUNT(*). Since id is (I guess) the PRIMARY KEY and NOT NULL, the results will be identical.
Add an ORDER BY NULL after ther GROUP BY clause. In MySQL, a group by operation also sorts the results, unless you specify other wise.
The (page_id, date_created) is probably the best index that MySQL can use for this query but you could also try (page_id, user_id, date_created) (can you also post the EXPLAIN if you add this index?)
Another thing not related to the performance of this query:
If your (user_id, page_id, date_created) is UNIQUE and the id is auto generated (and not used for anything else but as a Primary Key), you can make it the PRIMARY KEY and drop the id column. One less index and 4 bytes less per row.
1) It depends on your data - but you should have multiple indexes available to allow MySQL to choose the best one. e.g. if the table had an index on page_id it wouldn't be scanning so many rows.
2) There is a way of optimising date searches. I haven't actually implemented this myself yet, but have a similar problem that I have thought about quite a bit.
Basically you are looking up data by day - but date compares are really slow. What you could do is create another table that stores earliest and latest ID from table for each day. That table would need to be populated at the end of each day.
After that you could break your query into two parts:
i) Find the IDs to search y running two queries:
select earliestID from idCacheTable where date = '2012-01-03';
select latestID from idCacheTable where date = '2012-02-03';
ii) You can then search directly on the primary key of the table, without doing a date compare on each row, which would be waaaaaay faster.
SELECT SQL_NO_CACHE user_id, count(id) nr
FROM table
WHERE page_id=301
and (id >= earliestID and id <= latestID)
AND page_id<>user_id
group by user_id;
The exact solution to your problem will depend on what your data looks like though, rather than one of those two things always being correct.
Sounds odd, but try to add JOIN statement:
SELECT SQL_NO_CACHE user_id, count(id) nr
FROM `table` t
JOIN `table` t2 ON t.`user_id`= t2.`user_id`
WHERE t.`page_id`=301
and t.`date_created` BETWEEN '2012-01-03' AND '2012-02-03 23:59:59'
AND t.`page_id`<>t.`user_id`
group by t.`user_id`
For similar problem, I got that query execute 20 times faster (3-4s instead 60+). JOIN statement does not perform anything smart - seems that speedup is fully to internal MySql implementation (Tested on MySql 5.1., table have rare user_id duplicates).

MySQL multiple Date subqueries

I am writing a query for a very specific report that contains a variable number of columns, based on specific relationships of an item. I am open to suggetions on how to change the query if needs be, but I don't think it can be. I would prefer to keep this as a single query, as opposed to running it in a loop. The table that is being searched contains around 4 million records, and cannot be archived.
What I would like to know, is why the DATEADD index is not being used on the subquery, although it is being used in the outer query, which is on the same table. I am aware that functions on a field stop MySQL from being able to index, but this is only on the item, not what you are comparing it to.
The result of the report is a number for each specific item (subquery) for each date in the range, where something took place. The date range is generated dynamically. The subquerys should return the results for a single day
We are using MySQL version 5.0.77, which we cannot change as it is managed by our Hosting Provider.
Here is the query:
SELECT DATE_FORMAT(DATEADD, '%d/%m/%y') AS DATEADD,
(SELECT COUNT(ID)
FROM ATABLE AS
WHERE ELEMNAME = 'ANELEMENT' AND COMPID = 132
AND VT.DATEADD BETWEEN CONCAT(DATE(V.DATEADD)," 00:00:00") AND CONCAT(DATE(V.DATEADD)," 23:59:59")))
AS '132',
(SELECT COUNT(ID)
FROM ATABLE AS
WHERE ELEMNAME = 'ANELEMENT' AND COMPID = 149
AND VT.DATEADD BETWEEN CONCAT(DATE(V.DATEADD)," 00:00:00") AND CONCAT(DATE(V.DATEADD)," 23:59:59")))
AS '149'
FROM ATABLE AS V
WHERE 1 = 1 AND COMPID = 132
AND (V.DATEADD >= "2010-09-01 00:00:00"
AND V.DATEADD <= "2010-10-26 23:59:59")
AND 1 = 1
AND ELEMNAME = 'ANELEMENT'
GROUP BY DATE_FORMAT(DATEADD, '%Y-%m-%d')
The number of times the subquery is ran depends on the number of links this item has, and is determined when the query is built.
We have tried:
replacing the between with
"VT.DATEADD <= DATE(V.DATEADD) and VT.DATEADD <= DATE(V.DATEADD) +1"
however this doesnt work either, changing it to
"VT.DATEADD = DATE(V.DATEADD)"
does use the index, however doesnt return the correct number of rows, as DATEADD is a datetime. If we change it to:
"VT.DATEADD >= "2010-09-01" AND VT.DATEADD <= "2010-09-02"
The output from Explain is
+----+--------------------+-------+-------+-------------------------+----------+---------+-------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+-------------------------+----------+---------+-------+-------+----------------------------------------------+
| 1 | PRIMARY | V | range | DATEADD,COMPID,ELEMNAME | DATEADD | 8 | NULL | 1386 | Using where; Using temporary; Using filesort |
| 2 | DEPENDENT SUBQUERY | VT | ref | COMPID,ELEMNAME | ELEMNAME | 103 | const | 44277 | Using where |
+----+--------------------+-------+-------+-------------------------+----------+---------+-------+-------+----------------------------------------------+
Using USE INDEX, or FORCE INDEX (when it is available but not used) uses NULL key
without fixing this, the query runs incredibly slowly, even over a tiny date range and locks the database up.
I don't know if I'm over simplifying what you want overall, but will this one work for you. It appears you want to know how much activity for two "compid" values within a given date range.
SELECT
DATE_FORMAT(DATEADD, '%Y-%m-%d'),
SUM( if( compid = 132, 1, 0 ) ) as Count132,
SUM( if( compid = 149, 1, 0 ) ) as Count149
from
ATable
where
elemname = "ANELEMENT"
AND ( compid = 132 or compid = 149 )
AND DATEADD BETWEEN "2010-09-01 00:00:00" AND "2010-10-26 23:59:59"
group by
dateadd