I have the following two tables in MySQL (Simplified).
clicks (InnoDB)
Contains around about 70,000,000 records
Has an index on the date_added column
Has a column link_id which refers to a record in the links table
links (MyISAM)
Contains far fewer records, around about 65,000
I'm trying to run some analytical queries using these tables. I need to pull out some data, about clicks that occurred inside of two specified dates while applying some other user selected filters using other tables and joining them into the links table.
My question revolves around the use of indexes however. When I run the following query:
SELECT
COUNT(1)
FROM
clicks
WHERE
date_added >= '2016-11-01 00:00:00'
AND date_added <= '2016-11-03 23:59:59';
I get a response back in 1.40 sec. Using EXPLAIN I find that the MySQL uses the index on the date_added column as expected.
EXPLAIN SELECT COUNT(1) FROM clicks WHERE date_added >= '2016-11-01 00:00:00' AND date_added <= '2016-11-16 23:59:59';
+----+-------------+--------+-------+---------------+------------+---------+------+---------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+------------+---------+------+---------+--------------------------+
| 1 | SIMPLE | clicks | range | date_added | date_added | 4 | NULL | 1559288 | Using where; Using index |
+----+-------------+--------+-------+---------------+------------+---------+------+---------+--------------------------+
However, when I LEFT JOIN in my links table I find that the query takes much longer to execute:
SELECT
COUNT(1) AS clicks
FROM
clicks AS c
LEFT JOIN links AS l ON l.id = c.link_id
WHERE
c.date_added >= '2016-11-01 00:00:00'
AND c.date_added <= '2016-11-16 23:59:59';
Which completed in 6.50 sec. Using EXPLAIN I find that the index was not used on the date_added column:
EXPLAIN SELECT COUNT(1) AS clicks FROM clicks AS c LEFT JOIN links AS l ON l.id = c.link_id WHERE c.date_added >= '2016-11-01 00:00:00' AND c.date_added <= '2016-11-16 23:59:59';
+----+-------------+-------+--------+---------------+------------+---------+---------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+------------+---------+---------------+---------+-------------+
| 1 | SIMPLE | c | range | date_added | date_added | 4 | NULL | 6613278 | Using where |
| 1 | SIMPLE | l | eq_ref | PRIMARY | PRIMARY | 4 | c.link_id | 1 | Using index |
+----+-------------+-------+--------+---------------+------------+---------+---------------+---------+-------------+
As you can see the index isn't being used for the date_added column in the larger table and seems to take far longer. This seems to get even worse when I join in other tables.
Does anyone know why this is happening or if there's anything I can do to get it to use the index on the date_added column in the clicks table?
Edit
I've just attempted to get my stats out of the database using a different method. The first step in my method involves pulling out a distinct set of link_ids from the clicks table. I've found that I'm seeing the same problem here again, without a JOIN. The index is not being used:
My query:
SELECT
DISTINCT(link_id) AS link_id
FROM
clicks
WHERE
date_added >= '2016-11-01 00:00:00'
AND date_added <= '2016-12-05 10:16:00'
This query took almost a minute to complete. I ran an EXPLAIN on this and found that the query is not using the index as I expected it would:
+----+-------------+---------+-------+---------------+----------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+----------+---------+------+----------+-------------+
| 1 | SIMPLE | clicks | index | date_added | link_id | 4 | NULL | 79786609 | Using where |
+----+-------------+---------+-------+---------------+----------+---------+------+----------+-------------+
I expected that it would use the index on date_added to filter down the result set and then pull out the distinct link_id values. Any idea why this is happening? I have an index on link_id as well as date_added.
Do you want to use an ordinary JOIN in place of the LEFT JOIN? LEFT JOIN preserves all the rows on the right, so it will yield the same value of COUNT() as the unjoined table. If you want to count only the rows from your right-hand table that have matching rows in the left-hand table, use JOIN, not LEFT JOIN.
Try dropping your index on date_added and replacing it with a compound index on (date_added, link_id). This sort of index is called a covering index. When the query planner knows it can get everything it needs from an index, it doesn't have to bounce back to the table. In this case the query planner can random-access the index to the beginning of your date range, then do an index range scan to the end of the range. It's still going to have to refer to the other table, though.
(Edit) For the sake of experimentation, try a narrower date range. See if EXPLAIN changes. In that case, the query planner might be guessing your date_added column's cardinality wrong.
You might try an index hint. For example, try
SELECT COUNT(1) AS clicks
FROM clicks AS c USE INDEX (date_added)
LEFT JOIN links AS l ON l.id = c.link_id
WHERE etc
But, judging from your EXPLAIN output, you're already doing a range scan on date_added. Your next step, like it or not, is the compound covering index.
Make sure there's an index on links(id). There probably is, because it's probably the PK.
Try using COUNT(*) instead of COUNT(1). It probably won't make a difference, but it's worth a try. COUNT(*) simply counts rows rather than evaluating something for each row it counts.
(Nitpick) Your date range smells funny. Use < for the end of your range for best results, like so.
WHERE c.date_added >= '2016-11-01'
AND c.date_added < '2016-11-17';
Edit: Look, the MySQL query planner uses lots of internal knowledge about how tables are structured. And, it can only use one index per table to satisfy a query as of late 2016. That's a limitation.
SELECT DISTINCT column is actually a fairly complex query, because it has to de-dupe the column in question. If there's an index on that column, the query planner is likely to use it. Choosing that index means it could not choose some other index.
Compound indexes (covering indexes) sometimes but not always resolve this kind of index-selection dilemma, and allow index dual usage. You can read about all this at http://use-the-index-luke.com/
But if your operational constraints prevent the adding of compound indexes, you'll need to live with the one-second query. It isn't that bad.
Of course, saying you can't add compound indexes to get your job done is like this:
A: stuff is falling off my truck on the freeway.
B: put a tarp over the stuff and tie it down.
A: my boss won't let me put a tarp on the truck.
B: well, then, drive slow.
Not absolutely sure but consider moving the condition from WHERE condition to JOIN ON condition since you are performing a outer join (LEFT JOIN) it makes difference in performance unlike inner join where the condition be it on where or join on clause is equivalent.
SELECT COUNT(1) AS clicks
FROM clicks AS c
LEFT JOIN links AS l ON l.id = c.link_id
AND (c.date_added >= '2016-11-01 00:00:00'
AND c.date_added <= '2016-11-16 23:59:59');
Related
I have, in a project, a database with two big tables, "terminosnoticia" have 400 Million rows and "noticia" 3 Million. I have one query I want to make lighter (it spend from 10s to 400s):
SELECT noticia_id, termino_id
FROM noticia
LEFT JOIN terminosnoticia on terminosnoticia.noticia_id=noticia.id AND termino_id IN (7818,12345)
WHERE noticia.fecha BETWEEN '2016-09-16 00:00' AND '2016-09-16 10:00'
AND noticia_id is not null AND termino_id is not null;`
The only viable solution I have to explore is to denormalize the database to include the 'fecha' field in the big table, but, this will multiply the index sizes.
Explain plan:
+----+-------------+-----------------+--------+-----------------------+------------+---------+-----------------------------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+--------+-----------------------+------------+---------+-----------------------------------------+-------+-------------+
| 1 | SIMPLE | terminosnoticia | ref | noticia_id,termino_id | termino_id | 4 | const | 58480 | Using where |
| 1 | SIMPLE | noticia | eq_ref | PRIMARY,fecha | PRIMARY | 4 | db_resumenes.terminosnoticia.noticia_id | 1 | Using where |
+----+-------------+-----------------+--------+-----------------------+------------+---------+-----------------------------------------+-------+-------------+
Changing the query and creating the index as suggested, the explain plan is now:
+----+-------------+-------+--------+-------------------------------------------+---------------------+---------+---------------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------------------------------+---------------------+---------+---------------------------+-------+-------------+
| 1 | SIMPLE | T | ref | noticia_id,termino_id,terminosnoticia_cpx | terminosnoticia_cpx | 4 | const | 60600 | Using index |
| 1 | SIMPLE | N | eq_ref | PRIMARY,fecha | PRIMARY | 4 | db_resumenes.T.noticia_id | 1 | Using where |
+----+-------------+-------+--------+-------------------------------------------+---------------------+---------+---------------------------+-------+-------------+
But the execution time does not vary too much...
Any idea?
As Strawberry pointed out, by having an "AND" in your where clause for NOT NULL
is the same as a regular INNER JOIN and can be reduced to.
SELECT
N.id as noticia_id,
T.termino_id
FROM
noticia N USING INDEX (fecha)
JOIN terminosnoticia T
on N.id = T.noticia_id
AND T.termino_id IN (7818,12345)
WHERE
N.fecha BETWEEN '2016-09-16 00:00' AND '2016-09-16 10:00'
Now, that said and aliases applied, I would suggest the following covering indexes as
table index
Noticia ( fecha, id )
terminosnoticia ( noticia_id, termino_id )
This way the query can get all the results directly from the indexes and not have to go to the raw data pages to qualify the other fields.
Assuming noticia_id is noticia's primary key, I would add the following indexes:
create index noticia_fecha_idx on noticia(fecha);
create index terminosnoticia_id_noticia_idx on terminosnoticia(noticia_id);
And try your queries again.
Do include the current execution plan of your query. It might help on helping you figuring this one out.
Try this:
SELECT tbl1.noticia_id, tbl1.termino_id FROM
( SELECT FROM terminosnoticia WHERE
terminosnoticia.termino_id IN (7818,12345)
AND terminosnoticia.noticia_id is not null
) tbl1 INNER JOIN
( SELECT id FROM noticia
WHERE noticia.fecha
BETWEEN '2016-09-16 00:00' AND '2016-09-16 10:00'
) tbl2 ON tbl1.id=tbl2.noticia.id
We're assuming that the noticia_id and termino_id are columns in terminosnoticia table. (We wouldn't have to guess, if all of the column references were qualified with the table name or a short table alias.)
Why is this an outer join? The predicates in the WHERE clause are going to exclude rows with NULL values for columns from terminosnoticia. That's going to negate the "outerness" of the join.
And if we write this as an inner join, those predicates in the WHERE clause are redundant. We already know that noticia_id won't be NULL (if it satisfies the equality predicate in the ON clause). Same for termino_id, that won't be NULL if it's equal to a value in the IN list.
I believe this query will return an equivalent result:
SELECT t.noticia_id
, t.termino_id
FROM noticia n
JOIN terminosnoticia t
ON t.noticia_id = n.id
AND t.termino_id IN (7818,12345)
WHERE n.fecha BETWEEN '2016-09-16 00:00' AND '2016-09-16 10:00'
What's left now is figuring out if there's any implicit datatype conversions.
We don't see the datatype of termino_id. So we don't know if that's defined as numeric. It's bad news if it's not, since MySQL will have to perform a conversion to numeric, for every row in the table, so it can do the comparison to the numeric literals.
We don't see the datatypes of the noticia_id, and whether that matches the datatype of the column it's being compared to, the id column from noticia table.
We also don't see the datatype of fecha. Based on the string literals in the between predicate, it looks like it's probably a DATETIME or TIMESTAMP. But that's just a guess. We don't know, since we don't have a table definition available to us.
Once we have verified that there aren't any implicit datatype conversions that are going to bite us...
For the query with the inner join (as above), the best shot at reasonable performance will likely be with MySQL making effective use of covering indexes. (A covering index allows MySQL to satisfy the query directly from from the index blocks, without needing to lookup pages in the underlying table.)
As DRApp's answer already states, the best candidates for covering indexes, for this particular query, would be:
... ON noticia (fecha, id)
... ON terminosnoticia (noticia_id, termino_id)
An index that has those same leading columns in that same order would also be suitable, and would render these indexes redundant.
The addition of these indexes will render other indexes redundant.
The first index would be redundant with ... ON noticia (fecha). Assuming the index isn't enforcing a UNIQUE constraint, it could be dropped. Any query making effective use of that index could use the new index, since fecha is the leading column in the new index.
Similarly, an index ... ON terminosnoticia (noticia_id) would be redundant. Again, assuming it's not a unique index, enforcing a UNIQUE constraint, that index could be dropped as well.
I have the following query. I picked it from mysql slow queries log:
SELECT AVG(item.duration) AS dur
FROM `item`
INNER JOIN item_step ON item_step.item_id = item.id
WHERE
item_step.number = '2' AND
(IS_OK(item_step.result) OR item_step.result2 IN ("R1", "R2")) AND
item.time >= '2015-03-01 07:00:00' AND
item.time < '2015-05-01 07:00:00';
As usually I tried to inspect it using explain:
+----+-------------+-----------+------+----------------------------+---------+---------+------------------+--------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------+----------------------------+---------+---------+------------------+--------+----------+-------------+
| 1 | SIMPLE | item | ALL | PRIMARY,time | NULL | NULL | NULL | 790464 | 38.74 | Using where |
| 1 | SIMPLE | item_step | ref | number,item_id,result2_idx | item_id | 4 | debug_db.item.id | 1 | 100.00 | Using where |
+----+-------------+-----------+------+----------------------------+---------+---------+------------------+--------+----------+-------------+
Adding index to table item on id and time gave nothing.
Actually time column has an index,tables are connected using foreign keys and have an indexes..
I have no idea about what to do here. Is it really impossible to optimize this query to avoid using join_type = ALL ?
Since you already seem to have a FK from item_step.item_id to item.item_id, the only option you have for improvement is focusing on the parts being used to filter out records.
Slightly reformatting your query we have :
SELECT AVG(item.duration) AS dur
FROM `item`
INNER JOIN item_step
ON item_step.item_id = item.id
AND item_step.number = '2'
AND (IS_OK(item_step.result) OR item_step.result2 IN ("R1", "R2"))
WHERE item.time >= '2015-03-01 07:00:00'
AND item.time < '2015-05-01 07:00:00';
First thing to notice is IS_OK(item_step.result). I have no clue what's behind this function but I'm pretty sure it blocks the optimizer from using any index this field efficiently. If the formula is something that can be written in the query directly I would suggest to do so. (e.g. IN (1, 4, 9), or IN (SELECT OK FROM result_values) etc...)
Going by the field-names I'm going to assume that we FIRST want to reduce the item_id list to a minimum first and then use that reduced list to work on the item_step table. To do so you'll need an index on the time field first. I'm assuming that the item_id field is automatically included in the index as it's the PK field, but I'm no MySQL specialist and it might also depend on your storage engine. Anyay, in MSSQL that's how it would work, YMMV.
The second thing to do then is to go with this list of item_ids to the item_step table and reduce the number of records there. For this you'll want a compound index on item_id, number, result2, result. If you manage to write the IS_OK() function 'inline' into the query you might want to try swapping the last two fields around... something you'll need to test.
From what I read here and there, MySQL does not support something like INCLUDE on indexes in the same way as MSSQL does. A way around that would be to create a 'covering' index on time, duration on item. That way, everything can be done from the index directly, at the cost of more disk-space and CPU requirements when adding data to the item table.
In short:
add index on item on time, duration
add index on item_step on item_id, number, result2, result
see if you can inline the IS_OK() function.
I have been trying to create an index in MySQL, but keep getting temporary and filesort whenever I run an explain on my query.
A simplified version of my tables looks like:
ordered_products
op_id INT UNSIGNED NOT NULL AUTO_INCREMENT
op_orderid INT UNSIGNED NOT NULL
op_orderdate TIMESTAMP NOT NULL
op_productid INT UNSIGNED NOT NULL
products
p_id INT UNSIGNED NOT NULL AUTO_INCREMENT
p_productname VARCHAR(128) NOT NULL
p_enabled TINYINT NOT NULL
The 'ordered_products' table currently has more than 1,000,000 rows and is a record of all products that have been ordered, as well as the orders that they belong to. This table grows rapidly.
The 'products' table currently has around 3,000 rows and contains a list of products that are for sale.
The site displays a list of the top products for a given period (normally the last 3 days) and my query looks like:
SELECT COUNT(op.op_productid) AS ProductCount, op.op_productid
FROM ordered_products op
LEFT JOIN products p ON op.op_productid=p.p_id
WHERE op.op_orderdate>='2014-03-08 00:00:00'
AND p.p_enabled=1
GROUP BY op.op_productid
ORDER BY ProductCount DESC, p.p_productname ASC
When I run that query, it normally takes around 800 milliseconds (0.8 seconds) to execute, which is ridiculous. We've remedied this with caching, however whenever the cache expires, we have a slowdown. I need to fix this.
I have tried to index the tables, but no matter what I try, I can't avoid temporary and filesort. The output from EXPLAIN is:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE p index PRIMARY,idx_enabled_id_name idx_enabled_id_name 782 \N 1477 Using where; Using index; Using temporary; Using filesort
1 SIMPLE op ref idx_pid_oid_date idx_pid_oid_date 4 test_store.p.p_id 9 Using where; Using index
If I remove the GROUP BY, the filesort disappears, however I need it to ensure the ProductCount value shows me every product count rather than a total sum of all products.
If I remove the GROUP BY and the ORDER BY ProductCount, both temporary and filesort disappear, but now I am left with a very bad result set.
Can anyone please help me solve this? I have tried a multitude of different indexes, and have tried rewriting the SQL numerous times, but can never succeed.
Any help would be greatly appreciated.
You can't get rid of the temp table and filesort while you are using ORDER BY on a calculated column ProductCount. There's no index for the calculated column, so it has to do do the sorting at the time of the query.
I tried experimentally to reproduce your results. I can put an index on op_productid and then the optimizer might use it to perform the GROUP BY.
mysql> EXPLAIN SELECT COUNT(op.op_productid) AS ProductCount, op.op_productid
FROM ordered_products op FORCE INDEX (op_productid) STRAIGHT_JOIN products p
ON op.op_productid=p.p_id
WHERE op.op_orderdate>='2014-03-08 00:00:00' AND p.p_enabled=1
GROUP BY op.op_productid ORDER BY null;
In my case, I had to use STRAIGHT_JOIN and FORCE INDEX to override the optimizer. But that might be due to my test environment, where I have only 1 or 2 rows per table for testing, and it throws off the optimizer's choices. In your real data, it might make a more sensible choice.
Also, don't use LEFT JOIN if you have conditions in the WHERE clause that make the join implicitly an inner join. Learn the types of joins and how they work -- don't always use LEFT JOIN by default.
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
| 1 | SIMPLE | op | index | op_productid | op_productid | 4 | NULL | 5 | Using where |
| 1 | SIMPLE | p | ALL | PRIMARY | NULL | NULL | NULL | 1 | Using where |
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
Your only alternative is to store a denormalized table, where the counts are persisted. Then if your cache fails, it isn't an expensive query to refresh the cache.
We have a MyISAM table with approximately 75 milion rows that has 5 columns:
id (int),
user_id(int),
page_id (int),
type (enum with 6 strings)
date_created(datetime).
We have a primary index on the ID column, a unique index (user_id, page_id, date_created) AND a composite index (page_id, date_created)
The problem is that the query below takes up to 90 seconds to complete
SELECT SQL_NO_CACHE user_id, count(id) nr
FROM `table`
WHERE `page_id`=301
and `date_created` BETWEEN '2012-01-03' AND '2012-02-03 23:59:59'
AND page_id<>user_id
group by `user_id`
This is the explain of this query
+----+-------------+----------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
| 1 | SIMPLE | table | range | page_id | page_id | 12 | NULL | 520024 | Using where; Using temporary; Using filesort |
+----+-------------+----------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
EDIT:
At the suggestion of ypercube I tried adding a new index (page_id, user_id, date_created). However mysql does not use it bu default so i had to suggest it to the query optimizer. Here is the new query and the explain:
SELECT SQL_NO_CACHE user_id, count(*) nr FROM `table` USE INDEX (usridexp) WHERE `page_id`=301 and `date_created` BETWEEN '2012-01-03' AND '2012-02-03 23:59:59' AND page_id<>user_id group by `user_id` ORDER BY NULL
+----+-------------+----------------------------+------+---------------+----------+---------+-------+---------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------------+------+---------------+----------+---------+-------+---------+--------------------------+
| 1 | SIMPLE | table | ref | usridexp | usridexp | 4 | const | 3943444 | Using where; Using index |
+----+-------------+----------------------------+------+---------------+----------+---------+-------+---------+--------------------------+
Some changes that may improve the query:
Change COUNT(id) to COUNT(*). Since id is (I guess) the PRIMARY KEY and NOT NULL, the results will be identical.
Add an ORDER BY NULL after ther GROUP BY clause. In MySQL, a group by operation also sorts the results, unless you specify other wise.
The (page_id, date_created) is probably the best index that MySQL can use for this query but you could also try (page_id, user_id, date_created) (can you also post the EXPLAIN if you add this index?)
Another thing not related to the performance of this query:
If your (user_id, page_id, date_created) is UNIQUE and the id is auto generated (and not used for anything else but as a Primary Key), you can make it the PRIMARY KEY and drop the id column. One less index and 4 bytes less per row.
1) It depends on your data - but you should have multiple indexes available to allow MySQL to choose the best one. e.g. if the table had an index on page_id it wouldn't be scanning so many rows.
2) There is a way of optimising date searches. I haven't actually implemented this myself yet, but have a similar problem that I have thought about quite a bit.
Basically you are looking up data by day - but date compares are really slow. What you could do is create another table that stores earliest and latest ID from table for each day. That table would need to be populated at the end of each day.
After that you could break your query into two parts:
i) Find the IDs to search y running two queries:
select earliestID from idCacheTable where date = '2012-01-03';
select latestID from idCacheTable where date = '2012-02-03';
ii) You can then search directly on the primary key of the table, without doing a date compare on each row, which would be waaaaaay faster.
SELECT SQL_NO_CACHE user_id, count(id) nr
FROM table
WHERE page_id=301
and (id >= earliestID and id <= latestID)
AND page_id<>user_id
group by user_id;
The exact solution to your problem will depend on what your data looks like though, rather than one of those two things always being correct.
Sounds odd, but try to add JOIN statement:
SELECT SQL_NO_CACHE user_id, count(id) nr
FROM `table` t
JOIN `table` t2 ON t.`user_id`= t2.`user_id`
WHERE t.`page_id`=301
and t.`date_created` BETWEEN '2012-01-03' AND '2012-02-03 23:59:59'
AND t.`page_id`<>t.`user_id`
group by t.`user_id`
For similar problem, I got that query execute 20 times faster (3-4s instead 60+). JOIN statement does not perform anything smart - seems that speedup is fully to internal MySql implementation (Tested on MySql 5.1., table have rare user_id duplicates).
I've got a very large table (~100Million Records) in MySQL that contains information about files. One of the pieces of information is the modified date of each file.
I need to write a query that will count the number of files that fit into specified date ranges. To do that I made a small table that specifies these ranges (all in days) and looks like this:
DateRanges
range_id range_name range_start range_end
1 0-90 0 90
2 91-180 91 180
3 181-365 181 365
4 366-1095 366 1095
5 1096+ 1096 999999999
And wrote a query that looks like this:
SELECT r.range_name, sum(IF((DATEDIFF(CURDATE(),t.file_last_access) > r.range_start and DATEDIFF(CURDATE(),t.file_last_access) < r.range_end),1,0)) as FileCount
FROM `DateRanges` r, `HugeFileTable` t
GROUP BY r.range_name
However, quite predictably, this query takes forever to run. I think that is because I am asking MySQL to go through the HugeFileTable 5 times, each time performing the DATEDIFF() calculation on each file.
What I want to do instead is to go through the HugeFileTable record by record only once, and for each file increment the count in the appropriate range_name running total. I can't figure out how to do that....
Can anyone help out with this?
Thanks.
EDIT: MySQL Version: 5.0.45, Tables are MyISAM
EDIT2: Here's the descibe that was asked for in the comments
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE r ALL NULL NULL NULL NULL 5 Using temporary; Using filesort
1 SIMPLE t ALL NULL NULL NULL NULL 96506321
First, create an index on HugeFileTable.file_last_access.
Then try the following query:
SELECT r.range_name, COUNT(t.file_last_access) as FileCount
FROM `DateRanges` r
JOIN `HugeFileTable` t
ON (t.file_last_access BETWEEN
CURDATE() + INTERVAL r.range_start DAY AND
CURDATE() + INTERVAL r.range_end DAY)
GROUP BY r.range_name;
Here's the EXPLAIN plan that I got when I tried this query on MySQL 5.0.75 (edited down for brevity):
+-------+-------+------------------+----------------------------------------------+
| table | type | key | Extra |
+-------+-------+------------------+----------------------------------------------+
| t | index | file_last_access | Using index; Using temporary; Using filesort |
| r | ALL | NULL | Using where |
+-------+-------+------------------+----------------------------------------------+
It's still not going to perform very well. By using GROUP BY, the query incurs a temporary table, which may be expensive. Not much you can do about that.
But at least this query eliminates the Cartesian product that you had in your original query.
update: Here's another query that uses a correlated subquery but I have eliminated the GROUP BY.
SELECT r.range_name,
(SELECT COUNT(*)
FROM `HugeFileTable` t
WHERE t.file_last_access BETWEEN
CURDATE() - INTERVAL r.range_end DAY AND
CURDATE() - INTERVAL r.range_start DAY
) as FileCount
FROM `DateRanges` r;
The EXPLAIN plan shows no temporary table or filesort (at least with the trivial amount of rows I have in my test tables):
+----+--------------------+-------+-------+------------------+--------------------------+
| id | select_type | table | type | key | Extra |
+----+--------------------+-------+-------+------------------+--------------------------+
| 1 | PRIMARY | r | ALL | NULL | |
| 2 | DEPENDENT SUBQUERY | t | index | file_last_access | Using where; Using index |
+----+--------------------+-------+-------+------------------+--------------------------+
Try this query on your data set and see if it performs better.
Well, start by making sure that file_last_access is an index for the table HugeFileTable.
I'm not sure if this is possible\better, but try to compute the dates limits first (files from date A to date B), then use some query with >= and <=. It will, theoretically at least, improve the performance.
The comparison would be something like:
t.file_last_access >= StartDate AND t.file_last_access <= EndDate
You could get a small improvement by removing CURDATE() and putting a date in the query as it will run this function for each row twice in your SQL.