MySQL query take so long time in where clause - mysql

I have table structure like bellow:
id item_id created
1 5 2012-09-05 09:37:59
2 5 2012-09-05 10:25:09
3 5 2012-09-05 11:05:09
4 1 2012-09-05 10:25:09
5 3 2012-09-05 03:05:01
I want to know which item_id is most view by pass current date with WHERE clause as bellow:
SELECT item_id, COUNT( id ) AS TOTAL
FROM stats_item
WHERE DAY( created ) = '05'
AND MONTH( created ) = '07'
AND YEAR( created ) = '2013'
GROUP BY item_id
ORDER BY TOTAL DESC
LIMIT 0 , 30
The result query in MySql
Showing rows 0 - 29 ( 30 total, Query took 4.1747 sec)
It's take time up to 4.1747 sec
Bellow is index in table
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Sub_part Packed Null Index_type Comment
stats_item 0 PRIMARY 1 id A 2575580 NULL NULL BTREE
stats_item 1 created 1 created A 515116 NULL NULL YES BTREE
Why is query take so long time with WHERE clause and filter with YEAR, MONTH and DAY?
==================================
Edit with EXPLAIN:
Field Type Null Key Default Extra
id int(11) unsigned NO PRI NULL auto_increment
item_id int(11) unsigned YES NULL
created timestamp YES MUL NULL

Try to add composite index: created+item_id
Try to use query like:
SELECT item_id, COUNT( id ) AS TOTAL FROM stats_item
WHERE created >= "2013-07-05" and created <= "2013-07-05 23:59:59"
GROUP BY item_id ORDER BY TOTAL DESC LIMIT 0 , 30

Try this. I bet it runs faster. When you use functions on dates, the engine will generally ignore the indices. Comparing a range directly to the date field should be faster.
SELECT item_id, COUNT( id ) AS TOTAL
FROM stats_item
WHERE created BETWEEN '2013-05-07' AND '2013-05-07 23:59:59'
GROUP BY item_id
ORDER BY TOTAL DESC
LIMIT 0, 30

What does it take a long time? Two possible reasons. One is that the table is doing a full-table scan (because the functions in the where clause preclude the use of indexes). The other is because there are lots and lots of rows.
The first problem is solved by JW's solution:
WHERE created >= '2013-07-05' AND
created < '2013-07-05' + INTERVAL 1 DAY
Direct comparisons, with no functions, should always use an index.
Because this is not the problem, let me assume that there are lots and lots of rows for the day. If so, the problem is something called thrashing that can happen with indexes. This basically means that every matching reference is on a different page, so you still have to read in lots of pages. The fix to this is to add the item_id to the index. In the create table statement, you would do:
index (created, item_id)
Or you would do:
create index stats_item_created_item_id on stats_item(created, item_id)

In table stats_item it's really have a lot of row. I try to add INDEX with created, item_id the result is still slow.
The only way i use is between. It's really better that it's took only 0.0686 sec very difference from 4.1747 sec

Related

MySQL shows "possible_keys" but does not use it

I have a table with more than a million entries and around 42 columns. I am trying to run SELECT query on this table which takes a minute to execute. In order to reduce the query execution time I added an index on the table, but the index is not being used.
The table structure is as follows. Though the table has 42 columns I am only showing here those that are relevant to my query
CREATE TABLE `tas_usage` (
`uid` int(11) NOT NULL AUTO_INCREMENT,
`userid` varchar(255) DEFAULT NULL,
`companyid` varchar(255) DEFAULT NULL,
`SERVICE` varchar(2000) DEFAULT NULL,
`runstatus` varchar(255) DEFAULT NULL,
`STATUS` varchar(2000) DEFAULT NULL,
`servertime` datetime DEFAULT NULL,
`machineId` varchar(2000) DEFAULT NULL,
PRIMARY KEY (`uid`)
) ENGINE=InnoDB AUTO_INCREMENT=2992891 DEFAULT CHARSET=latin1
The index that I have added is as follows
ALTER TABLE TAS_USAGE ADD INDEX last_quarter (SERVERTIME,COMPANYID(20),MACHINEID(20),SERVICE(50),RUNSTATUS(10));
My SELECT Query
EXPLAIN SELECT DISTINCT t1.COMPANYID, t1.USERID, t1.MACHINEID FROM TAS_USAGE t1
LEFT JOIN TAS_INVALID_COMPANY INVL ON INVL.COMPANYID = t1.COMPANYID
LEFT JOIN TAS_INVALID_MACHINE INVL_MAC_ID ON INVL_MAC_ID.MACHINEID = t1.MACHINEID
WHERE t1.SERVERTIME >= '2018-10-01 00:00:00' AND t1.SERVERTIME <= '2018-12-31 00:00:00' AND
INVL.companyId IS NULL AND INVL_MAC_ID.machineId IS NULL AND
t1.SERVICE NOT IN ('credentialtest%', 'webupdate%') AND
t1.RUNSTATUS NOT IN ('Failed', 'Failed Failed', 'Failed Success', 'Success Failed', '');
EXPLAIN result is as follows
+----+-------------+-------------+------------+--------+-----------------------+-----------------------+---------+-----------------------------+---------+----------+------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------+------------+--------+-----------------------+-----------------------+---------+-----------------------------+---------+----------+------------------------------------------------+
| 1 | SIMPLE | t1 | NULL | ALL | last_quarter | NULL | NULL | NULL | 1765296 | 15.68 | Using where; Using temporary |
| 1 | SIMPLE | INVL | NULL | ref | invalid_company_index | invalid_company_index | 502 | servicerunprod.t1.companyid | 1 | 100.00 | Using where; Not exists; Using index; Distinct |
| 1 | SIMPLE | INVL_MAC_ID | NULL | eq_ref | machineId | machineId | 502 | servicerunprod.t1.machineId | 1 | 100.00 | Using where; Not exists; Using index; Distinct |
+----+-------------+-------------+------------+--------+-----------------------+-----------------------+---------+-----------------------------+---------+----------+------------------------------------------------+
Explanation of my Query
I want to select all the records from table TAS_USAGE
which are between date range(including) 1st October 2018 and 31st
Dec 2018 AND
which do not have columns COMPANYID and MACHINEID matching in
tables TAS_INVALID_COMPANYand TAS_INVALID_MACHINE AND
which do not contain values ('credentialtest%', 'webupdate%') in
SERVICE column and values ('Failed', 'Failed Failed', 'Failed
Success', 'Success Failed', '') in RUNSTATUS column
WHERE t1.SERVERTIME >= '2018-10-01 00:00:00'
AND t1.SERVERTIME <= '2018-12-31 00:00:00'
is strange. It covers 3 months minus 1 day plus 1 second. Suggest you rephrase thus:
WHERE t1.SERVERTIME >= '2018-10-01'
AND t1.SERVERTIME < '2018-10-01' + INTERVAL 3 MONTH
There are multiple possible reasons why the INDEX(servertime, ...) was not used and/or was not "useful" even if used:
If more than perhaps 20% of the table involved that daterange, using the index is likely to be less efficient than simply scanning the table. Using the index would involve bouncing between the index's BTree and the data's BTree.
Starting an index with a 'range' means that the rest of the index will not be used.
Index "prefixing" (foo(10)) is next to useless.
What you can do:
Normalize most of those string columns. How many "machines" do you have? Probably nowhere near 3 million. By replacing repeated strings with a small id (perhaps a 2-byte SMALLINT UNSIGNED with a max of 65K) will save a lot of space in this table. This, in turn, will speed up the query, and eliminate the desire for index prefixing.
If Normalizing is not practical because there really are upwards of 3 million distinct values, then see if shortening the VARCHAR. If you get it under 255, prefixing is no longer needed.
NOT IN is not optimizable. If you can invert the test and make it IN(...), more possibilities open up, such as INDEX(service, runstatus, servertime). If you have a new enough version of MySQL, I think the optimizer will hop around in the index on the two IN columns and use the index for the time range.
NOT IN ('credentialtest%', 'webupdate%') -- Is % part of the string? If you are using % as a wildcard, that construct will not work. You would need two LIKE clauses.
Reformulate the query thus:
SELECT t1.COMPANYID, t1.USERID, t1.MACHINEID
FROM TAS_USAGE t1
WHERE t1.SERVERTIME >= '2018-10-01'
AND t1.SERVERTIME < '2018-10-01' + INTERVAL 3 MONTH
AND t1.SERVICE NOT IN ('credentialtest%', 'webupdate%')
AND t1.RUNSTATUS NOT IN ('Failed', 'Failed Failed',
'Failed Success', 'Success Failed', '')
AND NOT EXISTS( SELECT 1 FROM TAS_INVALID_COMPANY WHERE companyId = t1.COMPANYID )
AND NOT EXISTS( SELECT 1 FROM TAS_INVALID_MACHINE WHERE MACHINEID = t1.MACHINEID );
If the trio t1.COMPANYID, t1.USERID, t1.MACHINEID is unique, then get rid of DISTINCT.
Since there are only 6 (of 42) columns being used in this query, building a "covering" index will probably help:
INDEX(SERVERTIME, SERVICE, RUNSTATUS, COMPANYID, USERID, MACHINEID)
This is because the query can be performed entirely withing the index. In this case, I deliberately put the range first.
Focussing on the date range, MySQL basically has two options :
read the complete table consecutively and throw away records that do not fit the date range
use the index to identify the records in the date range and then look up each record in the table (using the primary key) individually ("random access")
Consecutive reads are significantly faster than random access, but you need to read more data. There will be some break-even point at which using an index will become slower than just simply reading everything, and MySQL assumes this is the case here. If that's the right choice will largely depend on how correctly it guessed how many records are actually in the range. If you make the range smaller, it should actually use the index at some point.
If you know that (or want to test if) using the index is faster, you can force MySQL to use it with
... FROM TAS_USAGE t1 force index (last_quarter) LEFT JOIN ...
You should test it with different ranges, and if you generate your query dynamically, only force the index when you are decently certain (as MySQL will not correct you if you e.g. specify a range that would include all rows).
There is one important way around the slow random access to the table, although it unfortunately does not work with your prefixed index, but I mention it in case you can reduce your field sizes (or change them to lookups/enums). You can include every column that MySQL needs to evaluate the query by using a covering index:
An index that includes all the columns retrieved by a query. Instead of using the index values as pointers to find the full table rows, the query returns values from the index structure, saving disk I/O.
As mentioned, since in a prefixed index, part of the data is missing, those columns unfortunately cannot be used to cover though.
Actually, they also cannot be used for much at all, especially not to filter records before doing the random access, as to evaluate your where-condition for RUNSTATUS or SERVICE, the complete value is required anyway. So you could check if e.g. RUNSTATUS is very significant - maybe 99% of your records are in status 'Failed' - and in that case add an unprefixed filter for just
(SERVERTIME, RUNSTATUS) (and MySQL might even pick that index then on its own).
The distinct clause is the one that interferes with the index usage. Since the index cannot be used to help with the distinct, mysql decided against the use of index completely.
If you rearrange the order of fields in the select list, in the index, and in the where clause, mysql may decide to use it:
ALTER TABLE TAS_USAGE ADD INDEX last_quarter (COMPANYID(20),MACHINEID(20), SERVERTIME, SERVICE(50),RUNSTATUS(10));
SELECT DISTINCT t1.COMPANYID, t1.MACHINEID, t1.USERID FROM TAS_USAGE t1
LEFT JOIN TAS_INVALID_COMPANY INVL ON INVL.COMPANYID = t1.COMPANYID
LEFT JOIN TAS_INVALID_MACHINE INVL_MAC_ID ON INVL_MAC_ID.MACHINEID = t1.MACHINEID
WHERE
INVL.companyId IS NULL AND INVL_MAC_ID.machineId IS NULL AND
t1.SERVERTIME >= '2018-10-01 00:00:00' AND t1.SERVERTIME <= '2018-12-31 00:00:00' AND
t1.SERVICE NOT IN ('credentialtest%', 'webupdate%') AND
t1.RUNSTATUS NOT IN ('Failed', 'Failed Failed', 'Failed Success', 'Success Failed', '');
This way COMPANYID, MACHINEID fields become the leftmost fields in the distinct, where, and index - although the prefix may result in the index still to be discarded. You may want to consider reducing your varchar(255) fields.

query becomes slow with GROUP BY

I have spent 4 hours googling and trying all sorts of indexes, mysqlyog, reading, searching etc. When I add the GROUP BY the query changes from 0.002 seconds to 0.093 seconds. Is this normal and acceptable? Or can I alter the indexes and/or the query?
Table:
uniqueid int(11) NO PRI NULL auto_increment
ip varchar(64) YES NULL
lang varchar(16) YES MUL NULL
timestamp int(11) YES MUL NULL
correct decimal(12,2) YES NULL
user varchar(32) YES NULL
timestart int(11) YES NULL
timeend int(11) YES NULL
speaker varchar(64) YES NULL
postedAnswer int(32) YES NULL
correctAnswerINT int(32) YES NULL
Query:
SELECT
SQL_NO_CACHE
user,
lang,
COUNT(*) AS total,
SUM(correct) AS correct,
ROUND(SUM(correct) / COUNT(*) * 100) AS score,
TIMESTAMP
FROM
maths_score
WHERE TIMESTAMP > 1
AND lang = 'es'
GROUP BY USER
ORDER BY (
(SUM(correct) / COUNT(*) * 100) + SUM(correct)
) DESC
LIMIT 500
explain extended:
id select_type table type possible_keys key key_len ref rows filtered Extra
------ ----------- ----------- ------ ------------------------- -------------- ------- ------ ------ -------- ---------------------------------------------------------------------
1 SIMPLE maths_score ref scoretable,fulltablething fulltablething 51 const 10631 100.00 Using index condition; Using where; Using temporary; Using filesort
Current indexes (I have tried many)
Keyname Type Unique Packed Column Cardinality Collation Null Comment
uniqueid BTREE Yes No uniqueid 21262 A No
scoretable BTREE No No timestamp 21262 A Yes
lang 21262 A Yes
fulltablething BTREE No No lang 56 A Yes
timestamp 21262 A Yes
user 21262 A Yes
Please use SHOW CREATE TABLE; it is more descriptive than DESCRIBE.
Do you have INDEX(lang, TIMESTAMP)? (Why.) It is likely to help both versions of the query.
Without the GROUP BY, you get one row, correct? With the GROUP BY, you get many rows, correct? Guess what, it takes more time to deliver more rows.
In addition, the GROUP BY probably involves an extra sort. The ORDER BY involves a sort, but in one case there is only 1 row to sort, hence faster. If there are a million USERs, then the ORDER BY will need to sort a million rows, only to deliver 500.
Please provide EXPLAIN SELECT ... for each case -- you will see some of what I am saying.
So you ran the query without GROUP BY and got one result row in 0.002 secs. Then you added GROUP BY (and ORDER BY obviously) and ended up with multiple result rows in 0.093 secs.
In order to produce this result, the DBMS must somehow order your records by user or create buckets per user, so as to get record count, sum, etc. per user. This takes of course much more time than just running through the table, counting records and summing up a value unconditionally. At last the DBMS must even sort these results again. I am not surprised this runs much longer.
The most appropriate index for this query should be:
create index idx on maths_score (lang, timestamp, user, correct);
This is a covering index, starting with the columns in WHERE, continuing with the column in GROUP BY and ending with all other columns used in the query.

Two SQL Queries - Performance difference?

I'm using MySQL with PDO in PHP and I have a SQL query, which works as expected. However, I care about performance and would like to know if I could improve my query. I'm also asking, because I want to gain some more background knowledge of SQL.
Let's say I have two tables that have a few equal fields (and some additional information, which are different in each table):
table `blog_comments`: id, userid (int) | timestamp (int) | content (varchar) | other
table `projects_comments`: id, userid (int) | timestamp (int) | content (varchar) | other
The field id is the primary key, userid + timestamp have an index in both tables, and timestamp is simply the unixtime with the length of 10 (integer).
As a simple spam protection, I block a user from submitting a new comment (no matter if blog, project or anything else) until 60 seconds have passed since his last comment. To achieve this, I get the latest timestamp of that user from all the comments tables.
This is my working query:
SELECT MAX(`last_timestamp`) AS `last_timestamp`
FROM
(
SELECT `userid`, max(`timestamp`) AS `last_timestamp`
FROM `blog_comments`
GROUP BY `userid`
UNION ALL
SELECT `userid`, max(`timestamp`) as `last_timestamp`
FROM `projects_comments`
GROUP BY `userid`
) AS `subquery`
WHERE `userid` = 1
LIMIT 0, 1;
As you can notice, I use GROUP BY inside the subqueries, and in the main query I simply filter the userid (in this case: 1). The advantage: I just need to pass the userid once as a parameter.
Now, I am interested into how SQL exactly works. I think it will be like this: SQL first performs the subqueries, groups all the existing rows by userid and returns the whole set to the main query, which then applies the where clause to find the required userid. This seems like a big leak of performance to me.
So I thought on slightly changing the query:
SELECT max(`last_timestamp`) AS `last_timestamp`
FROM
(
SELECT max(`timestamp`) AS `last_timestamp`
FROM `blog_comments`
WHERE `userid` = 1
UNION ALL
SELECT max(`timestamp`) as `last_timestamp`
FROM `projects_comments`
WHERE `userid` = 1
) AS `subquery`
LIMIT 0, 1
Now I have to pass the userid twice, and still the whole set of rows will be looked up for the given userid. I am not sure if this really improves the performance.
I don't have any large data amount yet to really test it, maybe I will do some test scenarios later. I would be really interested into knowing if there would be a difference, when there would be many data sets in those tables?
Would appreciate any ideas, information and tips, thanks in advance.
Edit:
MySQL explain of the first query:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 4 Using where
2 DERIVED blog_comments range NULL userid 8 NULL 10 Using index for group-by
3 UNION projects_comments index NULL userid 12 NULL 6 Using index
NULL UNION RESULT <union2,3> ALL NULL NULL NULL NULL NULL
MySQL explain of the second query:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 2
2 DERIVED NULL NULL NULL NULL NULL NULL NULL Select tables optimized away
3 UNION NULL NULL NULL NULL NULL NULL NULL Select tables optimized away
NULL UNION RESULT <union2,3> ALL NULL NULL NULL NULL NULL
As an alternative approach...
SELECT 'It''s been more than 1 minute since your last post' As result
WHERE NOT EXISTS (
SELECT *
FROM blog_comments
WHERE userid = 1
AND timestamp > Date_Sub(Current_Timestamp, INTERVAL 1 MINUTE)
)
AND NOT EXISTS (
SELECT *
FROM projects_comments
WHERE userid = 1
AND timestamp > Date_Sub(Current_Timestamp, INTERVAL 1 MINUTE)
)
There will be a result if userid = 1 hasn't got a timestamped record within the last minute in either table.
You can also swap the logic around...
SELECT 'You''re not allowed to post just yet...' As result
WHERE EXISTS (
SELECT *
FROM blog_comments
WHERE userid = 1
AND timestamp > Date_Sub(Current_Timestamp, INTERVAL 1 MINUTE)
)
OR EXISTS (
SELECT *
FROM projects_comments
WHERE userid = 1
AND timestamp > Date_Sub(Current_Timestamp, INTERVAL 1 MINUTE)
)
This second option will probably be more efficient (EXISTS vs NOT EXISTS) but that's for you to test and prove ;)
The answer to your question is that the second should perform better in MySQL than the first, for exactly the reason you gave. MySQL will run the full group by on all the data and then select the one group.
You can see the different in execution paths by putting an explain in front of the query. That will give ou some idea of what the query is really doing.
If you have an index on user_id, timestamp, then the second query will run quite fast, only using the index. Even without an index, the second query would do a full table scan of the two tables -- and that is it. The first will do a full table scan and a file sort for the aggregation. The second takes longer.
If you wanted to pass in the userid only once, you could do something like:
select coalesce(greatest(bc_last_timestamp, pc_last_timestamp),
bc_last_timestamp, pc_last_timestamp
)
from (select (SELECT max(`timestamp`) FROM `blog_comments` bc where bc.userid = const.userid
) bc_last_timestamp,
(SELECT max(`timestamp`) FROM `projects_comments` pc where pc.userid = const.userid
) pc_last_timestamp
from (select 1 as userid) const
) t;
The query looks arcane but it should optimize similarly to your second one.

Removing "Using temporary; Using filesort" from this MySQL select+join+group by

I have the following query:
select
t.Chunk as LeftChunk,
t.ChunkHash as LeftChunkHash,
q.Chunk as RightChunk,
q.ChunkHash as RightChunkHash,
count(t.ChunkHash) as ChunkCount
from
chunks as t
join
chunks as q
on
t.ID = q.ID
group by LeftChunkHash, RightChunkHash
And the following explain table:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t ALL IDIndex NULL NULL NULL 17796190 "Using temporary; Using filesort"
1 SIMPLE q ref IDIndex IDIndex 4 sotero.t.Id 12
note the "using temporary; using filesort".
When this query is run, I quickly run out of RAM (presumably b/c of the temp table), and then the HDD kicks in, and the query slows to a halt.
I thought it might be an index issue, so I started adding a few that sort of made sense:
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Sub_part Packed Null Index_type Comment Index_comment
chunks 0 PRIMARY 1 ChunkId A 17796190 NULL NULL BTREE
chunks 1 ChunkHashIndex 1 ChunkHash A 243783 NULL NULL BTREE
chunks 1 IDIndex 1 Id A 1483015 NULL NULL BTREE
chunks 1 ChunkIndex 1 Chunk A 243783 NULL NULL BTREE
chunks 1 ChunkTypeIndex 1 ChunkType A 2 NULL NULL BTREE
chunks 1 chunkHashByChunkIDIndex 1 ChunkHash A 243783 NULL NULL BTREE
chunks 1 chunkHashByChunkIDIndex 2 ChunkId A 17796190 NULL NULL BTREE
chunks 1 chunkHashByChunkTypeIndex 1 ChunkHash A 243783 NULL NULL BTREE
chunks 1 chunkHashByChunkTypeIndex 2 ChunkType A 261708 NULL NULL BTREE
chunks 1 chunkHashByIDIndex 1 ChunkHash A 243783 NULL NULL BTREE
chunks 1 chunkHashByIDIndex 2 Id A 17796190 NULL NULL BTREE
But still using the temporary table.
The db engine is MyISAM.
How can I get rid of the using temporary; using filesort in this query?
Just changing to InnoDB w/o explaining the underlying cause is not a particularly satisfying answer. Besides, if the solution is to just add the proper index, then that's much easier than migrating to another db engine.
I am new to relational databases. So I'm hoping that the solution is something obvious to the experts.
EDIT1:
ID is not the primary key. ChunkID is. There are approximately 40 ChunkIDs for each ID. So adding an additional ID to the table adds about 40 rows. Each unique chunk has a unique chunkHash associated with it.
EDIT2:
Here's the schema:
Field Type Null Key Default Extra
ChunkId int(11) NO PRI NULL
ChunkHash int(11) NO MUL NULL
Id int(11) NO MUL NULL
Chunk varchar(255) NO MUL NULL
ChunkType varchar(255) NO MUL NULL
EDIT 3:
The end objective of the query is to create a table of word co-occurrences across documents. ChunkIDs are word instances. Each instance is a word that is associated with a particular document (ID). About 40 words present per document. About 1 million documents. So the resulting table of co-occurrences is highly compressed compared to the full cross-product temporary table that is (apparently) being created. That is, the full cross-product temp table is 1 mil * 40 * 40 = 1.6 billion rows. The compressed resulting table is estimated at about 40 million rows.
EDIT 4:
Adding postgresql tag to see if any postgresql users can get a better execution plan on that SQL implementation. If that's the case, I'll switch over.
How about summarizing the table before the join?
The summary might be:
select count(*) count,
Chunk,
ChunkHash
from chunks
group by Chunk, ChunkHash
Then the join would be:
Select r.Chunk as RightChunk,
r.ChunkHash as RightChunkHash,
l.Chunk as LeftChunk,
l.ChunkHash as LeftChunkHash
sum (l.Count) + sum(r.Count) as Count
from (
select count(*) count,
Chunk,
ChunkHash
from chunks
group by Chunk, ChunkHash
) l
join (
select count(*) count,
Chunk,
ChunkHash
from chunks
group by Chunk, ChunkHash
) r on l.Chunk = r.Chunk
group by r.Chunk, r.ChunkHash, l.Chunk, l.ChunkHash
The thing I'm not sure about is what you're counting, exactly. So my SUM() + SUM() is a guess. You might want SUM() * SUM().
Also, I'm assuming that two Chunk values are equal if and only if ChunkHash values are equal.
Updated with a query that produces the same results. It won't be any faster though.
Create Index IX_ID On Chunks (ID);
Select
LeftChunk,
LeftChunkHash,
RightChunk,
RightChunkHash,
Sum(ChunkCount)
From (
Select
t.Chunk as LeftChunk,
t.ChunkHash as LeftChunkHash,
q.Chunk as RightChunk,
q.ChunkHash as RightChunkHash,
count(t.ChunkHash) as ChunkCount
From
chunks as t
inner join
chunks as q
on t.ID = q.ID
Group By
t.ID,
t.ChunkHash,
q.ChunkHash
) x
Group By
LeftChunk,
LeftChunkHash,
RightChunk,
RightChunkHash
Fiddle with example test data http://sqlfiddle.com/#!3/ea1a5/2
Latest Fiddle, with the problem reformulated as words and documents: http://sqlfiddle.com/#!3/f5aef/12
With the problem reformulated as documents and words, how many documents do you have, how many words, and how many document words?
Also, using the documents and words analogy, would you say your query is "For all pairs of words that appear in a document together, how often do they appear together in any document. If word A appears n times in a document and word B m times in the same document, then this counts as n * m times in the total."
I migrated from MySQL to PostgreSQL, and query execution time went from ~1.5 days to ~10 mins.
Here's the PostgreSQL query execution plan:
I am no longer using MySQL.

Create index to optimize slow query

There is a query that takes too long on a 250,000 rows table. I need to speed it up:
create table occurrence (
occurrence_id int(11) primary key auto_increment,
client_id varchar(16) not null,
occurrence_cod varchar(50) not null,
entry_date datetime not null,
zone varchar(8) null default null
)
;
insert into occurrence (client_id, occurrence_cod, entry_date, zone)
values
('1116', 'E401', '2011-03-28 18:44', '004'),
('1116', 'R401', '2011-03-28 17:44', '004'),
('1116', 'E401', '2011-03-28 16:44', '004'),
('1338', 'R401', '2011-03-28 14:32', '001')
;
select client_id, occurrence_cod, entry_date, zone
from occurrence o
where
occurrence_cod = 'E401'
and
entry_date = (
select max(entry_date)
from occurrence
where client_id = o.client_id
)
;
+-----------+----------------+---------------------+------+
| client_id | occurrence_cod | entry_date | zone |
+-----------+----------------+---------------------+------+
| 1116 | E401 | 2011-03-28 16:44:00 | 004 |
+-----------+----------------+---------------------+------+
1 row in set (0.00 sec)
The table structure is from a commercial application and can not be altered.
What would be the best index(es) to optimize it? Or a better query?
EDIT:
It is the last occurrence of the E401 code for each client and only if the last occurrence is that code.
The ideal indexes for such a query would be:
index #1: [client_id] + [entry_date]
index #2: [occurence_cod] + [entry_date]
Nevertheless those indexes can be simplified if it happens that data have some characteristics. This will save file space, and also time when data are updated (insert/delete/update).
If there is rarely more than one "occurence" record for each [client_id], then index #1 can be only [client_id].
By the same way, if there is rarely more than one "occurence" record for each [occurence_cod], then index #1 can be only [occurence_cod].
It may be more useful to turn index #2 into [entry_date] + [occurence_cod]. This will enable you to use the index for criteria that are only on [entry_date].
Regards,
Unless you are truly trying to get the row with the max date, if and only if the occurrence_cod matches, this should work:
select client_id, occurrence_cod, entry_date, zone
from occurrence o
where occurrence_cod = 'E401'
ORDER BY entry_date DESC
LIMIT 1;
It will return the most recent row with occurrence_cod='E401'
select
a.client_id,
a.occurrence_cod,
a.entry_date,
a.zone
from occurrence a
inner join (
select client_id, occurence_cod, max(entry_date) as entry_date
from occurence
) as b
on
a.client_id = b.client_id and
a.occurence_cod = b.occurence_cod and
a.entry_date = b.entry_date
where
a.occurrence_cod = 'E401'
Using this approach you're avoiding the subselect per row, and it should be faster to compare two big sets of data than a big set of data for each row of the set.
I'd re-write the query:
select client_id, occurrence_cod, max(entry_date), zone
from occurrence
group by client_id, occurrence_cod, zone;
(assuming the other lines are indeed identical, and entry date is the only thing that changes).
Did you try putting an index on occurrence_cod?
Try this if other approaches not available.
create a new table: last_occurrence.
Every time user occurred, update the corresponding row in this last_occurrence table.
by doing this, you just need to use the following sql to get your result :)
select * from last_occurrence