I have a somewhat complex (to me) query where I am joining three tables. I have been steadily trying to optize it, reading how to improve things by looking at the EXPLAIN output.
One of the tables person_deliveries is growing by one to two million records per day, so the query is taking longer and longer due to my poor optimization. Any insight would be GREATLY appreciated.
Here is the query:
SELECT
DATE(pdel.date) AS date,
pdel.ip_address AS ip_address,
pdel.sending_campaigns_id AS campaigns_id,
(substring_index(pe.email, '#', -1)) AS recipient_domain,
COUNT(DISTINCT(concat(pdel.emails_id, pdel.date))) AS deliveries,
COUNT(CASE WHEN pdel.ip_address = pc.ip_address AND pdel.sending_campaigns_id = pc.campaigns_id AND pdel.emails_id = pc.emails_id THEN pdel.emails_id ELSE NULL END) AS complaints
FROM
person_deliveries pdel
LEFT JOIN person_complaints pc on pc.ip_address = pdel.ip_address
LEFT JOIN person_emails pe ON pe.id = pdel.emails_id
WHERE
(pdel.date >= '2022-03-11' AND pdel.date <= '2022-03-12')
AND pe.id IS NOT NULL
AND pdel.ip_address is NOT NULL
GROUP BY date(pdel.date), pdel.ip_address, pdel.sending_campaigns_id
ORDER BY date(pdel.date), INET_ATON(pdel.ip_address), pdel.sending_campaigns_id ASC ;
Here is the output of EXPLAIN:
+----+-------------+-------+------------+--------+------------------------------------------------+------------+---------+----------------------------+---------+----------+---------------------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+------------------------------------------------+------------+---------+----------------------------+---------+----------+---------------------------------------------------------------------+
| 1 | SIMPLE | pdel | NULL | range | person_campaign_date,ip_address,date,emails_id | date | 5 | NULL | 2333678 | 50.00 | Using index condition; Using where; Using temporary; Using filesort |
| 1 | SIMPLE | pe | NULL | eq_ref | PRIMARY | PRIMARY | 4 | subscriber.pdel.emails_id | 1 | 100.00 | NULL |
| 1 | SIMPLE | pc | NULL | ref | ip_address | ip_address | 18 | subscriber.pdel.ip_address | 128 | 100.00 | NULL |
+----+-------------+-------+------------+--------+------------------------------------------------+------------+---------+----------------------------+---------+----------+---------------------------------------------------------------------+
I added a few indexes to get it to this point, but the query still takes an extraordinary amount of resources/time to process.
I know I am missing something here, either an index or using a function that is causing it to be slow, but from everything I have read I haven't figured it out yet.
UPDATE:
I neglected to include table info, so I am providing that to be more helpful.
person_deliveries:
CREATE TABLE `person_deliveries` (
`emails_id` int unsigned NOT NULL,
`sending_campaigns_id` int NOT NULL,
`date` datetime NOT NULL,
`vmta` varchar(255) DEFAULT NULL,
`ip_address` varchar(15) DEFAULT NULL,
`sending_domain` varchar(255) DEFAULT NULL,
UNIQUE KEY `person_campaign_date` (`emails_id`,`sending_campaigns_id`,`date`),
KEY `ip_address` (`ip_address`),
KEY `sending_domain` (`sending_domain`),
KEY `sending_campaigns_id` (`sending_campaigns_id`),
KEY `date` (`date`),
KEY `emails_id` (`emails_id`)
person_complaints:
CREATE TABLE `person_complaints` (
`emails_id` int unsigned NOT NULL,
`campaigns_id` int unsigned NOT NULL,
`complaint_datetime` datetime DEFAULT NULL,
`added_datetime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`ip_address` varchar(15) DEFAULT NULL,
`networks_id` int DEFAULT NULL,
`mailing_domains_id` int DEFAULT NULL,
UNIQUE KEY `email_campaign_date` (`emails_id`,`campaigns_id`,`complaint_datetime`),
KEY `ip_address` (`ip_address`)
person_emails:
CREATE TABLE `person_emails` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
`data_providers_id` tinyint unsigned DEFAULT NULL,
`email` varchar(255) NOT NULL,
`email_md5` varchar(255) DEFAULT NULL,
`original_import` timestamp NULL DEFAULT NULL,
`last_import` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
UNIQUE KEY `email` (`email`),
KEY `data_providers_id` (`data_providers_id`),
KEY `email_md5` (`email_md5`)
Hopefully this extra info helps.
Too many questions vs comments.
It appears for date criteria you are only pulling for a SINGLE date. Is this always the case?, or just this sample. Your pdel.date. Is it a date or date/time as stored. Your query is doing >= '2022-03-11' AND <= '2022-03-12'. Is this because your are trying to get up to and including 2022-03-11 at 11:59:59pm? And if so, should it be LESS than 03-12?
If your counts are based on a single day basis, and this data is rather fixed... that is you are not going to be changing deliveries, etc. on a day that has already passed. This might be a candidate condition for having a stored aggregate table that is done on a daily basis. This way when you are looking for activity patterns, you can have the non-changing aggregates already done and just go against that. Then if you need the details, go back to the raw data.
These indexes are "covering", which should help some:
pdel: INDEX(date, ip_address, sending_campaigns_id, emails_id)
pc: INDEX(ip_address, campaigns_id, emails_id)
Assuming date is a DATETIME, this contains an extra midnight:
WHERE pdel.date >= '2022-03-11'
AND pdel.date <= '2022-03-12'
I prefer the pattern:
WHERE pdel.date >= '2022-03-11'
AND pdel.date < '2022-03-11' + INTERVAL 1 DAY
When the GROUP BY and ORDER BY are different, an extra sort is (usually) required. So, write the GROUP BY to be just like the ORDER BY (after removing "ASC").
A minor simplification (and speedup):
COUNT(DISTINCT(concat(pdel.emails_id, pdel.date))) AS deliveries,
-->
COUNT(DISTINCT, pdel.emails_id, pdel.date) AS deliveries,
Consider storing the numeric version of the IPv4 in INT UNSIGNED (only 4 bytes) instead of a VARCHAR. It will be smaller and you can eliminate some conversions, but will add an INET_NTOA in the SELECT.
The COUNT(CASE ... ) can be simplified to
SUM( pdel.ip_address = pc.ip_address
AND pdel.sending_campaigns_id = pc.campaigns_id
AND pdel.emails_id = pc.emails_id ) AS complaints
In
(substring_index(pe.email, '#', -1)) AS recipient_domain,
I think it should be 1, not -1 or the alias is 'wrong'.
Please change LEFT JOIN pe ... WHERE pe.id IS NOT NULL to equivalent, but simpler JOIN pe without the null test.
Sorry, but those will not provide a huge performance improvement. The next step would be to build and maintain a Summary Tables and use that to generate the desired 'report'. (See DRapp's Answer.)
Related
I have a rather large database where I would like to search/filter on a MEDIUMTEXT (tags), DATETIME (created_time) and a BIT (include) column.
Let's say the database looks like this:
+------+-----------------------+--------------------------+---------+
| id | created_time | tags | include |
|(INT) | (DATETIME) | (MEDIUMTEXT) | (BIT) |
+------+-----------------------+--------------------------+---------+
| 1 | '2017-02-20 08:58:06' | 'client 1' | 1 |
| 2 | '2017-03-01 18:12:00' | 'client 1 and client 2' | 0 |
| 3 | '2017-03-02 02:52:35' | 'client 3 plus client 1' | 0 |
| 4 | '2017-03-03 12:41:58' | 'client 1' | 1 |
| 5 | '2017-03-05 18:03:12' | 'client 2, client 3' | 1 |
| 6 | '2017-03-06 20:25:45' | 'client 1 and client 3' | 0 |
| 7 | '2017-03-08 22:51:22' | 'client 1' | 1 |
+------+-----------------------+--------------------------+---------+
I have indexed the DATETIME and BIT columns and I have used a FULLTEXT index on the MEDIUMTEXT column.
If I run this statement:
select statement 1
------------------
SELECT COUNT(*)
FROM database
WHERE (MATCH(tags) AGAINST('"client 1"' IN BOOLEAN MODE))
AND created_time >= '2017-03-01 12:00:00'
AND include = 0;
It takes 14 sec. to run and returns 6700 rows.
However, if I run:
select statement 2
------------------
SELECT COUNT(*)
FROM database
WHERE (MATCH(tags) AGAINST('"client 1"' IN BOOLEAN MODE));
It takes 0,4 sec. to run and returns 145000 rows and if I run:
select statement 3
------------------
SELECT COUNT(*)
FROM database
WHERE created_time >= '2017-03-01 12:00:00'
AND include = 0;
It takes 0,5 sec. to run and returns 25000 rows.
Now my question is, how do I make ‘select statement 1’ run faster? Do I need to first run ‘select statement 2’ and then run the ‘select statement 3’ on the results? If so, how? Anyone have experience with UNION and can I use it here? Or is there a way I can create a multiple-column index on INDEX and FULLTEXT?
Added info on the actual table (and not the example above) with special thanks to #rick-james
Query 1:
SELECT SQL_NO_CACHE count(*)
FROM Twitter_tweet
WHERE created_time >= '2017-01-01 23:00:00'
AND MATCH(tags) AGAINST('\"dkpol\"' IN BOOLEAN MODE);
Query 2:
SELECT SQL_NO_CACHE count(*)
FROM Twitter_tweet
WHERE MATCH(tags) AGAINST('\"dkpol\"' IN BOOLEAN MODE);
Query 3:
SELECT SQL_NO_CACHE count(*)
FROM Twitter_tweet
WHERE created_time >= '2017-01-01 23:00:00';
EXPLAIN for the 3 queries:
+----+-------------+---------------+----------+----------------------------------------------------+--------------------+---------+-------+--------+----------+-----------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+----------+----------------------------------------------------+--------------------+---------+-------+--------+----------+-----------------------------------+
| 1 | SIMPLE | Twitter_tweet | fulltext | created_time_INDEX,SELECT_tags_INDEX,tags_FULLTEXT | tags_FULLTEXT | 0 | const | 1 | 50.00 | Using where; Ft_hints: no_ranking |
+----+-------------+---------------+----------+----------------------------------------------------+--------------------+---------+-------+--------+----------+-----------------------------------+
| 2 | SIMPLE | | | | | | | | | Select tables optimized away |
+----+-------------+---------------+----------+----------------------------------------------------+--------------------+---------+-------+--------+----------+-----------------------------------+
| 3 | SIMPLE | Twitter_tweet | range | created_time_INDEX,SELECT_tags_INDEX | created_time_INDEX | 6 | | 572286 | 100.00 | Using where; Using index |
+----+-------------+---------------+----------+----------------------------------------------------+--------------------+---------+-------+--------+----------+-----------------------------------+
SHOW CREATE TABLE:
CREATE TABLE `Twitter_tweet` (
`post_id` bigint(20) unsigned NOT NULL,
`from_user_id` bigint(20) unsigned NOT NULL,
`from_user_username` tinytext,
`from_user_fullname` tinytext,
`message` mediumtext,
`created_time` datetime DEFAULT NULL,
`quoted_post_id` bigint(20) unsigned DEFAULT NULL,
`quoted_user_id` bigint(20) unsigned DEFAULT NULL,
`quoted_user_username` tinytext,
`quoted_user_fullname` tinytext,
`to_post_id` bigint(20) unsigned DEFAULT NULL,
`to_user_id` bigint(20) unsigned DEFAULT NULL,
`to_user_username` tinytext,
`truncated` bit(1) DEFAULT NULL,
`is_retweet` bit(1) DEFAULT NULL,
`retweeting_post_id` bigint(20) unsigned DEFAULT NULL,
`retweeting_user_id` bigint(20) unsigned DEFAULT NULL,
`retweeting_user_username` tinytext,
`retweeting_user_fullname` tinytext,
`tags` text,
`mentions_user_id` text,
`mentions_user_username` text,
`mentions_user_fullname` text,
`post_urls` text,
`count_favourite` int(11) DEFAULT NULL,
`count_retweet` int(11) DEFAULT NULL,
`lang` tinytext,
`location_longitude` float(13,10) DEFAULT NULL,
`location_latitude` float(13,10) DEFAULT NULL,
`place_id` tinytext,
`place_fullname` tinytext,
`source` tinytext,
`fetchtime` datetime DEFAULT NULL,
PRIMARY KEY (`post_id`),
UNIQUE KEY `post_id_UNIQUE` (`post_id`),
KEY `from_user_id_INDEX` (`from_user_id`),
KEY `quoted_user_id_INDEX` (`quoted_user_id`),
KEY `to_user_id_INDEX` (`to_user_id`),
KEY `retweeting_user_id_INDEX` (`retweeting_user_id`),
KEY `created_time_INDEX` (`created_time`),
KEY `retweeting_post_id_INDEX` (`retweeting_post_id`),
KEY `post_all_id_INDEX` (`post_id`,`retweeting_post_id`,`to_post_id`,`quoted_post_id`),
KEY `quoted_post_id_INDEX` (`quoted_post_id`),
KEY `to_post_id_INDEX` (`to_post_id`),
KEY `is_retweet_INDEX` (`is_retweet`),
KEY `SELECT_tags_INDEX` (`created_time`,`is_retweet`,`post_id`),
FULLTEXT KEY `tags_FULLTEXT` (`tags`),
FULLTEXT KEY `mentions_user_id_FULLTEXT` (`mentions_user_id`),
FULLTEXT KEY `message_FULLTEXT` (`message`),
FULLTEXT KEY `content_select` (`tags`,`message`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
When timing, do two things:
Turn off the Query cache (or us SELECT SQL_NO_CACHE...)
Run the query twice.
When a query is run, these happen:
Check the QC to see if exactly the same query was recently run; if so, return the result from that run. This usually takes ~1ms. (This is not what happened in the examples you gave.)
Perform the query. Now there are multiple sub-cases:
If the "buffer pool" is 'cold', this is likely to involve lots of I/O. I/O is slow. This may explain your 14 second run.
If the desired data is cached in RAM, then it will run faster. This probably explains why the other two runs were a lot faster.
If, after compensating from these, you still have issues, please provide SHOW CREATE TABLE and EXPLAIN SELECT ... for the cases. (There could other factors involved.)
Schema critique
One way to improve performance (some) is to shrink the data.
lang tinytext, -- there is a 5 char standard
BIGINT takes 8 bytes. A 4-byte INT is enough for half the people in the world. (But first verify that your AUTO_INCREMENTs are not burning a lot of ids.)
For subtle reasons, VARCHAR(255) is better than TINYTEXT, even though they seem equivalent. Whenever practical, use something less than 255.
FLOAT(13,10) has some issues; I recommend DECIMAL(8,6)/(9,6) as sufficient for distinguishing two tweeters sitting next to each other (not that GPS is that precise).
A PRIMARY KEY is a UNIQUE key; get rid of the redundant UNIQUE.
With INDEX(a, b), you don't also need INDEX(a). (at least 2 cases of such)
Bulk
What will you do with 6700 or 25000 rows in the resultset? I ask because the effort of returning lots of rows is part of the performance problem. If your next step is to further whittle down the output, then it may be better to do the whittling in SQL.
Analysis
Looking at the second set of Queries:
FT + date range. This first did the FT search, then further filtered by date.
FT, count results, quit. Note that all of that was done in the EXPLAIN, hence "Select tables optimized away" -- and the EXPLAIN time is the same as the SELECT time.
Scan one index for an estimated 572K rows -- done entirely in the index. This cannot be improved. However, it can be made severely worse -- such as by adding a seemingly innocuous AND include = 0. In this case it would not be able to use just the index, but instead would have to bounce between the index and the data -- a lot more costly. A cure for this case: INDEX(include, created_time), which would run faster.
COUNT(*) is potentially cheap -- no need to return lots of data, often can be completed within an index, etc.
SELECT col1, col2 is faster than SELECT * -- especially because of TEXT columns.
I have a query which purpose is to generate statistics for how many musical work (track) has been downloaded from a site at different periods (by month, by quarter, by year etc). The query operates on the tables entityusage, entityusage_file and track.
To get the number of downloads for tracks belonging to an specific album I would do the following query :
select
date_format(eu.updated, '%Y-%m-%d') as p, count(eu.id) as c
from entityusage as eu
inner join entityusage_file as euf
ON euf.entityusage_id = eu.id
inner join track as t
ON t.id = euf.track_id
where
t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7'
and entitytype = 't'
and action = 1
group by date_format(eu.updated, '%Y%m%d')
I need to set entitytype = 't' as the entityusage can hold downloads of other entities as well (if entitytype = 'a' then an entire album would have been downloaded, and entityusage_file would then hold all tracks which the album "translated" into at the point of download).
This query takes 40 - 50 seconds. I've been trying to optimize this query for a while, but I have the feeling that I'm approaching this the wrong way.
This is one out of 4 similar queries which must run to generate a report. The report should preferable be able to finish while a user waits for it. Right now, I'm looking at 3 - 4 minutes. That's a long time to wait.
Can this query be optimised further with indexes, or do I need to take another approach to get this job done?
CREATE TABLE `entityusage` (
`id` char(36) NOT NULL,
`title` varchar(255) DEFAULT NULL,
`entitytype` varchar(5) NOT NULL,
`entityid` char(36) NOT NULL,
`externaluser` int(10) NOT NULL,
`action` tinyint(1) NOT NULL,
`updated` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `e` (`entityid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `entityusage_file` (
`id` char(36) NOT NULL,
`entityusage_id` char(36) NOT NULL,
`track_id` char(36) NOT NULL,
`file_id` char(36) NOT NULL,
`type` varchar(3) NOT NULL,
`quality` int(1) NOT NULL,
`size` int(20) NOT NULL,
`updated` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `file_id` (`file_id`),
KEY `entityusage_id` (`entityusage_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `track` (
`id` char(36) NOT NULL,
`album_id` char(36) NOT NULL,
`number` int(3) NOT NULL DEFAULT '0',
`title` varchar(255) DEFAULT NULL,
`updated` datetime NOT NULL DEFAULT '2000-01-01 00:00:00',
PRIMARY KEY (`id`),
KEY `album` (`album_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 CHECKSUM=1 DELAY_KEY_WRITE=1 ROW_FORMAT=DYNAMIC;
An EXPLAIN on the query gives me the following :
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
| 1 | SIMPLE | eu | ALL | NULL | NULL | NULL | NULL | 7832817 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | euf | ref | entityusage_id | entityusage_id | 108 | func | 1 | Using index condition |
| 1 | SIMPLE | t | eq_ref | PRIMARY,album | PRIMARY | 108 | trackerdatabase.euf.track_id | 1 | Using where |
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
This is your query:
select date_format(eu.updated, '%Y-%m-%d') as p, count(eu.id) as c
from entityusage eu join
entityusage_file euf
on euf.entityusage_id = eu.id join
track t
on t.id = euf.track_id
where t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7' and
eu.entitytype = 't' and
eu.action = 1
group by date_format(eu.updated, '%Y%m%d');
I would suggest indexes on track(album_id, id), entityusage_file(track_id, entityusage_id), and entityusage(id, entitytype, action).
Assuming that entityusage_file is mostly a many:many mapping table, see this for tips on improving it. Note that it calls for getting rid of the id and making a pair of 2-column indexes, one of which is the PRIMARY KEY(track_id, entityusage_id). Since your table has a few extra columns, that link does not cover everything.
The UUIDs could be shrunk from 108 bytes to 36, then then to 16 by going to BINARY(16) and using a compression function. Many exist (including a builtin pair in version 8.0); here's mine.
To explain one thing... The query execution should have started with track (on the assumption that '0054a47e-b594-407b-86df-3be078b4e7b7' is very selective). The hangup was that there was no index to get from there to the next table. Gordon's suggested indexes include such.
date_format(eu.updated, '%Y-%m-%d') and date_format(eu.updated, '%Y%m%d') can be simplified to DATE(eu.updated). (No significant performance change.)
(The other Answers and Comments cover a number of issues; I won't repeat them here.)
Because the GROUP BY operation is on an expression involving a function, MySQL can't use an index to optimize that operation. It's going to require a "Using filesort" operation.
I believe the indexes that Gordon suggested are the best bets, given the current table definitions. But even with those indexes, the "tall post" is the eu table, chunking through and sorting all those rows.
To get more reasonable performance, you may need to introduce a "precomputed results" table. It's going to be expensive to generate the counts for everything... but we can pay that price ahead of time...
CREATE TABLE usage_track_by_day
( updated_dt DATE NOT NULL
, PRIMARY KEY (track_id, updated_dt)
)
AS
SELECT eu.track_id
, DATE(eu.updated) AS updated_dt
, SUM(IF(eu.action = 1,1,0) AS cnt
FROM entityusage eu
WHERE eu.track_id IS NOT NULL
AND eu.updated IS NOT NULL
GROUP
BY eu.track_id
, DATE(eu.updated)
An index ON entityusage (track_id,updated,action) may benefit performance.
Then, we could write a query against the new "precomputed results" table, with a better shot at reasonable performance.
The "precomputed results" table would get stale, and would need to be periodically refreshed.
This isn't necessarily the best solution to the issue, but it's a technique we can use in datawarehouse/datamart applications. This lets us churn through lots of detail rows to get counts one time, and then save those counts for fast access.
can you try this. i cant really test it without some sample data from you.
In this case the query looks first in table track and joins then the other tables.
SELECT
date_format(eu.updated, '%Y-%m-%d') AS p
, count(eu.id) AS c
FROM track AS t
INNER JOIN entityusage_file AS euf ON t.id = euf.track_id
INNER JOIN entityusage AS eu ON euf.entityusage_id = eu.id
WHERE
t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7'
AND entitytype = 't'
AND ACTION = 1
GROUP BY date_format(eu.updated, '%Y%m%d');
I have a mysql table with following structure:
mysql> show create table logs \G;
Create Table: CREATE TABLE `logs` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`request` text,
`response` longtext,
`msisdn` varchar(255) DEFAULT NULL,
`username` varchar(255) DEFAULT NULL,
`shortcode` varchar(255) DEFAULT NULL,
`response_code` varchar(255) DEFAULT NULL,
`response_description` text,
`transaction_name` varchar(250) DEFAULT NULL,
`system_owner` varchar(250) DEFAULT NULL,
`request_date_time` datetime DEFAULT NULL,
`response_date_time` datetime DEFAULT NULL,
`comments` text,
`user_type` varchar(255) DEFAULT NULL,
`channel` varchar(20) DEFAULT 'WEB',
/**
other columns here....
other 18 columns here, with Type varchar and Text
**/
PRIMARY KEY (`id`),
KEY `transaction_name` (`transaction_name`) USING BTREE,
KEY `msisdn` (`msisdn`) USING BTREE,
KEY `username` (`username`) USING BTREE,
KEY `request_date_time` (`request_date_time`) USING BTREE,
KEY `system_owner` (`system_owner`) USING BTREE,
KEY `shortcode` (`shortcode`) USING BTREE,
KEY `response_code` (`response_code`) USING BTREE,
KEY `channel` (`channel`) USING BTREE,
KEY `request_date_time_2` (`request_date_time`),
KEY `response_date_time` (`response_date_time`)
) ENGINE=InnoDB AUTO_INCREMENT=59582405 DEFAULT CHARSET=utf8
and it has more than 30000000 records in it.
mysql> select count(*) from logs;
+----------+
| count(*) |
+----------+
| 38962312 |
+----------+
1 row in set (1 min 17.77 sec)
Now the problem is that it is very slow, the result of select takes ages to fetch records from table.
My following sub query takes almost 30 minutes to fetch records of one day:
SELECT
COUNT(sub.id) AS count,
DATE(sub.REQUEST_DATE_TIME) AS transaction_date,
sub.SYSTEM_OWNER,
sub.transaction_name,
sub.response,
MIN(sub.response_time),
MAX(sub.response_time),
AVG(sub.response_time),
sub.channel
FROM
(SELECT
id,
REQUEST_DATE_TIME,
RESPONSE_DATE_TIME,
TIMESTAMPDIFF(SECOND, REQUEST_DATE_TIME, RESPONSE_DATE_TIME) AS response_time,
SYSTEM_OWNER,
transaction_name,
(CASE
WHEN response_code IN ('0' , '00', 'EIL000') THEN 'Success'
ELSE 'Failure'
END) AS response,
channel
FROM
logs
WHERE
response_code != ''
AND DATE(REQUEST_DATE_TIME) BETWEEN '2016-10-26 00:00:00' AND '2016-10-27 00:00:00'
AND SYSTEM_OWNER != '') sub
GROUP BY DATE(sub.REQUEST_DATE_TIME) , sub.channel , sub.SYSTEM_OWNER , sub.transaction_name , sub.response
ORDER BY DATE(sub.REQUEST_DATE_TIME) DESC , sub.SYSTEM_OWNER , sub.transaction_name , sub.response DESC;
I have also added indexes to my table, but still it is very slow.
Any help on how can I make it fast ?
EDIT:
Ran the above query using EXPLAIN
+----+-------------+------------+------+----------------------------+------+---------+------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+----------------------------+------+---------+------+----------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 16053297 | Using temporary; Using filesort |
| 2 | DERIVED | logs | ALL | system_owner,response_code | NULL | NULL | NULL | 32106592 | Using where |
+----+-------------+------------+------+----------------------------+------+---------+------+----------+---------------------------------+
As it stands, the query must scan the entire table.
But first, let's air a possible bug:
AND DATE(REQUEST_DATE_TIME) BETWEEN '2016-10-26 00:00:00'
AND '2016-10-27 00:00:00'
Gives you the logs for two days -- all of the 26th and all of the 27th. Or is that what you really wanted? (BETWEEN is inclusive.)
But the performance problem is that the index will not be used because request_date_time is hiding inside a function (DATE).
Jump forward to a better way to phrase it:
AND REQUEST_DATE_TIME >= '2016-10-26'
AND REQUEST_DATE_TIME < '2016-10-26' + INTERVAL 1 DAY
A DATETIME can be compared against a date.
Midnight of the morning of the 26th is included, but midnight of the 27th is not.
You can easily change 1 to however many days you wish -- without having to deal with leap days, etc.
This formulation allows the use of the index on request_date_time, thereby cutting back severely on amount of data to be scanned.
As for other tempting areas:
!= does not optimize well, so no 'composite' index is likely to be beneficial.
Since we can't really get past the WHERE, no index is useful for GROUP BY or ORDER BY.
My comments about DATE() in WHERE do not apply to GROUP BY; no change needed.
Why have the subquery? I think it can be done in a single layer. This will eliminate a rather large temp table. (Yeah, it means 3 uses of TIMESTAMPDIFF(), but that is probably a lot cheaper than the temp table.)
How much RAM? What is the value of innodb_buffer_pool_size?
If my comments are not enough, and if you frequently run a query like this (over a day or over a date range), then we can talk about building and maintaining a Summary table, which might give you a 10x speedup.
Been spending several days profiling a wide variety of queries used by a distributed application of ours in a MySQL database. Our app potentially stores millions of records on client database servers and the queries can vary enough so that the design of the indexes isn't always clear or easy. A tiny bit of extra overhead on query write it acceptable if the lookup speed is fast enough.
I've managed to narrow down a few composite indexes that work very well for nearly all of our most common queries. There may be some columns in the below indexes I can weed out, but I need to run tests to be sure.
However, my problem: A certain query actually runs faster when it uses an index that contains fewer columns present in the conditions.
The table structure with current composite indexes:
CREATE TABLE IF NOT EXISTS `prism_data` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`epoch` int(10) unsigned NOT NULL,
`action_id` int(10) unsigned NOT NULL,
`player_id` int(10) unsigned NOT NULL,
`world_id` int(10) unsigned NOT NULL,
`x` int(11) NOT NULL,
`y` int(11) NOT NULL,
`z` int(11) NOT NULL,
`block_id` mediumint(5) DEFAULT NULL,
`block_subid` mediumint(5) DEFAULT NULL,
`old_block_id` mediumint(5) DEFAULT NULL,
`old_block_subid` mediumint(5) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `epoch` (`epoch`),
KEY `block` (`block_id`,`action_id`,`player_id`),
KEY `location` (`world_id`,`x`,`z`,`y`,`epoch`,`action_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
I have eight common queries that I've been testing and they all show incredible performance improvement on a database with 50 million records. One query however, doesn't.
The following query returns 11088 rows in (9.77 sec) and uses the location index
SELECT SQL_NO_CACHE id,
epoch,
action,
player,
world_id,
x,
y,
z
FROM prism_data
INNER JOIN prism_players p ON p.player_id = prism_data.player_id
INNER JOIN prism_actions a ON a.action_id = prism_data.action_id
WHERE world_id =
(SELECT w.world_id
FROM prism_worlds w
WHERE w.world = 'world')
AND (a.action = 'world-edit')
AND (prism_data.x BETWEEN -7220 AND -7020)
AND (prism_data.y BETWEEN -22 AND 178)
AND (prism_data.z BETWEEN -9002 AND -8802)
AND prism_data.epoch >= 1392220467;
+----+-------------+------------+--------+----------------+----------+---------+--------------------------------+--------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+----------------+----------+---------+--------------------------------+--------+------------------------------------+
| 1 | PRIMARY | a | ref | PRIMARY,action | action | 77 | const | 1 | Using where; Using index |
| 1 | PRIMARY | prism_data | ref | epoch,location | location | 4 | const | 660432 | Using index condition; Using where |
| 1 | PRIMARY | p | eq_ref | PRIMARY | PRIMARY | 4 | minecraft.prism_data.player_id | 1 | NULL |
| 2 | SUBQUERY | w | ref | world | world | 767 | const | 1 | Using where; Using index |
+----+-------------+------------+--------+----------------+----------+---------+--------------------------------+--------+------------------------------------+
If I remove the world condition, it would no longer match the location index and instead uses the epoch index. Amazingly, it returns 11088 rows in (0.31 sec)
9.77 sec versus 0.31 sec is too much of a difference to ignore. I don't understand why I'm not seeing such a performance kill on my other queries using the location index too but more importantly I don't know what I can do to fix this.
Presumably, the "epoch" index is more selective than the "location" index.
Note that MySQL might be running the subquery once for every row. That could have considerable overhead, even with an index. Doing 30 million index lookups might take a little time.
Try doing the query this way:
SELECT SQL_NO_CACHE id,
epoch,
action,
player,
world_id,
x,
y,
z
FROM prism_data
INNER JOIN prism_players p ON p.player_id = prism_data.player_id
INNER JOIN prism_actions a ON a.action_id = prism_data.action_id
CROSS JOIN (SELECT w.world_id FROM prism_worlds w WHERE w.world = 'world') w
WHERE world_id = w.world_id
AND (a.action = 'world-edit')
AND (prism_data.x BETWEEN -7220 AND -7020)
AND (prism_data.y BETWEEN -22 AND 178)
AND (prism_data.z BETWEEN -9002 AND -8802)
AND prism_data.epoch >= 1392220467;
If this doesn't show an improvement, then the issue is selectivity of the indexes. MySQL is simply making the wrong decision on which is the best index to use. If this does show an improvement, then it is because the subquery is being executed only one time in the from clause.
EDIT:
Your location index is:
KEY `location` (`world_id`,`x`,`z`,`y`,`epoch`,`action_id`)
Can you change this to:
KEY `location` (`world_id`, action_id `x`, `z`, `y`, `epoch`)
This allows the where filtering to use the action_id as well as x. (Only the first inequality uses direct index lookups.)
or better yet, one of of these:
KEY `location` (`world_id`, action_id, epoch, `x`, `z`, `y`)
KEY `location` (`world_id`, epoch, action_id, `x`, `z`, `y`)
KEY `location` (epoch, `world_id`, action_id, `x`, `z`, `y`)
The idea is to move epoch before x so it will be used for the where clause conditions.
A query that used to work just fine on a production server has started becoming extremely slow (in a matter of hours).
This is it:
SELECT * FROM news_articles WHERE published = '1' AND news_category_id = '4' ORDER BY date_edited DESC LIMIT 1;
This takes up to 20-30 seconds to execute (the table has ~200.000 rows)
This is the output of EXPLAIN:
+----+-------------+---------------+-------------+----------------------------+----------------------------+---------+------+------+--------------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------------+----------------------------+----------------------------+---------+------+------+--------------------------------------------------------------------------+
| 1 | SIMPLE | news_articles | index_merge | news_category_id,published | news_category_id,published | 5,5 | NULL | 8409 | Using intersect(news_category_id,published); Using where; Using filesort |
+----+-------------+---------------+-------------+----------------------------+----------------------------+---------+------+------+--------------------------------------------------------------------------+
Playing around with it, I found that hinting a specific index (date_edited) makes it much faster:
SELECT * FROM news_articles USE INDEX (date_edited) WHERE published = '1' AND news_category_id = '4' ORDER BY date_edited DESC LIMIT 1;
This one takes milliseconds to execute.
EXPLAIN output for this one is:
+----+-------------+---------------+-------+---------------+-------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------+---------------+-------------+---------+------+------+-------------+
| 1 | SIMPLE | news_articles | index | NULL | date_edited | 8 | NULL | 1 | Using where |
+----+-------------+---------------+-------+---------------+-------------+---------+------+------+-------------+
Columns news_category_id, published and date_edited are all indexed.
The storage engine is InnoDB.
This is the table structure:
CREATE TABLE `news_articles` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`title` text NOT NULL,
`subtitle` text NOT NULL,
`summary` text NOT NULL,
`keywords` varchar(500) DEFAULT NULL,
`body` mediumtext NOT NULL,
`source` varchar(255) DEFAULT NULL,
`source_visible` int(11) DEFAULT NULL,
`author_information` enum('none','name','signature') NOT NULL DEFAULT 'name',
`date_added` datetime NOT NULL,
`date_edited` datetime NOT NULL,
`views` int(11) DEFAULT '0',
`news_category_id` int(11) DEFAULT NULL,
`user_id` int(11) DEFAULT NULL,
`c_forwarded` int(11) DEFAULT '0',
`published` int(11) DEFAULT '0',
`deleted` int(11) DEFAULT '0',
`permalink` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`),
KEY `news_category_id` (`news_category_id`),
KEY `published` (`published`),
KEY `deleted` (`deleted`),
KEY `date_edited` (`date_edited`),
CONSTRAINT `news_articles_ibfk_3` FOREIGN KEY (`news_category_id`) REFERENCES `news_categories` (`id`) ON DELETE SET NULL ON UPDATE CASCADE,
CONSTRAINT `news_articles_ibfk_4` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=192588 DEFAULT CHARSET=utf8
I could possibly change all queries my web application does to hint using that index. but this is considerable work.
Is there some way to tune MySQL so that the first query is made more efficient without actually rewriting all queries?
just a few tips..
1 - It seems to me the fields published and news_category_id are INTEGER. If so, please remove the single quotes from your query. It can make a huge difference when comes to performance;
2 - Also, I'd say that your field published has no many different values (it is probably 1 - yes and 0 - no, or something like that). If I'm right, this is not a good field to index at all. The parse in this case still has to go through all the records to find what it is looking for; In this case move the news_category_id to be the first field in your WHERE clause.
3 - "Don't forget about the most left index". This affirmation is valid for your SELECT, JOINS, WHERE, ORDER BY. Even the position of the columns on the table are imporant, keep the indexed ones on the top. Indexes are your friend as long as you know how to play with them.
Hope it can help you in somehow..
SELECT * FROM news_articles WHERE published = '1' AND news_category_id = '4' ORDER BY date_edited DESC LIMIT 1;
Original:
SELECT * FROM news_articles
WHERE published = 1 AND news_category_id = 4
ORDER BY date_edited DESC LIMIT 1;
Since you have LIMIT 1, you're only selecting the latest row. ORDER BY date_edited tells MySQL to sort then take 1 row off the top. This is really slow, and why USE INDEX would help.
Try to match MAX(date_edited) in the WHERE clause instead. That should get the query planner to use its index automatically.
Choose MAX(date_entered):
SELECT * FROM news_articles
WHERE published = 1 AND news_category_id = 4
AND date_edited = (select max(date_edited) from news_articles);
Please change your query to :
SELECT * FROM news_articles WHERE published = 1 AND news_category_id = 4 ORDER BY date_edited DESC LIMIT 1;
Please note that i have removed quotes from '1' and '4' data provided in query
The difference in the datatype passed and the column structure does not allow mysql to be able to use the index on these 2 columns.