I have a mysql table with following structure:
mysql> show create table logs \G;
Create Table: CREATE TABLE `logs` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`request` text,
`response` longtext,
`msisdn` varchar(255) DEFAULT NULL,
`username` varchar(255) DEFAULT NULL,
`shortcode` varchar(255) DEFAULT NULL,
`response_code` varchar(255) DEFAULT NULL,
`response_description` text,
`transaction_name` varchar(250) DEFAULT NULL,
`system_owner` varchar(250) DEFAULT NULL,
`request_date_time` datetime DEFAULT NULL,
`response_date_time` datetime DEFAULT NULL,
`comments` text,
`user_type` varchar(255) DEFAULT NULL,
`channel` varchar(20) DEFAULT 'WEB',
/**
other columns here....
other 18 columns here, with Type varchar and Text
**/
PRIMARY KEY (`id`),
KEY `transaction_name` (`transaction_name`) USING BTREE,
KEY `msisdn` (`msisdn`) USING BTREE,
KEY `username` (`username`) USING BTREE,
KEY `request_date_time` (`request_date_time`) USING BTREE,
KEY `system_owner` (`system_owner`) USING BTREE,
KEY `shortcode` (`shortcode`) USING BTREE,
KEY `response_code` (`response_code`) USING BTREE,
KEY `channel` (`channel`) USING BTREE,
KEY `request_date_time_2` (`request_date_time`),
KEY `response_date_time` (`response_date_time`)
) ENGINE=InnoDB AUTO_INCREMENT=59582405 DEFAULT CHARSET=utf8
and it has more than 30000000 records in it.
mysql> select count(*) from logs;
+----------+
| count(*) |
+----------+
| 38962312 |
+----------+
1 row in set (1 min 17.77 sec)
Now the problem is that it is very slow, the result of select takes ages to fetch records from table.
My following sub query takes almost 30 minutes to fetch records of one day:
SELECT
COUNT(sub.id) AS count,
DATE(sub.REQUEST_DATE_TIME) AS transaction_date,
sub.SYSTEM_OWNER,
sub.transaction_name,
sub.response,
MIN(sub.response_time),
MAX(sub.response_time),
AVG(sub.response_time),
sub.channel
FROM
(SELECT
id,
REQUEST_DATE_TIME,
RESPONSE_DATE_TIME,
TIMESTAMPDIFF(SECOND, REQUEST_DATE_TIME, RESPONSE_DATE_TIME) AS response_time,
SYSTEM_OWNER,
transaction_name,
(CASE
WHEN response_code IN ('0' , '00', 'EIL000') THEN 'Success'
ELSE 'Failure'
END) AS response,
channel
FROM
logs
WHERE
response_code != ''
AND DATE(REQUEST_DATE_TIME) BETWEEN '2016-10-26 00:00:00' AND '2016-10-27 00:00:00'
AND SYSTEM_OWNER != '') sub
GROUP BY DATE(sub.REQUEST_DATE_TIME) , sub.channel , sub.SYSTEM_OWNER , sub.transaction_name , sub.response
ORDER BY DATE(sub.REQUEST_DATE_TIME) DESC , sub.SYSTEM_OWNER , sub.transaction_name , sub.response DESC;
I have also added indexes to my table, but still it is very slow.
Any help on how can I make it fast ?
EDIT:
Ran the above query using EXPLAIN
+----+-------------+------------+------+----------------------------+------+---------+------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+----------------------------+------+---------+------+----------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 16053297 | Using temporary; Using filesort |
| 2 | DERIVED | logs | ALL | system_owner,response_code | NULL | NULL | NULL | 32106592 | Using where |
+----+-------------+------------+------+----------------------------+------+---------+------+----------+---------------------------------+
As it stands, the query must scan the entire table.
But first, let's air a possible bug:
AND DATE(REQUEST_DATE_TIME) BETWEEN '2016-10-26 00:00:00'
AND '2016-10-27 00:00:00'
Gives you the logs for two days -- all of the 26th and all of the 27th. Or is that what you really wanted? (BETWEEN is inclusive.)
But the performance problem is that the index will not be used because request_date_time is hiding inside a function (DATE).
Jump forward to a better way to phrase it:
AND REQUEST_DATE_TIME >= '2016-10-26'
AND REQUEST_DATE_TIME < '2016-10-26' + INTERVAL 1 DAY
A DATETIME can be compared against a date.
Midnight of the morning of the 26th is included, but midnight of the 27th is not.
You can easily change 1 to however many days you wish -- without having to deal with leap days, etc.
This formulation allows the use of the index on request_date_time, thereby cutting back severely on amount of data to be scanned.
As for other tempting areas:
!= does not optimize well, so no 'composite' index is likely to be beneficial.
Since we can't really get past the WHERE, no index is useful for GROUP BY or ORDER BY.
My comments about DATE() in WHERE do not apply to GROUP BY; no change needed.
Why have the subquery? I think it can be done in a single layer. This will eliminate a rather large temp table. (Yeah, it means 3 uses of TIMESTAMPDIFF(), but that is probably a lot cheaper than the temp table.)
How much RAM? What is the value of innodb_buffer_pool_size?
If my comments are not enough, and if you frequently run a query like this (over a day or over a date range), then we can talk about building and maintaining a Summary table, which might give you a 10x speedup.
Related
I have a somewhat complex (to me) query where I am joining three tables. I have been steadily trying to optize it, reading how to improve things by looking at the EXPLAIN output.
One of the tables person_deliveries is growing by one to two million records per day, so the query is taking longer and longer due to my poor optimization. Any insight would be GREATLY appreciated.
Here is the query:
SELECT
DATE(pdel.date) AS date,
pdel.ip_address AS ip_address,
pdel.sending_campaigns_id AS campaigns_id,
(substring_index(pe.email, '#', -1)) AS recipient_domain,
COUNT(DISTINCT(concat(pdel.emails_id, pdel.date))) AS deliveries,
COUNT(CASE WHEN pdel.ip_address = pc.ip_address AND pdel.sending_campaigns_id = pc.campaigns_id AND pdel.emails_id = pc.emails_id THEN pdel.emails_id ELSE NULL END) AS complaints
FROM
person_deliveries pdel
LEFT JOIN person_complaints pc on pc.ip_address = pdel.ip_address
LEFT JOIN person_emails pe ON pe.id = pdel.emails_id
WHERE
(pdel.date >= '2022-03-11' AND pdel.date <= '2022-03-12')
AND pe.id IS NOT NULL
AND pdel.ip_address is NOT NULL
GROUP BY date(pdel.date), pdel.ip_address, pdel.sending_campaigns_id
ORDER BY date(pdel.date), INET_ATON(pdel.ip_address), pdel.sending_campaigns_id ASC ;
Here is the output of EXPLAIN:
+----+-------------+-------+------------+--------+------------------------------------------------+------------+---------+----------------------------+---------+----------+---------------------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+------------------------------------------------+------------+---------+----------------------------+---------+----------+---------------------------------------------------------------------+
| 1 | SIMPLE | pdel | NULL | range | person_campaign_date,ip_address,date,emails_id | date | 5 | NULL | 2333678 | 50.00 | Using index condition; Using where; Using temporary; Using filesort |
| 1 | SIMPLE | pe | NULL | eq_ref | PRIMARY | PRIMARY | 4 | subscriber.pdel.emails_id | 1 | 100.00 | NULL |
| 1 | SIMPLE | pc | NULL | ref | ip_address | ip_address | 18 | subscriber.pdel.ip_address | 128 | 100.00 | NULL |
+----+-------------+-------+------------+--------+------------------------------------------------+------------+---------+----------------------------+---------+----------+---------------------------------------------------------------------+
I added a few indexes to get it to this point, but the query still takes an extraordinary amount of resources/time to process.
I know I am missing something here, either an index or using a function that is causing it to be slow, but from everything I have read I haven't figured it out yet.
UPDATE:
I neglected to include table info, so I am providing that to be more helpful.
person_deliveries:
CREATE TABLE `person_deliveries` (
`emails_id` int unsigned NOT NULL,
`sending_campaigns_id` int NOT NULL,
`date` datetime NOT NULL,
`vmta` varchar(255) DEFAULT NULL,
`ip_address` varchar(15) DEFAULT NULL,
`sending_domain` varchar(255) DEFAULT NULL,
UNIQUE KEY `person_campaign_date` (`emails_id`,`sending_campaigns_id`,`date`),
KEY `ip_address` (`ip_address`),
KEY `sending_domain` (`sending_domain`),
KEY `sending_campaigns_id` (`sending_campaigns_id`),
KEY `date` (`date`),
KEY `emails_id` (`emails_id`)
person_complaints:
CREATE TABLE `person_complaints` (
`emails_id` int unsigned NOT NULL,
`campaigns_id` int unsigned NOT NULL,
`complaint_datetime` datetime DEFAULT NULL,
`added_datetime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`ip_address` varchar(15) DEFAULT NULL,
`networks_id` int DEFAULT NULL,
`mailing_domains_id` int DEFAULT NULL,
UNIQUE KEY `email_campaign_date` (`emails_id`,`campaigns_id`,`complaint_datetime`),
KEY `ip_address` (`ip_address`)
person_emails:
CREATE TABLE `person_emails` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
`data_providers_id` tinyint unsigned DEFAULT NULL,
`email` varchar(255) NOT NULL,
`email_md5` varchar(255) DEFAULT NULL,
`original_import` timestamp NULL DEFAULT NULL,
`last_import` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
UNIQUE KEY `email` (`email`),
KEY `data_providers_id` (`data_providers_id`),
KEY `email_md5` (`email_md5`)
Hopefully this extra info helps.
Too many questions vs comments.
It appears for date criteria you are only pulling for a SINGLE date. Is this always the case?, or just this sample. Your pdel.date. Is it a date or date/time as stored. Your query is doing >= '2022-03-11' AND <= '2022-03-12'. Is this because your are trying to get up to and including 2022-03-11 at 11:59:59pm? And if so, should it be LESS than 03-12?
If your counts are based on a single day basis, and this data is rather fixed... that is you are not going to be changing deliveries, etc. on a day that has already passed. This might be a candidate condition for having a stored aggregate table that is done on a daily basis. This way when you are looking for activity patterns, you can have the non-changing aggregates already done and just go against that. Then if you need the details, go back to the raw data.
These indexes are "covering", which should help some:
pdel: INDEX(date, ip_address, sending_campaigns_id, emails_id)
pc: INDEX(ip_address, campaigns_id, emails_id)
Assuming date is a DATETIME, this contains an extra midnight:
WHERE pdel.date >= '2022-03-11'
AND pdel.date <= '2022-03-12'
I prefer the pattern:
WHERE pdel.date >= '2022-03-11'
AND pdel.date < '2022-03-11' + INTERVAL 1 DAY
When the GROUP BY and ORDER BY are different, an extra sort is (usually) required. So, write the GROUP BY to be just like the ORDER BY (after removing "ASC").
A minor simplification (and speedup):
COUNT(DISTINCT(concat(pdel.emails_id, pdel.date))) AS deliveries,
-->
COUNT(DISTINCT, pdel.emails_id, pdel.date) AS deliveries,
Consider storing the numeric version of the IPv4 in INT UNSIGNED (only 4 bytes) instead of a VARCHAR. It will be smaller and you can eliminate some conversions, but will add an INET_NTOA in the SELECT.
The COUNT(CASE ... ) can be simplified to
SUM( pdel.ip_address = pc.ip_address
AND pdel.sending_campaigns_id = pc.campaigns_id
AND pdel.emails_id = pc.emails_id ) AS complaints
In
(substring_index(pe.email, '#', -1)) AS recipient_domain,
I think it should be 1, not -1 or the alias is 'wrong'.
Please change LEFT JOIN pe ... WHERE pe.id IS NOT NULL to equivalent, but simpler JOIN pe without the null test.
Sorry, but those will not provide a huge performance improvement. The next step would be to build and maintain a Summary Tables and use that to generate the desired 'report'. (See DRapp's Answer.)
I have a MySQL table structured like this:
CREATE TABLE `messages` (
`id` int NOT NULL AUTO_INCREMENT,
`author` varchar(250) COLLATE utf8mb4_unicode_ci NOT NULL,
`message` varchar(2000) COLLATE utf8mb4_unicode_ci NOT NULL,
`serverid` varchar(200) COLLATE utf8mb4_unicode_ci NOT NULL,
`date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`guildname` varchar(1000) COLLATE utf8mb4_unicode_ci NOT NULL,
PRIMARY KEY (`id`,`date`)
) ENGINE=InnoDB AUTO_INCREMENT=27769461 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
I need to query this table for various statistics using date ranges for Grafana graphs, however all of those queries are extremely slow, despite the table being indexed using a composite key of id and date.
"id" is auto-incrementing and date is also always increasing.
The queries generated by Grafana look like this:
SELECT
UNIX_TIMESTAMP(date) DIV 120 * 120 AS "time",
count(DISTINCT(serverid)) AS "servercount"
FROM messages
WHERE
date BETWEEN FROM_UNIXTIME(1615930154) AND FROM_UNIXTIME(1616016554)
GROUP BY 1
ORDER BY UNIX_TIMESTAMP(date) DIV 120 * 120
This query takes over 30 seconds to complete with 27 million records in the table.
Explaining the query results in this output:
+----+-------------+----------+------------+------+---------------+------+---------+------+----------+----------+-----------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+---------------+------+---------+------+----------+----------+-----------------------------+
| 1 | SIMPLE | messages | NULL | ALL | PRIMARY | NULL | NULL | NULL | 26952821 | 11.11 | Using where; Using filesort |
+----+-------------+----------+------------+------+---------------+------+---------+------+----------+----------+-----------------------------+
This indicates that MySQL is indeed using the composite primary key I created for indexing the data, but still has to scan almost the entire table, which I do not understand. How can I optimize this table for date range queries?
Plan A:
PRIMARY KEY(date, id), -- to cluster by date
INDEX(id) -- needed to keep AUTO_INCREMENT happy
Assiming the table is quite big, having date at the beginning of the PK puts the rows in the given date range all next to each other. This minimizes (somewhat) the I/O.
Plan B:
PRIMARY KEY(id),
INDEX(date, serverid)
Now the secondary index is exactly what is needed for the one query you have provided. It is optimized for searching by date, and it is smaller than the whole table, hence even faster (I/O-wise) than Plan A.
But, if you have a lot of different queries like this, adding a lot more indexes gets impractical.
Plan C: There may be a still better way:
PRIMARY KEY(id),
INDEX(server_id, date)
In theory, it can hop through that secondary index checking each server_id. But I am not sure that such an optimization exists.
Plan D: Do you need id for anything other than providing a unique PRIMARY KEY? If not, there may be other options.
The index on (id, date) doesn't help because the first key is id not date.
You can either
(a) drop the current index and index (date, id) instead -- when date is in the first place this can be used to filter for date regardless of the following columns -- or
(b) just create an additional index only on (date) to support the query.
I have the following tables:
mysql> show create table rsspodcastitems \G
*************************** 1. row ***************************
Table: rsspodcastitems
Create Table: CREATE TABLE `rsspodcastitems` (
`id` char(20) NOT NULL,
`description` mediumtext,
`duration` int(11) default NULL,
`enclosure` mediumtext NOT NULL,
`guid` varchar(300) NOT NULL,
`indexed` datetime NOT NULL,
`published` datetime default NULL,
`subtitle` varchar(255) default NULL,
`summary` mediumtext,
`title` varchar(255) NOT NULL,
`podcast_id` char(20) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `podcast_id` (`podcast_id`,`guid`),
UNIQUE KEY `UKfb6nlyxvxf3i2ibwd8jx6k025` (`podcast_id`,`guid`),
KEY `IDXkcqf7wi47t3epqxlh34538k7c` (`indexed`),
KEY `IDXt2ofice5w51uun6w80g8ou7hc` (`podcast_id`,`published`),
KEY `IDXfb6nlyxvxf3i2ibwd8jx6k025` (`podcast_id`,`guid`),
KEY `published` (`published`),
FULLTEXT KEY `title` (`title`),
FULLTEXT KEY `summary` (`summary`),
FULLTEXT KEY `subtitle` (`subtitle`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
mysql> show create table station_cache \G
*************************** 1. row ***************************
Table: station_cache
Create Table: CREATE TABLE `station_cache` (
`Station_id` char(36) NOT NULL,
`item_id` char(20) NOT NULL,
`item_type` int(11) NOT NULL,
`podcast_id` char(20) NOT NULL,
`published` datetime NOT NULL,
KEY `Station_id` (`Station_id`,`published`),
KEY `IDX12n81jv8irarbtp8h2hl6k4q3` (`Station_id`,`published`),
KEY `item_id` (`item_id`,`item_type`),
KEY `IDXqw9yqpavo9fcduereqqij4c80` (`item_id`,`item_type`),
KEY `podcast_id` (`podcast_id`,`published`),
KEY `IDXkp2ehbpmu41u1vhwt7qdl2fuf` (`podcast_id`,`published`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
The "item_id" column of the second refers to the "id" column of the former (there isn't a foreign key between the two because the relationship is polymorphic, i.e. the second table may have references to entities that aren't in the first but in other tables that are similar but distinct).
I'm trying to get a query that lists the most recent items in the first table that do not have any corresponding items in the second. The highest performing query I've found so far is:
select i.*,
(select count(station_id)
from station_cache
where item_id = i.id) as stations
from rsspodcastitems i
having stations = 0
order by published desc
I've also considered using a where not exists (...) subquery to perform the restriction, but this was actually slower than the one I have above. But this is still taking a substantial length of time to complete. MySQL's query plan doesn't seem to be using the available indices:
+----+--------------------+---------------+------+---------------+------+---------+------+--------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------------+------+---------------+------+---------+------+--------+----------------+
| 1 | PRIMARY | i | ALL | NULL | NULL | NULL | NULL | 106978 | Using filesort |
| 2 | DEPENDENT SUBQUERY | station_cache | ALL | NULL | NULL | NULL | NULL | 44227 | Using where |
+----+--------------------+---------------+------+---------------+------+---------+------+--------+----------------+
Note that neither portion of the query is using a key, whereas it ought to be able to use KEY published (published) from the primary table and KEY item_id (item_id,item_type) for the subquery.
Any suggestions how I can get an appropriate result without waiting for several minutes?
I would expect the fastest query to be:
select i.*
from rsspodcastitems i
where not exists (select 1
from station_cache sc
where sc.item_id = i.id
)
order by published desc;
This would take advantage of an index on station_cache(item_id) and perhaps rsspodcastitems(published, id).
Your query could be faster, if your query returns a significant number of rows. Your phrasing of the query allows the index on rsspodcastitems(published) to avoid the file sort. If you remove the group by, the exists version should be faster.
I should note that I like your use of the having clause. When faced with this in the past, I have used a subquery:
select i.*,
(select count(station_id)
from station_cache
where item_id = i.id) as stations
from (select i.*
from rsspodcastitems i
order by published desc
) i
where not exists (select 1
from station_cache sc
where sc.item_id = i.id
);
This allows one index for sorting.
I prefer a slight variation on your method:
select i.*,
(exists (select 1
from station_cache sc
where sc.item_id = i.id
)
) as has_station
from rsspodcastitems i
having has_station = 0
order by published desc;
This should be slightly faster than the version with count().
You might want to detect and remove redundant indexes from your tables. Reviewing your CREATE TABLE information for both tables with help you discover several, including podcast_id,guid and Station_id,published, item_id,item_type and podcast_id,published there may be more.
My eventual solution was to delete the full text indices and use an externally generated index table (produced by iterating over the words in the text, filtering stop words, and applying a stemming algorithm) to allow searching. I don't know why the full text indices were causing performance problems, but they seemed to slow down every query that touched the table even if they weren't used.
I have a query which purpose is to generate statistics for how many musical work (track) has been downloaded from a site at different periods (by month, by quarter, by year etc). The query operates on the tables entityusage, entityusage_file and track.
To get the number of downloads for tracks belonging to an specific album I would do the following query :
select
date_format(eu.updated, '%Y-%m-%d') as p, count(eu.id) as c
from entityusage as eu
inner join entityusage_file as euf
ON euf.entityusage_id = eu.id
inner join track as t
ON t.id = euf.track_id
where
t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7'
and entitytype = 't'
and action = 1
group by date_format(eu.updated, '%Y%m%d')
I need to set entitytype = 't' as the entityusage can hold downloads of other entities as well (if entitytype = 'a' then an entire album would have been downloaded, and entityusage_file would then hold all tracks which the album "translated" into at the point of download).
This query takes 40 - 50 seconds. I've been trying to optimize this query for a while, but I have the feeling that I'm approaching this the wrong way.
This is one out of 4 similar queries which must run to generate a report. The report should preferable be able to finish while a user waits for it. Right now, I'm looking at 3 - 4 minutes. That's a long time to wait.
Can this query be optimised further with indexes, or do I need to take another approach to get this job done?
CREATE TABLE `entityusage` (
`id` char(36) NOT NULL,
`title` varchar(255) DEFAULT NULL,
`entitytype` varchar(5) NOT NULL,
`entityid` char(36) NOT NULL,
`externaluser` int(10) NOT NULL,
`action` tinyint(1) NOT NULL,
`updated` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `e` (`entityid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `entityusage_file` (
`id` char(36) NOT NULL,
`entityusage_id` char(36) NOT NULL,
`track_id` char(36) NOT NULL,
`file_id` char(36) NOT NULL,
`type` varchar(3) NOT NULL,
`quality` int(1) NOT NULL,
`size` int(20) NOT NULL,
`updated` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `file_id` (`file_id`),
KEY `entityusage_id` (`entityusage_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `track` (
`id` char(36) NOT NULL,
`album_id` char(36) NOT NULL,
`number` int(3) NOT NULL DEFAULT '0',
`title` varchar(255) DEFAULT NULL,
`updated` datetime NOT NULL DEFAULT '2000-01-01 00:00:00',
PRIMARY KEY (`id`),
KEY `album` (`album_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 CHECKSUM=1 DELAY_KEY_WRITE=1 ROW_FORMAT=DYNAMIC;
An EXPLAIN on the query gives me the following :
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
| 1 | SIMPLE | eu | ALL | NULL | NULL | NULL | NULL | 7832817 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | euf | ref | entityusage_id | entityusage_id | 108 | func | 1 | Using index condition |
| 1 | SIMPLE | t | eq_ref | PRIMARY,album | PRIMARY | 108 | trackerdatabase.euf.track_id | 1 | Using where |
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
This is your query:
select date_format(eu.updated, '%Y-%m-%d') as p, count(eu.id) as c
from entityusage eu join
entityusage_file euf
on euf.entityusage_id = eu.id join
track t
on t.id = euf.track_id
where t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7' and
eu.entitytype = 't' and
eu.action = 1
group by date_format(eu.updated, '%Y%m%d');
I would suggest indexes on track(album_id, id), entityusage_file(track_id, entityusage_id), and entityusage(id, entitytype, action).
Assuming that entityusage_file is mostly a many:many mapping table, see this for tips on improving it. Note that it calls for getting rid of the id and making a pair of 2-column indexes, one of which is the PRIMARY KEY(track_id, entityusage_id). Since your table has a few extra columns, that link does not cover everything.
The UUIDs could be shrunk from 108 bytes to 36, then then to 16 by going to BINARY(16) and using a compression function. Many exist (including a builtin pair in version 8.0); here's mine.
To explain one thing... The query execution should have started with track (on the assumption that '0054a47e-b594-407b-86df-3be078b4e7b7' is very selective). The hangup was that there was no index to get from there to the next table. Gordon's suggested indexes include such.
date_format(eu.updated, '%Y-%m-%d') and date_format(eu.updated, '%Y%m%d') can be simplified to DATE(eu.updated). (No significant performance change.)
(The other Answers and Comments cover a number of issues; I won't repeat them here.)
Because the GROUP BY operation is on an expression involving a function, MySQL can't use an index to optimize that operation. It's going to require a "Using filesort" operation.
I believe the indexes that Gordon suggested are the best bets, given the current table definitions. But even with those indexes, the "tall post" is the eu table, chunking through and sorting all those rows.
To get more reasonable performance, you may need to introduce a "precomputed results" table. It's going to be expensive to generate the counts for everything... but we can pay that price ahead of time...
CREATE TABLE usage_track_by_day
( updated_dt DATE NOT NULL
, PRIMARY KEY (track_id, updated_dt)
)
AS
SELECT eu.track_id
, DATE(eu.updated) AS updated_dt
, SUM(IF(eu.action = 1,1,0) AS cnt
FROM entityusage eu
WHERE eu.track_id IS NOT NULL
AND eu.updated IS NOT NULL
GROUP
BY eu.track_id
, DATE(eu.updated)
An index ON entityusage (track_id,updated,action) may benefit performance.
Then, we could write a query against the new "precomputed results" table, with a better shot at reasonable performance.
The "precomputed results" table would get stale, and would need to be periodically refreshed.
This isn't necessarily the best solution to the issue, but it's a technique we can use in datawarehouse/datamart applications. This lets us churn through lots of detail rows to get counts one time, and then save those counts for fast access.
can you try this. i cant really test it without some sample data from you.
In this case the query looks first in table track and joins then the other tables.
SELECT
date_format(eu.updated, '%Y-%m-%d') AS p
, count(eu.id) AS c
FROM track AS t
INNER JOIN entityusage_file AS euf ON t.id = euf.track_id
INNER JOIN entityusage AS eu ON euf.entityusage_id = eu.id
WHERE
t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7'
AND entitytype = 't'
AND ACTION = 1
GROUP BY date_format(eu.updated, '%Y%m%d');
A query that used to work just fine on a production server has started becoming extremely slow (in a matter of hours).
This is it:
SELECT * FROM news_articles WHERE published = '1' AND news_category_id = '4' ORDER BY date_edited DESC LIMIT 1;
This takes up to 20-30 seconds to execute (the table has ~200.000 rows)
This is the output of EXPLAIN:
+----+-------------+---------------+-------------+----------------------------+----------------------------+---------+------+------+--------------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------------+----------------------------+----------------------------+---------+------+------+--------------------------------------------------------------------------+
| 1 | SIMPLE | news_articles | index_merge | news_category_id,published | news_category_id,published | 5,5 | NULL | 8409 | Using intersect(news_category_id,published); Using where; Using filesort |
+----+-------------+---------------+-------------+----------------------------+----------------------------+---------+------+------+--------------------------------------------------------------------------+
Playing around with it, I found that hinting a specific index (date_edited) makes it much faster:
SELECT * FROM news_articles USE INDEX (date_edited) WHERE published = '1' AND news_category_id = '4' ORDER BY date_edited DESC LIMIT 1;
This one takes milliseconds to execute.
EXPLAIN output for this one is:
+----+-------------+---------------+-------+---------------+-------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+-------+---------------+-------------+---------+------+------+-------------+
| 1 | SIMPLE | news_articles | index | NULL | date_edited | 8 | NULL | 1 | Using where |
+----+-------------+---------------+-------+---------------+-------------+---------+------+------+-------------+
Columns news_category_id, published and date_edited are all indexed.
The storage engine is InnoDB.
This is the table structure:
CREATE TABLE `news_articles` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`title` text NOT NULL,
`subtitle` text NOT NULL,
`summary` text NOT NULL,
`keywords` varchar(500) DEFAULT NULL,
`body` mediumtext NOT NULL,
`source` varchar(255) DEFAULT NULL,
`source_visible` int(11) DEFAULT NULL,
`author_information` enum('none','name','signature') NOT NULL DEFAULT 'name',
`date_added` datetime NOT NULL,
`date_edited` datetime NOT NULL,
`views` int(11) DEFAULT '0',
`news_category_id` int(11) DEFAULT NULL,
`user_id` int(11) DEFAULT NULL,
`c_forwarded` int(11) DEFAULT '0',
`published` int(11) DEFAULT '0',
`deleted` int(11) DEFAULT '0',
`permalink` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`),
KEY `news_category_id` (`news_category_id`),
KEY `published` (`published`),
KEY `deleted` (`deleted`),
KEY `date_edited` (`date_edited`),
CONSTRAINT `news_articles_ibfk_3` FOREIGN KEY (`news_category_id`) REFERENCES `news_categories` (`id`) ON DELETE SET NULL ON UPDATE CASCADE,
CONSTRAINT `news_articles_ibfk_4` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`) ON DELETE SET NULL ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=192588 DEFAULT CHARSET=utf8
I could possibly change all queries my web application does to hint using that index. but this is considerable work.
Is there some way to tune MySQL so that the first query is made more efficient without actually rewriting all queries?
just a few tips..
1 - It seems to me the fields published and news_category_id are INTEGER. If so, please remove the single quotes from your query. It can make a huge difference when comes to performance;
2 - Also, I'd say that your field published has no many different values (it is probably 1 - yes and 0 - no, or something like that). If I'm right, this is not a good field to index at all. The parse in this case still has to go through all the records to find what it is looking for; In this case move the news_category_id to be the first field in your WHERE clause.
3 - "Don't forget about the most left index". This affirmation is valid for your SELECT, JOINS, WHERE, ORDER BY. Even the position of the columns on the table are imporant, keep the indexed ones on the top. Indexes are your friend as long as you know how to play with them.
Hope it can help you in somehow..
SELECT * FROM news_articles WHERE published = '1' AND news_category_id = '4' ORDER BY date_edited DESC LIMIT 1;
Original:
SELECT * FROM news_articles
WHERE published = 1 AND news_category_id = 4
ORDER BY date_edited DESC LIMIT 1;
Since you have LIMIT 1, you're only selecting the latest row. ORDER BY date_edited tells MySQL to sort then take 1 row off the top. This is really slow, and why USE INDEX would help.
Try to match MAX(date_edited) in the WHERE clause instead. That should get the query planner to use its index automatically.
Choose MAX(date_entered):
SELECT * FROM news_articles
WHERE published = 1 AND news_category_id = 4
AND date_edited = (select max(date_edited) from news_articles);
Please change your query to :
SELECT * FROM news_articles WHERE published = 1 AND news_category_id = 4 ORDER BY date_edited DESC LIMIT 1;
Please note that i have removed quotes from '1' and '4' data provided in query
The difference in the datatype passed and the column structure does not allow mysql to be able to use the index on these 2 columns.