I have a query that takes roughly four minutes to run on a high powered SSD server with no other notable processes running. I'd like to make it faster if possible.
The database stores a match history for a popular video game called Dota 2. In this game, ten players (five on each team) each select a "hero" and battle it out.
The intention of my query is to create a list of past matches along with how much of a "XP dependence" each team had, based on the heroes used. With 200,000 matches (and a 2,000,000 row matches-to-heroes relationship table) the query takes about four minutes. With 1,000,000 matches, it takes roughly 15.
I have full control of the server, so any configuration suggestions are also appreciated. Thanks for any help guys. Here are the details...
CREATE TABLE matches (
* match_id BIGINT UNSIGNED NOT NULL,
start_time INT UNSIGNED NOT NULL,
skill_level TINYINT NOT NULL DEFAULT -1,
* winning_team TINYINT UNSIGNED NOT NULL,
PRIMARY KEY (match_id),
KEY start_time (start_time),
KEY skill_level (skill_level),
KEY winning_team (winning_team));
CREATE TABLE heroes (
* hero_id SMALLINT UNSIGNED NOT NULL,
name CHAR(40) NOT NULL DEFAULT '',
faction TINYINT NOT NULL DEFAULT -1,
primary_attribute TINYINT NOT NULL DEFAULT -1,
group_index TINYINT NOT NULL DEFAULT -1,
match_count BIGINT UNSIGNED NOT NULL DEFAULT 0,
win_count BIGINT UNSIGNED NOT NULL DEFAULT 0,
* xp_from_wins BIGINT UNSIGNED NOT NULL DEFAULT 0,
* team_xp_from_wins BIGINT UNSIGNED NOT NULL DEFAULT 0,
xp_from_losses BIGINT UNSIGNED NOT NULL DEFAULT 0,
team_xp_from_losses BIGINT UNSIGNED NOT NULL DEFAULT 0,
gold_from_wins BIGINT UNSIGNED NOT NULL DEFAULT 0,
team_gold_from_wins BIGINT UNSIGNED NOT NULL DEFAULT 0,
gold_from_losses BIGINT UNSIGNED NOT NULL DEFAULT 0,
team_gold_from_losses BIGINT UNSIGNED NOT NULL DEFAULT 0,
included TINYINT UNSIGNED NOT NULL DEFAULT 0,
PRIMARY KEY (hero_id));
CREATE TABLE matches_heroes (
* match_id BIGINT UNSIGNED NOT NULL,
player_id INT UNSIGNED NOT NULL,
* hero_id SMALLINT UNSIGNED NOT NULL,
xp_per_min SMALLINT UNSIGNED NOT NULL,
gold_per_min SMALLINT UNSIGNED NOT NULL,
position TINYINT UNSIGNED NOT NULL,
PRIMARY KEY (match_id, hero_id),
KEY match_id (match_id),
KEY player_id (player_id),
KEY hero_id (hero_id),
KEY xp_per_min (xp_per_min),
KEY gold_per_min (gold_per_min),
KEY position (position));
Query
SELECT
matches.match_id,
SUM(CASE
WHEN position < 5 THEN xp_from_wins / team_xp_from_wins
ELSE 0
END) AS radiant_xp_dependence,
SUM(CASE
WHEN position >= 5 THEN xp_from_wins / team_xp_from_wins
ELSE 0
END) AS dire_xp_dependence,
winning_team
FROM
matches
INNER JOIN
matches_heroes
ON matches.match_id = matches_heroes.match_id
INNER JOIN
heroes
ON matches_heroes.hero_id = heroes.hero_id
GROUP BY
matches.match_id
Sample Results
match_id | radiant_xp_dependence | dire_xp_dependence | winning_team
2298874871 | 1.0164 | 0.9689 | 1
2298884079 | 0.9932 | 1.0390 | 0
2298885606 | 0.9877 | 1.0015 | 1
EXPLAIN
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | heroes | ALL | PRIMARY | NULL | NULL | NULL | 111 | Using temporary; Using filesort
1 | SIMPLE | matches_heroes | ref | PRIMARY,match_id,hero_id | hero_id | 2 | dota_2.heroes.hero_id | 3213 |
1 | SIMPLE | matches | eq_ref | PRIMARY | PRIMARY | 8 | dota_2.matches_heroes.match_id | 1 |
Machine Specs
Intel Xeon E5
E5-1630v3 4/8t
3.7 / 3.8 GHz
64 GB of RAM
DDR4 ECC 2133 MHz
2 x 480GB of SSD SOFT
Database
MariaDB 10.0
InnoDB
In all likelihood, the main performance driver is the GROUP BY. Sometimes, in MySQL, it can be faster to use correlated subuqeries. So, try writing the query like this:
SELECT m.match_id,
(SELECT SUM(h.xp_from_wins / h.team_xp_from_wins)
FROM matches_heroes mh INNER JOIN
heroes h
ON mh.hero_id = h.hero_id
WHERE m.match_id = mh.match_id AND mh.position < 5
) AS radiant_xp_dependence,
(SELECT SUM(h.xp_from_wins / h.team_xp_from_wins)
FROM matches_heroes mh INNER JOIN
heroes h
ON mh.hero_id = h.hero_id
WHERE m.match_id = mh.match_id AND mh.position >= 5
) AS dire_xp_dependence,
m.winning_team
FROM matches m;
Then, you want indexes on:
matches_heroes(match_id, position)
heroes(hero_id, xp_from_wins, team_xp_from_wins)
For completeness, you might want this index as well:
matches(match_id, winning_team)
This would be more important if you added order by match_id to the query.
As has already been mentioned in a comment; there is little you can do, because you select all data from the table. The query looks perfect.
The one idea that comes to mind are covering indexes. With indexes containing all data needed for the query, the tables themselves don't have to be accessed anymore.
CREATE INDEX matches_quick ON matches(match_id, winning_team);
CREATE INDEX heroes_quick ON heroes(hero_id, xp_from_wins, team_xp_from_wins);
CREATE INDEX matches_heroes_quick ON matches_heroes (match_id, hero_id, position);
There is no guarantee for this to speed up your query, as you are still reading all data, so running through the indexes may be just as much work as reading the tables. But there is a chance that the joins will be faster and there would probably be less physical read. Just give it a try.
Waiting for another idea? :-)
Well, there is always the data warehouse approach. If you must run this query again and again and always for all matches ever played, then why not store the query results and access them later?
I suppose that matches played won't be altered, so you could access all results you computed, say, last week and only retrieve the additional results from the games since then from your real tables.
Create a table archived_results. Add a flag archived in your matches table. Then add query results to the archived_results table and set the flag to TRUE for these matches. When having to perform your query, you'd either update the archived_results table anew and only show its contents then or you'd combine archive and current:
select match_id, radiant_xp_dependence, radiant_xp_dependence winning_team
from archived_results
union all
SELECT
matches.match_id,
SUM(CASE
WHEN position < 5 THEN xp_from_wins / team_xp_from_wins
ELSE 0
END) AS radiant_xp_dependence,
...
WHERE matches.archived = FALSE
GROUP BY matches.match_id;
People's comments about loading whole tables into memory got me thinking. I searched for "MySQL memory allocation" and learned how to change the buffer pool size for InnoDB tables. The default is much smaller than my database, so I ramped it up to 8 GB using the innodb_buffer_pool_size directive in my.cnf. The speed of the query increased drastically from 1308 seconds to only 114.
After researching more settings, my my.cnf file now looks like the following (no further speed improvements, but it should be better in other situations).
[mysqld]
bind-address=127.0.0.1
character-set-server=utf8
collation-server=utf8_general_ci
innodb_buffer_pool_size=8G
innodb_buffer_pool_dump_at_shutdown=1
innodb_buffer_pool_load_at_startup=1
innodb_flush_log_at_trx_commit=2
innodb_log_buffer_size=8M
innodb_log_file_size=64M
innodb_read_io_threads=64
innodb_write_io_threads=64
Thanks everyone for taking the time to help out. This will be a massive improvement to my website.
Related
I have a somewhat complex (to me) query where I am joining three tables. I have been steadily trying to optize it, reading how to improve things by looking at the EXPLAIN output.
One of the tables person_deliveries is growing by one to two million records per day, so the query is taking longer and longer due to my poor optimization. Any insight would be GREATLY appreciated.
Here is the query:
SELECT
DATE(pdel.date) AS date,
pdel.ip_address AS ip_address,
pdel.sending_campaigns_id AS campaigns_id,
(substring_index(pe.email, '#', -1)) AS recipient_domain,
COUNT(DISTINCT(concat(pdel.emails_id, pdel.date))) AS deliveries,
COUNT(CASE WHEN pdel.ip_address = pc.ip_address AND pdel.sending_campaigns_id = pc.campaigns_id AND pdel.emails_id = pc.emails_id THEN pdel.emails_id ELSE NULL END) AS complaints
FROM
person_deliveries pdel
LEFT JOIN person_complaints pc on pc.ip_address = pdel.ip_address
LEFT JOIN person_emails pe ON pe.id = pdel.emails_id
WHERE
(pdel.date >= '2022-03-11' AND pdel.date <= '2022-03-12')
AND pe.id IS NOT NULL
AND pdel.ip_address is NOT NULL
GROUP BY date(pdel.date), pdel.ip_address, pdel.sending_campaigns_id
ORDER BY date(pdel.date), INET_ATON(pdel.ip_address), pdel.sending_campaigns_id ASC ;
Here is the output of EXPLAIN:
+----+-------------+-------+------------+--------+------------------------------------------------+------------+---------+----------------------------+---------+----------+---------------------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+------------------------------------------------+------------+---------+----------------------------+---------+----------+---------------------------------------------------------------------+
| 1 | SIMPLE | pdel | NULL | range | person_campaign_date,ip_address,date,emails_id | date | 5 | NULL | 2333678 | 50.00 | Using index condition; Using where; Using temporary; Using filesort |
| 1 | SIMPLE | pe | NULL | eq_ref | PRIMARY | PRIMARY | 4 | subscriber.pdel.emails_id | 1 | 100.00 | NULL |
| 1 | SIMPLE | pc | NULL | ref | ip_address | ip_address | 18 | subscriber.pdel.ip_address | 128 | 100.00 | NULL |
+----+-------------+-------+------------+--------+------------------------------------------------+------------+---------+----------------------------+---------+----------+---------------------------------------------------------------------+
I added a few indexes to get it to this point, but the query still takes an extraordinary amount of resources/time to process.
I know I am missing something here, either an index or using a function that is causing it to be slow, but from everything I have read I haven't figured it out yet.
UPDATE:
I neglected to include table info, so I am providing that to be more helpful.
person_deliveries:
CREATE TABLE `person_deliveries` (
`emails_id` int unsigned NOT NULL,
`sending_campaigns_id` int NOT NULL,
`date` datetime NOT NULL,
`vmta` varchar(255) DEFAULT NULL,
`ip_address` varchar(15) DEFAULT NULL,
`sending_domain` varchar(255) DEFAULT NULL,
UNIQUE KEY `person_campaign_date` (`emails_id`,`sending_campaigns_id`,`date`),
KEY `ip_address` (`ip_address`),
KEY `sending_domain` (`sending_domain`),
KEY `sending_campaigns_id` (`sending_campaigns_id`),
KEY `date` (`date`),
KEY `emails_id` (`emails_id`)
person_complaints:
CREATE TABLE `person_complaints` (
`emails_id` int unsigned NOT NULL,
`campaigns_id` int unsigned NOT NULL,
`complaint_datetime` datetime DEFAULT NULL,
`added_datetime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`ip_address` varchar(15) DEFAULT NULL,
`networks_id` int DEFAULT NULL,
`mailing_domains_id` int DEFAULT NULL,
UNIQUE KEY `email_campaign_date` (`emails_id`,`campaigns_id`,`complaint_datetime`),
KEY `ip_address` (`ip_address`)
person_emails:
CREATE TABLE `person_emails` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
`data_providers_id` tinyint unsigned DEFAULT NULL,
`email` varchar(255) NOT NULL,
`email_md5` varchar(255) DEFAULT NULL,
`original_import` timestamp NULL DEFAULT NULL,
`last_import` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
UNIQUE KEY `email` (`email`),
KEY `data_providers_id` (`data_providers_id`),
KEY `email_md5` (`email_md5`)
Hopefully this extra info helps.
Too many questions vs comments.
It appears for date criteria you are only pulling for a SINGLE date. Is this always the case?, or just this sample. Your pdel.date. Is it a date or date/time as stored. Your query is doing >= '2022-03-11' AND <= '2022-03-12'. Is this because your are trying to get up to and including 2022-03-11 at 11:59:59pm? And if so, should it be LESS than 03-12?
If your counts are based on a single day basis, and this data is rather fixed... that is you are not going to be changing deliveries, etc. on a day that has already passed. This might be a candidate condition for having a stored aggregate table that is done on a daily basis. This way when you are looking for activity patterns, you can have the non-changing aggregates already done and just go against that. Then if you need the details, go back to the raw data.
These indexes are "covering", which should help some:
pdel: INDEX(date, ip_address, sending_campaigns_id, emails_id)
pc: INDEX(ip_address, campaigns_id, emails_id)
Assuming date is a DATETIME, this contains an extra midnight:
WHERE pdel.date >= '2022-03-11'
AND pdel.date <= '2022-03-12'
I prefer the pattern:
WHERE pdel.date >= '2022-03-11'
AND pdel.date < '2022-03-11' + INTERVAL 1 DAY
When the GROUP BY and ORDER BY are different, an extra sort is (usually) required. So, write the GROUP BY to be just like the ORDER BY (after removing "ASC").
A minor simplification (and speedup):
COUNT(DISTINCT(concat(pdel.emails_id, pdel.date))) AS deliveries,
-->
COUNT(DISTINCT, pdel.emails_id, pdel.date) AS deliveries,
Consider storing the numeric version of the IPv4 in INT UNSIGNED (only 4 bytes) instead of a VARCHAR. It will be smaller and you can eliminate some conversions, but will add an INET_NTOA in the SELECT.
The COUNT(CASE ... ) can be simplified to
SUM( pdel.ip_address = pc.ip_address
AND pdel.sending_campaigns_id = pc.campaigns_id
AND pdel.emails_id = pc.emails_id ) AS complaints
In
(substring_index(pe.email, '#', -1)) AS recipient_domain,
I think it should be 1, not -1 or the alias is 'wrong'.
Please change LEFT JOIN pe ... WHERE pe.id IS NOT NULL to equivalent, but simpler JOIN pe without the null test.
Sorry, but those will not provide a huge performance improvement. The next step would be to build and maintain a Summary Tables and use that to generate the desired 'report'. (See DRapp's Answer.)
I have a rather large database where I would like to search/filter on a MEDIUMTEXT (tags), DATETIME (created_time) and a BIT (include) column.
Let's say the database looks like this:
+------+-----------------------+--------------------------+---------+
| id | created_time | tags | include |
|(INT) | (DATETIME) | (MEDIUMTEXT) | (BIT) |
+------+-----------------------+--------------------------+---------+
| 1 | '2017-02-20 08:58:06' | 'client 1' | 1 |
| 2 | '2017-03-01 18:12:00' | 'client 1 and client 2' | 0 |
| 3 | '2017-03-02 02:52:35' | 'client 3 plus client 1' | 0 |
| 4 | '2017-03-03 12:41:58' | 'client 1' | 1 |
| 5 | '2017-03-05 18:03:12' | 'client 2, client 3' | 1 |
| 6 | '2017-03-06 20:25:45' | 'client 1 and client 3' | 0 |
| 7 | '2017-03-08 22:51:22' | 'client 1' | 1 |
+------+-----------------------+--------------------------+---------+
I have indexed the DATETIME and BIT columns and I have used a FULLTEXT index on the MEDIUMTEXT column.
If I run this statement:
select statement 1
------------------
SELECT COUNT(*)
FROM database
WHERE (MATCH(tags) AGAINST('"client 1"' IN BOOLEAN MODE))
AND created_time >= '2017-03-01 12:00:00'
AND include = 0;
It takes 14 sec. to run and returns 6700 rows.
However, if I run:
select statement 2
------------------
SELECT COUNT(*)
FROM database
WHERE (MATCH(tags) AGAINST('"client 1"' IN BOOLEAN MODE));
It takes 0,4 sec. to run and returns 145000 rows and if I run:
select statement 3
------------------
SELECT COUNT(*)
FROM database
WHERE created_time >= '2017-03-01 12:00:00'
AND include = 0;
It takes 0,5 sec. to run and returns 25000 rows.
Now my question is, how do I make ‘select statement 1’ run faster? Do I need to first run ‘select statement 2’ and then run the ‘select statement 3’ on the results? If so, how? Anyone have experience with UNION and can I use it here? Or is there a way I can create a multiple-column index on INDEX and FULLTEXT?
Added info on the actual table (and not the example above) with special thanks to #rick-james
Query 1:
SELECT SQL_NO_CACHE count(*)
FROM Twitter_tweet
WHERE created_time >= '2017-01-01 23:00:00'
AND MATCH(tags) AGAINST('\"dkpol\"' IN BOOLEAN MODE);
Query 2:
SELECT SQL_NO_CACHE count(*)
FROM Twitter_tweet
WHERE MATCH(tags) AGAINST('\"dkpol\"' IN BOOLEAN MODE);
Query 3:
SELECT SQL_NO_CACHE count(*)
FROM Twitter_tweet
WHERE created_time >= '2017-01-01 23:00:00';
EXPLAIN for the 3 queries:
+----+-------------+---------------+----------+----------------------------------------------------+--------------------+---------+-------+--------+----------+-----------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+----------+----------------------------------------------------+--------------------+---------+-------+--------+----------+-----------------------------------+
| 1 | SIMPLE | Twitter_tweet | fulltext | created_time_INDEX,SELECT_tags_INDEX,tags_FULLTEXT | tags_FULLTEXT | 0 | const | 1 | 50.00 | Using where; Ft_hints: no_ranking |
+----+-------------+---------------+----------+----------------------------------------------------+--------------------+---------+-------+--------+----------+-----------------------------------+
| 2 | SIMPLE | | | | | | | | | Select tables optimized away |
+----+-------------+---------------+----------+----------------------------------------------------+--------------------+---------+-------+--------+----------+-----------------------------------+
| 3 | SIMPLE | Twitter_tweet | range | created_time_INDEX,SELECT_tags_INDEX | created_time_INDEX | 6 | | 572286 | 100.00 | Using where; Using index |
+----+-------------+---------------+----------+----------------------------------------------------+--------------------+---------+-------+--------+----------+-----------------------------------+
SHOW CREATE TABLE:
CREATE TABLE `Twitter_tweet` (
`post_id` bigint(20) unsigned NOT NULL,
`from_user_id` bigint(20) unsigned NOT NULL,
`from_user_username` tinytext,
`from_user_fullname` tinytext,
`message` mediumtext,
`created_time` datetime DEFAULT NULL,
`quoted_post_id` bigint(20) unsigned DEFAULT NULL,
`quoted_user_id` bigint(20) unsigned DEFAULT NULL,
`quoted_user_username` tinytext,
`quoted_user_fullname` tinytext,
`to_post_id` bigint(20) unsigned DEFAULT NULL,
`to_user_id` bigint(20) unsigned DEFAULT NULL,
`to_user_username` tinytext,
`truncated` bit(1) DEFAULT NULL,
`is_retweet` bit(1) DEFAULT NULL,
`retweeting_post_id` bigint(20) unsigned DEFAULT NULL,
`retweeting_user_id` bigint(20) unsigned DEFAULT NULL,
`retweeting_user_username` tinytext,
`retweeting_user_fullname` tinytext,
`tags` text,
`mentions_user_id` text,
`mentions_user_username` text,
`mentions_user_fullname` text,
`post_urls` text,
`count_favourite` int(11) DEFAULT NULL,
`count_retweet` int(11) DEFAULT NULL,
`lang` tinytext,
`location_longitude` float(13,10) DEFAULT NULL,
`location_latitude` float(13,10) DEFAULT NULL,
`place_id` tinytext,
`place_fullname` tinytext,
`source` tinytext,
`fetchtime` datetime DEFAULT NULL,
PRIMARY KEY (`post_id`),
UNIQUE KEY `post_id_UNIQUE` (`post_id`),
KEY `from_user_id_INDEX` (`from_user_id`),
KEY `quoted_user_id_INDEX` (`quoted_user_id`),
KEY `to_user_id_INDEX` (`to_user_id`),
KEY `retweeting_user_id_INDEX` (`retweeting_user_id`),
KEY `created_time_INDEX` (`created_time`),
KEY `retweeting_post_id_INDEX` (`retweeting_post_id`),
KEY `post_all_id_INDEX` (`post_id`,`retweeting_post_id`,`to_post_id`,`quoted_post_id`),
KEY `quoted_post_id_INDEX` (`quoted_post_id`),
KEY `to_post_id_INDEX` (`to_post_id`),
KEY `is_retweet_INDEX` (`is_retweet`),
KEY `SELECT_tags_INDEX` (`created_time`,`is_retweet`,`post_id`),
FULLTEXT KEY `tags_FULLTEXT` (`tags`),
FULLTEXT KEY `mentions_user_id_FULLTEXT` (`mentions_user_id`),
FULLTEXT KEY `message_FULLTEXT` (`message`),
FULLTEXT KEY `content_select` (`tags`,`message`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
When timing, do two things:
Turn off the Query cache (or us SELECT SQL_NO_CACHE...)
Run the query twice.
When a query is run, these happen:
Check the QC to see if exactly the same query was recently run; if so, return the result from that run. This usually takes ~1ms. (This is not what happened in the examples you gave.)
Perform the query. Now there are multiple sub-cases:
If the "buffer pool" is 'cold', this is likely to involve lots of I/O. I/O is slow. This may explain your 14 second run.
If the desired data is cached in RAM, then it will run faster. This probably explains why the other two runs were a lot faster.
If, after compensating from these, you still have issues, please provide SHOW CREATE TABLE and EXPLAIN SELECT ... for the cases. (There could other factors involved.)
Schema critique
One way to improve performance (some) is to shrink the data.
lang tinytext, -- there is a 5 char standard
BIGINT takes 8 bytes. A 4-byte INT is enough for half the people in the world. (But first verify that your AUTO_INCREMENTs are not burning a lot of ids.)
For subtle reasons, VARCHAR(255) is better than TINYTEXT, even though they seem equivalent. Whenever practical, use something less than 255.
FLOAT(13,10) has some issues; I recommend DECIMAL(8,6)/(9,6) as sufficient for distinguishing two tweeters sitting next to each other (not that GPS is that precise).
A PRIMARY KEY is a UNIQUE key; get rid of the redundant UNIQUE.
With INDEX(a, b), you don't also need INDEX(a). (at least 2 cases of such)
Bulk
What will you do with 6700 or 25000 rows in the resultset? I ask because the effort of returning lots of rows is part of the performance problem. If your next step is to further whittle down the output, then it may be better to do the whittling in SQL.
Analysis
Looking at the second set of Queries:
FT + date range. This first did the FT search, then further filtered by date.
FT, count results, quit. Note that all of that was done in the EXPLAIN, hence "Select tables optimized away" -- and the EXPLAIN time is the same as the SELECT time.
Scan one index for an estimated 572K rows -- done entirely in the index. This cannot be improved. However, it can be made severely worse -- such as by adding a seemingly innocuous AND include = 0. In this case it would not be able to use just the index, but instead would have to bounce between the index and the data -- a lot more costly. A cure for this case: INDEX(include, created_time), which would run faster.
COUNT(*) is potentially cheap -- no need to return lots of data, often can be completed within an index, etc.
SELECT col1, col2 is faster than SELECT * -- especially because of TEXT columns.
I have a query which purpose is to generate statistics for how many musical work (track) has been downloaded from a site at different periods (by month, by quarter, by year etc). The query operates on the tables entityusage, entityusage_file and track.
To get the number of downloads for tracks belonging to an specific album I would do the following query :
select
date_format(eu.updated, '%Y-%m-%d') as p, count(eu.id) as c
from entityusage as eu
inner join entityusage_file as euf
ON euf.entityusage_id = eu.id
inner join track as t
ON t.id = euf.track_id
where
t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7'
and entitytype = 't'
and action = 1
group by date_format(eu.updated, '%Y%m%d')
I need to set entitytype = 't' as the entityusage can hold downloads of other entities as well (if entitytype = 'a' then an entire album would have been downloaded, and entityusage_file would then hold all tracks which the album "translated" into at the point of download).
This query takes 40 - 50 seconds. I've been trying to optimize this query for a while, but I have the feeling that I'm approaching this the wrong way.
This is one out of 4 similar queries which must run to generate a report. The report should preferable be able to finish while a user waits for it. Right now, I'm looking at 3 - 4 minutes. That's a long time to wait.
Can this query be optimised further with indexes, or do I need to take another approach to get this job done?
CREATE TABLE `entityusage` (
`id` char(36) NOT NULL,
`title` varchar(255) DEFAULT NULL,
`entitytype` varchar(5) NOT NULL,
`entityid` char(36) NOT NULL,
`externaluser` int(10) NOT NULL,
`action` tinyint(1) NOT NULL,
`updated` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `e` (`entityid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `entityusage_file` (
`id` char(36) NOT NULL,
`entityusage_id` char(36) NOT NULL,
`track_id` char(36) NOT NULL,
`file_id` char(36) NOT NULL,
`type` varchar(3) NOT NULL,
`quality` int(1) NOT NULL,
`size` int(20) NOT NULL,
`updated` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `file_id` (`file_id`),
KEY `entityusage_id` (`entityusage_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `track` (
`id` char(36) NOT NULL,
`album_id` char(36) NOT NULL,
`number` int(3) NOT NULL DEFAULT '0',
`title` varchar(255) DEFAULT NULL,
`updated` datetime NOT NULL DEFAULT '2000-01-01 00:00:00',
PRIMARY KEY (`id`),
KEY `album` (`album_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 CHECKSUM=1 DELAY_KEY_WRITE=1 ROW_FORMAT=DYNAMIC;
An EXPLAIN on the query gives me the following :
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
| 1 | SIMPLE | eu | ALL | NULL | NULL | NULL | NULL | 7832817 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | euf | ref | entityusage_id | entityusage_id | 108 | func | 1 | Using index condition |
| 1 | SIMPLE | t | eq_ref | PRIMARY,album | PRIMARY | 108 | trackerdatabase.euf.track_id | 1 | Using where |
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
This is your query:
select date_format(eu.updated, '%Y-%m-%d') as p, count(eu.id) as c
from entityusage eu join
entityusage_file euf
on euf.entityusage_id = eu.id join
track t
on t.id = euf.track_id
where t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7' and
eu.entitytype = 't' and
eu.action = 1
group by date_format(eu.updated, '%Y%m%d');
I would suggest indexes on track(album_id, id), entityusage_file(track_id, entityusage_id), and entityusage(id, entitytype, action).
Assuming that entityusage_file is mostly a many:many mapping table, see this for tips on improving it. Note that it calls for getting rid of the id and making a pair of 2-column indexes, one of which is the PRIMARY KEY(track_id, entityusage_id). Since your table has a few extra columns, that link does not cover everything.
The UUIDs could be shrunk from 108 bytes to 36, then then to 16 by going to BINARY(16) and using a compression function. Many exist (including a builtin pair in version 8.0); here's mine.
To explain one thing... The query execution should have started with track (on the assumption that '0054a47e-b594-407b-86df-3be078b4e7b7' is very selective). The hangup was that there was no index to get from there to the next table. Gordon's suggested indexes include such.
date_format(eu.updated, '%Y-%m-%d') and date_format(eu.updated, '%Y%m%d') can be simplified to DATE(eu.updated). (No significant performance change.)
(The other Answers and Comments cover a number of issues; I won't repeat them here.)
Because the GROUP BY operation is on an expression involving a function, MySQL can't use an index to optimize that operation. It's going to require a "Using filesort" operation.
I believe the indexes that Gordon suggested are the best bets, given the current table definitions. But even with those indexes, the "tall post" is the eu table, chunking through and sorting all those rows.
To get more reasonable performance, you may need to introduce a "precomputed results" table. It's going to be expensive to generate the counts for everything... but we can pay that price ahead of time...
CREATE TABLE usage_track_by_day
( updated_dt DATE NOT NULL
, PRIMARY KEY (track_id, updated_dt)
)
AS
SELECT eu.track_id
, DATE(eu.updated) AS updated_dt
, SUM(IF(eu.action = 1,1,0) AS cnt
FROM entityusage eu
WHERE eu.track_id IS NOT NULL
AND eu.updated IS NOT NULL
GROUP
BY eu.track_id
, DATE(eu.updated)
An index ON entityusage (track_id,updated,action) may benefit performance.
Then, we could write a query against the new "precomputed results" table, with a better shot at reasonable performance.
The "precomputed results" table would get stale, and would need to be periodically refreshed.
This isn't necessarily the best solution to the issue, but it's a technique we can use in datawarehouse/datamart applications. This lets us churn through lots of detail rows to get counts one time, and then save those counts for fast access.
can you try this. i cant really test it without some sample data from you.
In this case the query looks first in table track and joins then the other tables.
SELECT
date_format(eu.updated, '%Y-%m-%d') AS p
, count(eu.id) AS c
FROM track AS t
INNER JOIN entityusage_file AS euf ON t.id = euf.track_id
INNER JOIN entityusage AS eu ON euf.entityusage_id = eu.id
WHERE
t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7'
AND entitytype = 't'
AND ACTION = 1
GROUP BY date_format(eu.updated, '%Y%m%d');
Been spending several days profiling a wide variety of queries used by a distributed application of ours in a MySQL database. Our app potentially stores millions of records on client database servers and the queries can vary enough so that the design of the indexes isn't always clear or easy. A tiny bit of extra overhead on query write it acceptable if the lookup speed is fast enough.
I've managed to narrow down a few composite indexes that work very well for nearly all of our most common queries. There may be some columns in the below indexes I can weed out, but I need to run tests to be sure.
However, my problem: A certain query actually runs faster when it uses an index that contains fewer columns present in the conditions.
The table structure with current composite indexes:
CREATE TABLE IF NOT EXISTS `prism_data` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`epoch` int(10) unsigned NOT NULL,
`action_id` int(10) unsigned NOT NULL,
`player_id` int(10) unsigned NOT NULL,
`world_id` int(10) unsigned NOT NULL,
`x` int(11) NOT NULL,
`y` int(11) NOT NULL,
`z` int(11) NOT NULL,
`block_id` mediumint(5) DEFAULT NULL,
`block_subid` mediumint(5) DEFAULT NULL,
`old_block_id` mediumint(5) DEFAULT NULL,
`old_block_subid` mediumint(5) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `epoch` (`epoch`),
KEY `block` (`block_id`,`action_id`,`player_id`),
KEY `location` (`world_id`,`x`,`z`,`y`,`epoch`,`action_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
I have eight common queries that I've been testing and they all show incredible performance improvement on a database with 50 million records. One query however, doesn't.
The following query returns 11088 rows in (9.77 sec) and uses the location index
SELECT SQL_NO_CACHE id,
epoch,
action,
player,
world_id,
x,
y,
z
FROM prism_data
INNER JOIN prism_players p ON p.player_id = prism_data.player_id
INNER JOIN prism_actions a ON a.action_id = prism_data.action_id
WHERE world_id =
(SELECT w.world_id
FROM prism_worlds w
WHERE w.world = 'world')
AND (a.action = 'world-edit')
AND (prism_data.x BETWEEN -7220 AND -7020)
AND (prism_data.y BETWEEN -22 AND 178)
AND (prism_data.z BETWEEN -9002 AND -8802)
AND prism_data.epoch >= 1392220467;
+----+-------------+------------+--------+----------------+----------+---------+--------------------------------+--------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+----------------+----------+---------+--------------------------------+--------+------------------------------------+
| 1 | PRIMARY | a | ref | PRIMARY,action | action | 77 | const | 1 | Using where; Using index |
| 1 | PRIMARY | prism_data | ref | epoch,location | location | 4 | const | 660432 | Using index condition; Using where |
| 1 | PRIMARY | p | eq_ref | PRIMARY | PRIMARY | 4 | minecraft.prism_data.player_id | 1 | NULL |
| 2 | SUBQUERY | w | ref | world | world | 767 | const | 1 | Using where; Using index |
+----+-------------+------------+--------+----------------+----------+---------+--------------------------------+--------+------------------------------------+
If I remove the world condition, it would no longer match the location index and instead uses the epoch index. Amazingly, it returns 11088 rows in (0.31 sec)
9.77 sec versus 0.31 sec is too much of a difference to ignore. I don't understand why I'm not seeing such a performance kill on my other queries using the location index too but more importantly I don't know what I can do to fix this.
Presumably, the "epoch" index is more selective than the "location" index.
Note that MySQL might be running the subquery once for every row. That could have considerable overhead, even with an index. Doing 30 million index lookups might take a little time.
Try doing the query this way:
SELECT SQL_NO_CACHE id,
epoch,
action,
player,
world_id,
x,
y,
z
FROM prism_data
INNER JOIN prism_players p ON p.player_id = prism_data.player_id
INNER JOIN prism_actions a ON a.action_id = prism_data.action_id
CROSS JOIN (SELECT w.world_id FROM prism_worlds w WHERE w.world = 'world') w
WHERE world_id = w.world_id
AND (a.action = 'world-edit')
AND (prism_data.x BETWEEN -7220 AND -7020)
AND (prism_data.y BETWEEN -22 AND 178)
AND (prism_data.z BETWEEN -9002 AND -8802)
AND prism_data.epoch >= 1392220467;
If this doesn't show an improvement, then the issue is selectivity of the indexes. MySQL is simply making the wrong decision on which is the best index to use. If this does show an improvement, then it is because the subquery is being executed only one time in the from clause.
EDIT:
Your location index is:
KEY `location` (`world_id`,`x`,`z`,`y`,`epoch`,`action_id`)
Can you change this to:
KEY `location` (`world_id`, action_id `x`, `z`, `y`, `epoch`)
This allows the where filtering to use the action_id as well as x. (Only the first inequality uses direct index lookups.)
or better yet, one of of these:
KEY `location` (`world_id`, action_id, epoch, `x`, `z`, `y`)
KEY `location` (`world_id`, epoch, action_id, `x`, `z`, `y`)
KEY `location` (epoch, `world_id`, action_id, `x`, `z`, `y`)
The idea is to move epoch before x so it will be used for the where clause conditions.
I have this table, that contains around 80,000,000 rows.
CREATE TABLE `mytable` (
`date` date NOT NULL,
`parameters` mediumint(8) unsigned NOT NULL,
`num` tinyint(3) unsigned NOT NULL,
`val1` int(11) NOT NULL,
`val2` int(10) NOT NULL,
`active` tinyint(3) unsigned NOT NULL,
`ref` int(10) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`ref`) USING BTREE,
KEY `parameters` (`parameters`)
) ENGINE=MyISAM AUTO_INCREMENT=79092001 DEFAULT CHARSET=latin1
it's articulated around 2 main columns: "parameters" and "date".
there are around 67,000 possible values for "parameters"
for each "parameters" there are around 1200 rows, each with a different date.
so for each date, there are 67,000 rows.
1200 * 67,000 = 80,400,000.
table size appears as 1.5GB, index size 1.4GB.
now, I want to query the table to retrieve all rows of one "parameters"
(actually I want to do it for each parameter, but this is a good start)
SELECT val1 FROM mytable WHERE parameters=1;
the first run gives me results in 8 seconds
subsequent runs for different but close values of parameters (2, 3, 4...) are instantaneous
a run for a "far away" value (parameters=1000) gives me results in 8 seconds again.
I did tests running the same query without the index, and got results in 20 seconds, so I guess the index is kicking in as shown by EXPLAIN, but not giving a drastic jump in performances:
+----+-------------+----------+------+---------------+------------+---------+-------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+---------------+------------+---------+-------+------+-------+
| 1 | SIMPLE | mytable | ref | parameters | parameters | 3 | const | 1097 | |
+----+-------------+----------+------+---------------+------------+---------+-------+------+-------+
but I'm still baffled by the time for such and easy request (no join, directly on the index).
the server is 2 years-old 2 cpu quad core 2.6GHz running Ubuntu, with 4G of RAM.
I've raised the key_buffer parameter to 1G, and have restarted mysql, but noticed no change whatsoever.
should I consider this normal ? or is there something I'm doing wrong ? I get the feeling with the right config the request should be almost immediate.
Try using a covering index, i.e. create an index that includes both of the columns you need. It won't need the second disk I/O to fetch the values from the main table since the data's right there in the index.