I am new to SQL (using mySQL Community Workbench) and not sure where to begin with this problem.
Here is the overview: I have two tables in my food database: branded_food and food_nutrient
The important columns in branded_food are fdc_id and kcals.
The important columns in food_nutrient are fdc_id, nutrient_id, and value
branded_food's fdc_id column indexes into food_nutrient's fdc_id column. However, this returns every nutrient in the food, when I only want nutrient id 208's value entry.
Here is an example:
branded_food looks like:
fdc_id | kcals
-----------------
123 | (Empty)
456 | (Empty)
... | (Empty)
food_nutrient looks like:
fdc_id | nutrient_id | value
----------------------------
123 | 203 | 23
123 | 204 | 25
123 | ... | ...
123 | 208 | 500
Essentially, I would like to write some sort of loop that goes through each fdc_id in branded_food, finds the row in food_nutrient that has fdc_id equal to the looped value, and then populate the kcals in the row of the fdc_id in branded_food. Thus the first example row should populate like:
fdc_id | kcals
-----------------
123 | 500
As an update, I have looked at INNER JOIN and have created this:
SELECT food_nutrient.amount,food_branded_food.description, food_branded_food.fdc_id
FROM food_nutrient
INNER JOIN food_branded_food ON food_nutrient.fdc_id = food_branded_food.fdc_id
WHERE food_nutrient.nutrient_id = 208
LIMIT 1;
This will correctly display the kcals of the food_branded_food.description (the name of the food) that has fdc_id of food_branded_food.fdc_id. I limit to 1 because the query takes very long. Is there a better way?
Update #2: Here is something I recently tried, but just spins forever:
UPDATE backup_branded_food bf
INNER JOIN (
SELECT food_nutrient.fdc_id,food_nutrient.amount amt FROM food_nutrient WHERE food_nutrient.nutrient_id = 208
) mn ON bf.fdc_id = mn.fdc_id
SET bf.kcals = mn.amt
WHERE bf.kcals IS NULL;
Running explain:
And SHOW CREATE TABLE food_nutrient
| food_nutrient | CREATE TABLE `food_nutrient` (
`id` bigint DEFAULT NULL,
`fdc_id` bigint DEFAULT NULL,
`nutrient_id` bigint DEFAULT NULL,
`amount` bigint DEFAULT NULL,
`data_points` bigint DEFAULT NULL,
`derivation_id` bigint DEFAULT NULL,
`min` double DEFAULT NULL,
`max` double DEFAULT NULL,
`median` double DEFAULT NULL,
`loq` text,
`footnote` text,
`min_year_acquired` text
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
Running SHOW CREATE TABLE backup_branded_food (I use a backup of branded food instead of the actual table)
| backup_branded_food | CREATE TABLE `backup_branded_food` (
`fdc_id` bigint DEFAULT NULL,
`data_type` text,
`description` text,
`food_category_id` bigint DEFAULT NULL,
`publication_date` text,
`brand_owner` varchar(255) DEFAULT NULL,
`brand_name` varchar(255) DEFAULT NULL,
`serving_size` double DEFAULT NULL,
`serving_size_unit` varchar(50) DEFAULT NULL,
`kcals` double DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
Table Indexes:
The table structure info obtained from SHOW CREATE TABLE table_name shows that both table don't have any indexes and/or primary key. This is probably why your query runs very slow. To quickly fix this issue, let's start by adding indexes on columns appear in WHERE and ON (in the JOIN):
ALTER TABLE food_nutrient
ADD INDEX fdc_id(fdc_id),
ADD INDEX nutrient_id(nutrient_id);
ALTER TABLE branded_food
ADD INDEX fdc_id(fdc_id);
With these indexes added, the EXPLAIN shows the following:
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
fn
ref
fdc_id,nutrient_id
nutrient_id
9
const
1
100.00
Using where
1
SIMPLE
bf
ref
fdc_id
fdc_id
9
db_40606077.fn.fdc_id
1
100.00
Since I don't know the size of the table, I can't really test how quick the query will be after adding these indexes but I assume that this will improve the query speed significantly.
P/S: Normally, you would have at least 1 column assigned as PRIMARY KEY - which will never have any duplicates. In your table food_nutrient, there's an id column that might be the PRIMARY KEY but there's a possible unique combination between fdc_id and nutrient_id. Therefore, you might consider adding UNIQUE KEY on those two columns apart from adding PRIMARY KEY on id. 24.6.1 Partitioning Keys, Primary Keys, and Unique Keys
Usage of aliases:
This is to help make your query more readable. You didn't use any in your current query so you end up appending the full table name on column(s) that you're using in you operations:
....
FROM food_nutrient AS fn
INNER JOIN food_branded_food fbf /*can simply be written without "..AS.."*/
ON fn.fdc_id = fbf.fdc_id /*the operation afterwards didn't require you to append full table name*/
...
Similarly, once you've added the table alias, you can use it in SELECT too:
SELECT fn.amount, fbf.description,
fbf.fdc_id AS 'FBF_id'
/*you can also assign a custom/desired alias to your column - as your output column name*/
...
Couldn't find official documentation on MySQL website but here's a further explanation from a different site.
Alternative UPDATE syntax:
Your current UPDATE query should be able to perform what you need but you probably don't need the subquery at all. This UPDATE query should work as well:
UPDATE branded_food bf
JOIN food_nutrient fn ON bf.fdc_id = fn.fdc_id
SET bf.kcals = fn.amount
WHERE fn.nutrient_id = 208
AND bf.kcals IS NULL;
Here's a demo fiddle for reference
A UPDATE and an INNER JOIN gets you your wanted result
UPDATE branded_food bf
INNER JOIN (SELECT fdc_id , SUM(value) svalue FROM Mfood_nutrient ) mn ON bg.fdc_id = mn.fdc_id
SET bf.value = mn.svalue
WHERE bf.value IS NULL;
Related
I have a legacy query that is terribly slow. I'll show the query, and the background to it after.
The query takes ~ 10s which is ridiculously slow. Explain gives me:
Query:
select staff.id as Id,
staff.eid as AccountId,
staff.Surname
from staff
LEFT JOIN app_roles ON (app_roles.app_staff_id = staff.id )
where staff.eid = 7227
AND app_roles.application_id = '1'
and staff.last_modified > '2022-05-11 13:15:21Z'
Staff table contains 280k rows, app_roles contains 644k rows. Staff rows with eid 7727 - 87 rows. app_roles rows for those matching staff id's - 75 rows
Table structures:
CREATE TABLE `app_roles` (
`application_id` varchar(40) NOT NULL,
`app_staff_id` varchar(40) NOT NULL,
`role` varchar(40) NOT NULL,
PRIMARY KEY (`application_id`,`app_staff_id`),
KEY `application_id` (`application_id`),
KEY `app_staff_id` (`app_staff_id`)
) ENGINE=InnoDB
CREATE TABLE `staff` (
`eid` int NOT NULL,
`id` varchar(40) NOT NULL,
`forename` varchar(60) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`surname` varchar(150) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
last_nodified DATETIME NOT NULL,
... columns omitted for simplicity
PRIMARY KEY (`eid`,`id`),
KEY `email` (`email`),
KEY `app_login` (`app_login`),
KEY `app_passwd` (`app_password`),
KEY `id` (`id`),
KEY `eid` (`eid`)
) ENGINE=InnoDB
+----+-------------+-----------+------------+--------+-------------------------------------+----------------+---------+---------------------------------------+--------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+--------+-------------------------------------+----------------+---------+---------------------------------------+--------+----------+--------------------------+
| 1 | SIMPLE | app_roles | NULL | ref | PRIMARY,application_id,app_staff_id | application_id | 42 | const | 330114 | 100.00 | Using where; Using index |
| 1 | SIMPLE | staff | NULL | eq_ref | PRIMARY,id,eid | PRIMARY | 126 | const,inventry.app_roles.app_staff_id | 1 | 33.33 | Using where |
+----+-------------+-----------+------------+--------+-------------------------------------+----------------+---------+---------------------------------------+--------+----------+--------------------------+
I don't understand why the left join and the where are not filtering rows out, and why the indexes are not helping.
All other things being equal, MySQL likes to do joins by primary key lookup. It has a strong preference for that, because primary key lookups are a bit more efficient than secondary key lookups.
It may even change the order of the join to satisfy this preference. Inner join is commutative, so the optimizer can access either table first and then join to the other.
But you used a LEFT [OUTER] JOIN, so how can this be optimized like an inner join? You wrote a condition app_roles.application_id = '1' in the WHERE clause. If you test for a non-NULL value on the right table of a left outer join, it eliminates any of the rows that would make that join an outer join. It's effectively an inner join. Therefore the optimizer is free to reorder the tables in the join.
Both orders of join result in the join using primary key lookups. In both cases, the first column of the lookup is based on a constant condition in your query. The second column of the lookup is a reference from the first table.
So the optimizer has a dilemma. It can choose either join order, and both satisfy the preference for a primary key lookup. So it chooses one arbitrarily.
The failure is that it apparently didn't take into account that the condition on application_id causes it to examine over 330k rows. Either the optimizer has a blindness to this cost, or else the table statistics are not up to date and are fooling the optimizer.
You can refresh the table statistics. This is easy to do and has very small impact on the running system, so you might as well do it to rule out the possibility that bad statistics are causing a bad query optimization.
ANALYZE TABLE app_roles;
ANALYZE TABLE staff;
Then try your query again.
If it's still choosing a bad optimization strategy, you can use a join hint to force it to use the join order matching what you wrote in your query.
select id as Id,
eid as AccountId,
Surname
from staff
STRAIGHT_JOIN app_roles ON (app_roles.app_staff_id = staff.id )
where staff.eid = 7227
AND app_roles.application_id = '1'
and last_modified > '2022-05-11 13:15:21Z'
There might also be a way to incorporate last_modified into an index, but I can't tell which table it belongs to.
I would assume you have an issue with the character set / collation. Make sure the fields you are joining match. To verify this, run :
SHOW FULL COLUMNS FROM staff;
SHOW FULL COLUMNS FROM app_roles;
More specifically, make sure app_roles.app_staff_id and staff.id are the same type.
These 'composite' and 'covering' indexes should help:
staff: INDEX(eid, last_modified, id, Surname)
app_roles: INDEX(application_id, app_staff_id)
Get rid of the Z on the DATETIME literal; MySQL does not understand such.
I have a MySQL table structured like this:
CREATE TABLE `messages` (
`id` int NOT NULL AUTO_INCREMENT,
`author` varchar(250) COLLATE utf8mb4_unicode_ci NOT NULL,
`message` varchar(2000) COLLATE utf8mb4_unicode_ci NOT NULL,
`serverid` varchar(200) COLLATE utf8mb4_unicode_ci NOT NULL,
`date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`guildname` varchar(1000) COLLATE utf8mb4_unicode_ci NOT NULL,
PRIMARY KEY (`id`,`date`)
) ENGINE=InnoDB AUTO_INCREMENT=27769461 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
I need to query this table for various statistics using date ranges for Grafana graphs, however all of those queries are extremely slow, despite the table being indexed using a composite key of id and date.
"id" is auto-incrementing and date is also always increasing.
The queries generated by Grafana look like this:
SELECT
UNIX_TIMESTAMP(date) DIV 120 * 120 AS "time",
count(DISTINCT(serverid)) AS "servercount"
FROM messages
WHERE
date BETWEEN FROM_UNIXTIME(1615930154) AND FROM_UNIXTIME(1616016554)
GROUP BY 1
ORDER BY UNIX_TIMESTAMP(date) DIV 120 * 120
This query takes over 30 seconds to complete with 27 million records in the table.
Explaining the query results in this output:
+----+-------------+----------+------------+------+---------------+------+---------+------+----------+----------+-----------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+---------------+------+---------+------+----------+----------+-----------------------------+
| 1 | SIMPLE | messages | NULL | ALL | PRIMARY | NULL | NULL | NULL | 26952821 | 11.11 | Using where; Using filesort |
+----+-------------+----------+------------+------+---------------+------+---------+------+----------+----------+-----------------------------+
This indicates that MySQL is indeed using the composite primary key I created for indexing the data, but still has to scan almost the entire table, which I do not understand. How can I optimize this table for date range queries?
Plan A:
PRIMARY KEY(date, id), -- to cluster by date
INDEX(id) -- needed to keep AUTO_INCREMENT happy
Assiming the table is quite big, having date at the beginning of the PK puts the rows in the given date range all next to each other. This minimizes (somewhat) the I/O.
Plan B:
PRIMARY KEY(id),
INDEX(date, serverid)
Now the secondary index is exactly what is needed for the one query you have provided. It is optimized for searching by date, and it is smaller than the whole table, hence even faster (I/O-wise) than Plan A.
But, if you have a lot of different queries like this, adding a lot more indexes gets impractical.
Plan C: There may be a still better way:
PRIMARY KEY(id),
INDEX(server_id, date)
In theory, it can hop through that secondary index checking each server_id. But I am not sure that such an optimization exists.
Plan D: Do you need id for anything other than providing a unique PRIMARY KEY? If not, there may be other options.
The index on (id, date) doesn't help because the first key is id not date.
You can either
(a) drop the current index and index (date, id) instead -- when date is in the first place this can be used to filter for date regardless of the following columns -- or
(b) just create an additional index only on (date) to support the query.
I have the following tables:
mysql> show create table rsspodcastitems \G
*************************** 1. row ***************************
Table: rsspodcastitems
Create Table: CREATE TABLE `rsspodcastitems` (
`id` char(20) NOT NULL,
`description` mediumtext,
`duration` int(11) default NULL,
`enclosure` mediumtext NOT NULL,
`guid` varchar(300) NOT NULL,
`indexed` datetime NOT NULL,
`published` datetime default NULL,
`subtitle` varchar(255) default NULL,
`summary` mediumtext,
`title` varchar(255) NOT NULL,
`podcast_id` char(20) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `podcast_id` (`podcast_id`,`guid`),
UNIQUE KEY `UKfb6nlyxvxf3i2ibwd8jx6k025` (`podcast_id`,`guid`),
KEY `IDXkcqf7wi47t3epqxlh34538k7c` (`indexed`),
KEY `IDXt2ofice5w51uun6w80g8ou7hc` (`podcast_id`,`published`),
KEY `IDXfb6nlyxvxf3i2ibwd8jx6k025` (`podcast_id`,`guid`),
KEY `published` (`published`),
FULLTEXT KEY `title` (`title`),
FULLTEXT KEY `summary` (`summary`),
FULLTEXT KEY `subtitle` (`subtitle`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
mysql> show create table station_cache \G
*************************** 1. row ***************************
Table: station_cache
Create Table: CREATE TABLE `station_cache` (
`Station_id` char(36) NOT NULL,
`item_id` char(20) NOT NULL,
`item_type` int(11) NOT NULL,
`podcast_id` char(20) NOT NULL,
`published` datetime NOT NULL,
KEY `Station_id` (`Station_id`,`published`),
KEY `IDX12n81jv8irarbtp8h2hl6k4q3` (`Station_id`,`published`),
KEY `item_id` (`item_id`,`item_type`),
KEY `IDXqw9yqpavo9fcduereqqij4c80` (`item_id`,`item_type`),
KEY `podcast_id` (`podcast_id`,`published`),
KEY `IDXkp2ehbpmu41u1vhwt7qdl2fuf` (`podcast_id`,`published`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
The "item_id" column of the second refers to the "id" column of the former (there isn't a foreign key between the two because the relationship is polymorphic, i.e. the second table may have references to entities that aren't in the first but in other tables that are similar but distinct).
I'm trying to get a query that lists the most recent items in the first table that do not have any corresponding items in the second. The highest performing query I've found so far is:
select i.*,
(select count(station_id)
from station_cache
where item_id = i.id) as stations
from rsspodcastitems i
having stations = 0
order by published desc
I've also considered using a where not exists (...) subquery to perform the restriction, but this was actually slower than the one I have above. But this is still taking a substantial length of time to complete. MySQL's query plan doesn't seem to be using the available indices:
+----+--------------------+---------------+------+---------------+------+---------+------+--------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------------+------+---------------+------+---------+------+--------+----------------+
| 1 | PRIMARY | i | ALL | NULL | NULL | NULL | NULL | 106978 | Using filesort |
| 2 | DEPENDENT SUBQUERY | station_cache | ALL | NULL | NULL | NULL | NULL | 44227 | Using where |
+----+--------------------+---------------+------+---------------+------+---------+------+--------+----------------+
Note that neither portion of the query is using a key, whereas it ought to be able to use KEY published (published) from the primary table and KEY item_id (item_id,item_type) for the subquery.
Any suggestions how I can get an appropriate result without waiting for several minutes?
I would expect the fastest query to be:
select i.*
from rsspodcastitems i
where not exists (select 1
from station_cache sc
where sc.item_id = i.id
)
order by published desc;
This would take advantage of an index on station_cache(item_id) and perhaps rsspodcastitems(published, id).
Your query could be faster, if your query returns a significant number of rows. Your phrasing of the query allows the index on rsspodcastitems(published) to avoid the file sort. If you remove the group by, the exists version should be faster.
I should note that I like your use of the having clause. When faced with this in the past, I have used a subquery:
select i.*,
(select count(station_id)
from station_cache
where item_id = i.id) as stations
from (select i.*
from rsspodcastitems i
order by published desc
) i
where not exists (select 1
from station_cache sc
where sc.item_id = i.id
);
This allows one index for sorting.
I prefer a slight variation on your method:
select i.*,
(exists (select 1
from station_cache sc
where sc.item_id = i.id
)
) as has_station
from rsspodcastitems i
having has_station = 0
order by published desc;
This should be slightly faster than the version with count().
You might want to detect and remove redundant indexes from your tables. Reviewing your CREATE TABLE information for both tables with help you discover several, including podcast_id,guid and Station_id,published, item_id,item_type and podcast_id,published there may be more.
My eventual solution was to delete the full text indices and use an externally generated index table (produced by iterating over the words in the text, filtering stop words, and applying a stemming algorithm) to allow searching. I don't know why the full text indices were causing performance problems, but they seemed to slow down every query that touched the table even if they weren't used.
I'm trying to speed up a query for the below:
My table has around 4 million records.
EXPLAIN SELECT * FROM chrecords WHERE company_number = 'test' OR MATCH (company_name,registered_office_address_address_line_1,registered_office_address_address_line_2) AGAINST('test') LIMIT 0, 10;
+------+-------------+-----------+------+------------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------+------+------------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | chrecords | ALL | i_company_number | NULL | NULL | NULL | 2208348 | Using where |
+------+-------------+-----------+------+------------------+------+---------+------+---------+-------------+
1 row in set (0.00 sec)
I've created two indexes using the below:
ALTER TABLE `chapp`.`chrecords` ADD INDEX `i_company_number` (`company_number`);
ALTER TABLE `chapp`.`chrecords`ADD FULLTEXT(
`company_name`,
`registered_office_address_address_line_1`,
`registered_office_address_address_line_2`
);
How can "combine" the two indexes however? As the above query takes 15+ seconds to execute (only using one index).
The entire table definition:
CREATE TABLE `chapp`.`chrecords` (
`id` INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
`company_name` VARCHAR(100) NULL,
`company_number` VARCHAR(100) NULL,
`registered_office_care_of` VARCHAR(100) NULL,
`registered_office_po_box` VARCHAR(100) NULL,
`registered_office_address_address_line_1` VARCHAR(100) NULL,
`registered_office_address_address_line_2` VARCHAR(100) NULL,
`registered_office_locality` VARCHAR(100) NULL,
`registered_office_region` VARCHAR(100) NULL,
`registered_office_country` VARCHAR(100) NULL,
`registered_office_postal_code` VARCHAR(100) NULL
);
ALTER TABLE `chapp`.`chrecords` ADD INDEX `i_company_name` (`company_name`);
ALTER TABLE `chapp`.`chrecords` ADD INDEX `i_company_number` (`company_number`);
ALTER TABLE `chapp`.`chrecords` ADD INDEX `i_registered_office_address_address_line_1` (`registered_office_address_address_line_1`);
ALTER TABLE `chapp`.`chrecords` ADD INDEX `i_registered_office_address_address_line_2` (`registered_office_address_address_line_2`);
ALTER TABLE `chapp`.`chrecords`ADD FULLTEXT(
`company_name`,
`registered_office_address_address_line_1`,
`registered_office_address_address_line_2`
);
(
SELECT *
FROM chrecords
WHERE company_number = 'test'
ORDER BY something
LIMIT 10
)
UNION DISTINCT
(
SELECT *
FROM cbrecords
WHERE MATCH (company_name, registered_office_address_address_line_1,
registered_office_address_address_line_2)
AGAINST('test')
ORDER BY something
LIMIT 10
)
ORDER BY something
LIMIT 10
Notes:
No need for an outer SELECT
Explicitly say DISTINCT (the default) or ALL (which is faster) so that you will know that you thought about whether dedupping was needed, versus speed.
A LIMIT without an ORDER BY is not very meaningful
However, if you just want some rows to look at, you can remove the ORDER BYs.
Yes the ORDER BY and LIMIT need to be repeated outside so that you can get the ordering correct and limit to 10.
If you need an OFFSET, the the inside need a full count, say LIMIT 50 for 5 pages, the n the outside needs to skip to the 5th page: LIMIT 40,10.
Try using a UNION rather than OR.
SELECT *
FROM (
SELECT *
FROM chrecords
WHERE company_number = 'test'
) a
UNION (
SELECT *
FROM cbrecords
WHERE MATCH (company_name,
registered_office_address_address_line_1,
registered_office_address_address_line_2)
AGAINST('test')
LIMIT 0, 10
) b
If this helps, it's because MySQL struggles to use more than one index in a single subquery. This gives the query planner two queries.
You can run EXPLAIN on each of the subqueries separately to understand their performance. UNION just puts their results together and eliminates duplicates. If you want to keep the duplicates, do UNION ALL.
Please notice that lots of single-column indexes on MySQL tables are generally harmful to performance. You should refrain from creating indexes unless they're constructed to help specific queries.
I have a query which purpose is to generate statistics for how many musical work (track) has been downloaded from a site at different periods (by month, by quarter, by year etc). The query operates on the tables entityusage, entityusage_file and track.
To get the number of downloads for tracks belonging to an specific album I would do the following query :
select
date_format(eu.updated, '%Y-%m-%d') as p, count(eu.id) as c
from entityusage as eu
inner join entityusage_file as euf
ON euf.entityusage_id = eu.id
inner join track as t
ON t.id = euf.track_id
where
t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7'
and entitytype = 't'
and action = 1
group by date_format(eu.updated, '%Y%m%d')
I need to set entitytype = 't' as the entityusage can hold downloads of other entities as well (if entitytype = 'a' then an entire album would have been downloaded, and entityusage_file would then hold all tracks which the album "translated" into at the point of download).
This query takes 40 - 50 seconds. I've been trying to optimize this query for a while, but I have the feeling that I'm approaching this the wrong way.
This is one out of 4 similar queries which must run to generate a report. The report should preferable be able to finish while a user waits for it. Right now, I'm looking at 3 - 4 minutes. That's a long time to wait.
Can this query be optimised further with indexes, or do I need to take another approach to get this job done?
CREATE TABLE `entityusage` (
`id` char(36) NOT NULL,
`title` varchar(255) DEFAULT NULL,
`entitytype` varchar(5) NOT NULL,
`entityid` char(36) NOT NULL,
`externaluser` int(10) NOT NULL,
`action` tinyint(1) NOT NULL,
`updated` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `e` (`entityid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `entityusage_file` (
`id` char(36) NOT NULL,
`entityusage_id` char(36) NOT NULL,
`track_id` char(36) NOT NULL,
`file_id` char(36) NOT NULL,
`type` varchar(3) NOT NULL,
`quality` int(1) NOT NULL,
`size` int(20) NOT NULL,
`updated` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `file_id` (`file_id`),
KEY `entityusage_id` (`entityusage_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `track` (
`id` char(36) NOT NULL,
`album_id` char(36) NOT NULL,
`number` int(3) NOT NULL DEFAULT '0',
`title` varchar(255) DEFAULT NULL,
`updated` datetime NOT NULL DEFAULT '2000-01-01 00:00:00',
PRIMARY KEY (`id`),
KEY `album` (`album_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 CHECKSUM=1 DELAY_KEY_WRITE=1 ROW_FORMAT=DYNAMIC;
An EXPLAIN on the query gives me the following :
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
| 1 | SIMPLE | eu | ALL | NULL | NULL | NULL | NULL | 7832817 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | euf | ref | entityusage_id | entityusage_id | 108 | func | 1 | Using index condition |
| 1 | SIMPLE | t | eq_ref | PRIMARY,album | PRIMARY | 108 | trackerdatabase.euf.track_id | 1 | Using where |
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
This is your query:
select date_format(eu.updated, '%Y-%m-%d') as p, count(eu.id) as c
from entityusage eu join
entityusage_file euf
on euf.entityusage_id = eu.id join
track t
on t.id = euf.track_id
where t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7' and
eu.entitytype = 't' and
eu.action = 1
group by date_format(eu.updated, '%Y%m%d');
I would suggest indexes on track(album_id, id), entityusage_file(track_id, entityusage_id), and entityusage(id, entitytype, action).
Assuming that entityusage_file is mostly a many:many mapping table, see this for tips on improving it. Note that it calls for getting rid of the id and making a pair of 2-column indexes, one of which is the PRIMARY KEY(track_id, entityusage_id). Since your table has a few extra columns, that link does not cover everything.
The UUIDs could be shrunk from 108 bytes to 36, then then to 16 by going to BINARY(16) and using a compression function. Many exist (including a builtin pair in version 8.0); here's mine.
To explain one thing... The query execution should have started with track (on the assumption that '0054a47e-b594-407b-86df-3be078b4e7b7' is very selective). The hangup was that there was no index to get from there to the next table. Gordon's suggested indexes include such.
date_format(eu.updated, '%Y-%m-%d') and date_format(eu.updated, '%Y%m%d') can be simplified to DATE(eu.updated). (No significant performance change.)
(The other Answers and Comments cover a number of issues; I won't repeat them here.)
Because the GROUP BY operation is on an expression involving a function, MySQL can't use an index to optimize that operation. It's going to require a "Using filesort" operation.
I believe the indexes that Gordon suggested are the best bets, given the current table definitions. But even with those indexes, the "tall post" is the eu table, chunking through and sorting all those rows.
To get more reasonable performance, you may need to introduce a "precomputed results" table. It's going to be expensive to generate the counts for everything... but we can pay that price ahead of time...
CREATE TABLE usage_track_by_day
( updated_dt DATE NOT NULL
, PRIMARY KEY (track_id, updated_dt)
)
AS
SELECT eu.track_id
, DATE(eu.updated) AS updated_dt
, SUM(IF(eu.action = 1,1,0) AS cnt
FROM entityusage eu
WHERE eu.track_id IS NOT NULL
AND eu.updated IS NOT NULL
GROUP
BY eu.track_id
, DATE(eu.updated)
An index ON entityusage (track_id,updated,action) may benefit performance.
Then, we could write a query against the new "precomputed results" table, with a better shot at reasonable performance.
The "precomputed results" table would get stale, and would need to be periodically refreshed.
This isn't necessarily the best solution to the issue, but it's a technique we can use in datawarehouse/datamart applications. This lets us churn through lots of detail rows to get counts one time, and then save those counts for fast access.
can you try this. i cant really test it without some sample data from you.
In this case the query looks first in table track and joins then the other tables.
SELECT
date_format(eu.updated, '%Y-%m-%d') AS p
, count(eu.id) AS c
FROM track AS t
INNER JOIN entityusage_file AS euf ON t.id = euf.track_id
INNER JOIN entityusage AS eu ON euf.entityusage_id = eu.id
WHERE
t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7'
AND entitytype = 't'
AND ACTION = 1
GROUP BY date_format(eu.updated, '%Y%m%d');