I am hoping to get some advice on how to optimize the performance of this query I have with an outer join. First I will explain what I am trying to do and then I'll show the code and results.
I have an Accounts table that has a list of all customer accounts. And I have a datausage table which keeps track of how much data each customer is using. A backend process running on multiple servers inserts records into the datausage table each day to keep track of how much usage occurred that day for each customer on that server.
The backend process works like this - if there is no activity on that server for an account on that day, no records are written for that account. If there is activity, one record is written with a "LogDate" of that day. This is happening on multiple servers. So collectively the datausage table winds up with no rows (no activity at all for that customer each day), one row (activity was only on one server for that day), or multiple rows (activity was on multiple servers for that day).
We need to run a report that lists ALL customers, along with their usage for a specific date range. Some customers may have no usage at all (nothing whatsoever in the datausage table). Some customers may have no usage at all for the current period (but usage in other periods).
Regardless of whether there is any usage or not (ever, or for the selected period) we need EVERY customer in the Accounts table to be listed in the report, even if they show no usage. Therefore it seems this required an outer join.
Here is the query I am using:
SELECT
Accounts.accountID as AccountID,
IFNULL(Accounts.name,Accounts.accountID) as AccountName,
AccountPlans.plantype as AccountType,
Accounts.status as AccountStatus,
date(Accounts.created_at) as Created,
sum(IFNULL(datausage.Core,0) + (IFNULL(datausage.CoreDeluxe,0) * 3)) as 'CoreData'
FROM `Accounts`
LEFT JOIN `datausage` on `Accounts`.`accountID` = `datausage`.`accountID`
LEFT JOIN `AccountPlans` on `AccountPlans`.`PlanID` = `Accounts`.`PlanID`
WHERE
(
(`datausage`.`LogDate` >= '2014-06-01' and `datausage`.`LogDate` < '2014-07-01')
or `datausage`.`LogDate` is null
)
GROUP BY Accounts.accountID
ORDER BY `AccountName` asc
This query takes about 2 seconds to run. However it only takes 0.3 seconds to run if the "or datausage.LogDate is NULL" is removed. However, it seems I must have that clause in there, because accounts with no usage are excluded from the result set if that does not appear.
Here is the table data:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+--------+---------------------------------------------------------+---------+---------+----------------------+------- +----------------------------------------------------+
| 1 | SIMPLE | Accounts | ALL | PRIMARY,accounts_planid_foreign,accounts_cardid_foreign | NULL | NULL | NULL | 57 | Using temporary; Using filesort |
| 1 | SIMPLE | datausage | ALL | NULL | NULL | NULL | NULL | 96805 | Using where; Using join buffer (Block Nested Loop) |
| 1 | SIMPLE | AccountPlans | eq_ref | PRIMARY | PRIMARY | 4 | mydb.Accounts.planID | 1 | NULL |
The indexes on Accounts table are as follows:
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+-------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Accounts | 0 | PRIMARY | 1 | accountID | A | 57 | NULL | NULL | | BTREE | | |
| Accounts | 1 | accounts_planid_foreign | 1 | planID | A | 5 | NULL | NULL | | BTREE | | |
| Accounts | 1 | accounts_cardid_foreign | 1 | cardID | A | 0 | NULL | NULL | YES | BTREE | | |
The index on the datausage table is as follows:
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| datausage | 0 | PRIMARY | 1 | UsageID | A | 96805 | NULL | NULL | | BTREE | | |
I tried creating different indexes on datausage to see if it would help, but nothing did. I tried an index on AccountID, an index on AccountID, LogData, and index on LogData, AccountID, and an index on LogData. None of these made any difference.
I also tried using a UNION ALL with one of the queries with the logdata range and the other query just where logdata is null, but the result was about the same (actually a bit worse).
Can someone please help me understand what may be going on and the ways in which I can optimize the query execution time? Thank you!!
UPDATE: At Philipxy's request, here are the table definitions. Note that I removed some columns and constraints that are not related to this query to help keep things as tight and clean as possible.
CREATE TABLE `Accounts` (
`accountID` varchar(25) NOT NULL,
`name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`status` int(11) NOT NULL,
`planID` int(10) unsigned NOT NULL DEFAULT '1',
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00'
PRIMARY KEY (`accountID`),
KEY `accounts_planid_foreign` (`planID`),
KEY `acctname_id_ndx` (`name`,`accountID`),
CONSTRAINT `accounts_planid_foreign` FOREIGN KEY (`planID`) REFERENCES `AccountPlans` (`planID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
CREATE TABLE `datausage` (
`UsageID` int(11) NOT NULL AUTO_INCREMENT,
`Core` int(11) DEFAULT NULL,
`CoreDelux` int(11) DEFAULT NULL,
`AccountID` varchar(25) DEFAULT NULL,
`LogDate` date DEFAULT NULL
PRIMARY KEY (`UsageID`),
KEY `acctusage` (`AccountID`,`LogDate`)
) ENGINE=MyISAM AUTO_INCREMENT=104303 DEFAULT CHARSET=latin1
CREATE TABLE `AccountPlans` (
`planID` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`params` text COLLATE utf8_unicode_ci NOT NULL,
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`plantype` varchar(25) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`planID`),
KEY `acctplans_id_type_ndx` (`planID`,`plantype`)
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
First, you can simplify the query by moving the where clause to the on clause:
SELECT a.accountID as AccountID, coalesce(a.name, a.accountID) as AccountName,
ap.plantype as AccountType, a.status as AccountStatus,
date(a.created_at) as Created,
sum(coalesce(du.Core, 0) + (coalesce(du.CoreDeluxe, 0) * 3)) as CoreData
FROM Accounts a LEFT JOIN
datausage du
on a.accountID = du.`accountID` AND
du.`LogDate` >= '2014-06-01' and du.`LogDate` < '2014-07-01'
LEFT JOIN
AccountPlans ap
on ap.`PlanID` = a.`PlanID`
GROUP BY a.accountID
ORDER BY AccountName asc ;
(I also introduced table aliases to make the query easier to read.)
This version should make better uses of indexes because it eliminates the or in the where clause. However, it still won't use an index for the outer sort. The following might be better:
SELECT a.accountID as AccountID, coalesce(a.name, a.accountID) as AccountName,
ap.plantype as AccountType, a.status as AccountStatus,
date(a.created_at) as Created,
sum(coalesce(du.Core, 0) + (coalesce(du.CoreDeluxe, 0) * 3)) as CoreData
FROM Accounts a LEFT JOIN
datausage du
on a.accountID = du.`accountID` AND
du.LogDate >= '2014-06-01' and du.LogDate < '2014-07-01'LEFT JOIN
AccountPlans ap
on ap.PlanID = a.PlanID
GROUP BY a.accountID
ORDER BY a.name, a.accountID ;
For this, I would recommend the following indexes:
Accounts(name, AccountId)
Datausage(AccountId, LogDate)
AccountPlans(PlanId, PlanType)
When you left join with datausage you should restrict the output as much as possible right there. (JOIN means AND means WHERE means ON. Put the conditions in essentially whatever order will be clear and/or optimize when necessary.) The result will be a null-extended row when there was no usage; you want to leave that row in.
When you join with AccountPlans you don't want to introduce null rows (which can't happen anyway) so that's just an inner join.
The version below has the AccountPlan join as an inner join and put first. (Indexed) Accounts FK PlanID to AccountPlan means the DBMS knows the inner join will only ever generate one row per Accounts PK. So the output has key AccountId. That row can be immediately inner joined to datausage. (An index on its AccountID should help, eg for a merge join.) For the other way around there is no PlanID key/index on the outer join result to join with AccountPlan.
SELECT
a.accountID as AccountID,
IFNULL(a.name,a.accountID) as AccountName,
ap.plantype as AccountType,
a.status as AccountStatus,
date(a.created_at) as Created,
sum(IFNULL(du.Core,0) + (IFNULL(du.CoreDeluxe,0) * 3)) as CoreData
FROM Accounts a
JOIN AccountPlans ap ON ap.PlanID = a.PlanID
LEFT JOIN datausage du ON a.accountID = du.accountID AND du.LogDate >= '2014-06-01' AND du.LogDate < '2014-07-01'
GROUP BY a.accountID
Related
I'm trying to figure out which is the best way to optimize my current selection query on a MySQL database.
I have 2 MySQL tables with a relationship one-to-many. One is the user table that contains the unique list of users and It has around 22krows. One is the linedata table which contains all the possible coordinates for each user and it has around 490k rows.
In this case we can assume the foreign key between the 2 tables is the id value. In the case of the user table the id is also the auto-increment primary key, while in the linedata table it's not primary key cause we can have more rows for the same user.
The CREATE STMT structure
CREATE TABLE `user` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`isActive` tinyint(4) NOT NULL,
`userId` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`name` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`gender` varchar(45) COLLATE utf8_unicode_ci NOT NULL,
`age` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=21938 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `linedata` (
`id` int(11) NOT NULL,
`userId` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`timestamp` datetime NOT NULL,
`x` float NOT NULL,
`y` float NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The selection query
SELECT
u.id,
u.isActive,
u.userId,
u.name,
u.gender,
u.age,
GROUP_CONCAT(CONCAT_WS(', ',timestamp,x, y)
ORDER BY timestamp ASC SEPARATOR '; '
) as linedata_0
FROM user u
JOIN linedata l
ON u.id=l.id
WHERE DATEDIFF(l.timestamp, '2018-02-28T20:00:00.000Z') >= 0
AND DATEDIFF(l.timestamp, '2018-11-20T09:20:08.218Z') <= 0
GROUP BY userId;
The EXPLAIN output
+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
| ID | SELECT_TYPE | TABLE | TYPE | POSSIBLE_KEYS | KEY | KEY_LEN | REF | ROWS | EXTRA |
+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
| 1 | SIMPLE | l | ALL | NULL | NULL | NULL | NULL | 491157 | "Using where; Using temporary; Using filesort" |
+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | l.id | 1 | NULL |
+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
The selection query works if for example I add another WHERE condition for filter single users. Let's say that I want to select just 200 user, then I got around 14 seconds as execution time. Around 7 seconds if I select just the first 100 users. But in case of having only datetime range condition it seems loading without an ending point. Any suggestions?
UPDATE
After following the Rick's suggestions now the query benchmark is around 14 seconds. Here below the EXPLAIN EXTENDED:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,filtered,Extra
1,PRIMARY,u,index,PRIMARY,PRIMARY,4,NULL,21959,100.00,NULL
1,PRIMARY,l,ref,id_timestamp_index,id_timestamp_index,4,u.id,14,100.00,"Using index condition"
2,"DEPENDENT SUBQUERY",NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,"No tables used"
I have changed a bit some values of the tables:
Where the id in user table can be joined with userId in linedata table. And they are integer now. We will have string type just for the userId value in user table cause it is a sort of long string identifier like 0000309ab2912b2fd34350d7e6c079846bb6c5e1f97d3ccb053d15061433e77a_0.
So, just for make a quick example we will have in user and in linedata table:
+-------+-----------+-----------+-------------------+--------+---+
| id | isActive | userId | name | gender |age|
+-------+-----------+-----------+-------------------+--------+---+
| 1 | 1 | x4by4d | john | m | 22|
| 2 | 1 | 3ub3ub | bob | m | 50|
+-------+-----------+-----------+-------------------+--------+---+
+-------+-----------+-----------+------+---+
| id | userId |timestamp | x | y |
+-------+-----------+-----------+------+----+
| 1 | 1 | somedate | 30 | 10 |
| 2 | 1 | somedate | 45 | 15 |
| 3 | 1 | somedate | 50 | 20 |
| 4 | 2 | somedate | 20 | 5 |
| 5 | 2 | somedate | 25 | 10 |
+-------+-----------+-----------+------+----+
I have added a compound index made of userId and timestamp values in linedata table.
Maybe instead of having as primary key an ai id value for linedata table, if I add a composite primary key made of userId+timestamp? Should increase the performance or maybe not?
I need to help you fix several bugs before discussing performance.
First of all, '2018-02-28T20:00:00.000Z' won't work in MySQL. It needs to be '2018-02-28 20:00:00.000' and something needs to be done about the timezone.
Then, don't "hide a column in a function". That is DATEDIFF(l.timestamp ...) cannot use any indexing on timestamp.
So, instead of
WHERE DATEDIFF(l.timestamp, '2018-02-28T20:00:00.000Z') >= 0
AND DATEDIFF(l.timestamp, '2018-11-20T09:20:08.218Z') <= 0
do something like
WHERE l.timestamp >= '2018-02-28 20:00:00.000'
AND l.timestamp < '2018-11-20 09:20:08.218'
I'm confused about the two tables. Both have id and userid, yet you join on id. Perhaps instead of
CREATE TABLE `linedata` (
`id` int(11) NOT NULL,
`userId` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
...
you meant
CREATE TABLE `linedata` (
`id` int(11) NOT NULL AUTO_INCREMENT, -- (the id for `linedata`)
`userId` int NOT NULL, -- to link to the other table
...
PRIMARY KEY(id)
...
Then there could be several linedata rows for each user.
At that point, this
JOIN linedata l ON u.id=l.id
becomes
JOIN linedata l ON u.id=l.userid
Now, for performance: linedata needs INDEX(userid, timestamp) - in that order.
Now, think about the output. You are asking for up to 22K rows, with possibly hundreds of "ts,x,y" strung together in one of the columns. What will receive this much data? Will it choke on it?
And GROUP_CONCAT has a default limit of 1024 bytes. That will allow for about 50 points. If a 'user' can be in more than 50 spots in 9 days, consider increasing group_concat_max_len before running the query.
To make it work even faster, reformulate it this way:
SELECT u.id, u.isActive, u.userId, u.name, u.gender, u.age,
( SELECT GROUP_CONCAT(CONCAT_WS(', ',timestamp, x, y)
ORDER BY timestamp ASC
SEPARATOR '; ')
) as linedata_0
FROM user u
JOIN linedata l ON u.id = l.userid
WHERE l.timestamp >= '2018-02-28 20:00:00.000'
AND l.timestamp < '2018-11-20 09:20:08.218';
Another thing. You probably want to be able to look up a user by name; so add INDEX(name)
Oh, what the heck is the VARCHAR(255) for userID?? Ids are normally integers.
I'm having some major issues getting a query that thins out a data set to return results in a timely manner. I've pasted below the indexes on this table (which I thought would be enough to improve the speed) as well as the logic of the query.
One thing I tried was removing the "ORDER BY" in the inner query, but that actually did not improved the timing on it whatsoever, as well as cleaning out some additional unneeded columns I was selecting which improve the time, but did not make it "fast enough".
SELECT unix_date, price FROM
(SELECT #row := #row +1 as row_num, unix_date, price
FROM (SELECT #row:=0, unix_date, price FROM
price_data WHERE created_date >= '2017-03-26 00:00:00' AND created_date
<= '2017-06-26 23:59:59' AND currency= 'USD' ORDER BY unix_date DESC)
AS p) AS d
WHERE MOD(row_num, 288) = 1;
The whole point of this query is just trying to return a result set of price data points (unix timestamp, price) but thin it out to only return every 288th (or X) data points. This table is honestly pretty small currently (total rows: 198109) so I am having a hard time understanding why the query would take so long to return.
Here are the indexes on the table currently:
| Table | Non_unique | Key_name | Seq_in_index |
Column_name | Collation | Cardinality | Sub_part | Packed | Null |
Index_type | Comment | Index_comment |
+--------------------+------------+--------------+--------------+------
--------+-----------+-------------+----------+--------+------+---------
---+---------+---------------+
| price_data | 0 | PRIMARY | 1 |
unix_date | A | 200002 | NULL | NULL | |
BTREE | | |
| price_data | 0 | PRIMARY | 2 |
currency | A | 200002 | NULL | NULL | |
BTREE | | |
| price_data | 1 | created_date | 1 |
created_date | A | 200002 | NULL | NULL | |
BTREE | | |
| price_data | 1 | price | 1 | price
| A | 200002 | NULL | NULL | YES | BTREE |
| |
+--------------------+------------+--------------+--------------+------
--------+-----------+-------------+----------+--------+------+---------
---+---------+--
Per suggestion I've added the create table:
CREATE TABLE `price_data` (
`created_date` datetime NOT NULL,
`unix_date` int(11) NOT NULL,
`currency` varchar(255) NOT NULL DEFAULT '',
`price` decimal(10,6) DEFAULT NULL,
PRIMARY KEY (`unix_date`,`currency`),
KEY `created_date` (`created_date`),
KEY `price` (`price`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
Any advice on how to improve the speed of this query would be greatly appreciated.
EDIT: Could the issue be something along the lines of the final WHERE of this query, is actually evaluating information against a "fake column" of row_num, which is derived from the previous inner queries, thus does not really have an index on it per se? So when the WHERE is evaluating it is not doing so with the speed of something a normal indexed column would?
Change the primary key to currency, unix_date. Since you have an equality condition for currency, and a range condition for unix_date, you should put the column with the equality condition first. Then both the range condition on unix_date and the ORDER BY should use the primary key order.
Apply the condition to unix_date, not create_date, to get it to use the primary key index.
You have to use a derived table subquery, but you don't have to use two nested levels of subqueries.
SELECT row_num, unix_date, price
FROM (
SELECT #row := #row + 1 AS row_num, unix_date, price
FROM (SELECT #row := 0) AS _init
CROSS JOIN price_data
WHERE currency = 'USD'
AND unix_date BETWEEN UNIX_TIMESTAMP('2017-03-26 00:00:00')
AND UNIX_TIMESTAMP('2017-06-26 23:59:59')
ORDER BY unix_timestamp DESC
) AS t
WHERE MOD(row_num, 288) = 1
You should learn to use EXPLAIN to help you analyze index usage.
You might also like my presentation How to Design Indexes, Really, and the video: https://www.youtube.com/watch?v=ELR7-RdU9XU
MySQL 8.0 should have windowing functions, so look for that next year sometime.
I have a biggish InnoDB table which at this moment contains about 20 million rows with ~20000 new rows inserted every day. They contain messages for different topics.
CREATE TABLE IF NOT EXISTS `Messages` (
`ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`TopicID` bigint(20) unsigned NOT NULL,
`DATESTAMP` int(11) DEFAULT NULL,
`TIMESTAMP` int(10) unsigned NOT NULL,
`Message` mediumtext NOT NULL,
`Checksum` varchar(50) DEFAULT NULL,
`Nickname` varchar(80) NOT NULL,
PRIMARY KEY (`ID`),
UNIQUE KEY `TopicID` (`TopicID`,`Checksum`),
KEY `DATESTAMP` (`DATESTAMP`),
KEY `Nickname` (`Nickname`),
KEY `TIMESTAMP` (`TIMESTAMP`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=25195126 ;
NOTE: The Cheksum stores an MD5 checksum which prevents same messages inserted twice in the same topics. (nickname + timestamp + topicid + last 20 chars of message)
The site I'm building has a newsfeed in which users can select to view newest messages from different Nicknames from different forums. The query is as follows:
SELECT
Messages.ID AS MessageID,
Messages.Message,
Messages.TIMESTAMP,
Messages.Nickname,
Topics.ID AS TopicID,
Topics.Title AS TopicTitle,
Forums.Title AS ForumTitle
FROM Messages
JOIN FollowedNicknames ON FollowedNicknames.UserID = 'MYUSERID'
JOIN Forums ON Forums.ID = FollowedNicknames.ForumID
JOIN Subforums ON Subforums.ForumID = Forums.ID
JOIN Topics ON Topics.SubforumID = Subforums.ID
WHERE
Messages.Nickname = FollowedNicknames.Nickname AND
Messages.TopicID = Topics.ID AND Messages.DATESTAMP = '2013619'
ORDER BY Messages.TIMESTAMP DESC
The TIMESTAMP contains an unix timestamp and DATESTAMP is simply a date generated from the unix timestamp for faster access via '=' operator instead of range scans with unix timestamps.
The problem is, this query takes about 13 seconds ( or more ) unbuffered. That is of course unacceptable for the intented usage. Adding the DATESTAMP seemed to speed things up, but not by much.
At this point, I don't really know what should I do. I've read about composite primary keys, but I am still unsure whether they would do any good and how to correctly implement one in this particular case.
I know that using BIGINTs may be a little overkill, but do they affect that much?
EXPLAIN:
+----+-------------+-----------------------+--------+---------------------------------------+------------+---------+-----------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------------+--------+---------------------------------------+------------+---------+-----------------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | FollowedNicknames | ALL | UserID,ForumID,Nickname | NULL | NULL | NULL | 8 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | Forums | eq_ref | PRIMARY | PRIMARY | 8 | database.FollowedNicknames.ForumiID | 1 | NULL |
| 1 | SIMPLE | Messages | ref | TopicID,DATETIME,Nickname | Nickname | 242 | database.FollowedNicknames.Nickname | 15 | Using where |
| 1 | SIMPLE | Topics | eq_ref | PRIMARY,SubforumID | PRIMARY | 8 | database.Messages.TopicID | 1 | NULL |
| 1 | SIMPLE | Subforums | eq_ref | PRIMARY,ForumID | PRIMARY | 8 | database.Topics.SubforumID | 1 | Using where |
+----+-------------+-----------------------+--------+---------------------------------------+------------+---------+-----------------------------------------------+------+----------------------------------------------+
You shouldn't be JOINing on a VARCHAR column (Nickname); you should use the user ID to join those tables. That is definitely slowing the query down and is probably the biggest issue. It would also be easier to follow if you wrote all of the JOINs explicitly instead of at the end in the WHERE clause like this:
SELECT
Messages.ID AS MessageID,
Messages.Message,
Messages.TIMESTAMP,
Messages.Nickname,
Topics.ID AS TopicID,
Topics.Title AS TopicTitle,
Forums.Title AS ForumTitle
FROM Messages
JOIN FollowedNicknames ON Messages.Nickname = FollowedNicknames.Nickname
AND FollowedNicknames.UserID = 'MYUSERID'
JOIN Forums ON Forums.ID = FollowedNicknames.ForumID
JOIN Subforums ON Subforums.ForumID = Forums.ID
JOIN Topics ON Messages.TopicID = Topics.ID
AND Topics.SubforumID = Subforums.ID
WHERE Messages.DATESTAMP = '2013619'
ORDER BY Messages.TIMESTAMP DESC
Instead of INT as the data type for the DATESTAMP column, I would use DATE. The Checksum column should probably use latin1_general_ci as the collation. I would use INT for the ID columns as long as their values are less than 2,000,000,000 since INT UNSIGNED can store values up to roughly 4,000,000,000. InnoDB is affected by the primary key much more than MyISAM and it could make a noticeable difference.
I have two tables:
localities:
CREATE TABLE `localities` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(100) NOT NULL,
`type` varchar(30) NOT NULL,
`parent_id` int(11) DEFAULT NULL,
`lft` int(11) DEFAULT NULL,
`rgt` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_localities_on_parent_id_and_type` (`parent_id`,`type`),
KEY `index_localities_on_name` (`name`),
KEY `index_localities_on_lft_and_rgt` (`lft`,`rgt`)
) ENGINE=InnoDB;
locatings:
CREATE TABLE `locatings` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`localizable_id` int(11) DEFAULT NULL,
`localizable_type` varchar(255) DEFAULT NULL,
`locality_id` int(11) NOT NULL,
`category` varchar(50) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_locatings_on_locality_id` (`locality_id`),
KEY `localizable_and_category_index` (`localizable_type`,`localizable_id`,`category`),
KEY `index_locatings_on_category` (`category`)
) ENGINE=InnoDB;
localities table is implemented as a nested set.
Now, when user belongs to some locality (through some locating) he also belongs to all its ancestors (higher level localities). I need a query that will select all the localities that all the users belong to into a view.
Here is my try:
select distinct lca.*, lt.localizable_type, lt.localizable_id
from locatings lt
join localities lc on lc.id = lt.locality_id
left join localities lca on (lca.lft <= lc.lft and lca.rgt >= lc.rgt)
The problem here is that it takes way too much time to execute.
I consulted EXPLAIN:
+----+-------------+-------+--------+---------------------------------+---------+---------+----------------------------------+-------+----------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+--------+---------------------------------+---------+---------+----------------------------------+-------+----------+-----------------+
| 1 | SIMPLE | lt | ALL | index_locatings_on_locality_id | NULL | NULL | NULL | 4926 | 100.00 | Using temporary |
| 1 | SIMPLE | lc | eq_ref | PRIMARY | PRIMARY | 4 | bzzik_development.lt.locality_id | 1 | 100.00 | |
| 1 | SIMPLE | lca | ALL | index_localities_on_lft_and_rgt | NULL | NULL | NULL | 11439 | 100.00 | |
+----+-------------+-------+--------+---------------------------------+---------+---------+----------------------------------+-------+----------+-----------------+
3 rows in set, 1 warning (0.00 sec)
The last join obviously doesn’t use lft, rgt index as I expect it to. I’m desperate.
UPDATE:
After adding a condition as #cairnz suggested, the query takes still too much time to process.
UPDATE 2: Column names instead of the asterisk
Updated query:
SELECT DISTINCT lca.id, lt.`localizable_id`, lt.`localizable_type`
FROM locatings lt FORCE INDEX(index_locatings_on_category)
JOIN localities lc
ON lc.id = lt.locality_id
INNER JOIN localities lca
ON lca.lft <= lc.lft AND lca.rgt >= lc.rgt
WHERE lt.`category` != "Unknown";
Updated EXAPLAIN:
+----+-------------+-------+--------+-----------------------------------------+-----------------------------+---------+---------------------------------+-------+----------+-------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+--------+-----------------------------------------+-----------------------------+---------+---------------------------------+-------+----------+-------------------------------------------------+
| 1 | SIMPLE | lt | range | index_locatings_on_category | index_locatings_on_category | 153 | NULL | 2545 | 100.00 | Using where; Using temporary |
| 1 | SIMPLE | lc | eq_ref | PRIMARY,index_localities_on_lft_and_rgt | PRIMARY | 4 | bzzik_production.lt.locality_id | 1 | 100.00 | |
| 1 | SIMPLE | lca | ALL | index_localities_on_lft_and_rgt | NULL | NULL | NULL | 11570 | 100.00 | Range checked for each record (index map: 0x10) |
+----+-------------+-------+--------+-----------------------------------------+-----------------------------+---------+---------------------------------+-------+----------+-------------------------------------------------+
Any help appreciated.
Ah, it just occurred to me.
Since you are asking for everything in the table, mysql decides to use a full table scan instead, as it deems it more efficient.
In order to get some key usage, add in some filters to restrict looking for every row in all the tables anyways.
Updating Answer:
Your second query does not make sense. You are left joining to lca yet you have a filter in it, this negates the left join by itself. Also you're looking for data in the last step of the query, meaning you will have to look through all of lt, lc and lca in order to find your data. Also you have no index with left-most column 'type' on locations, so you still need a full table scan to find your data.
If you had some sample data and example of what you are trying to achieve it would perhaps be easier to help.
try to experiment with forcing index - http://dev.mysql.com/doc/refman/5.1/en/index-hints.html, maybe it's just optimizer issue.
It looks like you're wanting the parents of the single result.
According to the person credited with defining Nested Sets in SQL, Joe Celko at http://www.ibase.ru/devinfo/DBMSTrees/sqltrees.html "This model is a natural way to show a parts explosion, because a final assembly is made of physically nested assemblies that break down into separate parts."
In other words, Nested Sets are used to filter children efficiently to an arbitrary number of independent levels within a single collection. You have two tables, but I don't see where the properties of the set "locatings" can't be de-normalized into "localities"?
If the localities table had a geometry column, could I not find the one locality from a "locating" and then select on the one table using a single filter: parent.lft <= row.left AND parent.rgt >= row.rgt ?
UPDATED
In this answer https://stackoverflow.com/a/1743952/3018894, there is an example from http://explainextended.com/2009/09/29/adjacency-list-vs-nested-sets-mysql/ where the following example gets all the ancestors to an arbitrary depth of 100000:
SELECT hp.id, hp.parent, hp.lft, hp.rgt, hp.data
FROM (
SELECT #r AS _id,
#level := #level + 1 AS level,
(
SELECT #r := NULLIF(parent, 0)
FROM t_hierarchy hn
WHERE id = _id
)
FROM (
SELECT #r := 1000000,
#level := 0
) vars,
t_hierarchy hc
WHERE #r IS NOT NULL
) hc
JOIN t_hierarchy hp
ON hp.id = hc._id
ORDER BY
level DESC
I have a website where visitors can leave comments. I want to add the ability to answer comments (i.e. nested comments).
At first this query was fast but after I populated the table with the existing comments (about 30000) a simple query like:
SELECT c.id, c2.id
FROM (SELECT id
FROM swb_comments
WHERE pageId = 1411
ORDER BY id DESC
LIMIT 10) AS c
LEFT JOIN swb_comments AS c2 ON c.id = c2.parentId
took over 2 seconds, with no childComments(!).
How do I optimize a query like this? On possible solution would be http://www.ferdychristant.com/blog//articles/DOMM-7QJPM7 (scroll to "The Flat Table Model done right") but this makes pagination rather difficult (how do I limit to 10 parent comments within 1 query?)
The table has 3 indexes, id, pageId and ParentId.
Thanks in advance!
EDIT:
Table definition added. This is the full definition with some differences to the above SELECT query, (i.e. pageId instead of numberId to avoid confussion)
CREATE TABLE `swb_comments` (
`id` mediumint(9) NOT NULL auto_increment,
`userId` mediumint(9) unsigned NOT NULL default '0',
`numberId` mediumint(9) unsigned default NULL,
`orgId` mediumint(9) unsigned default NULL,
`author` varchar(100) default NULL,
`email` varchar(255) NOT NULL,
`message` text NOT NULL,
`IP` varchar(40) NOT NULL,
`timestamp` varchar(25) NOT NULL,
`editedTimestamp` varchar(25) default NULL COMMENT 'last edited timestamp',
`status` varchar(20) NOT NULL default 'publish',
`parentId` mediumint(9) unsigned NOT NULL default '0',
`locale` varchar(10) NOT NULL,
PRIMARY KEY (`id`),
KEY `userId` (`userId`),
KEY `numberId` (`numberId`),
KEY `orgId` (`orgId`),
KEY `parentId` (`parentId`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=34748 ;
The issue is that MySQL cannot apply index if it need to deal with a result from a derived query (that's why you have NULL in the possible_keys column). So I suggest to filter out ten comments that you need:
SELECT * FROM swb_comments WHERE pageId = 1411 ORDER BY id DESC LIMIT 10
And after that send separate request to get answers for each comment id:
SELECT * FROM swb_comments WHERE parentId IN ($commentId1, $commentId2, ..., $commentId10)
In this case database engine will be able to apply pageId and parentId indexes efficiently.
If Mr Fedorenko is correct and the subquery is causing the optimiser difficulties, could you not try...
SELECT c.id, c2.id
FROM swb_comments c LEFT JOIN swb_comments c2 ON c.id = c2.parentID
WHERE c.pageId = 1411
ORDER BY c.id DESC
LIMIT 10;
and see if it's any improvement?
Later - I have created a table using your definition, filled it in with 30,000 skeletal rows, and tried both the queries. They both complete in too short a time to notice. The explain plans are here...
mysql> EXPLAIN SELECT c.id, c2.id
FROM swb_comments c LEFT JOIN swb_comments c2 ON c.id = c2.parentID
WHERE c.numberId = 1411 ORDER BY c.id DESC LIMIT 10;
+----+-------------+-------+------+---------------+----------+---------+------------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+----------+---------+------------+------+-----------------------------+
| 1 | SIMPLE | c | ref | numberId | numberId | 4 | const | 1 | Using where; Using filesort |
| 1 | SIMPLE | c2 | ref | parentId | parentId | 3 | books.c.id | 14 | |
+----+-------------+-------+------+---------------+----------+---------+------------+------+-----------------------------+
mysql> EXPLAIN SELECT c.id, c2.id
FROM swb_comments c LEFT JOIN swb_comments c2 ON c.id = c2.parentID
WHERE c.numberId = 1411 ORDER BY c.id DESC LIMIT 10;
+----+-------------+-------+------+---------------+----------+---------+------------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+----------+---------+------------+------+-----------------------------+
| 1 | SIMPLE | c | ref | numberId | numberId | 4 | const | 1 | Using where; Using filesort |
| 1 | SIMPLE | c2 | ref | parentId | parentId | 3 | books.c.id | 14 | |
+----+-------------+-------+------+---------------+----------+---------+------------+------+-----------------------------+
and are exactly what I'd expect.
This is very mysterious.
I'll think about it a bit more to see if there's anything else we can try.