optimizing query order by results to Using filesort; - mysql

Query :
SELECT
r.reply_id,
r.msg_id,
r.uid,
r.body,
r.date,
u.username as username,
u.profile_picture as profile_picture
FROM
pm_replies as r
LEFT JOIN users as u
ON u.uid = r.uid
WHERE
r.msg_id = '784351921943772258'
ORDER BY r.date DESC
i tried all index combinations i could think of, searched in google how best i could index this but nothing worked.
this query takes 0,33 on 500 returned items and counting...
EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE r ALL index1 NULL NULL NULL 540 Using where; Using filesort
1 SIMPLE u eq_ref uid uid 8 site.r.uid 1
SHOW CREATE pm_replies
CREATE TABLE `pm_replies` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`reply_id` bigint(20) NOT NULL,
`msg_id` bigint(20) NOT NULL,
`uid` bigint(20) NOT NULL,
`body` text COLLATE utf8_unicode_ci NOT NULL,
`date` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `index1` (`msg_id`,`date`,`uid`)
) ENGINE=MyISAM AUTO_INCREMENT=541 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
SHOW CREATE users
CREATE TABLE `users` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`uid` bigint(20) NOT NULL,
`username` varchar(20) COLLATE utf8_unicode_ci NOT NULL,
`email` text CHARACTER SET latin1 NOT NULL,
`password` text CHARACTER SET latin1 NOT NULL,
`profile_picture` text COLLATE utf8_unicode_ci NOT NULL,
`date_registered` datetime NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `uid` (`uid`),
UNIQUE KEY `username` (`username`)
) ENGINE=MyISAM AUTO_INCREMENT=2004 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

For the query as it is, the best indexes would seem to be...
pm_replies: (msg_id, date, uid)
users: (uid)
The important one is pm_replies. You use it to both filter your data (the filter column is first) then order your data (the order column is second).
The would be different if you removed the filter. Then you'd just want (date, uid) as your index.
The last field in the index just makes it a fraction friendlier to the join, the important part is actually the index on users.
There is a lot more that coudl be said on this, a whole chapter in a book at the very least, and several books if your wanted to. But I hope this helps.
EDIT
Not that my suggested index for pm_replies is one index covering three fields, and not just three indexes. This ensures that all the entries in the index are pre-sorted by those columns. It's like sorting data in Excel by three columns.
Having three separate indexes is like having the Excel data on three tabs. Each sorted by a different fields.
Only whith one index over three fields do you get this behaviour...
- You can select one 'bunch' of records with the same msg_id
- That whole 'bunch' are next to each other, no gaps, etc
- That whole 'bunch' are sorted in date order for that msg_id
- For any rows with the same date, they're ordered by user_id
(Again the user_id part is really very minor.)

Please try this:
SELECT
r.reply_id,
r.msg_id,
r.uid,
r.body,
r.date,
u.username as username,
u.profile_picture as profile_picture
FROM
pm_replies as r
LEFT JOIN users as u
ON (u.uid = r.uid AND r.msg_id = '784351921943772258')
ORDER BY r.date DESC
in my case it help.

Add date to your index1 key so that msg_id and date are both in the index.

What Dems is saying should be correct, but there is one additional detail if you are using InnoDB: perhaps you are paying the price of secondary indexes on clustered tables - essentially, accessing a row through the secondary index requires additional lookup trough the primary, i.e. clustering index. This "double lookup" might make the index less attractive to the query optimizer.
To alleviate this, try covering the all the fields in your select statement with the index:
pm_replies: (msg_id, date, uid, reply_id, body, date)
users: (uid, username, profile_picture)

It appears the optimizer is trying to force the index by ID to make the join to the user table. Since you are doing a left-join (which doesn't make sense since I would expect every entry to have a user ID, thus a normal INNER JOIN), I'll keep it left join.
So, I would try the following. Query just the replies based on the MESSAGE ID and order by the date descending on its own merits, THEN left join, such as
SELECT
r.reply_id,
r.msg_id,
r.uid,
r.body,
r.date,
u.username as username,
u.profile_picture as profile_picture
FROM
( select R2.*
from pm_replies R2
where r2.msg_id = '784351921943772258' ) r
LEFT JOIN users as u
ON u.uid = r.uid
ORDER BY
r.date DESC
In addition, since I don't have MySQL readily available, and can't remember if order by is allowed in a sub-query, if so, you can optimize the inner prequery (using alias "R2") and put the order by there, so it uses the (msgid, date) index and returns just that set... THEN joins to user table on the ID which no index is required at that point from the SOURCE result set, just the index on the user table to find the match.

Related

Slow database queries

Since my website database has grown very large the performance for certain queries has become terrible. Some queries are taking over 30 seconds to perform. I'm wondering if someone can help me optimize my query or make a suggestion on how I can improve performance? I have set an index on all the foreign keys and ids.
SELECT p.*
, u.unique_id
, u.nick_name
, u.avatar_thumb
, t.desc as tag_desc
, pt.post_id as tag_post_id
from tt_post_tags pt
LEFT
JOIN tt_posts p
ON p.id = pt.post_id
RIGHT
JOIN tt_users u
ON p.user_id = u.user_id
LEFT
JOIN tt_tags t
ON t.name = "gameday"
WHERE pt.name = "gameday"
ORDER
BY create_date DESC
LIMIT 100
The above query takes 29 seconds to complete. If I remove the "create_date DESC" from the query it runs in .3 seconds. I've added an index to create_date but still, it takes 30 seconds for the query to run. The tt_posts table contains about 1.6 million records.
My database has the following tables: Posts, Users, Tags, and PostTags.
Posts table contains a foreign key for the users table.
Tags table contains a unique id and name for each tag
Post_tags table contains the foreign key from the Tags table aswell as a foreign key for the post that the tag is for.
I can include a diagram tomorrow if it's not easy to understand. Hopefully, someone can assist me. Thanks in advance.
CREATE TABLE `tt_posts` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`post_id` bigint(30) NOT NULL,
`user_id` bigint(30) NOT NULL,
`create_date` datetime NOT NULL,
`cover` varchar(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`duration` int(10) DEFAULT NULL,
`desc` text CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
PRIMARY KEY (`id`),
UNIQUE KEY `post_id` (`post_id`),
KEY `user_id` (`user_id`),
KEY `create_date` (`create_date`)
) ENGINE=InnoDB AUTO_INCREMENT=4641550 DEFAULT CHARSET=utf8
CREATE TABLE `tt_tags` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`tt_tag_id` BIGINT(30) NULL DEFAULT NULL,
`name` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
PRIMARY KEY (`id`),
UNIQUE INDEX `name` (`name`),
UNIQUE INDEX `tt_tag_id` (`tt_tag_id`),
INDEX `tt_tag_id_key` (`tt_tag_id`),
INDEX `name_key` (`name`)
)
COLLATE='utf8mb4_unicode_ci'
ENGINE=InnoDB
AND
CREATE TABLE `tt_post_tags` (
`post_id` INT(11) NOT NULL,
`name` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
INDEX `post_id` (`post_id`),
INDEX `name` (`name`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB;
AND
CREATE TABLE `tt_users` (
`id` BIGINT(20) NOT NULL AUTO_INCREMENT,
`user_id` BIGINT(30) NOT NULL,
`unique_id` VARCHAR(190) NOT NULL COLLATE 'utf8mb4_unicode_ci',
`nick_name` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
`avatar` VARCHAR(190) NOT NULL COLLATE 'utf8mb4_unicode_ci',
`signature` TEXT NOT NULL COLLATE 'utf8mb4_unicode_ci',
PRIMARY KEY (`id`),
UNIQUE INDEX `user_id` (`user_id`),
UNIQUE INDEX `unique_id` (`unique_id`),
INDEX `unique_id_index` (`unique_id`),
INDEX `user_id_index` (`user_id`)
)
COLLATE='utf8mb4_unicode_ci'
ENGINE=InnoDB
In my opinion, the main issue with your query is the mix of left and right outer joins. Honestly, are you able to read this correctly?
The first join alone seems weird. You outer join a post to its post tags. But can a post tag without a post even exist? What would it refer to? (The other way round would make more sense: to also select posts that have no tags.) If I am not mistaken here, your join is rendered to a mere inner join. In your where clause you further limit this result to post tags named 'gameday'.
Then you right outer join users. We avoid right outer joins for being by far less readable than left outer joins, but well, you select all users, even those without 'gameday' post tags here.
Then you left outer join all 'gameday' tags. This looks completely unrelated to the other tables (i.e. you either find 'gameday' tags or not). But in your explanation you say "Post_tags table contains the foreign key from the Tags", so I surmise there is no tag_id in your post tags table, but the name is the tag ID really (and thus also the foreign key in your post tags table). This again leads to the question: Why would a post tag even exist, when it has no related tag? Probably this is not possible, and again all this is boiled down to a mere inner join. (I would recommend here to have a tag_id instead of the names in both tables, just for readability. The column name name kind of hides the foreign key relationship.)
In your query, you don't show any information of the post tags table, but I see you select pt.post_id as tag_post_id, which of course is just p.id as tag_post_id again. I suppose this is a typo and you want to show pt.id as tag_post_id instead?
I understand that you want to see all users, but are only interested in 'gameday' post tags. This makes writing the query a little complicated. I would probably just select users and outer join the complete post tag information.
Your create_date is not qualified with a table. I suppose it is a column in the posts table?
This is the query I am coming up with:
select
gdp.*,
u.unique_id,
u.nick_name,
u.avatar_thumb
from tt_users u
left join
(
select
p.*,
t.desc as tag_desc,
pt.id as tag_post_id
from tt_tags t
join tt_post_tags pt on pt.name = t.name
join tt_posts p on p.id = pt.post_id
where t.name = 'gameday'
) gdp on gdp.user_id = u.user_id
order by p.create_date desc;
There has been a lot of guessing on my side, so this query may still be a little different from what you need. I don't know.
Now let's look at which table columns are accessed, to provide good indexes for the query. Let's particularily look at the subquery where we collect all post tags:
We only want 'gameday' tags. As this seems to be the primary key for tt_tags, there should already be a unique index on tt_tags(name).
Being the foreign key, there should also be an index on tt_post_tags(name). This is good, but as we want to continue joining on the post_id, it would be beneficial to have this in the index, too: create unique index idx on tt_post_tags(name, post_id). However, as this is the table's natural key, this index should also already exist in order to ensure data integrity. If it doesn't exist yet, hurry up to provide it.
At last we join tt_posts on its primary key (i.e. there should be an index on tt_posts(id)). Once more: Nothing for us to do here.
You select all users and you select all 'gameday' tags. Then you must join all found tags to the users, which already is some work. You can imagine this as ordering all found tags by user_id first in order to join. Then you want to sort your result by post date. This means that the DBMS must again sort all result rows. Sorting takes time; that's just the way it is. How many rows does the result contain? If we are talking about millions of rows to sort, then this will probably remain slow. And if many post tags are 'gameday' tags, then even the indexes may not help much reading the tables and the DBMS may go for full sequential table reads instead. Make sure the statistics are up-to-date (https://dev.mysql.com/doc/refman/8.0/en/analyze-table.html).
(Iteration one of Answering the Question)
First, let's look at the query without users:
select p.id
from post_tags AS pt
join posts AS p ON p.id = pt.post_id
join tags AS t ON t.name = "gameday"
where pt.name = "gameday"
ORDER BY p.create_date
LIMIT 100;
It is not possible to have a single index that handles both pt.name and p.create_date. Is there any way to get them into the same table? I see, for example, that name seems to be redundantly in t and p.
tt_post_tags sounds like a many-to-many mapping table between posts and tags; is it? If so, what is name, the seems to be in tags and post_tags?
I think this
join tags AS t ON t.name = "gameday"
should be
join tags AS t ON t.name = "gameday" AND pt.tag_id = t.tag_id
If so, that might be the main problem. Please provide SHOW CREATE TABLE for the rest of the tables.
The following indexes may (or may not) help:
tags: (post_id, name)
tags: (name, tag_id)
posts: (create_date, id)
post_tags: (name, post_id)
More
A UNIQUE INDEX is an INDEX, so the second of these is redundant and should be dropped: UNIQUE(x), INDEX(x)
Index Cookbook: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

Mysql index is not being taken while field is mentioned in join on clause

explain select * from users u join wallet w on w.userId=u.uuid where w.userId='8319611142598331610'; //Index is taken
explain select * from users u join wallet w on w.userId=u.uuid where w.currencyId=8; //index is not taken
As can be seen above, the index userIdIdx is used in the latter case, but not in the former.
Following are the schema of the two tables -
CREATE TABLE `users` (
`uuid` varchar(600) DEFAULT NULL,
KEY `uuidIdx` (`uuid`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `wallet` (
`Id` int(11) NOT NULL AUTO_INCREMENT,
`userId` varchar(200) NOT NULL DEFAULT '',
`currencyId` int(11) NOT NULL,
PRIMARY KEY (`Id`),
KEY `userIdIdx` (`userId`),
KEY `currencyIdIdx` (`currencyId`)
) ENGINE=InnoDB AUTO_INCREMENT=279668 DEFAULT CHARSET=latin1;
How do I force MySql to consider the userIdIdx or uuidIdx index?
There are two methodes improving this.
Method 1:
Adding a multiple column index wallet(userId, currencyId) looks to be better for both queries.
see demo https://www.db-fiddle.com/f/aesNYevEzwopmXrnQJRPoS/0
Method 2
Rewrite the query.
This works with the current table structure.
Query
SELECT
*
FROM (
SELECT
wallet.userId
FROM
wallet
WHERE
wallet.currencyId = 8
) AS wallet
INNER JOIN
users
ON
wallet.userId = users.uuid
see demo https://www.db-fiddle.com/f/aesNYevEzwopmXrnQJRPoS/3
p.s i also advice you to also add Id int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY to the users table when you use InnoDB as table engine.
This post off mine explains why https://dba.stackexchange.com/a/48184/27070
Both queries are doing the best they can with what you gave them.
select *
from users u
join wallet w ON w.userId=u.uuid
where w.userId='8319611142598331610';
select *
from users u
join wallet w ON w.userId=u.uuid
where w.currencyId=8;
If there is only one row in a table (such as users), the Optimizer takes a different path. That seems to be what happened with the first query.
Otherwise, both queries would start with wallet since there is filtering going on. Each of the secondary keys in wallet is handy for one of the queries. Even better would be
INDEX(userId, currencyId, id) -- for first query
INDEX(currencyId, userId, id) -- for second query
The first column is used in the WHERE; the other two columns make the index "covering" so that it does not need to bounce between the index and the data.
(Geez, those tables have awfully few columns.)
After filtering in w, it moves on to u and uses INDEX(uuid). Since that is the only column in the table, (no name??), it can be "Using index", that is "covering".
And the only reason for reaching into u is to verify that there exists a user with the value matching w.userId. Since you probably always have that, why JOIN to users at all in the query??

In MySQL is it faster to execute one JOIN + one LIKE statement or two JOINs?

I have to create a cron job, which is simple in itself, but because it will run every minute I'm worried about performance. I have two tables, one has user names and the other has details about their network. Most of the time a user will belong to just one network, but it is theoretically possible that they might belong to more, but even then very few, maybe two or three. So, in order to reduce the number of JOINs, I saved the network ids separated by | in a field in the user table, e.g.
|1|3|9|
The (simplified for this question) user table structure is
TABLE `users` (
`u_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE,
`userid` VARCHAR(500) NOT NULL UNIQUE,
`net_ids` VARCHAR(500) NOT NULL DEFAULT '',
PRIMARY KEY (`u_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The (also simplified) network table structure is
CREATE TABLE `network` (
`n_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE,
`netname` VARCHAR(500) NOT NULL UNIQUE,
`login_time` DATETIME DEFAULT NULL,
`timeout_mins` TINYINT UNSIGNED NOT NULL DEFAULT 10,
PRIMARY KEY (`n_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I have to send a warning when timeout occurs, my query is
SELECT N.netname, N.timeout_mins, N.n_id, U.userid FROM
(SELECT netname, timeout_mins, n_id FROM network
WHERE is_open = 1 AND notify = 1
AND TIMESTAMPDIFF(SECOND, TIMESTAMPADD(MINUTE, timeout_mins, login_time), NOW()) < 60) AS N
INNER JOIN users AS U ON U.net_ids LIKE CONCAT('%|', N.n_id, '|%');
I made N a subquery to reduce the number of rows joined. But I would like to know if it would be faster to add a third table with u_id and n_id as columns, removed the net_ids column from users and then do a join on all three tables? Because I read that using LIKE slows things down.
Which is the most effcient query to use in this case? One JOIN and a LIKE or two JOINS?
P.S. I did some experimentation and the initial values for using two JOINS are higher than using a JOIN and a LIKE. However, repeated runs of the same query seems to speed things up a lot, I suspect something is cached somewhere, either in my app or the database, and both become comparable, so I did not find this data satisfactory. It also contradicts what I was expecting based on what I have been reading.
I used this table:
TABLE `user_net` (
`u_id` BIGINT UNSIGNED NOT NULL,
`n_id` BIGINT UNSIGNED NOT NULL,
INDEX `u_id` (`u_id`),
FOREIGN KEY (`u_id`) REFERENCES `users`(`u_id`),
INDEX `n_id` (`n_id`),
FOREIGN KEY (`n_id`) REFERENCES `network`(`n_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
and this query:
SELECT N.netname, N.timeout_mins, N.n_id, U.userid FROM
(SELECT netname, timeout_mins, n_id FROM network
WHERE is_open = 1 AND notify = 1
AND TIMESTAMPDIFF(SECOND, TIMESTAMPADD(MINUTE, timeout_mins, login_time), NOW()) < 60) AS N
INNER JOIN user_net AS UN ON N.n_id = UN.n_id
INNER JOIN users AS U ON UN.u_id = U.u_id;
You should define composite indexes for the user_net table. One of them can (and should) be the primary key.
TABLE `user_net` (
`u_id` BIGINT UNSIGNED NOT NULL,
`n_id` BIGINT UNSIGNED NOT NULL,
PRIMARY KEY (`u_id`, `n_id`),
INDEX `uid_nid` (`n_id`, `u_id`),
FOREIGN KEY (`u_id`) REFERENCES `users`(`u_id`),
FOREIGN KEY (`n_id`) REFERENCES `network`(`n_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I would also rewrite your query to:
SELECT N.netname, N.timeout_mins, N.n_id, U.userid
FROM network N
INNER JOIN user_net AS UN ON N.n_id = UN.n_id
INNER JOIN users AS U ON UN.u_id = U.u_id
WHERE N.is_open = 1
AND N.notify = 1
AND TIMESTAMPDIFF(SECOND, TIMESTAMPADD(MINUTE, N.timeout_mins, N.login_time), NOW()) < 60
While your subquery will probably not hurt much, there is no need for it.
Note that the last condition cannot use an index, because you have to combine two columns. If your MySQL version is at least 5.7.6 you can define an indexed virtual (calculated) column.
CREATE TABLE `network` (
`n_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE,
`netname` VARCHAR(500) NOT NULL UNIQUE,
`login_time` DATETIME DEFAULT NULL,
`timeout_mins` TINYINT UNSIGNED NOT NULL DEFAULT 10,
`is_open` TINYINT UNSIGNED,
`notify` TINYINT UNSIGNED,
`timeout_dt` DATETIME AS (`login_time` + INTERVAL `timeout_mins` MINUTE),
PRIMARY KEY (`n_id`),
INDEX (`timeout_dt`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Now change the query to:
SELECT N.netname, N.timeout_mins, N.n_id, U.userid
FROM network N
INNER JOIN user_net AS UN ON N.n_id = UN.n_id
INNER JOIN users AS U ON UN.u_id = U.u_id
WHERE N.is_open = 1
AND N.notify = 1
AND N.timeout_dt < NOW() + INTERVAL 60 SECOND
and it will be able to use the index.
You can also try to replace
INDEX (`timeout_dt`)
with
INDEX (`is_open`, `notify`, `timeout_dt`)
and see if it is of any help.
Reformulate to avoid hiding columns inside functions. I can't grok your date expression, but note this:
login_time < NOW() - INTERVAL timeout_mins MINUTE
If you can achieve something like that, then this index should help:
INDEX(is_open, notify, login_time)
If that is not good enough, let's see the other formulation so we can compare them.
Having stuff separated by comma (or |) is likely to be a really bad idea.
Bottom line: Assume that JOINs are not a performance problem, write the queries with as many JOINs as needed. Then let's optimize that.

At what execution level will MySQL utilize the index for ORDER BY?

I would like to understand at what point in time will MySQL use the indexed column when using ORDER BY.
For example, the query
SELECT * FROM A
INNER JOIN B ON B.id = A.id
WHERE A.status = 1 AND A.name = 'Mike' AND A.created_on BETWEEN '2014-10-01 00:00:00' AND NOW()
ORDER BY A.accessed_on DESC
Based on my knowledge a good index for the above query is an index on table A (id, status, name created_on, accessed_on) and another on B.id.
I also understand that SQL execution follow the order below. but I am not sure how the order selection and order works.
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
Question
Is will it be better to start the index with the id column or in this case is does not matter since WHERE is executed first before the JOIN? or should it be
Second question the column accessed_on should it be at the beginning of the index combination, end or the middle? or should the id column come after all the columns in the WHERE clause?
I appreciate a detailed answer so I can understand the execution level of MySQL/SQL
UPDATED
I added few million records to both tables A and B then I have added multiple indexes to see which would be the best index. But, MySQL seems to like the index id_2 (ie. (status, name, created_on, id, accessed_on))
It seems to be applying the where and it will figure out that it would need and index on status, name, created_on then it apples the INNER JOIN and it will use the id index followed by the first 3. Finally, it will look for accessed_on as the last column. so the index (status, name, created_on, id, accessed_on) fits the same execution order
Here is the tables structures
CREATE TABLE `a` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`status` int(2) NOT NULL,
`name` varchar(255) NOT NULL,
`created_on` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`accessed_on` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `status` (`status`,`name`),
KEY `status_2` (`status`,`name`,`created_on`),
KEY `status_3` (`status`,`name`,`created_on`,`accessed_on`),
KEY `status_4` (`status`,`name`,`accessed_on`),
KEY `id` (`id`,`status`,`name`,`created_on`,`accessed_on`),
KEY `id_2` (`status`,`name`,`created_on`,`id`,`accessed_on`)
) ENGINE=InnoDB AUTO_INCREMENT=3135750 DEFAULT CHARSET=utf8
CREATE TABLE `b` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3012644 DEFAULT CHARSET=utf8
The best indexes for this query is: A(status, name, created_on) and B(id). These indexes will satisfy the where clause and use the index for the join to B.
This index will not be used for sorting. There are two major impediments to using any index for sorting. The first is the join. The second is the non-equality on created_on. Some databases might figure out to use an index on A(status, name, accessed_on), but I don't think MySQL is smart enough for that.
You don't want id as the first column in the index. This precludes using the index to filter on A, because id is used for the join rather than in the where.

Optimizing MySQL Query, takes almost 20 seconds!

I'm running the following query on a Macbook Pro 2.53ghz with 4GB of Ram:
SELECT
c.id AS id,
c.name AS name,
c.parent_id AS parent_id,
s.domain AS domain_name,
s.domain_id AS domain_id,
NULL AS stats
FROM
stats s
LEFT JOIN stats_id_category sic ON s.id = sic.stats_id
LEFT JOIN categories c ON c.id = sic.category_id
GROUP BY
c.name
It takes about 17 seconds to complete.
EXPLAIN:
alt text http://img7.imageshack.us/img7/1364/picture1va.png
The tables:
Information:
Number of rows: 147397
Data size: 20.3MB
Index size: 1.4MB
Table:
CREATE TABLE `stats` (
`id` int(11) unsigned NOT NULL auto_increment,
`time` int(11) NOT NULL,
`domain` varchar(40) NOT NULL,
`ip` varchar(20) NOT NULL,
`user_agent` varchar(255) NOT NULL,
`domain_id` int(11) NOT NULL,
`date` timestamp NOT NULL default CURRENT_TIMESTAMP,
`referrer` varchar(400) default NULL,
KEY `id` (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=147398 DEFAULT CHARSET=utf8
Information second table:
Number of rows: 1285093
Data size: 11MB
Index size: 17.5MB
Second table:
CREATE TABLE `stats_id_category` (
`stats_id` int(11) NOT NULL,
`category_id` int(11) NOT NULL,
KEY `stats_id` (`stats_id`,`category_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
Information third table:
Number of rows: 161
Data size: 3.9KB
Index size: 8KB
Third table:
CREATE TABLE `categories` (
`id` int(11) NOT NULL auto_increment,
`parent_id` int(11) default NULL,
`name` varchar(40) NOT NULL,
`questions_category_id` int(11) NOT NULL default '0',
`rank` int(2) NOT NULL default '0',
PRIMARY KEY (`id`),
KEY `id` (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=205 DEFAULT CHARSET=latin1
Hopefully someone can help me speed this up.
I see several WTF's in your query:
You use two LEFT OUTER JOINs but then you group by the c.name column which might have no matches. So perhaps you don't really need an outer join? If that's the case, you should use an inner join, because outer joins are often slower.
You are grouping by c.name but this gives ambiguous results for every other column in your select-list. I.e. there might be multiple values in these columns in each grouping by c.name. You're lucky you're using MySQL, because this query would simply give an error in any other RDBMS.
This is a performance issue because the GROUP BY is likely causing the "using temporary; using filesort" you see in the EXPLAIN. This is a notorious performance-killer, and it's probably the single biggest reason this query is taking 17 seconds. Since it's not clear why you're using GROUP BY at all (using no aggregate functions, and violating the Single-Value Rule), it seems like you need to rethink this.
You are grouping by c.name which doesn't have a UNIQUE constraint on it. You could in theory have multiple categories with the same name, and these would be lumped together in a group. I wonder why you don't group by c.id if you want one group per category.
SELECT NULL AS stats: I don't understand why you need this. It's kind of like creating a variable that you never use. It shouldn't harm performance, but it's just another WTF that makes me think you haven't thought this query through very well.
You say in a comment you're looking for number of visitors per category. But your query doesn't have any aggregate functions like SUM() or COUNT(). And your select-list includes s.domain and s.domain_id which would be different for every visitor, right? So what value do you expect to be in the result set if you only have one row per category? This isn't really a performance issue either, it just means your query results don't tell you anything useful.
Your stats_id_category table has an index over its two columns, but no primary key. So you can easily get duplicate rows, and this means your count of visitors may be inaccurate. You need to drop that redundant index and use a primary key instead. I'd order category_id first in that primary key, so the join can take advantage of the index.
ALTER TABLE stats_id_category DROP KEY stats_id,
ADD PRIMARY KEY (category_id, stats_id);
Now you can eliminate one of your joins, if all you need to count is the number of visitors:
SELECT c.id, c.name, c.parent_id, COUNT(*) AS num_visitors
FROM categories c
INNER JOIN stats_id_category sic ON (sic.category_id = c.id)
GROUP BY c.id;
Now the query doesn't need to read the stats table at all, or even the stats_id_category table. It can get its count simply by reading the index of the stats_id_category table, which should eliminate a lot of work.
You are missing the third table in the information provided (categories).
Also, it seems odd that you are doing a LEFT JOIN and then using the right table (which might be all NULLS) in the GROUP BY. You will end up grouping all of the non-matching rows together as a result, is that what you intended?
Finally, can you provide an EXPLAIN for the SELECT?
Harrison is right; we need the other table. I would start by adding an index on category_id to stats_id_category, though.
I agree with Bill. Point 2 is very important. The query doesn't even make logical sense. Also, with the simple fact that there is no where statement means that you have to pull back every row in the stats table, which seems to be around 140000. It then has to sort all that data, so that it can perform the GROUP BY. This is because sort [ O(n log n)] and then find duplicates [ O(n) ] is much faster than just finding duplicates without sorting the data set [ O(n^2)?? ].