Since my website database has grown very large the performance for certain queries has become terrible. Some queries are taking over 30 seconds to perform. I'm wondering if someone can help me optimize my query or make a suggestion on how I can improve performance? I have set an index on all the foreign keys and ids.
SELECT p.*
, u.unique_id
, u.nick_name
, u.avatar_thumb
, t.desc as tag_desc
, pt.post_id as tag_post_id
from tt_post_tags pt
LEFT
JOIN tt_posts p
ON p.id = pt.post_id
RIGHT
JOIN tt_users u
ON p.user_id = u.user_id
LEFT
JOIN tt_tags t
ON t.name = "gameday"
WHERE pt.name = "gameday"
ORDER
BY create_date DESC
LIMIT 100
The above query takes 29 seconds to complete. If I remove the "create_date DESC" from the query it runs in .3 seconds. I've added an index to create_date but still, it takes 30 seconds for the query to run. The tt_posts table contains about 1.6 million records.
My database has the following tables: Posts, Users, Tags, and PostTags.
Posts table contains a foreign key for the users table.
Tags table contains a unique id and name for each tag
Post_tags table contains the foreign key from the Tags table aswell as a foreign key for the post that the tag is for.
I can include a diagram tomorrow if it's not easy to understand. Hopefully, someone can assist me. Thanks in advance.
CREATE TABLE `tt_posts` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`post_id` bigint(30) NOT NULL,
`user_id` bigint(30) NOT NULL,
`create_date` datetime NOT NULL,
`cover` varchar(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`duration` int(10) DEFAULT NULL,
`desc` text CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
PRIMARY KEY (`id`),
UNIQUE KEY `post_id` (`post_id`),
KEY `user_id` (`user_id`),
KEY `create_date` (`create_date`)
) ENGINE=InnoDB AUTO_INCREMENT=4641550 DEFAULT CHARSET=utf8
CREATE TABLE `tt_tags` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`tt_tag_id` BIGINT(30) NULL DEFAULT NULL,
`name` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
PRIMARY KEY (`id`),
UNIQUE INDEX `name` (`name`),
UNIQUE INDEX `tt_tag_id` (`tt_tag_id`),
INDEX `tt_tag_id_key` (`tt_tag_id`),
INDEX `name_key` (`name`)
)
COLLATE='utf8mb4_unicode_ci'
ENGINE=InnoDB
AND
CREATE TABLE `tt_post_tags` (
`post_id` INT(11) NOT NULL,
`name` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
INDEX `post_id` (`post_id`),
INDEX `name` (`name`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB;
AND
CREATE TABLE `tt_users` (
`id` BIGINT(20) NOT NULL AUTO_INCREMENT,
`user_id` BIGINT(30) NOT NULL,
`unique_id` VARCHAR(190) NOT NULL COLLATE 'utf8mb4_unicode_ci',
`nick_name` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
`avatar` VARCHAR(190) NOT NULL COLLATE 'utf8mb4_unicode_ci',
`signature` TEXT NOT NULL COLLATE 'utf8mb4_unicode_ci',
PRIMARY KEY (`id`),
UNIQUE INDEX `user_id` (`user_id`),
UNIQUE INDEX `unique_id` (`unique_id`),
INDEX `unique_id_index` (`unique_id`),
INDEX `user_id_index` (`user_id`)
)
COLLATE='utf8mb4_unicode_ci'
ENGINE=InnoDB
In my opinion, the main issue with your query is the mix of left and right outer joins. Honestly, are you able to read this correctly?
The first join alone seems weird. You outer join a post to its post tags. But can a post tag without a post even exist? What would it refer to? (The other way round would make more sense: to also select posts that have no tags.) If I am not mistaken here, your join is rendered to a mere inner join. In your where clause you further limit this result to post tags named 'gameday'.
Then you right outer join users. We avoid right outer joins for being by far less readable than left outer joins, but well, you select all users, even those without 'gameday' post tags here.
Then you left outer join all 'gameday' tags. This looks completely unrelated to the other tables (i.e. you either find 'gameday' tags or not). But in your explanation you say "Post_tags table contains the foreign key from the Tags", so I surmise there is no tag_id in your post tags table, but the name is the tag ID really (and thus also the foreign key in your post tags table). This again leads to the question: Why would a post tag even exist, when it has no related tag? Probably this is not possible, and again all this is boiled down to a mere inner join. (I would recommend here to have a tag_id instead of the names in both tables, just for readability. The column name name kind of hides the foreign key relationship.)
In your query, you don't show any information of the post tags table, but I see you select pt.post_id as tag_post_id, which of course is just p.id as tag_post_id again. I suppose this is a typo and you want to show pt.id as tag_post_id instead?
I understand that you want to see all users, but are only interested in 'gameday' post tags. This makes writing the query a little complicated. I would probably just select users and outer join the complete post tag information.
Your create_date is not qualified with a table. I suppose it is a column in the posts table?
This is the query I am coming up with:
select
gdp.*,
u.unique_id,
u.nick_name,
u.avatar_thumb
from tt_users u
left join
(
select
p.*,
t.desc as tag_desc,
pt.id as tag_post_id
from tt_tags t
join tt_post_tags pt on pt.name = t.name
join tt_posts p on p.id = pt.post_id
where t.name = 'gameday'
) gdp on gdp.user_id = u.user_id
order by p.create_date desc;
There has been a lot of guessing on my side, so this query may still be a little different from what you need. I don't know.
Now let's look at which table columns are accessed, to provide good indexes for the query. Let's particularily look at the subquery where we collect all post tags:
We only want 'gameday' tags. As this seems to be the primary key for tt_tags, there should already be a unique index on tt_tags(name).
Being the foreign key, there should also be an index on tt_post_tags(name). This is good, but as we want to continue joining on the post_id, it would be beneficial to have this in the index, too: create unique index idx on tt_post_tags(name, post_id). However, as this is the table's natural key, this index should also already exist in order to ensure data integrity. If it doesn't exist yet, hurry up to provide it.
At last we join tt_posts on its primary key (i.e. there should be an index on tt_posts(id)). Once more: Nothing for us to do here.
You select all users and you select all 'gameday' tags. Then you must join all found tags to the users, which already is some work. You can imagine this as ordering all found tags by user_id first in order to join. Then you want to sort your result by post date. This means that the DBMS must again sort all result rows. Sorting takes time; that's just the way it is. How many rows does the result contain? If we are talking about millions of rows to sort, then this will probably remain slow. And if many post tags are 'gameday' tags, then even the indexes may not help much reading the tables and the DBMS may go for full sequential table reads instead. Make sure the statistics are up-to-date (https://dev.mysql.com/doc/refman/8.0/en/analyze-table.html).
(Iteration one of Answering the Question)
First, let's look at the query without users:
select p.id
from post_tags AS pt
join posts AS p ON p.id = pt.post_id
join tags AS t ON t.name = "gameday"
where pt.name = "gameday"
ORDER BY p.create_date
LIMIT 100;
It is not possible to have a single index that handles both pt.name and p.create_date. Is there any way to get them into the same table? I see, for example, that name seems to be redundantly in t and p.
tt_post_tags sounds like a many-to-many mapping table between posts and tags; is it? If so, what is name, the seems to be in tags and post_tags?
I think this
join tags AS t ON t.name = "gameday"
should be
join tags AS t ON t.name = "gameday" AND pt.tag_id = t.tag_id
If so, that might be the main problem. Please provide SHOW CREATE TABLE for the rest of the tables.
The following indexes may (or may not) help:
tags: (post_id, name)
tags: (name, tag_id)
posts: (create_date, id)
post_tags: (name, post_id)
More
A UNIQUE INDEX is an INDEX, so the second of these is redundant and should be dropped: UNIQUE(x), INDEX(x)
Index Cookbook: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
Related
Query :
SELECT
r.reply_id,
r.msg_id,
r.uid,
r.body,
r.date,
u.username as username,
u.profile_picture as profile_picture
FROM
pm_replies as r
LEFT JOIN users as u
ON u.uid = r.uid
WHERE
r.msg_id = '784351921943772258'
ORDER BY r.date DESC
i tried all index combinations i could think of, searched in google how best i could index this but nothing worked.
this query takes 0,33 on 500 returned items and counting...
EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE r ALL index1 NULL NULL NULL 540 Using where; Using filesort
1 SIMPLE u eq_ref uid uid 8 site.r.uid 1
SHOW CREATE pm_replies
CREATE TABLE `pm_replies` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`reply_id` bigint(20) NOT NULL,
`msg_id` bigint(20) NOT NULL,
`uid` bigint(20) NOT NULL,
`body` text COLLATE utf8_unicode_ci NOT NULL,
`date` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `index1` (`msg_id`,`date`,`uid`)
) ENGINE=MyISAM AUTO_INCREMENT=541 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
SHOW CREATE users
CREATE TABLE `users` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`uid` bigint(20) NOT NULL,
`username` varchar(20) COLLATE utf8_unicode_ci NOT NULL,
`email` text CHARACTER SET latin1 NOT NULL,
`password` text CHARACTER SET latin1 NOT NULL,
`profile_picture` text COLLATE utf8_unicode_ci NOT NULL,
`date_registered` datetime NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `uid` (`uid`),
UNIQUE KEY `username` (`username`)
) ENGINE=MyISAM AUTO_INCREMENT=2004 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
For the query as it is, the best indexes would seem to be...
pm_replies: (msg_id, date, uid)
users: (uid)
The important one is pm_replies. You use it to both filter your data (the filter column is first) then order your data (the order column is second).
The would be different if you removed the filter. Then you'd just want (date, uid) as your index.
The last field in the index just makes it a fraction friendlier to the join, the important part is actually the index on users.
There is a lot more that coudl be said on this, a whole chapter in a book at the very least, and several books if your wanted to. But I hope this helps.
EDIT
Not that my suggested index for pm_replies is one index covering three fields, and not just three indexes. This ensures that all the entries in the index are pre-sorted by those columns. It's like sorting data in Excel by three columns.
Having three separate indexes is like having the Excel data on three tabs. Each sorted by a different fields.
Only whith one index over three fields do you get this behaviour...
- You can select one 'bunch' of records with the same msg_id
- That whole 'bunch' are next to each other, no gaps, etc
- That whole 'bunch' are sorted in date order for that msg_id
- For any rows with the same date, they're ordered by user_id
(Again the user_id part is really very minor.)
Please try this:
SELECT
r.reply_id,
r.msg_id,
r.uid,
r.body,
r.date,
u.username as username,
u.profile_picture as profile_picture
FROM
pm_replies as r
LEFT JOIN users as u
ON (u.uid = r.uid AND r.msg_id = '784351921943772258')
ORDER BY r.date DESC
in my case it help.
Add date to your index1 key so that msg_id and date are both in the index.
What Dems is saying should be correct, but there is one additional detail if you are using InnoDB: perhaps you are paying the price of secondary indexes on clustered tables - essentially, accessing a row through the secondary index requires additional lookup trough the primary, i.e. clustering index. This "double lookup" might make the index less attractive to the query optimizer.
To alleviate this, try covering the all the fields in your select statement with the index:
pm_replies: (msg_id, date, uid, reply_id, body, date)
users: (uid, username, profile_picture)
It appears the optimizer is trying to force the index by ID to make the join to the user table. Since you are doing a left-join (which doesn't make sense since I would expect every entry to have a user ID, thus a normal INNER JOIN), I'll keep it left join.
So, I would try the following. Query just the replies based on the MESSAGE ID and order by the date descending on its own merits, THEN left join, such as
SELECT
r.reply_id,
r.msg_id,
r.uid,
r.body,
r.date,
u.username as username,
u.profile_picture as profile_picture
FROM
( select R2.*
from pm_replies R2
where r2.msg_id = '784351921943772258' ) r
LEFT JOIN users as u
ON u.uid = r.uid
ORDER BY
r.date DESC
In addition, since I don't have MySQL readily available, and can't remember if order by is allowed in a sub-query, if so, you can optimize the inner prequery (using alias "R2") and put the order by there, so it uses the (msgid, date) index and returns just that set... THEN joins to user table on the ID which no index is required at that point from the SOURCE result set, just the index on the user table to find the match.
I'm working on a blogging app that requires a unique query.
Problem: I need to display one parent post, all it's children posts (up to a certain number before requiring pagination), and up to 5 comments associated with each child post, and the parent.
I wrote this query, but it doesn't work because it will return only 5 comments that belong to the parent post.
SELECT
posts.id, posts.postTypeId, posts.parentId, posts.ownerUserId, posts.body
, users.id AS authorId, users.displayname AS authorDisplayName
, comments.id AS commentId, comments.text AS commentText
, comments.commentOwnerUserId, comments.commentOwnerDisplayName
FROM posts
JOIN users ON posts.owneruserid = users.id
LEFT JOIN ( SELECT comments.id, comments.postId, comments.text, commenters.id AS commentOwnerUserId, commenters.displayname AS commentOwnerDisplayName
FROM comments
JOIN users AS commenters ON comments.userid = commenters.id
ORDER BY comments.createdat ASC
LIMIT 0,5 ) AS comments ON comments.postid = posts.id
WHERE posts.id = #postId OR posts.parentId = #postId
ORDER BY posts.posttypeid, posts.createdAt
The query returns the parent post, all it's children, and the first 5 comments it encounters, (usually they belong to the parent because we are ordering by postTypeId, and the parent is the first post). If the first post doesn't have 5 comments, it moves on the next post and returns those comments, until the 5 limit is reached.
What I need is to return one parent post and all it's children posts, and up to 5 comments for each child, the parent. I also need the owner data for each post and comment.
UPDATE I'm open to doing this with more than one query if it will scale well. The only condition is that the parent and children posts retrieval happens in the same query.
Any idea how I can write such a query? I included my schema below.
/* Posts table */
CREATE TABLE `posts` (
`id` int(10) NOT NULL AUTO_INCREMENT,
`posttypeid` int(10) NOT NULL,
`parentid` int(10) DEFAULT NULL,
`body` text NOT NULL,
`userid` int(10) NOT NULL,
`createdat` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `parentId` (`parentid`)
KEY `userId` (`userid`)
) ENGINE=InnoDB AUTO_INCREMENT=572 DEFAULT CHARSET=utf8
/* Comments table */
CREATE TABLE `comments` (
`id` int(10) NOT NULL AUTO_INCREMENT,
`postid` int(10) NOT NULL,
`userid` int(10) NOT NULL,
`text` text NOT NULL,
`createdat` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `postId` (`postid`),
KEY `userId` (`userid`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=utf8
/* users table */
CREATE TABLE `users` (
`id` int(10) NOT NULL AUTO_INCREMENT,
`email` varchar(50) NOT NULL,
`displayname` varchar(50) NOT NULL,
`createdat` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `email` (`email`),
) ENGINE=InnoDB AUTO_INCREMENT=66 DEFAULT CHARSET=utf8
It sounds like you're looking to store hierarchical data. This isn't so much a hard question as it is a bit of a time-consuming one. I would suggest reading a really good article from a few years ago named Managing Hierarchical Data in MySQL by Mike Hillyer. It's got some really good conceptual suggestions as well as example implementations of the kind of system it sounds like you're designing. Definitely read the "Find the Immediate Subordinates of a Node" section. :)
I assume you will have paging of some sort to restrict the amount of top level posts.
You would also need some summary information on the number of comments, childposts in the posts table or a post summary table.
The comments table would need sequence column by post
1: Get all parent posts, (parentId = 0) & construct an IN clause of postids
2: Get all children posts by passing the postids obtained in 1, order by postid which will help in segregation. Add these posts to the overall IN clause
Get comments by passing the postids from 1 & 2.
Restrict the number of comments by using the number of comments & the sequence column
For ex: join comments & post_summary where post_comment_seq between (noofcommentsforthepost - 5) and noofcommentsforthepost
You can look at in clause performance here Performance of MYSQL "IN"
I've adjusted the other query from Your Previous Question to simply include a WHERE clause on your ParentID. That was the condition I didn't know you were looking for to limit return set. I added where the post ID = the one you want OR the ParentID = the one you want.
By having the ORDER by the POST ID, it will naturally have the originating parent ID in the first position as others would be derived from it sequentially. I think that will solve you again.
I take as granted that every child has exactly one parent. Then, I think this will work:
SELECT p.* <-- post details
, u.* <-- user details
, cc.* <-- comment details
FROM
( ( SELECT parentid AS id
FROM posts
WHERE posts.id = #mypostid <-- the id of the post we want
)
UNION ALL
( SELECT child.id
FROM posts AS parent
JOIN posts AS child
ON child.parentid = parent.id
WHERE parent.id =
( SELECT posts.parentid
FROM posts
WHERE posts.id = #mypostid) <-- the id of the post we want
ORDER BY child.createdat <-- any order you prefer
LIMIT x, 5 <-- 5 children posts
)
) AS pp
JOIN posts p
ON p.id = pp.id
JOIN users
ON users.id = p.userid
JOIN comments cc
ON cc.postid = pp.id
WHERE cc.postid IN
( SELECT c.id
FROM comments c
WHERE c.postid = pp.id
ORDER BY c.createdat <-- any order you prefer
LIMIT y, 5 <-- 5 comments for every post
)
The x,5 should be replaced with 0,5 for first five childen posts and y,5 with 0,5 for first five comments. Then with 5,5 for next five, 10,5 for next five, etc.
UPDATE
Sorry, my mistake. The above gives the error:
This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'
I'll wrap my head up to work around this :)
I read but I'm still confused when to use a normal index or a unique index in MySQL. I have a table that stores posts and responses (id, parentId). I have set up three normal indices for parentId, userId, and editorId.
Would using unique indices benefit me in any way given the following types of queries I will generally run? And why?
Most of my queries will return a post and its responses:
SELECT * FROM posts WHERE id = #postId OR parentId = #postId ORDER BY postTypeId
Some times I will add a join to get user data:
SELECT * FROM posts
JOIN users AS owner ON owner.id = posts.userId
LEFT JOIN users AS editor ON editor.id = posts.editorId
WHERE id = #postId OR parentId = #postId ORDER BY postTypeId
Other times I may ask for a user and his/her posts:
SELECT * FROM users
LEFT JOIN posts ON users.id = posts.userid
WHERE id = #userId
My schema looks like this:
CREATE TABLE `posts` (
`id` int(10) NOT NULL AUTO_INCREMENT,
`posttypeid` int(10) NOT NULL,
`parentid` int(10) DEFAULT NULL,
`body` text NOT NULL,
`userid` int(10) NOT NULL,
`editorid` int(10) NOT NULL,
`updatedat` datetime DEFAULT NULL,
`createdat` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `userId` (`userid`),
KEY `editorId` (`editorid`),
KEY `parentId` (`parentid`)
) ENGINE=InnoDB AUTO_INCREMENT=572 DEFAULT CHARSET=utf8
When an index is created as UNIQUE, it only adds consistency to your table: inserting a new entry reusing the same key by error will fail, instead of being accepted and lead to strange errors later.
So, you should use it for your IDs when you know there won't be duplicate (it's by default and mandatory for primary keys), but it won't give you any benefits performance wise. It only gives you a guarantee that you won't have to deal with a specific kind of database corruption because of a bug in the client code.
However, if you know there can be duplicates (which I assume is the case for your columns userId, editorId, and parentId), using the UNIQUE attribute would be a serious bug: it would forbid multiple posts with the same userId, editorId or parentId.
In short: use it everywhere you can, but in this case you can't.
Unique is a constraint that just happens to be implemented by the index.
Use unique when you need unique values. IE no duplicates. Otherwise don't. That simple really.
Unique keys do not have any benefit over normal keys for data retrieval. Unique keys are indexes with a constraint: they prevent insertion of the same value and so they only benefit inserts.
I have web application that use a similar table scheme like below. simply I want to optimize the selection of articles. articles are selected based on the tag given. for example, if the tag is 'iphone' , the query should output all open articles about 'iphone' from the last month.
CREATE TABLE `article` (
`id` int(11) NOT NULL auto_increment,
`title` varchar(100) NOT NULL,
`body` varchar(200) NOT NULL,
`date` timestamp NOT NULL default CURRENT_TIMESTAMP,
`author_id` int(11) NOT NULL,
`section` varchar(30) NOT NULL,
`status` int(1) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
CREATE TABLE `tags` (
`name` varchar(30) NOT NULL,
`article_id` int(11) NOT NULL,
PRIMARY KEY (`name`,`article_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
CREATE TABLE `users` (
`id` int(11) NOT NULL auto_increment,
`username` varchar(30) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=3 ;
The following is my MySQL query
explain select article.id,users.username,article.title
from article,users,tags
where article.id=tags.article_id and tags.name = 'iphone4'
and article.author_id=users.id and article.status = '1'
and article.section = 'mobile'
and article.date > '2010-02-07 13:25:46'
ORDER BY tags.article_id DESC
the output is
id select_type table type possible_keys key key_len ref rows Extra <br>
1 SIMPLE tags ref PRIMARY PRIMARY 92 const 55 Using where; Using index <br>
1 SIMPLE article eq_ref PRIMARY PRIMARY 4 test.tags.article_id 1 Using where <br>
1 SIMPLE users eq_ref PRIMARY PRIMARY 4 test.article.author_id 1 <br>
is it possible to optimize it more?
This query may be optimized, depending on which condition is more selective: tags.name = 'iphone4' or article.date > '2010-02-07 13:25:46'
If there are less articles tagged iphone than those posted after Feb 7, then your original query is nice.
If there are many articles tagged iphone, but few those posted after Feb 7, then this query will be more efficient:
SELECT article.id, users.username, article.title
FROM tags
JOIN article
ON article.id = tags.article_id
AND article.status = '1'
AND article.section = 'mobile'
AND article.date > '2010-02-07 13:25:46'
JOIN users
ON users.id = article.author_id
WHERE tags.name = 'iphone4'
ORDER BY
tags.article_date DESC, tags.article_id DESC
Note that the ORDER BY condition has changed. This may or may not be what you want, however, generally the orders of id and date correspond to each other.
If you really need your original ORDER BY condition you may leave it but it will add a filesort (or just revert to your original plan).
In either case, create an index on
article (status, section, date, id)
the query should output all open articles about 'iphone' from the last month.
So the only query you are going to run on this data uses the tag and the date. You've got a index for the tag in the tags table, but the date is stored in a different table (article - you're a bit inconsistent with your naming schema). Adding an index on the article table using date would be no benefit at all. Using id,date (in that order) would help a little - but really the date needs to be denormalised into the tags table to get the query running really fast.
Unless you're regularly moving around bulk data sets - just add a datetime column with a default of the current timestamp to the tags table.
I expect that you may be wanting to interact with the data in lots of other ways - really you should set a low (no?) threshold for slow query logging then analyse the resulting data to identify where you're performance problems are (try looking at the queries with the highest values for duration^2*frequency first).
There's a script at the URL below which is useful for this analysis:
http://www.retards.org/projects/mysql/
You could index the additional fields in article that you are referencing in your select statement. In this case, I would suggest you create an index in article like this:
CREATE INDEX article_idx ON article (author_id, status, section, date);
Creating that index should speed up your query depending on how many overall records you are dealing with. From my understanding, properly creating indexes involves looking at the queries you've written and indexing the columns that are a part of your where clause. This helps the query optimizer better process the query in general. That does not mean create an index on each individual column, however, as its both inefficient to do so and ineffective. When possible, create multiple column indexes that represent your select statement.
I'm running the following query on a Macbook Pro 2.53ghz with 4GB of Ram:
SELECT
c.id AS id,
c.name AS name,
c.parent_id AS parent_id,
s.domain AS domain_name,
s.domain_id AS domain_id,
NULL AS stats
FROM
stats s
LEFT JOIN stats_id_category sic ON s.id = sic.stats_id
LEFT JOIN categories c ON c.id = sic.category_id
GROUP BY
c.name
It takes about 17 seconds to complete.
EXPLAIN:
alt text http://img7.imageshack.us/img7/1364/picture1va.png
The tables:
Information:
Number of rows: 147397
Data size: 20.3MB
Index size: 1.4MB
Table:
CREATE TABLE `stats` (
`id` int(11) unsigned NOT NULL auto_increment,
`time` int(11) NOT NULL,
`domain` varchar(40) NOT NULL,
`ip` varchar(20) NOT NULL,
`user_agent` varchar(255) NOT NULL,
`domain_id` int(11) NOT NULL,
`date` timestamp NOT NULL default CURRENT_TIMESTAMP,
`referrer` varchar(400) default NULL,
KEY `id` (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=147398 DEFAULT CHARSET=utf8
Information second table:
Number of rows: 1285093
Data size: 11MB
Index size: 17.5MB
Second table:
CREATE TABLE `stats_id_category` (
`stats_id` int(11) NOT NULL,
`category_id` int(11) NOT NULL,
KEY `stats_id` (`stats_id`,`category_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
Information third table:
Number of rows: 161
Data size: 3.9KB
Index size: 8KB
Third table:
CREATE TABLE `categories` (
`id` int(11) NOT NULL auto_increment,
`parent_id` int(11) default NULL,
`name` varchar(40) NOT NULL,
`questions_category_id` int(11) NOT NULL default '0',
`rank` int(2) NOT NULL default '0',
PRIMARY KEY (`id`),
KEY `id` (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=205 DEFAULT CHARSET=latin1
Hopefully someone can help me speed this up.
I see several WTF's in your query:
You use two LEFT OUTER JOINs but then you group by the c.name column which might have no matches. So perhaps you don't really need an outer join? If that's the case, you should use an inner join, because outer joins are often slower.
You are grouping by c.name but this gives ambiguous results for every other column in your select-list. I.e. there might be multiple values in these columns in each grouping by c.name. You're lucky you're using MySQL, because this query would simply give an error in any other RDBMS.
This is a performance issue because the GROUP BY is likely causing the "using temporary; using filesort" you see in the EXPLAIN. This is a notorious performance-killer, and it's probably the single biggest reason this query is taking 17 seconds. Since it's not clear why you're using GROUP BY at all (using no aggregate functions, and violating the Single-Value Rule), it seems like you need to rethink this.
You are grouping by c.name which doesn't have a UNIQUE constraint on it. You could in theory have multiple categories with the same name, and these would be lumped together in a group. I wonder why you don't group by c.id if you want one group per category.
SELECT NULL AS stats: I don't understand why you need this. It's kind of like creating a variable that you never use. It shouldn't harm performance, but it's just another WTF that makes me think you haven't thought this query through very well.
You say in a comment you're looking for number of visitors per category. But your query doesn't have any aggregate functions like SUM() or COUNT(). And your select-list includes s.domain and s.domain_id which would be different for every visitor, right? So what value do you expect to be in the result set if you only have one row per category? This isn't really a performance issue either, it just means your query results don't tell you anything useful.
Your stats_id_category table has an index over its two columns, but no primary key. So you can easily get duplicate rows, and this means your count of visitors may be inaccurate. You need to drop that redundant index and use a primary key instead. I'd order category_id first in that primary key, so the join can take advantage of the index.
ALTER TABLE stats_id_category DROP KEY stats_id,
ADD PRIMARY KEY (category_id, stats_id);
Now you can eliminate one of your joins, if all you need to count is the number of visitors:
SELECT c.id, c.name, c.parent_id, COUNT(*) AS num_visitors
FROM categories c
INNER JOIN stats_id_category sic ON (sic.category_id = c.id)
GROUP BY c.id;
Now the query doesn't need to read the stats table at all, or even the stats_id_category table. It can get its count simply by reading the index of the stats_id_category table, which should eliminate a lot of work.
You are missing the third table in the information provided (categories).
Also, it seems odd that you are doing a LEFT JOIN and then using the right table (which might be all NULLS) in the GROUP BY. You will end up grouping all of the non-matching rows together as a result, is that what you intended?
Finally, can you provide an EXPLAIN for the SELECT?
Harrison is right; we need the other table. I would start by adding an index on category_id to stats_id_category, though.
I agree with Bill. Point 2 is very important. The query doesn't even make logical sense. Also, with the simple fact that there is no where statement means that you have to pull back every row in the stats table, which seems to be around 140000. It then has to sort all that data, so that it can perform the GROUP BY. This is because sort [ O(n log n)] and then find duplicates [ O(n) ] is much faster than just finding duplicates without sorting the data set [ O(n^2)?? ].