I have a mySQL DB of ~1 milllion entries.
I run the query:
SELECT a.id as aid, a.title as atitle, a.slug, summary,
a.link as alink, author, published, image, a.cat as acat,
a.rss as arss, a.site as asite
FROM articles a
ORDER BY published DESC
LIMIT 616150, 50;
It takes about 5 minutes or more ot load.
My TABLE and INDEXes:
CREATE TABLE IF NOT EXISTS `articles` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`title` varchar(255) NOT NULL,
`slug` varchar(255) NOT NULL,
`summary` text NOT NULL,
`link` text NOT NULL,
`author` varchar(255) NOT NULL,
`published` datetime NOT NULL,
`image` text NOT NULL,
`cat` int(11) NOT NULL,
`rss` int(11) NOT NULL,
`site` int(11) NOT NULL,
`bitly` varchar(255) NOT NULL,
`checked` tinyint(4) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `title` (`title`),
KEY `cat` (`cat`),
KEY `published` (`published`),
KEY `site` (`site`),
KEY `rss` (`rss`),
KEY `checked` (`checked`),
KEY `id_publ_index` (`id`,`published`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1230234;
What explain says:
mysql> EXPLAIN EXTENDED SELECT a.id as aid, a.title as atitle, a.slug, summary, a.link as alink, author, published, image, a.cat as acat, a.rss as arss, a.site as asite FROM articles a ORDER BY published DESC LIMIT 616150, 50;
+----+-------------+-------+-------+---------------+-----------+---------+------+--------+----------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+-------+---------------+-----------+---------+------+--------+----------+-------+
| 1 | SIMPLE | a | index | NULL | published | 8 | NULL | 616200 | 152.94 | |
+----+-------------+-------+-------+---------------+-----------+---------+------+--------+----------+-------+
1 row in set, 1 warning (0.46 sec)
Any tips of how to optimize this query? Why mySQL needs to read all 616200 rows and not just the 50 that were asked?
Thank you for your time.
The reason that you are seeing the published key being used is because that is what you are ordering by. How often does this query need to run?
There is one simple thing that you can do to help this query run much, much faster:
Make better use of your published key. Use WHERE to define what range of dates you want to retrieve from your table.
The reason why you are reading 616,200 rows of your table right now is because you are not using the index to limit the range. MySQL needs to use your full index to:
Sort the first 616200 rows in DESC order and then
Finally limit the result to 50 rows.
If possible, you should filter the results of your database in a different way. Changing your results to be based on the WHERE (making more efficient use of your index) will be the quickest way.
For example:
SELECT a.id as aid, a.title as atitle, a.slug, summary,
a.link as alink, author, published, image, a.cat as acat,
a.rss as arss, a.site as asite
FROM articles a
WHERE published > '2010-01-01'
ORDER BY published DESC
LIMIT 6150, 50;
The sad part is that ORDER BY and LIMIT does not scale too well, and you will loose you speed very quickly. (eg, change your limit to 0, 50, and then to 900000, 50 and see how your speed is affected) so adding more information to your WHERE will help your query be much faster.
EDIT:
There is no way I can know what to display by date, so putting a where is not possible. In addition this query is run on a news aggregator, that collects news every ... second. The limit is made so I can create paginated results.
Because you are inserting new posts, your LIMIT statement is going to cause the news items to jump when a user is going through page anyway. For instance, if I am on page one and three items get added before I press 'Next', then by the time that I click 'Next', I will see the last three items from the previous page.
For the best possible user experience, you should try adding the ID of the last seen news item or the date of the last seen news item to the pagination somehow. This could be done either by sessions or part of the query URL, but it will allow you to have better use of your indexes.
I understand why the limit is there - it's just how can you fix the issue of the query being slow after a certain amount of pages have been hit.
To efficiently fix your speed issues, you will need to make better use of an index, rather then relying on 'LIMIT' to be your sole method of pagination. LIMIT is amazing, yes, but it is not optimized for retrieving records the way that you are trying to do it, because you need to sort by the date.
Even though as you say 'there is no way I can know what to display by date' (at least currently...) there must be a way for your application to limit what needs to be fetched from your db. The same way Facebook does not need to go through every member of the website's individual posts just to make results show on your Facebook wall. You need to find out how it can be made more efficient.
Related
CONTEXT
I have a large table full of "documents" that are updated by outside sources. When I notice the updates are more recent than my last touchpoint I need to address these documents. I'm having some serious performance issues though.
EXAMPLE CODE
select count(*) from documents;
gets me back 212,494,397 documents in 1 min 15.24 sec.
select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE);
which is apx the actual query gets me 55,988,860 in 14 min 36.23 sec.
select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE) limit 1;
notably takes about 15 minutes as well. (this was surprising to me)
THE PROBLEM
How do I perform the updated_at > last_indexed_at in a more
reasonable time?
DETAILS
I'm pretty certain that my query is, in some way, not sargable. Unfortunately, I can't find what about this query prevents it from being executed on a row independent basis.
select count(*)
from documents
where last_indexed_at is null or updated_at > last_indexed_at;
doesn't do any better.
nor does
select count( distinct( id ) )
from documents
where last_indexed_at is null or updated_at > last_indexed_at limit 1;
nor does
select count( distinct( id ) )
from documents limit 1;
EDIT: FOLLOW UP REQUESTED DATA
This question only involves one table (thankfully) in a rails project, so we conveniently have the rails definition for the table.
/*!40101 SET #saved_cs_client = ##character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `documents` (
`id` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`document_id` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`document_type` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`locale` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`allowed_ids` text COLLATE utf8mb4_unicode_ci NOT NULL,
`fields` mediumtext COLLATE utf8mb4_unicode_ci,
`created_at` datetime(6) NOT NULL,
`updated_at` datetime(6) NOT NULL,
`last_indexed_at` datetime(6) DEFAULT NULL,
`deleted_at` datetime(6) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_documents_on_document_type` (`document_type`),
KEY `index_documents_on_locale` (`locale`),
KEY `index_documents_on_last_indexed_at` (`last_indexed_at`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
SELECT VERSION(); got me 5.7.27-30-log
And probably most import,
explain select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE);
gets me exactly
+----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
| 1 | SIMPLE | documents | NULL | ALL | NULL | NULL | NULL | NULL | 208793754 | 100.00 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
Oh! MySQL 5.7 introduced Generated Columns — which gives us a way of indexing expressions! 🥳
If you do something like this:
ALTER TABLE documents
ADD COLUMN dirty BOOL GENERATED ALWAYS AS (COALESCE(updated_at > last_indexed_at, TRUE)) STORED,
ADD INDEX index_documents_on_dirty(dirty);
...and change the query to:
SELECT COUNT(*) FROM documents WHERE dirty;
...what results do you get?
Hopefully, we're moving the work of evaluating COALESCE(updated_at > last_indexed_at, TRUE) from Read time to Write time.
Add a covering INDEX
If you had INDEX(last_indexed_at, updated_at), the 15-minute queries might run somewhat faster. (The order of the columns does not matter in this case.)
Assuming both of those columns are columns in the table. If so, then the query must read every row. (I don't know if the term "sargable" covers this situation.)
The INDEX I suggest will be faster because it is "covering". By reading only the index, there is less I/O.
The repeated 15 minutes is probably because innodb_buffer_pool_size was not big enough to hold the entire table. So, it was I/O-bound. My INDEX will be smaller, hence (hopefully) small enough to fit in the buffer_pool. So, it will be faster, and even faster on the second run.
Slow OR
OR is usually a terrible slowdown. But I don't think it matters here.
If you were to initialize last_indexed_at to some old date (say, '2000-01-01'), you could get rid of the COALESCE or OR.
Another way to clean it up is
SELECT SUM(last_indexed_at IS NULL) +
SUM(updated_at > last_indexed_at) AS "Need indexing"
FROM t;
I still need the index. SUM(boolean expression) sees the expression as 0 (false or NULL) or 1 (TRUE).
Meanwhile, I don't think the COUNT(DISTINCT id) is any different than COUNT(*). And the pair of SUMs should also give you the value.
Again, I am depending on "covering" being the trick.
"More than .." trick
In some situation, you don't really need the exact number, especially if it is "more than some threshold".
SELECT 1 FROM tbl WHERE ... LIMIT 1000,1;
If it comes back with "1", there are at least 1000 rows. If it comes back empty (no row returned), then not.
That will still have to touch up to 1000 rows (hopefully in an index), but that is better than touching a million.
You can, if you're on a recent MySQL version (5.7+), add a generated column to your table containing your search expression, then index that column.
ALTER TABLE t
ADD COLUMN needs_indexing TINYINT
GENERATED ALWAYS AS
(CASE WHEN last_indexed_at IS NULL THEN 1
WHEN updated_at > last_indexed_at THEN 1
ELSE 0 END) VIRTUAL;
ALTER TABLE t
ADD INDEX needs_indexing (needs_indexing);
This uses drive space for the index, but not in your table.
Then you can do SELECT SUM(needs_indexing) FROM t to get the number of items matching your criterion.
But: you don't have to count all the items to know you need to reindex some items. Doing a COUNT(*) on a large InnoDB table is very expensive as you have discovered. You can do this:
SELECT EXISTS (SELECT 1 FROM t WHERE needs_indexing = 1) something_needs_indexing;
You'll get 1 or 0 from this query very quickly. 1 means you have at least one row meeting your criteria.
And, of course, your indexing job can do
SELECT * FROM t WHERE needs_indexing=1 LIMIT 1;
or whatever makes sense. That will also be fast.
Okay, so as the title says, the queries were running fine 2 days ago, then all of a sudden yesterday the site was loading very slow. I tracked it down to a couple queries. One I was able to add an index which seems to have helped, but this one I just can't figure out. I tried running a repair and optimize on the tables, and that didn't help. I don't know what could have changed so much that would make it go from less than a second to query to 20+ seconds. Any help would be much appreciated.
SELECT city
FROM listings LEFT JOIN agencies
ON listings.agencyid_fk = agencies.agencyid
WHERE listingstatus IN (1,3) AND appid_fk = 101 AND active = 1
AND auction IS NULL AND agencies.multilist = 1
AND notagency IS NULL
GROUP BY city
ORDER BY city;
I wasn't sure how to export the explain query result to make it readable on here, so I just put it in a code snippet. click run to see it in an html table.
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE listings ref appid_fk,listingstatus appid_fk 2 const 21699 Using where; Using temporary; Using filesort
1 SIMPLE agencies eq_ref PRIMARY,active PRIMARY 2 mls2.listings.agencyid_fk 1 Using where
And here are the tables...
listings table:
CREATE TABLE mls2.listings (
listingid INT(11) AUTO_INCREMENT NOT NULL,
appid_fk SMALLINT(3) NOT NULL DEFAULT '0',
agencyid_fk SMALLINT(3) NOT NULL DEFAULT '0',
listingstatus SMALLINT(3),
city VARCHAR(30) CHARACTER SET latin1 COLLATE latin1_swedish_ci,
multilist TINYINT(1),
auction TINYINT(1),
PRIMARY KEY (listingid)
) ENGINE = myisam ROW_FORMAT = DEFAULT CHARACTER SET latin1;
agencies table:
CREATE TABLE mls2.agenciesx (
agencyid SMALLINT(6) AUTO_INCREMENT NOT NULL,
multilist TINYINT(4) DEFAULT '0',
notagency TINYINT(1),
active TINYINT(1),
PRIMARY KEY (agencyid)
) ENGINE = myisam ROW_FORMAT = DEFAULT CHARACTER SET latin1;
Once you've taken on board the comments above, try adding the following indexes to your tables...
INDEX(city,listingstatus,appid_fk,auction)
INDEX(active,multilist,notagency)
In both cases, the order in which columns are arranged in the index may make a difference, so play around with those, although there are so few rows in the agencies table, that that one won't really matter.
Next, get rid of the GROUP BY clause, and write your query as follows.
SELECT DISTINCT l.city
FROM listings l
JOIN agencies a
ON a.agencyid = l.agencyid_fk
WHERE l.listingstatus IN (1,3)
AND l.appid_fk = 101
AND a.active = 1
AND l.auction IS NULL
AND a.multilist = 1
AND a.notagency IS NULL
ORDER
BY city;
Note: Although irrelevant for this particular problem, your original question showed that this schema is desperately in need of normalisation.
Forgive me for asking what should be a simple question but I am totally new to Sphinx.
I am using Sphinx with a mySQL datastore. The table looks like below with the Title and Content fields indexed by Sphinx.
CREATE TABLE `documents` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`group_id` int(11) NOT NULL,
`group_id2` int(11) NOT NULL,
`date_added` datetime NOT NULL,
`title` varchar(255) NOT NULL,
`content` text NOT NULL,
`url` varchar(255) NOT NULL,
`links` int(11) NOT NULL,
`hosts` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `url` (`url`)
) ENGINE=InnoDB AUTO_INCREMENT=439043 DEFAULT CHARSET=latin1
Now, if I connect through Sphinx with
mysql -h0 -P9306
I can run a simple query like...
SELECT * FROM test1 WHERE MATCH('test document');
And I will get back a result set like...
+--------+----------+------------+
| id | group_id | date_added |
+--------+----------+------------+
| 360625 | 1 | 1499727792 |
| 362257 | 1 | 1499727807 |
| 362777 | 1 | 1499727811 |
| 159717 | 1 | 1499717614 |
| 160557 | 1 | 1499717621 |
----------------------------------
When what I actually want is it to return a result set with column values from the documents table (like the URL, Title, Links, Hosts, etc. columns) and, if all possible, sort these by the relevancy of the Sphinx match.
Can that be accomplished in a single query? What might it look like?
Thanks in advance!
Two (main) options
Take the ids from the SphinxQL result, and run a MySQL Query to get the full details, see http://sphinxsearch.com/info/faq/#row-storage
eg SELECT * FROM documents WHERE id IN (3,5,7) ORDER BY FIELD(id,3,5,7)
This MySQL query, should be VERY quick, because its a PK lookup, and only retrieving a few rows (ie one page of results) - the heavy lifting of searching the whole table has already been done in first Sphinx Query.
Duplicate all the columns you want to retrieve in the resultset as Attributes. You've already made group_id and date_added as attributes, would need to make more attributes.
sql_field_string is a very convenient shortcut to make BOTH a Field and an String Attribute from one column. Not available for other column types, but less useful, as numeric columns, are not typically needed as fields anyway.
option 1 is good in it avoids duplicating the data, and saves memory (Sphinx wants to typically hold attributes in memory) - and may be most practical on big datasets.
whereas option 2 is good in that it avoids a second query for each result. But because have a copy of data, it may mean additional complication syncing.
Doesn't look relevant in your case, but if say had a 'clicks' column, which you want in increment often (when users click!), and need it in resultset but you don't really need it in sphinx for query purposes, the first option, would allow you only have to increment it in database, and the mysql query would always get the live value. But the second option means having to keep sphinx index in 'sync' at all times)
I have a site where there is an activity feed, similar to how social sites like Facebook have one. It is a "newest first" list that describes actions taken by users. In production, there's about 200k entries in that table.
Since this is going to be asked anyway, I'll first share the full table structure:
CREATE TABLE `karmalog` (
`id` int(11) NOT NULL auto_increment,
`guid` char(36) default NULL,
`user_id` int(11) default NULL,
`user_name` varchar(45) default NULL,
`user_avat_url` varchar(255) default NULL,
`user_sec_id` int(11) default NULL,
`user_sec_name` varchar(45) default NULL,
`user_sec_avat_url` varchar(255) default NULL,
`event` enum('EDIT_PROFILE','EDIT_AVATAR','EDIT_EMAIL','EDIT_PASSWORD','FAV_IMG_ADD','FAV_IMG_ADDED','FAV_IMG_REMOVE','FAV_IMG_REMOVED','FOLLOW','FOLLOWED','UNFOLLOW','UNFOLLOWED','COM_POSTED','COM_POST','COM_VOTE','COM_VOTED','IMG_VOTED','IMG_UPLOAD','LIST_CREATE','LIST_DELETE','LIST_ADMINDELETE','LIST_VOTE','LIST_VOTED','IMG_UPD','IMG_RESTORE','IMG_UPD_LIC','IMG_UPD_MOD','IMG_GEO','IMG_UPD_MODERATED','IMG_VOTE','IMG_VOTED','TAG_FAV_ADD','CLASS_DOWN','CLASS_UP','IMG_DELETE','IMG_ADMINDELETE','IMG_ADMINDELETEFAV','SET_PASSWORD','IMG_RESTORED','IMG_VIEW','FORUM_CREATE','FORUM_DELETE','FORUM_ADMINDELETE','FORUM_REPLY','FORUM_DELETEREPLY','FORUM_ADMINDELETEREPLY','FORUM_SUBSCRIBE','FORUM_UNSUBSCRIBE','TAG_INFO_EDITED','IMG_ADDSPECIE','IMG_REMOVESPECIE','SPECIE_ADDVIDEO','SPECIE_REMOVEVIDEO','EARN_MEDAL','JOIN') NOT NULL,
`event_type` enum('follow','tag','image','class','list','forum','specie','medal','user') NOT NULL,
`active` bit(1) NOT NULL,
`delete` bit(1) NOT NULL default '\0',
`object_id` int(11) default NULL,
`object_cache` text,
`object_sec_id` int(11) default NULL,
`object_sec_cache` text,
`karma_delta` int(11) NOT NULL,
`gold_delta` int(11) NOT NULL,
`newkarma` int(11) NOT NULL,
`newgold` int(11) NOT NULL,
`migrated` int(11) NOT NULL default '0',
`date_created` timestamp NOT NULL default '0000-00-00 00:00:00',
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`),
KEY `user_sec_id` (`user_sec_id`),
KEY `image_id` (`object_id`),
KEY `date_event` (`date_created`,`event`),
KEY `event` (`event`),
KEY `date_created` (`date_created`),
CONSTRAINT `karmalog_ibfk_1` FOREIGN KEY (`user_id`) REFERENCES `user` (`id`) ON DELETE SET NULL,
CONSTRAINT `karmalog_ibfk_2` FOREIGN KEY (`user_sec_id`) REFERENCES `user` (`id`) ON DELETE SET NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Before optimizing this table, my query had 5 joins and I ran into slow query times. I have denormalized all of that data, so that not a single join is there anymore. So the table and query is flat.
As you can see in the table design, there's an "event" field which is an enum, holding a few dozen possible values. Throughout the site, I show activity feeds based on specific event types. Typically that query looks like this:
SELECT * FROM karmalog as k
WHERE k.event IN ($events) AND k.delete=0
ORDER BY k.date_created DESC, k.id DESC
LIMIT 0,30
What this query does is to find the latest 30 entries in the total set that match any of the events passed in $events, which can be multiple.
Due to removing the joins and having indices on most fields, I was expecting this to perform very well, but it doesn't. On 200k entries, it still takes over 3 seconds and I don't understand why.
Regarding solutions, I know I could archive older entries or partition the table per event type, but that will have quite a code impact, and I first would like to understand why the above is so slow.
As a temporary work-around, I'm now doing this:
SELECT * FROM
(SELECT * FROM karmalog ORDER BY date_created DESC, id DESC LIMIT 0,1000) as karma
WHERE karma.event IN ($events) AND karma.delete=0
LIMIT $page,$pagesize
What this does is to limit the baseset to search in to the latest 1000 entries only, hoping and guessing that there's 30 entries to be found for the filters that I pass in. It's not very robust though. It will not work for more rare events, and it brings pagination issues.
Therefore, I first like to get to the root cause of why my initial query is slow, against my expectation.
Edit: I was asked to share the execution plan. Here's the test query:
EXPLAIN SELECT * FROM karmalog
WHERE event IN ('FAV_IMG_ADD','FOLLOW','COM_POST','IMG_VOTE','LIST_VOTE','JOIN','CLASS_UP','LIST_CREATE','FORUM_REPLY','FORUM_CREATE','FORUM_SUBSCRIBE','IMG_GEO','IMG_ADDSPECIE','SPECIE_ADDVIDEO','EARN_MEDAL') AND karmalog.delete=0
ORDER BY date_created DESC, id DESC
LIMIT 0,36
Execution plan:
id = 1
select_type = SIMPLE
table = karmalog
type = range
possible_keys = event
key = event
key_len = 1
red = NULL
rows = 80519
Extra = Using where; Using filesort
I'm not sure how to read into the above, but I do know that the sort clause really seems to kill this query. With this sorting, it takes 4.3 secs, without 0.03 secs.
SELECT * sometimes slows down ordered queries by a huge amount, so let's start by refactoring your query as follows:
SELECT k.*
FROM karmalog AS k
JOIN (
SELECT id
FROM karmalog
WHERE event IN ($events)
AND delete=0
ORDER BY date_created DESC, id DESC
LIMIT 0,30
) AS m ON k.id = m.id
ORDER BY k.date_created DESC, k.id DESC
This will do your ORDER BY ... LIMIT operation without having to haul the whole table around in the sorting phase. Finally it will look up the appropriate thirty rows from the original table and sort just those again. This might save a whole lot of I/O and in-memory data shuffling.
Second, if id column values are assigned in ascending order as records are inserted, then the use of date_created in your ORDER BY operation is redundant. But MySQL doesn't know that, so leaving it out might help. This will be true if you always use the current date when inserting, and never update the dates.
Third, you might be able to use a compound covering index for the selection (inner) query. This is an index that contains all the fields you need. When you use a covering index, the whole query can be satisfied from the index, and there's no need to bounce back to the original table. This saves disk access time.
Try this compound covering index: (delete, event, id). If you decide you can't get rid of the use of date_created in your ordering, try this instead: (delete, event, date_created, id)
Add a compound index over the two relevant questions. In your table, you can do that by specifying e.g.
KEY `date_created` (`date_created`, `event`)
This key can still be used to satisfy plain old date_created range searching. But in addition to that, the event data is included as well, so the DBS will be able to detect the relevant rows by only looking at the index.
If you want, you can try the other order as well: first event and then date. This might allow some optimization if there are many event types but your filter only contains few. On the other hand, I'm not sure the system will be able to make use of the LIMIT clause in this case, so I'm not certain that this other order will be any help at all.
Edit: I completely missed that your date_event index already has this info. According to your execution plan, though, that one isn't used. Looks like the optimizer is getting things wrong. You could try removing the event index, and perhaps the date index as well, and see what happens then.
I have web application that use a similar table scheme like below. simply I want to optimize the selection of articles. articles are selected based on the tag given. for example, if the tag is 'iphone' , the query should output all open articles about 'iphone' from the last month.
CREATE TABLE `article` (
`id` int(11) NOT NULL auto_increment,
`title` varchar(100) NOT NULL,
`body` varchar(200) NOT NULL,
`date` timestamp NOT NULL default CURRENT_TIMESTAMP,
`author_id` int(11) NOT NULL,
`section` varchar(30) NOT NULL,
`status` int(1) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
CREATE TABLE `tags` (
`name` varchar(30) NOT NULL,
`article_id` int(11) NOT NULL,
PRIMARY KEY (`name`,`article_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
CREATE TABLE `users` (
`id` int(11) NOT NULL auto_increment,
`username` varchar(30) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=3 ;
The following is my MySQL query
explain select article.id,users.username,article.title
from article,users,tags
where article.id=tags.article_id and tags.name = 'iphone4'
and article.author_id=users.id and article.status = '1'
and article.section = 'mobile'
and article.date > '2010-02-07 13:25:46'
ORDER BY tags.article_id DESC
the output is
id select_type table type possible_keys key key_len ref rows Extra <br>
1 SIMPLE tags ref PRIMARY PRIMARY 92 const 55 Using where; Using index <br>
1 SIMPLE article eq_ref PRIMARY PRIMARY 4 test.tags.article_id 1 Using where <br>
1 SIMPLE users eq_ref PRIMARY PRIMARY 4 test.article.author_id 1 <br>
is it possible to optimize it more?
This query may be optimized, depending on which condition is more selective: tags.name = 'iphone4' or article.date > '2010-02-07 13:25:46'
If there are less articles tagged iphone than those posted after Feb 7, then your original query is nice.
If there are many articles tagged iphone, but few those posted after Feb 7, then this query will be more efficient:
SELECT article.id, users.username, article.title
FROM tags
JOIN article
ON article.id = tags.article_id
AND article.status = '1'
AND article.section = 'mobile'
AND article.date > '2010-02-07 13:25:46'
JOIN users
ON users.id = article.author_id
WHERE tags.name = 'iphone4'
ORDER BY
tags.article_date DESC, tags.article_id DESC
Note that the ORDER BY condition has changed. This may or may not be what you want, however, generally the orders of id and date correspond to each other.
If you really need your original ORDER BY condition you may leave it but it will add a filesort (or just revert to your original plan).
In either case, create an index on
article (status, section, date, id)
the query should output all open articles about 'iphone' from the last month.
So the only query you are going to run on this data uses the tag and the date. You've got a index for the tag in the tags table, but the date is stored in a different table (article - you're a bit inconsistent with your naming schema). Adding an index on the article table using date would be no benefit at all. Using id,date (in that order) would help a little - but really the date needs to be denormalised into the tags table to get the query running really fast.
Unless you're regularly moving around bulk data sets - just add a datetime column with a default of the current timestamp to the tags table.
I expect that you may be wanting to interact with the data in lots of other ways - really you should set a low (no?) threshold for slow query logging then analyse the resulting data to identify where you're performance problems are (try looking at the queries with the highest values for duration^2*frequency first).
There's a script at the URL below which is useful for this analysis:
http://www.retards.org/projects/mysql/
You could index the additional fields in article that you are referencing in your select statement. In this case, I would suggest you create an index in article like this:
CREATE INDEX article_idx ON article (author_id, status, section, date);
Creating that index should speed up your query depending on how many overall records you are dealing with. From my understanding, properly creating indexes involves looking at the queries you've written and indexing the columns that are a part of your where clause. This helps the query optimizer better process the query in general. That does not mean create an index on each individual column, however, as its both inefficient to do so and ineffective. When possible, create multiple column indexes that represent your select statement.