I have a biggish InnoDB table which at this moment contains about 20 million rows with ~20000 new rows inserted every day. They contain messages for different topics.
CREATE TABLE IF NOT EXISTS `Messages` (
`ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`TopicID` bigint(20) unsigned NOT NULL,
`DATESTAMP` int(11) DEFAULT NULL,
`TIMESTAMP` int(10) unsigned NOT NULL,
`Message` mediumtext NOT NULL,
`Checksum` varchar(50) DEFAULT NULL,
`Nickname` varchar(80) NOT NULL,
PRIMARY KEY (`ID`),
UNIQUE KEY `TopicID` (`TopicID`,`Checksum`),
KEY `DATESTAMP` (`DATESTAMP`),
KEY `Nickname` (`Nickname`),
KEY `TIMESTAMP` (`TIMESTAMP`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=25195126 ;
NOTE: The Cheksum stores an MD5 checksum which prevents same messages inserted twice in the same topics. (nickname + timestamp + topicid + last 20 chars of message)
The site I'm building has a newsfeed in which users can select to view newest messages from different Nicknames from different forums. The query is as follows:
SELECT
Messages.ID AS MessageID,
Messages.Message,
Messages.TIMESTAMP,
Messages.Nickname,
Topics.ID AS TopicID,
Topics.Title AS TopicTitle,
Forums.Title AS ForumTitle
FROM Messages
JOIN FollowedNicknames ON FollowedNicknames.UserID = 'MYUSERID'
JOIN Forums ON Forums.ID = FollowedNicknames.ForumID
JOIN Subforums ON Subforums.ForumID = Forums.ID
JOIN Topics ON Topics.SubforumID = Subforums.ID
WHERE
Messages.Nickname = FollowedNicknames.Nickname AND
Messages.TopicID = Topics.ID AND Messages.DATESTAMP = '2013619'
ORDER BY Messages.TIMESTAMP DESC
The TIMESTAMP contains an unix timestamp and DATESTAMP is simply a date generated from the unix timestamp for faster access via '=' operator instead of range scans with unix timestamps.
The problem is, this query takes about 13 seconds ( or more ) unbuffered. That is of course unacceptable for the intented usage. Adding the DATESTAMP seemed to speed things up, but not by much.
At this point, I don't really know what should I do. I've read about composite primary keys, but I am still unsure whether they would do any good and how to correctly implement one in this particular case.
I know that using BIGINTs may be a little overkill, but do they affect that much?
EXPLAIN:
+----+-------------+-----------------------+--------+---------------------------------------+------------+---------+-----------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------------+--------+---------------------------------------+------------+---------+-----------------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | FollowedNicknames | ALL | UserID,ForumID,Nickname | NULL | NULL | NULL | 8 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | Forums | eq_ref | PRIMARY | PRIMARY | 8 | database.FollowedNicknames.ForumiID | 1 | NULL |
| 1 | SIMPLE | Messages | ref | TopicID,DATETIME,Nickname | Nickname | 242 | database.FollowedNicknames.Nickname | 15 | Using where |
| 1 | SIMPLE | Topics | eq_ref | PRIMARY,SubforumID | PRIMARY | 8 | database.Messages.TopicID | 1 | NULL |
| 1 | SIMPLE | Subforums | eq_ref | PRIMARY,ForumID | PRIMARY | 8 | database.Topics.SubforumID | 1 | Using where |
+----+-------------+-----------------------+--------+---------------------------------------+------------+---------+-----------------------------------------------+------+----------------------------------------------+
You shouldn't be JOINing on a VARCHAR column (Nickname); you should use the user ID to join those tables. That is definitely slowing the query down and is probably the biggest issue. It would also be easier to follow if you wrote all of the JOINs explicitly instead of at the end in the WHERE clause like this:
SELECT
Messages.ID AS MessageID,
Messages.Message,
Messages.TIMESTAMP,
Messages.Nickname,
Topics.ID AS TopicID,
Topics.Title AS TopicTitle,
Forums.Title AS ForumTitle
FROM Messages
JOIN FollowedNicknames ON Messages.Nickname = FollowedNicknames.Nickname
AND FollowedNicknames.UserID = 'MYUSERID'
JOIN Forums ON Forums.ID = FollowedNicknames.ForumID
JOIN Subforums ON Subforums.ForumID = Forums.ID
JOIN Topics ON Messages.TopicID = Topics.ID
AND Topics.SubforumID = Subforums.ID
WHERE Messages.DATESTAMP = '2013619'
ORDER BY Messages.TIMESTAMP DESC
Instead of INT as the data type for the DATESTAMP column, I would use DATE. The Checksum column should probably use latin1_general_ci as the collation. I would use INT for the ID columns as long as their values are less than 2,000,000,000 since INT UNSIGNED can store values up to roughly 4,000,000,000. InnoDB is affected by the primary key much more than MyISAM and it could make a noticeable difference.
Related
I am trying to speed up select in query below where I have over 1000 items in WHERE IN
table:
CREATE TABLE `user_item` (
`user_id` int(11) unsigned NOT NULL,
`item_id` int(11) unsigned NOT NULL,
PRIMARY KEY (`user_id`,`item_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
query:
SELECT
item_id
FROM
user_item
WHERE
user_id = 2
AND item_id IN(3433456,67584634,587345,...)
With 1000 items in IN list, query takes about 3 seconds to execute. is there any optimization that can be done in this case? There can be billions of rows in this table. Is there an alternative to doing this faster be it with another DB or programming method?
UPDATE:
Here's results of explain:
If I have 999 items in the IN(...) statement:
+------+-------------+----------+-------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------+-------+---------------+---------+---------+------+------+--------------------------+
| 1 | SIMPLE | user_item | range | PRIMARY | PRIMARY | 8 | NULL | 999 | Using where; Using index |
+------+-------------+----------+-------+---------------+---------+---------+------+------+--------------------------+
If I have 1000 items in IN(...) statement:
+------+--------------+-------------+--------+---------------+---------+---------+--------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------+-------------+--------+---------------+---------+---------+--------------------+------+--------------------------+
| 1 | PRIMARY | <subquery2> | ALL | distinct_key | NULL | NULL | NULL | 1000 | |
| 1 | PRIMARY | user_item | eq_ref | PRIMARY | PRIMARY | 8 | const,tvc_0._col_1 | 1 | Using where; Using index |
| 2 | MATERIALIZED | <derived3> | ALL | NULL | NULL | NULL | NULL | 1000 | |
| 3 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
+------+--------------+-------------+--------+---------------+---------+---------+--------------------+------+--------------------------+
Update 2
I want to explain why I need to do above:
I want to give the user the ability to list items ordered by sort_criteria_1, sort_criteria_2 or sort_criteria_3 and exclude from the list those items that have been marked by given (n) users in the user_item table.
Here's sample schema:
CREATE TABLE `user` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(45) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `item` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`file` varchar(45) NOT NULL,
`sort_criteria_1` int(11) DEFAULT NULL,
`sort_criteria_2` int(11) DEFAULT NULL,
`sort_criteria_3` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_sc1` (`sort_criteria_1`),
KEY `idx_sc2` (`sort_criteria_2`),
KEY `idx_sc3` (`sort_criteria_3`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `user_item` (
`user_id` int(11) NOT NULL,
`item_id` int(11) NOT NULL,
PRIMARY KEY (`user_id`,`item_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Here's how I would get items ordered by sort_criteria_2 excluding ones that have record by users (300, 6, 1344, 24) in user_item table:
SELECT
i.id,
FROM
item i
LEFT JOIN user_item ui1 ON (i.id = ui1.item_id AND ui1.user_id = 300)
LEFT JOIN user_item ui2 ON (i.id = ui2.item_id AND ui2.user_id = 6)
LEFT JOIN user_item ui3 ON (i.id = ui3.item_id AND ui3.user_id = 1344)
LEFT JOIN user_item ui4 ON (i.id = ui4.item_id AND ui4.user_id = 24)
WHERE
ui1.item_id IS NULL
AND ui2.item_id IS NULL
AND ui3.item_id IS NULL
AND ui4.item_id IS NULL
ORDER BY
v.sort_criteria_2
LIMIT
800
Main problem with above approach is that more users I'm filtering by, more expensive query gets. I want the toll for filtering to be paid by client browser. So I would send list of items and list of matching user_item records per user to the client to filter by. This would help with sharding as well, since I would not have to have user_item tables or set of records on the same machine.
It's hard to tell exactly, but there could be lag on parsing your huge query because of many constant item_id values.
Have you tried getting just all the values by user_id ? As this field is first (main) in the PRIMARY KEY, relevant index would still be used.
Have you tried replacing constant list with a subquery ? Maybe you're interested in items of specific type, for example.
Make sure that you use Prepared statement concept - at least if your database and language support it. This would protect your code from possible SQL injections and enable database built-in query caching (if your database supports it).
Instead of putting the 1000 item_id's into IN-clause, you could put them into temporary table with index and join it with the user_item-table.
If you also have an index with both user_id and item_id, that would make the query fastest that it gets. The rest depends on the data distribution.
I have a mySql table where all status changes are recorded. I want to be able to query the status of all items on a specific date, or the last date for all items. The table I have now is:
CREATE TABLE `tra_rel_sta` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`tra_id` int(11) DEFAULT NULL,
`sta_id` int(11) DEFAULT NULL,
`changed_on` datetime DEFAULT NULL,
`changed_by` int(11) DEFAULT NULL,
`comments` text,
PRIMARY KEY (`id`),
KEY `tra_id` (`tra_id`),
KEY `rel` (`tra_id`,`sta_id`,`changed_on`),
KEY `sta_id` (`sta_id`),
KEY `changed_on` (`changed_on`),
KEY `tra_changed` (`tra_id`,`changed_on`)
) ENGINE=InnoDB AUTO_INCREMENT=51734 DEFAULT CHARSET=utf8;
(I know I'm probably overdoing the indexes, but I haven't exactly figured out how to optimize indexes yet).
The query I'm using now, which works is:
SELECT rel.changed_on, rel.changed_by, rel.tra_id, sta.id AS sta_id, sta.status, sta.description, sta.onHold, sta.awaitingApproval, sta.approved, sta.complete, sta.locked
FROM (
SELECT tra_id, MAX(changed_on) AS lst
FROM tra_rel_sta
GROUP BY tra_id
) AS rec
LEFT JOIN tra_rel_sta AS rel ON rel.changed_on = rec.lst AND rel.tra_id = rec.tra_id
LEFT JOIN tra_status AS sta ON sta.id = rel.sta_id
If I want to use a specific date, I insert a WHERE statement in the sub-query.
This works, but it takes about 0.65 seconds to run in PHP with about 51,733 records in the table. This query is used as a sub query in several others when I need to know the last status of an object, and as a result, is slowing down many application.
I've tried to use a sub query in the WHERE statement as described in MySQL: how to select record with latest date before a certain date but it takes almost twice as long. I've tried using a JOIN statement as described in MySQL select of record with latest date but I'm getting about the same or just slightly slower results.
How can I optimize this query or fix my indexes to make this more effective?
Thanks!!
As requested, EXPLAIN of query:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
---|-------------|-------------|--------|-----------------------------------|---------|---------|-------------------|-------|-------------
1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 49931 | NULL
1 | PRIMARY | rel | ref | tra_id,rel,changed_on,tra_changed | tra_id | 5 | rec.tra_id | 1 | Using where
1 | PRIMARY | sta | eq_ref | PRIMARY | PRIMARY | 4 | csinfo.rel.sta_id | 1 | NULL
2 | DERIVED | tra_rel_sta | index | tra_id,rel,tra_changed | tra_id | 5 | NULL | 49931 | NULL
I am hoping to get some advice on how to optimize the performance of this query I have with an outer join. First I will explain what I am trying to do and then I'll show the code and results.
I have an Accounts table that has a list of all customer accounts. And I have a datausage table which keeps track of how much data each customer is using. A backend process running on multiple servers inserts records into the datausage table each day to keep track of how much usage occurred that day for each customer on that server.
The backend process works like this - if there is no activity on that server for an account on that day, no records are written for that account. If there is activity, one record is written with a "LogDate" of that day. This is happening on multiple servers. So collectively the datausage table winds up with no rows (no activity at all for that customer each day), one row (activity was only on one server for that day), or multiple rows (activity was on multiple servers for that day).
We need to run a report that lists ALL customers, along with their usage for a specific date range. Some customers may have no usage at all (nothing whatsoever in the datausage table). Some customers may have no usage at all for the current period (but usage in other periods).
Regardless of whether there is any usage or not (ever, or for the selected period) we need EVERY customer in the Accounts table to be listed in the report, even if they show no usage. Therefore it seems this required an outer join.
Here is the query I am using:
SELECT
Accounts.accountID as AccountID,
IFNULL(Accounts.name,Accounts.accountID) as AccountName,
AccountPlans.plantype as AccountType,
Accounts.status as AccountStatus,
date(Accounts.created_at) as Created,
sum(IFNULL(datausage.Core,0) + (IFNULL(datausage.CoreDeluxe,0) * 3)) as 'CoreData'
FROM `Accounts`
LEFT JOIN `datausage` on `Accounts`.`accountID` = `datausage`.`accountID`
LEFT JOIN `AccountPlans` on `AccountPlans`.`PlanID` = `Accounts`.`PlanID`
WHERE
(
(`datausage`.`LogDate` >= '2014-06-01' and `datausage`.`LogDate` < '2014-07-01')
or `datausage`.`LogDate` is null
)
GROUP BY Accounts.accountID
ORDER BY `AccountName` asc
This query takes about 2 seconds to run. However it only takes 0.3 seconds to run if the "or datausage.LogDate is NULL" is removed. However, it seems I must have that clause in there, because accounts with no usage are excluded from the result set if that does not appear.
Here is the table data:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+--------+---------------------------------------------------------+---------+---------+----------------------+------- +----------------------------------------------------+
| 1 | SIMPLE | Accounts | ALL | PRIMARY,accounts_planid_foreign,accounts_cardid_foreign | NULL | NULL | NULL | 57 | Using temporary; Using filesort |
| 1 | SIMPLE | datausage | ALL | NULL | NULL | NULL | NULL | 96805 | Using where; Using join buffer (Block Nested Loop) |
| 1 | SIMPLE | AccountPlans | eq_ref | PRIMARY | PRIMARY | 4 | mydb.Accounts.planID | 1 | NULL |
The indexes on Accounts table are as follows:
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+-------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Accounts | 0 | PRIMARY | 1 | accountID | A | 57 | NULL | NULL | | BTREE | | |
| Accounts | 1 | accounts_planid_foreign | 1 | planID | A | 5 | NULL | NULL | | BTREE | | |
| Accounts | 1 | accounts_cardid_foreign | 1 | cardID | A | 0 | NULL | NULL | YES | BTREE | | |
The index on the datausage table is as follows:
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| datausage | 0 | PRIMARY | 1 | UsageID | A | 96805 | NULL | NULL | | BTREE | | |
I tried creating different indexes on datausage to see if it would help, but nothing did. I tried an index on AccountID, an index on AccountID, LogData, and index on LogData, AccountID, and an index on LogData. None of these made any difference.
I also tried using a UNION ALL with one of the queries with the logdata range and the other query just where logdata is null, but the result was about the same (actually a bit worse).
Can someone please help me understand what may be going on and the ways in which I can optimize the query execution time? Thank you!!
UPDATE: At Philipxy's request, here are the table definitions. Note that I removed some columns and constraints that are not related to this query to help keep things as tight and clean as possible.
CREATE TABLE `Accounts` (
`accountID` varchar(25) NOT NULL,
`name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`status` int(11) NOT NULL,
`planID` int(10) unsigned NOT NULL DEFAULT '1',
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00'
PRIMARY KEY (`accountID`),
KEY `accounts_planid_foreign` (`planID`),
KEY `acctname_id_ndx` (`name`,`accountID`),
CONSTRAINT `accounts_planid_foreign` FOREIGN KEY (`planID`) REFERENCES `AccountPlans` (`planID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
CREATE TABLE `datausage` (
`UsageID` int(11) NOT NULL AUTO_INCREMENT,
`Core` int(11) DEFAULT NULL,
`CoreDelux` int(11) DEFAULT NULL,
`AccountID` varchar(25) DEFAULT NULL,
`LogDate` date DEFAULT NULL
PRIMARY KEY (`UsageID`),
KEY `acctusage` (`AccountID`,`LogDate`)
) ENGINE=MyISAM AUTO_INCREMENT=104303 DEFAULT CHARSET=latin1
CREATE TABLE `AccountPlans` (
`planID` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`params` text COLLATE utf8_unicode_ci NOT NULL,
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`plantype` varchar(25) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`planID`),
KEY `acctplans_id_type_ndx` (`planID`,`plantype`)
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
First, you can simplify the query by moving the where clause to the on clause:
SELECT a.accountID as AccountID, coalesce(a.name, a.accountID) as AccountName,
ap.plantype as AccountType, a.status as AccountStatus,
date(a.created_at) as Created,
sum(coalesce(du.Core, 0) + (coalesce(du.CoreDeluxe, 0) * 3)) as CoreData
FROM Accounts a LEFT JOIN
datausage du
on a.accountID = du.`accountID` AND
du.`LogDate` >= '2014-06-01' and du.`LogDate` < '2014-07-01'
LEFT JOIN
AccountPlans ap
on ap.`PlanID` = a.`PlanID`
GROUP BY a.accountID
ORDER BY AccountName asc ;
(I also introduced table aliases to make the query easier to read.)
This version should make better uses of indexes because it eliminates the or in the where clause. However, it still won't use an index for the outer sort. The following might be better:
SELECT a.accountID as AccountID, coalesce(a.name, a.accountID) as AccountName,
ap.plantype as AccountType, a.status as AccountStatus,
date(a.created_at) as Created,
sum(coalesce(du.Core, 0) + (coalesce(du.CoreDeluxe, 0) * 3)) as CoreData
FROM Accounts a LEFT JOIN
datausage du
on a.accountID = du.`accountID` AND
du.LogDate >= '2014-06-01' and du.LogDate < '2014-07-01'LEFT JOIN
AccountPlans ap
on ap.PlanID = a.PlanID
GROUP BY a.accountID
ORDER BY a.name, a.accountID ;
For this, I would recommend the following indexes:
Accounts(name, AccountId)
Datausage(AccountId, LogDate)
AccountPlans(PlanId, PlanType)
When you left join with datausage you should restrict the output as much as possible right there. (JOIN means AND means WHERE means ON. Put the conditions in essentially whatever order will be clear and/or optimize when necessary.) The result will be a null-extended row when there was no usage; you want to leave that row in.
When you join with AccountPlans you don't want to introduce null rows (which can't happen anyway) so that's just an inner join.
The version below has the AccountPlan join as an inner join and put first. (Indexed) Accounts FK PlanID to AccountPlan means the DBMS knows the inner join will only ever generate one row per Accounts PK. So the output has key AccountId. That row can be immediately inner joined to datausage. (An index on its AccountID should help, eg for a merge join.) For the other way around there is no PlanID key/index on the outer join result to join with AccountPlan.
SELECT
a.accountID as AccountID,
IFNULL(a.name,a.accountID) as AccountName,
ap.plantype as AccountType,
a.status as AccountStatus,
date(a.created_at) as Created,
sum(IFNULL(du.Core,0) + (IFNULL(du.CoreDeluxe,0) * 3)) as CoreData
FROM Accounts a
JOIN AccountPlans ap ON ap.PlanID = a.PlanID
LEFT JOIN datausage du ON a.accountID = du.accountID AND du.LogDate >= '2014-06-01' AND du.LogDate < '2014-07-01'
GROUP BY a.accountID
I have a table of products with a score column, which has a B-Tree Index on it. I have a query which returns products that have not been shown to the user in the current session. I can't simply use simple pagination with LIMIT for it, because the result should be ordered by the score column, which can change between query calls.
My current solution works like this:
SELECT *
FROM products p
LEFT JOIN product_seen ps
ON (ps.session_id = ? AND p.product_id = ps.product_id )
WHERE ps.product_id is null
ORDER BY p.score DESC
LIMIT 30;
This works fine for the first few pages, but the response time grows linear to the number of products already shown in the session and hits the second mark by the time this number reaches ~300. Is there a way to fasten this up in MySQL? Or should I solve this problem in an entirely other way?
Edit:
These are the two tables:
CREATE TABLE `products` (
`product_id` int(15) NOT NULL AUTO_INCREMENT,
`shop` varchar(15) NOT NULL,
`shop_id` varchar(25) NOT NULL,
`shop_category_id` varchar(20) DEFAULT NULL,
`shop_subcategory_id` varchar(20) DEFAULT NULL,
`shop_designer_id` varchar(20) DEFAULT NULL,
`shop_designer_name` varchar(40) NOT NULL,
`created_at` timestamp NULL DEFAULT NULL,
`product_url` varchar(255) NOT NULL,
`name` varchar(255) NOT NULL,
`description` mediumtext NOT NULL,
`price_cents` int(10) NOT NULL,
`list_image_url` varchar(255) NOT NULL,
`list_image_height` int(4) NOT NULL,
`ending` timestamp NULL DEFAULT NULL,
`category_id` int(5) NOT NULL,
`last_update` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`included_at` timestamp NULL DEFAULT NULL,
`hearts` int(5) NOT NULL,
`score` decimal(10,5) NOT NULL,
`rand_field` decimal(16,15) NOT NULL,
`last_score_update` timestamp NULL DEFAULT NULL,
`active` tinyint(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`product_id`),
UNIQUE KEY `unique_shop_id` (`shop`,`shop_id`),
KEY `score_index` (`active`,`score`),
KEY `included_at_index` (`included_at`),
KEY `active_category_score` (`active`,`category_id`,`score`),
KEY `active_category` (`active`,`category_id`,`product_id`),
KEY `active_products` (`active`,`product_id`),
KEY `active_rand` (`active`,`rand_field`),
KEY `active_category_rand` (`active`,`category_id`,`rand_field`)
) ENGINE=InnoDB AUTO_INCREMENT=55985 DEFAULT CHARSET=utf8
CREATE TABLE `product_seen` (
`seenby_id` int(20) NOT NULL AUTO_INCREMENT,
`session_id` varchar(25) NOT NULL,
`product_id` int(15) NOT NULL,
`last_seen` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`sorting` varchar(10) NOT NULL,
`in_category` int(3) DEFAULT NULL,
PRIMARY KEY (`seenby_id`),
KEY `last_seen_index` (`last_seen`),
KEY `session_id` (`session_id`,`seenby_id`),
KEY `session_id_2` (`session_id`,`sorting`,`seenby_id`)
) ENGINE=InnoDB AUTO_INCREMENT=17431 DEFAULT CHARSET=utf8
Edit 2:
The query above is a simplification, this is the real query with EXPLAIN:
EXPLAIN SELECT
DISTINCT p.product_id AS id,
p.list_image_url AS image,
p.list_image_height AS list_height,
hearts,
active AS available,
(UNIX_TIMESTAMP( ) - ulp.last_action) AS last_loved
FROM `looksandgoods`.`products` p
LEFT JOIN `looksandgoods`.`user_likes_products` ulp
ON ( p.product_id = ulp.product_id AND ulp.user_id =1 )
LEFT JOIN `looksandgoods`.`product_seen` sb
ON (sb.session_id = 'y7lWunZKKABgMoDgzjwDjZw1'
AND sb.sorting = 'trend'
AND p.product_id = sb.product_id )
WHERE p.active =1
AND sb.product_id IS NULL
ORDER BY p.score DESC
LIMIT 30 ;
Explain output, there is still a temp table and filesort, although the keys for the join exist:
+----+-------------+-------+-------+----------------------------------------------------------------------------------------------------+------------------+---------+----------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+----------------------------------------------------------------------------------------------------+------------------+---------+----------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | p | range | score_index,active_category_score,active_category,active_products,active_rand,active_category_rand | score_index | 1 | NULL | 2299 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | ulp | ref | love_count_index,user_to_product_index,product_id | love_count_index | 9 | looksandgoods.p.product_id,const | 1 | |
| 1 | SIMPLE | sb | ref | session_id,session_id_2 | session_id | 77 | const | 711 | Using where; Not exists; Distinct |
+----+-------------+-------+-------+----------------------------------------------------------------------------------------------------+------------------+---------+----------------------------------+------+----------------------------------------------+
New answer
I think the problem with the real query is the DISTINCT clause. The implication is that either or both of the product_seen and user_likes_products tables can join multiple rows for each product_id which could potentially appear in the result set (given the somewhat disturbing lack of UNIQUE KEYs on the product_seen table), and this is the reason you've included the DISTINCT clause. Unfortunately, it also means MySQL will have to create a temp table to process the query.
Before I go any further, if it's possible to do...
ALTER TABLE product_seen ADD UNIQUE KEY (session_id, product_id, sorting);
...and...
ALTER TABLE user_likes_products ADD UNIQUE KEY (user_id, product_id);
...then the DISTINCT clause is redundant, and removing it should eliminate the problem. N.B. I'm not suggesting you necessarily need to add these keys, but rather just to confirm that these fields are always unique.
If it's not possible, then there may be another solution, but I'd need to know a lot more about the tables involved in the joins.
Old answer
An EXPLAIN for your query yields...
+----+-------------+-------+------+---------------+------------+---------+-------+------+-------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------------+---------+-------+------+-------------------------+
| 1 | SIMPLE | p | ALL | NULL | NULL | NULL | NULL | 10 | Using filesort |
| 1 | SIMPLE | ps | ref | session_id | session_id | 27 | const | 1 | Using where; Not exists |
+----+-------------+-------+------+---------------+------------+---------+-------+------+-------------------------+
...which shows it's not using an index on the products table, so it's having to do a table scan and a filesort, which is why it's slow.
I noticed there's an index on (active, score) which you could use by changing the query to only show active products...
SELECT *
FROM products p
LEFT JOIN product_seen ps
ON (ps.session_id = ? AND p.product_id = ps.product_id )
WHERE p.active=TRUE AND ps.product_id is null
ORDER BY p.score DESC
LIMIT 30;
...which changes the EXPLAIN to...
+----+-------------+-------+-------+-----------------------------+-------------+---------+-------+------+-------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+-----------------------------+-------------+---------+-------+------+-------------------------+
| 1 | SIMPLE | p | range | score_index,active_products | score_index | 1 | NULL | 10 | Using where |
| 1 | SIMPLE | ps | ref | session_id | session_id | 27 | const | 1 | Using where; Not exists |
+----+-------------+-------+-------+-----------------------------+-------------+---------+-------+------+-------------------------+
...which is now doing a range scan and no filesort, which should be much faster.
Or if you want it to also return inactive products, then you'll need to add an index on score only, with...
ALTER TABLE products ADD KEY (score);
I have the following query:
explain select * from users, dls where dls.user_id=users.id and users.status = 'accepted' and users.acc = 0 order by users.user_name desc limit 18416, 16
Which results in the following explain;
+----+-------------+-------+------+------------------------+-------------+---------+---------------------------------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+------------------------+-------------+---------+---------------------------------+-------+---------------------------------+
| 1 | SIMPLE | dls | ALL | PRIMARY,user_id | NULL | NULL | NULL | 19910 | Using temporary; Using filesort |
| 1 | SIMPLE | users | ref | PRIMARY,id_user_name | id_user_name | 4 | dls.user_id | 1 | Using where |
+----+-------------+-------+------+------------------------+-------------+---------+---------------------------------+-------+---------------------------------+
2 rows in set (0.00 sec)
This query is really, really slow and I cannot figure out how to fix it. I tried all kinds of indexes from reading articles on how to optimize order by / limit queries, but the result remains the same. Can anyone please help?
Edit: schemas:
CREATE TABLE `users` (
`id` int(10) unsigned NOT NULL auto_increment,
`user_name` varchar(100) character set utf8 NOT NULL,
`status` enum('accepted','rejected') character set utf8 NOT NULL,
`acc` varchar(6) character set utf8 NOT NULL,
PRIMARY KEY (`id`),
KEY `user_name` (`user_name`),
KEY `id_user_name` (`id`,`user_name`)
)
CREATE TABLE `dls` (
`user_id` int(10) unsigned NOT NULL,
`category_id` bigint(20) NOT NULL,
`download_url` varchar(255) character set utf8 NOT NULL,
PRIMARY KEY (`user_id`,`category_id`),
KEY `user_id` (`user_id`)
)
Output for query by Scrummeister;
+----+-------------+-------+------+------------------------+--------+---------+------------------------------+-------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+------------------------+--------+---------+------------------------------+-------+-----------------------------+
| 1 | SIMPLE | u | ALL | PRIMARY,id_user_name | NULL | NULL | NULL | 10838 | Using where; Using filesort |
| 1 | SIMPLE | dls | ref | PRIMARY,user_id | user_id | 4 | u.id | 2 | |
+----+-------------+-------+------+------------------------+--------+---------+------------------------------+-------+-----------------------------+
MySql is known to have issues with a LIMIT using a large offset.
The STRAIGHT_JOIN keyword, tells MySql to first scan the users table and then for every user, look up the rows in the dls table.
SELECT STRAIGHT_JOIN *
FROM users u JOIN dls ON dls.user_id = users.id
WHERE u.status = 'accepted' and u.acc = 0
ORDER BY users.user_name desc
LIMIT 18416, 16
Using STRAIGHT_JOIN is not recommended unless there is a need for it, In this specific case i believe it might work since it can use the user_name index for Sorting.
Other options you have:
Increase the size of sort_buffer_size
Increase the size of read_rnd_buffer_size (with caution!)
Doing the paging on the users table only, regardless of how many dls he has, Only than apply the JOIN.
Handle the paging in your code. Assuming a user goes from page to page with skipping to many, you should store the first & last user names for each page. If the user clicks the next page - Add a WHERE user_name > "{LastPageLastUsername} LIMIT 0,16" this will increase
For other optimization, read ORDER BY Optimization and Limit Optimization
Try add an index to the users table with the following columns
status, acc, user_name
or
acc, status, user_name
which ever is the faster