Question covers doubts on efficient SQL query for multiple subqueries:
I have 3 tables. I want to get details from table 1, based on filtering done from table 2 and table 3. Currently I am using IN clause on table 2 and table 3 but it takes around 6 seconds for 2M users. I tried join also but it was slower than subquery.
Table1:
mysql> describe users;
Field | Type | Null | Key | Default
| uuid | varchar(36) | NO | PRI | NULL
| firstname | varchar(512) | YES | | NULL
| status | varchar(512) | YES | | NULL
| createdAt | timestamp | YES | | CURRENT_TIMESTAMP
Table 2:
describe homes;
| Field | Type | Null | Key | Default | Extra
| uuid | varchar(50) | NO | PRI | NULL
| phoneNumberHash | varchar(512) | YES | MUL | NULL
| secondaryPhoneNumberHash | varchar(512) | YES | MUL | NULL
Table 3:
describe utility_tags:
| Field | Type | Null | Key | Default |
| tag_name | varchar(50) | NO | MUL | NULL |
| tag_value | varchar(50) | NO | MUL | NULL |
| user_id | varchar(50) | NO | MUL | NULL |
I have index on all the required fields ie.
User Table : Index on uuid
Home Table : Separate Index on phoneNumberHash and secondaryPhoneNumberHash
Utility_Tags: Separate Index on tag_name and tag_value
Query I am running:
SELECT uuid, firstname
FROM users
WHERE ( uuid in (
SELECT `uuid`
FROM `homes`
WHERE ( ( `phoneNumberHash` = '02c' OR `secondaryPhoneNumberHash` = '02c' ))
)
OR uuid in (
SELECT `user_id`
FROM `utility_tags`
WHERE ( `tag_name` = 'ACCOUNT_NUMBER' AND `tag_value`= '13' )
))
AND `status` != 'DELETED'
ORDER BY `createdAt` DESC LIMIT 10 OFFSET 0;
The query is slow and takes around 6 sec when there are 2M rows in user and homes table.
I tried join query:
SELECT users.uuid, firstname
FROM users inner join homes on homes.uuid=users.uuid
inner join utility_tags on utility_tags.user_id=users.uuid
WHERE ( phoneNumberHash = '02c' OR secondaryPhoneNumberHash = '02cd0' )
OR ( tag_name = 'ACCOUNT_NUMBER' AND tag_value= '1311851988' )
AND `status` != 'DELETED'
ORDER BY `createdAt` DESC
LIMIT 10 OFFSET 0;
This takes around 30 seconds.
Any help is highly appreciated.
You are selecting certain rows from your users table based on matches in your other tables. You're using a complex IN( ... ) clause for that.
Let's look at the contents of that clause for optimization possibilities. Here's one way you generate a set of uuid values.
SELECT uuid
FROM homes
WHERE phoneNumberHash = '02c'
OR secondaryPhoneNumberHash = '02c'
Here's the other
SELECT user_id
FROM utility_tags
WHERE tag_name = 'ACCOUNT_NUMBER'
AND tag_value= '13'
Let's recast all this as a UNION of several sets of uuid values, like this.
SELECT uuid FROM homes WHERE phoneNumberHash = '02c'
UNION
SELECT uuid FROM homes WHERE secondaryPhoneNumberHash = '02c'
UNION
SELECT user_id AS uuid
FROM utility_tags
WHERE tag_name = 'ACCOUNT_NUMBER'
AND tag_value= '13'
That union of three queries does the same thing as all your OR clauses. The first two of those queries should (if you're using InnoDB) be optimized by the indexes on phoneNumberHash and secondaryPhoneNumberHash respectively. The third query in that union needs a compound index on (tag_name, tag_value, user_id) to perform efficiently.
The cool thing about UNION is it does the same sort of set creation as OR, but lets you write queries within the UNION that are more likely to use indexes. I suggest you experiment with this UNION query and appropriate indexes until you're happy with its performance. Then you can use it in your outer query.
(It's possible that the query planner has become smart enough to handle phoneNumberHash = '02c' OR secondaryPhoneNumberHash = '02c' as a UNION all by itself, exploiting your two indexes one after the other. Recent MySQL versions have made great progress in query planning.)
So that leaves us with the outer query:
SELECT uuid, firstname
FROM users
WHERE matching uuids
AND status != 'DELETED'
ORDER BY createdAt DESC
LIMIT 10 OFFSET 0
This is hard to make sargable. The query planner doesn't like != operators. It likes = best because index equality scans are cheap. It likes <, <=, >=, and > OK because range scans are almost as cheap. But you're stuck with !=.
Also, the query planner hates ORDER BY ... LIMIT because it has to sort a whole mess of rows just to discard all except a tiny number.
The following compound covering index MAY optimize this query: (createdAt, status, uuid, firstname). The query planner may be able to dodge the separate ORDER BY if it has an index that provides both the match criteria and the needed results. It's also possible that this index will be better. (createdAt, status, uuid, status, firstname) You'll need to try them both. Don't keep them both, only the one that helps best.
Putting it all together:
SELECT u.uuid, u.firstname
FROM users u
JOIN (
SELECT uuid FROM homes WHERE phoneNumberHash = '02c'
UNION
SELECT uuid FROM homes WHERE secondaryPhoneNumberHash = '02c'
UNION
SELECT user_id AS uuid
FROM utility_tags
WHERE tag_name = 'ACCOUNT_NUMBER'
AND tag_value= '13'
) s ON s.uuid = u.uuid
WHERE status != 'DELETED'
ORDER BY createdAt DESC
LIMIT 10 OFFSET 0
Things get interesting on megarow tables when you want subsecond query response. http://use-the-index-luke.com/ is a fine reference for this stuff.
Your main problem is you're selecting from users first - move it to last so its index can be used (subqueries can't be indexed).
Also, SQL OR is notorious, mainly because (almost always) at most 1 index can be used.
Select from the subquery first, so the index into users can be used
Ensure there are indexes on all looked-up columns, ie (uuid), (phoneNumberHash), (secondaryPhoneNumberHash) and (tag_name, tag_value)
Break up your query to eradicate OR
Try this:
SELECT uuid, firstname
FROM (
SELECT uuid
FROM homes
WHERE phoneNumberHash = '02c'
UNION
SELECT uuid
FROM homes
WHERE secondaryPhoneNumberHash = '02c'
SELECT user_id
FROM utility_tags
WHERE tag_name = 'ACCOUNT_NUMBER'
AND tag_value = 13
) x
JOIN users ON users.uuid = x.uuid
AND status != 'DELETED'
ORDER BY createdAt DESC
LIMIT 10 OFFSET 0
Notice also that the test for status != 'DELETED' is in the join condition (not the WHERE clause), so it's executed at join time, not post-join, which will boost performance especially if there are a lot of deleted users.
Related
I have the following tables (minified for the sake of simplicity):
CREATE TABLE IF NOT EXISTS `product_bundles` (
bundle_id int AUTO_INCREMENT PRIMARY KEY,
-- More columns here for bundle attributes
) ENGINE=InnoDB;
CREATE TABLE IF NOT EXISTS `product_bundle_parts` (
`part_id` int AUTO_INCREMENT PRIMARY KEY,
`bundle_id` int NOT NULL,
`sku` varchar(255) NOT NULL,
-- More columns here for product attributes
KEY `bundle_id` (`bundle_id`),
KEY `sku` (`sku`)
) ENGINE=InnoDB;
CREATE TABLE IF NOT EXISTS `products` (
`product_id` mediumint(8) AUTO_INCREMENT PRIMARY KEY,
`sku` varchar(64) NOT NULL DEFAULT '',
`status` char(1) NOT NULL default 'A',
-- More columns here for product attributes
KEY (`sku`),
) ENGINE=InnoDB;
And I want to show only the 'product bundles' that are currently completely in stock and defined in the database (since these get retrieved from a third party vendor, there is no guarantee the SKU is defined). So I figured I'd need an anti-join to retrieve it accordingly:
SELECT SQL_CALC_FOUND_ROWS *
FROM product_bundles AS bundles
WHERE 1
AND NOT EXISTS (
SELECT *
FROM product_bundle_parts AS parts
LEFT JOIN products AS products ON parts.sku = products.sku
WHERE parts.bundle_id = bundles.bundle_id
AND products.status = 'A'
AND products.product_id IS NULL
)
-- placeholder for other dynamic conditions for e.g. sorting
LIMIT 0, 24
Now, I sincerely thought this would filter out the products by status, however, that seems not to be the case. I then changed one thing up a bit, and the query never finished (although I believe it to be correct):
SELECT SQL_CALC_FOUND_ROWS *
FROM product_bundles AS bundles
WHERE 1
AND NOT EXISTS (
SELECT *
FROM product_bundle_parts AS parts
LEFT JOIN products AS products ON parts.sku = products.sku
AND products.status = 'A'
WHERE parts.bundle_id = bundles.bundle_id
AND products.product_id IS NULL
)
-- placeholder for other dynamic conditions for e.g. sorting
LIMIT 0, 24
Example data:
product_bundles
bundle_id | etc.
1 |
2 |
3 |
product_bundle_parts
part_id | bundle_id | sku
1 | 1 | 'sku11'
2 | 1 | 'sku22'
3 | 1 | 'sku33'
4 | 1 | 'sku44'
5 | 2 | 'sku55'
6 | 2 | 'sku66'
7 | 3 | 'sku77'
8 | 3 | 'sku88'
products
product_id | sku | status
101 | 'sku11' | 'A'
102 | 'sku22' | 'A'
103 | 'sku33' | 'A'
104 | 'sku44' | 'A'
105 | 'sku55' | 'D'
106 | 'sku66' | 'A'
107 | 'sku77' | 'A'
108 | 'sku99' | 'A'
Example result: Since the product status of product #105 is 'D' and 'sku88' from part #8 was not found:
bundle_id | etc.
1 |
I am running Server version: 10.3.25-MariaDB-0ubuntu0.20.04.1 Ubuntu 20.04
So there are a few questions I have.
Why does the first query not filter out products that do not have the status A.
Why does the second query not finish?
Are there alternative ways of achieving the same thing in a more efficient matter, as this looks rather cumbersome.
First of all, I've read that SQL_CALC_FOUND_ROWS * is much slower than running two separate query (COUNT(*) and then SELECT * or, if you make your query inside another programming language, like PHP, executing the SELECT * and then count the number of rows of the result set)
Second: your first query returns all the boundles that doesn't have ANY active products, while you need the boundles with ALL products active.
I'd change it in the following:
SELECT SQL_CALC_FOUND_ROWS *
FROM product_bundles AS bundles
WHERE NOT EXISTS (
SELECT 'x'
FROM product_bundle_parts AS parts
LEFT JOIN products ON (parts.sku = products.sku)
WHERE parts.bundle_id = bundles.bundle_id
AND COALESCE(products.status, 'X') != 'A'
)
-- placeholder for other dynamic conditions for e.g. sorting
LIMIT 0, 24
I changed the products.status = 'A' in products.status != 'A': in this way the query will return all the boundles that DOESN'T have inactive products (I also removed the condition AND products.product_id IS NULL because it should have been in OR, but with a loss in performance).
You can see my solution in SQLFiddle.
Finally, to know why your second query doesn't end, you should check the structure of your tables and how they are indexed. Executing an Explain on the query could help you to find eventual issues on the structure. Just put the keyword EXPLAIN before the SELECT and you'll have your "report" (EXPLAIN SELECT * ....).
I have the following queries which both return the same result and row count:
select * from (
select UNIX_TIMESTAMP(network_time) * 1000 as epoch_network_datetime,
hbrl.business_rule_id,
display_advertiser_id,
hbrl.campaign_id,
truncate(sum(coalesce(hbrl.ad_spend_network, 0))/100000.0, 2) as demand_ad_spend_network,
sum(coalesce(hbrl.ad_view, 0)) as demand_ad_view,
sum(coalesce(hbrl.ad_click, 0)) as demand_ad_click,
truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else 100*sum(hbrl.ad_click)/sum(hbrl.ad_view) end, 0), 2) as ctr_percent,
truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else sum(hbrl.ad_spend_network)/100.0/sum(hbrl.ad_view) end, 0), 2) as ecpm,
truncate(coalesce(case when sum(hbrl.ad_click) = 0 then 0 else sum(hbrl.ad_spend_network)/100000.0/sum(hbrl.ad_click) end, 0), 2) as ecpc
from hourly_business_rule_level hbrl
where (publisher_network_id = 31534)
and network_time between str_to_date('2017-08-13 17:00:00.000000', '%Y-%m-%d %H:%i:%S.%f') and str_to_date('2017-08-14 16:59:59.999000', '%Y-%m-%d %H:%i:%S.%f')
and (network_time IS NOT NULL and display_advertiser_id > 0)
group by network_time, hbrl.campaign_id, hbrl.business_rule_id
having demand_ad_spend_network > 0
OR demand_ad_view > 0
OR demand_ad_click > 0
OR ctr_percent > 0
OR ecpm > 0
OR ecpc > 0
order by epoch_network_datetime) as atb
left join dim_demand demand on atb.display_advertiser_id = demand.advertiser_dsp_id
and atb.campaign_id = demand.campaign_id
and atb.business_rule_id = demand.business_rule_id
ran explain extended, and these are the results:
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1451739 | 100.00 | NULL |
| 1 | PRIMARY | demand | ref | PRIMARY,join_index | PRIMARY | 4 | atb.campaign_id | 1 | 100.00 | Using where |
| 2 | DERIVED | hourly_business_rule_level | ALL | _hourly_business_rule_level_supply_idx,_hourly_business_rule_level_demand_idx | NULL | NULL | NULL | 1494447 | 97.14 | Using where; Using temporary; Using filesort |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+
and the other is:
select UNIX_TIMESTAMP(network_time) * 1000 as epoch_network_datetime,
hbrl.business_rule_id,
display_advertiser_id,
hbrl.campaign_id,
truncate(sum(coalesce(hbrl.ad_spend_network, 0))/100000.0, 2) as demand_ad_spend_network,
sum(coalesce(hbrl.ad_view, 0)) as demand_ad_view,
sum(coalesce(hbrl.ad_click, 0)) as demand_ad_click,
truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else 100*sum(hbrl.ad_click)/sum(hbrl.ad_view) end, 0), 2) as ctr_percent,
truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else sum(hbrl.ad_spend_network)/100.0/sum(hbrl.ad_view) end, 0), 2) as ecpm,
truncate(coalesce(case when sum(hbrl.ad_click) = 0 then 0 else sum(hbrl.ad_spend_network)/100000.0/sum(hbrl.ad_click) end, 0), 2) as ecpc
from hourly_business_rule_level hbrl
join dim_demand demand on hbrl.display_advertiser_id = demand.advertiser_dsp_id
and hbrl.campaign_id = demand.campaign_id
and hbrl.business_rule_id = demand.business_rule_id
where (publisher_network_id = 31534)
and network_time between str_to_date('2017-08-13 17:00:00.000000', '%Y-%m-%d %H:%i:%S.%f') and str_to_date('2017-08-14 16:59:59.999000', '%Y-%m-%d %H:%i:%S.%f')
and (network_time IS NOT NULL and display_advertiser_id > 0)
group by network_time, hbrl.campaign_id, hbrl.business_rule_id
having demand_ad_spend_network > 0
OR demand_ad_view > 0
OR demand_ad_click > 0
OR ctr_percent > 0
OR ecpm > 0
OR ecpc > 0
order by epoch_network_datetime;
and these are the results for the second query:
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+
| 1 | SIMPLE | hourly_business_rule_level | ALL | _hourly_business_rule_level_supply_idx,_hourly_business_rule_level_demand_idx | NULL | NULL | NULL | 1494447 | 97.14 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | demand | ref | PRIMARY,join_index | PRIMARY | 4 | my6sense_datawarehouse.hourly_business_rule_level.campaign_id | 1 | 100.00 | Using where; Using index |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+
the first one takes about 2 seconds while the second one takes over 2 minutes!
why is the second query taking so long?
what am I missing here?
thanks.
Use a subquery whenever the subquery significantly shrinks the number of rows before - ANY JOIN - always to reinforce Rick James Plan B.
To reinforce Rick & Paul's answer which you have already documented. The answers by Rick and Paul deserve Acceptance.
One possible reason is the number of rows that have to be joined with the second table.
The GROUP BY clause and the HAVING clause will limit the number of rows returned from your subquery.
Only those rows will be used for the join.
Without the subquery only the WHERE clause is limiting the number of rows for the JOIN.
The JOIN is done before the GROUP BY and HAVING clauses are processed.
Depending on group size and the selectivity of the HAVING conditions there would be much more rows that need to be joined.
Consider the following simplified example:
We have a table users with 1000 entries and the columns id, email.
create table users(
id smallint auto_increment primary key,
email varchar(50) unique
);
Then we have a (huge) log table user_actions with 1,000,000 entries and the columns id, user_id, timestamp, action
create table user_actions(
id mediumint auto_increment primary key,
user_id smallint not null,
timestamp timestamp,
action varchar(50),
index (timestamp, user_id)
);
The task is to find all users who have at least 900 entries in the log table since 2017-02-01.
The subquery solution:
select a.user_id, a.cnt, u.email
from (
select a.user_id, count(*) as cnt
from user_actions a
where a.timestamp >= '2017-02-01 00:00:00'
group by a.user_id
having cnt >= 900
) a
left join users u on u.id = a.user_id
The subquery returns 135 rows (users). Only those rows will be joined with the users table.
The subquery runs in about 0.375 seconds. The time needed for the join is almost zero, so the full query runs in about 0.375 seconds.
Solution without subquery:
select a.user_id, count(*) as cnt, u.email
from user_actions a
left join users u on u.id = a.user_id
where a.timestamp >= '2017-02-01 00:00:00'
group by a.user_id
having cnt >= 900
The WHERE condition filters the table to 866,081 rows.
The JOIN has to be done for all those 866K rows.
After the JOIN the GROUP BY and the HAVING clauses are processed and limit the result to 135 rows.
This query needs about 0.815 seconds.
So you can already see, that a subquery can improve the performance.
But let's make things worse and drop the primary key in the users table.
This way we have no index which can be used for the JOIN.
Now the first query runs in 0.455 seconds. The second query needs 40 seconds - almost 100 times slower.
Notes
It's difficult to say if the same applies to your case. Reasons are:
Your queries are quite complex and far away from from beeing an MVCE.
I don't see anything beeng selected from the demand table - So it's unclear why you are joining it at all.
You use a LEFT JOIN in one query and an INNER JOIN in another one.
The relation between the two tables is unclear.
No information about indexes. You should provide the CREATE statements (SHOW CREATE table_name).
Test setup
drop table if exists users;
create table users(
id smallint auto_increment primary key,
email varchar(50) unique
)
select seq as id, rand(1) as email
from seq_1_to_1000
;
drop table if exists user_actions;
create table user_actions(
id mediumint auto_increment primary key,
user_id smallint not null,
timestamp timestamp,
action varchar(50),
index (timestamp, user_id)
)
select seq as id
, floor(rand(2)*1000)+1 as user_id
#, '2017-01-01 00:00:00' + interval seq*20 second as timestamp
, from_unixtime(unix_timestamp('2017-01-01 00:00:00') + seq*20) as timestamp
, rand(3) as action
from seq_1_to_1000000
;
MariaDB 10.0.19 with sequence plugin.
The queries are different. One says JOIN, the other says LEFT JOIN. You are not using demand, so the join is probably useless. However, in the case of JOIN, you are filtering out advertisers that are not in dim_demand; it that the intent?
But that does not address the question.
The EXPLAINs estimate that there are 1.5M rows in hbrl. But how many show up in the result? I would guess it is a lot fewer. From this, I can answer your question.
Consider these two:
SELECT ... FROM ( SELECT ... FROM a
GROUP BY or HAVING or LIMIT ) x
JOIN b
SELECT ... FROM a
JOIN b
GROUP BY or HAVING or LIMIT
The first will decrease the number of rows that need to join to b; the second will need to do a full 1.5M joins. I suspect that the time taken to do the JOIN (be it LEFT or not) is where the difference is.
Plan A: Remove demand from the query.
Plan B: Use a subquery whenever the subquery significantly shrinks the number of rows before the JOIN.
Indexing (may speed up both variants):
INDEX(publisher_network_id, network_time)
and get rid of this as being useless (since the between will fail anyway for NULL):
and network_time IS NOT NULL
Side note: I recommend simplifying and fixing this
and network_time
between str_to_date('2017-08-13 17:00:00.000000', '%Y-%m-%d %H:%i:%S.%f')
AND str_to_date('2017-08-14 16:59:59.999000', '%Y-%m-%d %H:%i:%S.%f')
to
and network_time >= '2017-08-13 17:00:00
and network_time < '2017-08-13 17:00:00 + INTERVAL 24 HOUR
I have a forum and I would like to see the latest topics with the author's name and the last user who answered
Table Topic (forum)
| idTopic | IdParent | User | Title | Text |
--------------------------------------------------------
| 1 | 0 | Max | Help! | i need somebody |
--------------------------------------------------------
| 2 | 1 | Leo | | What?! |
Query:
SELECT
Question.*,
Response.User AS LastResponseUser
FROM Topic AS Question
LEFT JOIN (
SELECT User, IdParent
FROM Topic
ORDER BY idTopic DESC
) AS Response
ON ( Response.IdParent = Question.idTopic )
WHERE Question.IdParent = 0
GROUP BY Question.idTopic
ORDER BY Question.idTopic DESC
Output:
| idTopic | IdParent | User | Title | Text | LastResponseUser |
---------------------------------------------------------------------------
| 1 | 0 | Max | Help! | i need somebody | Leo |
---------------------------------------------------------------------------
Example:
http://sqlfiddle.com/#!2/22f72/4
The query works, but is very slow (more or less 0.90 seconds over 25'000 record).
How can I make it faster?
UPDATE
comparison between the proposed solutions
http://sqlfiddle.com/#!2/94068/22
If using your current schema, I'd recommend adding indexes (particularly a clustered index (primary key)) and simplifying your SQL to let mySQL do the work of optimising the statement, rather than forcing it to run a subquery, sort the results, then run the main query.
CREATE TABLE Topic (
idTopic INT
,IdParent INT
,User VARCHAR(100)
,Title VARCHAR(255)
,Text VARCHAR(255)
,CONSTRAINT Topic_PK PRIMARY KEY (idTopic)
,CONSTRAINT Topic_idTopic_UK UNIQUE (idTopic)
,INDEX Topic_idParentIdTopic_IX (idParent, idTopic)
);
INSERT INTO Topic (idTopic, IdParent, User, Title, Text) VALUES
(1, 0, 'Max', 'Help!', 'i need somebody'),
(2, 1, 'Leo', '', 'What!?');
SELECT Question.*
, Response.User AS LastResponseUser
FROM Topic AS Question
LEFT JOIN Topic AS Response
ON Response.IdParent = Question.idTopic
WHERE Question.IdParent = 0
order by Question.idTopic
;
http://sqlfiddle.com/#!2/7f1bc/1
Update
In the comments you mentioned you only want the most recent response. For that, try this:
SELECT Question.*
, Response.User AS LastResponseUser
FROM Topic AS Question
LEFT JOIN (
select a.user, a.idParent
from Topic as a
left join Topic as b
on b.idParent = a.idParent
and b.idTopic > a.idTopic
where b.idTopic is null
) AS Response
ON Response.IdParent = Question.idTopic
WHERE Question.IdParent = 0
order by Question.idTopic
;
http://sqlfiddle.com/#!2/7f1bc/3
Assuming the highest IDTopic is the last responses user...
and assuming you want to return topics without responses...
Select A.IDTopic, A.IDParent, A.User, A.Title, A.Text,
case when b.User is null then 'No Response' else B.User end as LastReponseUser
FROM topic A
LEFT JOIN Topic B
on A.IdTopic = B.IDParent
and B.IDTopic = (Select max(IDTopic) from Topic
where IDParent=B.IDParent group by IDParent)
WHERE A.IDParent =0
My mysql DB has become CPU hungry trying to execute a particularly slow query. When I do an explain, mysql says "Using where; Using temporary; Using filesort". Please help deciphering and solving this puzzle.
Table structure:
CREATE TABLE `topsources` (
`USER_ID` varchar(255) NOT NULL,
`UPDATED_TIME` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`URL_ID` int(11) NOT NULL,
`SOURCE_SLUG` varchar(100) NOT NULL,
`FEED_PAGE_URL` varchar(255) NOT NULL,
`CATEGORY_SLUG` varchar(100) NOT NULL,
`REFERRER` varchar(2048) DEFAULT NULL,
PRIMARY KEY (`USER_ID`,`URL_ID`),
KEY `USER_ID` (`USER_ID`),
KEY `FEED_PAGE_URL` (`FEED_PAGE_URL`),
KEY `SOURCE_SLUG` (`SOURCE_SLUG`),
KEY `CATEGORY_SLUG` (`CATEGORY_SLUG`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The table has 370K rows...sometimes higher. The below query takes 10+ seconds.
SELECT topsources.SOURCE_SLUG, COUNT(topsources.SOURCE_SLUG) AS VIEW_COUNT
FROM topsources
WHERE CATEGORY_SLUG = '/newssource'
GROUP BY topsources.SOURCE_SLUG
HAVING MAX(CASE WHEN topsources.USER_ID = 'xxxx' THEN 1 ELSE 0 END) = 0
ORDER BY VIEW_COUNT DESC;
Here's the extended explain:
+----+-------------+------------+------+---------------+---------------+---------+-------+--------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------+---------------+---------------+---------+-------+--------+----------+----------------------------------------------+
| 1 | SIMPLE | topsources | ref | CATEGORY_SLUG | CATEGORY_SLUG | 302 | const | 160790 | 100.00 | Using where; Using temporary; Using filesort |
+----+-------------+------------+------+---------------+----
-----------+---------+-------+--------+----------+----------------------------------------------+
Is there a way to improve this query? Also, are there any mysql settings that can help in reducing CPU load? I can allocate more memory that's available on my server.
The most likely thing to help the query is an index on CATEGORY_SLUG, especially if it takes on many values. (That is, if the query is highly selective.) The query needs to read the entire table to get the results -- although 10 seconds seems like a long time.
I don't think the HAVING clause would be affecting the query processing.
Does the query take just as long if you run it two times in a row?
If there are many rows that match your CATEGORY_SLUG criteria, it may be difficult to make this fast, but is this any quicker?
SELECT ts.SOURCE_SLUG, COUNT(ts.SOURCE_SLUG) AS VIEW_COUNT
FROM topsources ts
WHERE ts.CATEGORY_SLUG = '/newssource'
AND NOT EXISTS(SELECT 1 FROM topsources ts2
WHERE ts2.CATEGORY_SLUG = '/newssource'
AND ts.SOURCE_SLUG = TS2.SOURCE_SLUG
AND ts2.USER_ID = 'xxxx')
GROUP BY ts.SOURCE_SLUG
ORDER BY VIEW_COUNT DESC;
That should do the trick if I read this my sql alteration correcty
SELECT topsources.SOURCE_SLUG, COUNT(topsources.SOURCE_SLUG) AS VIEW_COUNT
FROM topsources
WHERE CATEGORY_SLUG = '/newssource' and
topsources.SOURCE_SLUG not in (
select distinct SOURCE_SLUG
from topsources
where USER_ID = 'xxxx'
)
GROUP BY topsources.SOURCE_SLUG
ORDER BY VIEW_COUNT DESC;
Always hard to optimise something when you can't just throw queries at the data yourself, but this would be my first attempt if I was doing it myself:
SELECT t.SOURCE_SLUG, COUNT(t.SOURCE_SLUG) AS VIEW_COUNT
FROM topsources t
LEFT JOIN (
SELECT SOURCE_SLUG
FROM topsources t
WHERE CATEGORY_SLUG = '/newssource'
AND USER_ID = 'xxx'
GROUP BY .SOURCE_SLUG
) x USING (SOURCE_SLUG)
WHERE t.CATEGORY_SLUG = '/newssource'
AND x.SOURCE_SLUG IS NULL
GROUP BY t.SOURCE_SLUG
ORDER BY VIEW_COUNT DESC;
I have a large mysql-table with about 110.000.000 items
The Table Design is:
CREATE TABLE IF NOT EXISTS `tracksim` (
`tracksimID` int(11) NOT NULL AUTO_INCREMENT,
`trackID1` int(11) NOT NULL,
`trackID2` int(11) NOT NULL,
`sim` double NOT NULL,
PRIMARY KEY (`tracksimID`),
UNIQUE KEY `TrackID1` (`trackID1`,`trackID2`),
KEY `sim` (`sim`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Now I want to query a normal query:
SELECT trackID1, trackID2 FROM `tracksim`
WHERE sim > 0.5 AND
(`trackID1` = 168123 OR `trackID2`= 168123)
ORDER BY sim DESC LIMIT 0,100
The Explain statement gives me:
+----+-------------+----------+-------+---------------+------+---------+------+----------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+-------+---------------+------+---------+------+----------+----------+-------------+
| 1 | SIMPLE | tracksim | range | TrackID1,sim | sim | 8 | NULL | 19980582 | 100.00 | Using where |
+----+-------------+----------+-------+---------------+------+---------+------+----------+----------+-------------+
The query seems to be very slow(about 185 seconds), but i don't know if it is only because of the amount of items in the table. Do you have a tip how I can speedup the query or the table-lookup?
With 110 million records, I can't imagine there are many entries with the track ID in question. I would have indexes such as
(trackID1, sim )
(trackID2, sim )
(tracksimID, sim)
and do a PREQUERY via union and join against that result
select STRAIGHT_JOIN
TS2.*
from
( select ts.tracksimID
from tracksim ts
where ts.trackID1 = 168123
and ts.sim > 0.5
UNION
select ts.trackSimID
from tracksim ts
where ts.trackid2 = 168123
and ts.sim > 0.5
) PreQuery
JOIN TrackSim TS2
on PreQuery.TrackSimID = TS2.TrackSimID
order by
TS2.SIM DESC
LIMIT 0, 100
Mostly I agree with Drap, but the following variation of the query might be even more efficient, especially for larger LIMIT:
SELECT TS2.*
FROM (
SELECT tracksimID, sim
FROM tracksim
WHERE trackID1 = 168123
AND sim > 0.5
UNION
SELECT trackSimID, sim
FROM tracksim
WHERE trackid2 = 168123
AND ts.sim > 0.5
ORDER BY sim DESC
LIMIT 0, 100
) as PreQuery
JOIN TrackSim TS2 USING (TrackSimID);
Requires (trackID1, sim) and (trackID2, sim) indexes.
Try filtering your query so you don't return the full table. Alternatively you could try applying an index to the table on one of the track ID's, for example:
CREATE INDEX TRACK_INDEX
ON tracksim (trackID1)
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
http://www.tutorialspoint.com/mysql/mysql-indexes.htm