Slow execution of a subquery when no matches - mysql

Please note that I have asked this question on dba.stackexchange.com, but I thought I'd post it here too:
In MySQL, I have two basic tables - Posts and Followers:
CREATE TABLE Posts (
id int(11) NOT NULL AUTO_INCREMENT,
posted int(11) NOT NULL,
body varchar(512) NOT NULL,
authorId int(11) NOT NULL,
PRIMARY KEY (id),
KEY posted (posted),
KEY authorId (authorId,posted)
) ENGINE=InnoDB;
CREATE TABLE Followers (
userId int(11) NOT NULL,
followerId int(11) NOT NULL,
PRIMARY KEY (userId,followerId),
KEY followerId (followerId)
) ENGINE=InnoDB;
I have the following query, which seems to be optimized enough:
SELECT p.*
FROM Posts p
WHERE p.authorId IN (SELECT f.userId
FROM Followers f
WHERE f.followerId = 9
ORDER BY authorId)
ORDER BY posted
LIMIT 0, 20
EXPLAIN output:
+------+--------------------+-------+-----------------+--------------------+---------+---------+------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------------+-------+-----------------+--------------------+---------+---------+------------+------+--------------------------+
| 1 | PRIMARY | p | index | NULL | posted | 4 | NULL | 20 | Using where |
| 2 | DEPENDENT SUBQUERY | f | unique_subquery | PRIMARY,followerId | PRIMARY | 8 | func,const | 1 | Using index; Using where |
+------+--------------------+-------+-----------------+--------------------+---------+---------+------------+------+--------------------------+
When followerId is a valid id (meaning, it actually exists in both tables), the query execution is almost immediate. However, when the id is not present in the tables, the query only returns results (empty set) after a 7 second delay.
Why is this happening? Is there some way to speed up this query for cases where there are no matches (without having to do a check ahead of time)?

Is there some way to speed up this query ...???
Yes. You should do two things.
First, you should use EXISTS instead of IN (cross reference SQL Server IN vs. EXISTS Performance). It'll speed up the instances where there is a match, which will come in handy as your data set grows (it's may be fast enough now, but that doesn't mean you shouldn't follow best practices, and in this case EXISTS is a better practice than IN)
Second, you should modify the keys on your second table just a little bit. You were off to a good start using the compound key on (userId,followerId), but in terms of optimizing this particular query, you need to keep in mind the "leftmost prefix" rule of MySQL indices, eg
If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to find rows. http://dev.mysql.com/doc/refman/5.6/en/multiple-column-indexes.html
What your Query Execution Plan from EXPLAIN is telling you is that SQL thinks it makes more sense to join Followers to Posts (using the Primary Key on Posts) and filter the results for a given followerId off of that index. Think of it like saying "Show me all the possible matches, then reduce that down to just the ones that match followerId = {}"
If you replace your followerId key with a compound key (followerId,userId), you should be able to quickly zoom in to just the user ids associated with a given followerID and do the existence check against those.
I wish I knew how to explain this better... it's kind of a tough concept to grasp until you have a "Aha!" moment and it clicks. But if you look into the leftmost prefix rules on indices, and also change the key on followerId to be a key on (followerId,userId), I think it'll speed it up quite a bit. And if you use EXISTS instead of IN, that'll help you maintain that speed even as your data set grows.

try this one :
SELECT p.*
FROM Posts p
inner join Followers f On f.userId = p.authorId
WHERE f.followerId = 9
ORDER BY posted
LIMIT 0, 20

Related

How to Optimized performance of JOIN query on large table

I am using Server version: 5.5.28-log MySQL Community Server (GPL).
I have a big table consist of 279703655 records called table A. I have to perform join on this table with one of my changelog table B and then insert matching records in new tmp table C.
B table has index on column type.
A table consist of prod_id,his_id and other columns.A table has index on both column prod_id,history_id.
When i am going to perform the following query
INSERT INTO C(prod,his_id,comm)
SELECT DISTINCT a.product_id,a.history_id,comm
FROM B as b INNER JOIN A as a ON a.his_id = b.his_id AND b.type="applications"
GROUP BY prod_id
ON DUPLICATE KEY UPDATE
`his_id` = VALUES(`his_id`);
it takes 7 to 8 min to insert records.
Even if i perform simple count from table A it took 15 min to give me count.
I have also tried a procedure to insert records in Limit but due to count query takes 15 min it is more slower then before.
BEGIN
DECLARE n INT DEFAULT 0;
DECLARE i INT DEFAULT 0;
SELECT COUNT(*) FROM A INTO n;
SET i=5000000;
WHILE i<n DO
INSERT INTO C(product_id,history_id,comments)
SELECT a.product_id,a.history_id,a.comments FROM B as b
INNER JOIN (SELECT * FROM A LIMIT i,1) as a ON a.history_id=b.history_id;
SET i = i + 5000000;
END WHILE;
End
But the above code is also take 15 to 20 min o execute.
Please suggest me how i make it faster.
Below is EXPLAIN result:
+----+-------------+-------+--------+---------------+---------+---------+-----------------+--------------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+-----------------+--------------+-------------+
| 1 | SIMPLE | a | ALL | (NULL) | (NULL) | (NULL) | (NULL) | 279703655 | |
| 1 | SIMPLE | b | eq_ref | PRIMARY | PRIMARY | 8 | DB.a.history_id | 1 | Using index |
+----+-------------+-------+--------+---------------+---------+---------+-----------------+--------------+-------------+
(from Comment)
CREATE TABLE B (
history_id bigint(20) unsigned NOT NULL AUTO_INCREMENT,
history_hash char(32) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
type enum('products','brands','partnames','mc_partnames','applications') NOT NULL,
stamp timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (history_id),
UNIQUE KEY history_hash (history_hash),
KEY type (type),
KEY stamp (stamp)
);
Let's first look at the tables.
What you call table B is really a history table. Its primary key is the history_id.
What you call table A is really a product table with one product per row and product_id its primary key. Each product also has a history_id. Thus you have created a 1:n relation. A product has one history row; one history row relates to multiple products.
You are selecting the product table rows that have an 'application' type history entry. This should be written as:
select product_id, history_id, comm
from product
where history_id in
(
select history_id
from history
where type = 'applications'
);
(A join would work just as well, but isn't as clear. As there is only one history row per product, you can't get duplicates. Both GROUP BY and DISTINCT are completely superfluous in your query and should be removed in order not to give the DBMS unecessary work to do. But as mentioned: better don't join at all. If you want rows from table A, select from table A. If you want to look up rows in table B, look them up in the WHERE clause, where all criteria belongs.)
Now, we would have to know how many rows may be affected. If only 1% of all history rows are 'applications', then an index should be used. Preferably
create index idx1 on history (type, history_id);
… which finds rows by type and gets their history_id right away.
If, say 20%, of all all history rows are 'applications', then reading the table sequentially might be more efficient.
Then, how many product rows may we get? Even with a single history row, we might get millions of related product rows. Or vice versa, with millions of history rows we might get no product row at all. Again, we can provide an index, which may or may not be used by the DBMS:
create index idx2 on product (history_id, product_id, comm);
This is about as fast as it gets. Two indexes offered and a proper written query without an unnecessary join. There were times when MySQL had performance problems with IN. People rewrote the clause with EXISTS then. I don't think this is still necessary.
As of MySQL 8.0.3, you can create histogram statistics for tables.
analyze history update histogram on type;
analyze product update histogram on history_id;
This is an important step to help the optimizer to find the optimal way to select the data.
Indexes needed (assuming it is history_id, not his_id):
B: INDEX(type, history_id) -- in this order. Note: "covering"
A: INDEX(history_id, product_id, comm)
What column or combination of columns provides the uniqueness constraint that IODKU needs?
Really-- Provide SHOW CREATE TABLE.

A unique can be used as index?

I have this table:
// votes
+----+---------+---------+
| id | user_id | post_id |
+----+---------+---------+
| 1 | 12345 | 12 |
| 2 | 12345 | 13 |
| 3 | 52344 | 12 |
+----+---------+---------+
Also this is a part of my query:
EXISTS (select 1 from votes v where u.id = v.user_id and p.id = v.post_id)
To make my query more efficient, I have added a index group on user_id and post_id:
ALTER TABLE `votes` ADD INDEX `user_id,post_id` (`user_id,post_id`)
What's my question? I also want to prevent of duplicate vote from one user to one post. So I have to create a unique index on user_id and post_id too. Now I want to know, should I create another index? or just a unique index is enough and I should remove previous index?
You do not need two indexes serving similar purpose. Only one of them would be used during a select operation, and both will have to be modified on insert, update and delete. These are unnecessary overheads. Go with the unique index, since it serves both the purposes. A range scan is almost guaranteed when using a unique indexed columns in a where clause.
EDIT :
The term for index does not matter. When you are creating an index, a B- tree structure is created, selecting a convenient root node, and rearranging column values. If all entries in the given column are going to be unique, normal index would also be of the same size as unique index, and would give same performance as unique index.
Primary index is also a unique index, with the exception that it would not allow null values.Null values are permitted in a unique index.
if you're trying to prevent multiple votes from the same user_id to the same post_id, then why don't you use a UNIQUE constraint?
ALTER TABLE votes
ADD CONSTRAINT uc_votes UNIQUE (user_id,post_id)
with regards to whether you should remove your index, you should review EXPLAIN concepts for query plan execution paths and performance. I suspect it will be better to keep them, but it will require testing.
In MySQL:
A PRIMARY KEY is a UNIQUE key.
A UNIQUE key is an INDEX.
"index" and "key" are synonyms.

MySQL index for non-empty string query

I have the table posts and its column descr in my application. And I need to query posts where description is not empty, but the table has too many rows, so I need to add an index. What is the best way to add this index?
Table structure (simplified):
CREATE TABLE posts (
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
title VARCHAR(255),
descr VARCHAR(1024),
PRIMARY KEY(id)
) engine=InnoDB DEFAULT CHARSET=utf8;
Example of query:
SELECT * FROM posts WHERE descr <> '';
I don't want to create an index on the whole descr column, because it will be huge overhead.
Also I know variant about adding another column is_empty_descr BOOLEAN and add index to it. This solution I will use if no other variants would be found.
I tried to add INDEX( descr(1) ), but I couldn't find the way how to use it:
desrc <> '' - index is not used
LEFT(desrc, 1) = '' - index is not used
SUBSTR(desrc, 0, 1) = '' - index is not used
desrc LIKE 'a%' - index is used! But this is totally different case
In all my examples I see something like this:
mysql> EXPLAIN SELECT * FROM posts WHERE descr <> '';
+------+---------------+------+---------+------+------+-------------+
| type | possible_keys | key | key_len | ref | rows | Extra |
+------+---------------+------+---------+------+------+-------------+
| ALL | descr_1 | NULL | NULL | NULL | 42 | Using where |
+------+---------------+------+---------+------+------+-------------+
(I omited some result columns, because the table is too wide for this site)
Even if I pass FORCE index (descr_1) result will be the same.
In MySQL, you can add a description on a prefix of a string using this syntax:
create index idx_posts_descr1 on posts(descr(1));
You should test this to see if the index is used for that particular where clause, though.

How to handle data set without unique ID

I'm working on a data import routine from one source into another and I've got one table that doesn't have it's own unique identifier. Instead it uses a combination of four fields to determine the record to be modified. My source table structure is below:
feed_hcp_leasenote table:
BLDGID varchar(255),
LEASID varchar(255),
NOTEDATE varchar(255),
REF1 varchar(255),
NOTETEXT varchar(8000),
tempid int PRIMARY, AUTONUMBER
The first four are the fields which, when evaluated altogether, make a record unique in the source database. I'm importing this data into two tables, one for the note and another for the other fields. Here is my structure for the new database:
lease_note table:
lnid int PRIMARY AUTONUMBER,
notetext longtext,
lid int (lease ID, links to lease table)
customfield_data table (holds other data):
cfdid int PRIMARY AUTONUMBER,
data_date dateteime,
data_smtext varchar(1000),
linkid int (links the data to its source ID)
cfid int (links the data to its field type)
The problem that I'm running into is when I try to identify those records that exist in the source database without a match in the new database my query seems to be duplicating records to the point that the query never finishes and locks up my server. I can successfully query based on BLDGID and LEASID and limit the query to the proper records but when I try to JOIN the customfield_data table aliased to the NOTEDATE and REF1 fields it starts to exponentially duplicate records. Here's my query:
SELECT NOTEDATE, REF1, REF2, LASTDATE, USERID, NOTETEXT, lid
FROM feed_hcp_leasenote
JOIN customfield_data mrileaseid ON feed_hcp_leasenote.LEASID = mrileaseid.data_smtext AND mrileaseid.cfid = 36
JOIN leases ON mrileaseid.linkid = leases.lid
JOIN suites ON leases.sid = suites.sid
JOIN floors ON suites.fid = floors.fid
JOIN customfield_data coid ON floors.bid = coid.linkid AND coid.cfid = 1 AND coid.data_smtext = feed_hcp_leasenote.BLDGID
JOIN customfield_data status ON leases.lid = status.linkid AND status.cfid = 27 AND status.data_smtext <> 'I'
WHERE tempid NOT IN (
SELECT tempid
FROM feed_hcp_leasenote
JOIN customfield_data mrileaseid ON feed_hcp_leasenote.LEASID = mrileaseid.data_smtext AND mrileaseid.cfid = 36
JOIN leases ON mrileaseid.linkid = temp_leases.lid
JOIN suites ON leases.sid = suites.sid
JOIN floors ON suites.fid = floors.fid
JOIN customfield_data coid ON floors.bid = coid.linkid AND coid.data_smtext = feed_hcp_leasenote.BLDGID AND coid.cfid = 1
JOIN customfield_data notedate ON STR_TO_DATE(feed_hcp_leasenote.NOTEDATE, '%e-%b-%Y') = notedate.data_date AND notedate.cfid = 55
JOIN customfield_data ref1 ON feed_hcp_leasenote.REF1 = ref1.data_smtext AND ref1.cfid = 56
JOIN lease_notes ON leases.lid = lease_notes.lid AND notedate.linkid = lease_notes.lnid AND ref1.linkid = lease_notes.lnid )
At the moment, I've narrowed the problem down to the NOT IN subquery -- running just that part crashes the server. I imagine the problem is that because there can be multiple notes with the same BLDGID, LEASID, NOTEDATE, and REF1 (but not all 4), the query keeps selecting back on itself and effectively creating an infinite loop.
Short of modifying the source database to include a unique ID (which I can't do) does anyone see a solution to this? Thanks in advance!
(Edits based on feedback)
Sorry for the lack of information, I was worried about that. Basically I'm importing the data in feed_hcp_leasenote from a CSV file dumped from another database that I have no control over. I add a tempid field once the data is imported into my server with the idea of using it in the SELECT WHERE tempid NOT IN query, though I'm not married to that approach.
My goal is to split the data in feed_hcp_leasenote into two tables: lease_note which holds the primary record (with a unique ID) and the note itself and; customfield_data which holds other data related to the record.
The source data feed consists of about 65,000 records, of which I'm importing about 25,000 since the remainder are connected to records that have been deactivated.
(2nd Edit)
Visual Schema of relevant tables: http://www.tentenstudios.com/clients/relynx/schema.png
EXPLAIN query:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY status ref data_smtext,linkid,cfid cfid 4 const 928 Using where
1 PRIMARY mrileaseid ref data_smtext,linkid,cfid linkid 5 rl_hpsi.status.linkid 19 Using where
1 PRIMARY leases eq_ref PRIMARY,sid PRIMARY 4 rl_hpsi.mrileaseid.linkid 1 Using where
1 PRIMARY suites eq_ref PRIMARY,fid PRIMARY 4 rl_hpsi.leases.sid 1
1 PRIMARY floors eq_ref PRIMARY,bid PRIMARY 4 rl_hpsi.suites.fid 1
1 PRIMARY feed_hcp_leasenote ref BLDGID,LEASID LEASID 768 rl_hpsi.mrileaseid.data_smtext 19 Using where
1 PRIMARY coid ref data_smtext,linkid,cfid data_smtext 1002 rl_hpsi.feed_hcp_leasenote.BLDGID 10 Using where
2 DEPENDENT SUBQUERY feed_hcp_leasenote eq_ref PRIMARY,BLDGID,LEASID PRIMARY 4 func 1
2 DEPENDENT SUBQUERY mrileaseid ref data_smtext,linkid,cfid data_smtext 1002 rl_hpsi.feed_hcp_leasenote.LEASID 10 Using where
2 DEPENDENT SUBQUERY leases eq_ref PRIMARY,sid PRIMARY 4 rl_hpsi.mrileaseid.linkid 1
2 DEPENDENT SUBQUERY suites eq_ref PRIMARY,fid PRIMARY 4 rl_hpsi.leases.sid 1
2 DEPENDENT SUBQUERY floors eq_ref PRIMARY,bid PRIMARY 4 rl_hpsi.suites.fid 1
2 DEPENDENT SUBQUERY ref1 ref data_smtext,linkid,cfid data_smtext 1002 rl_hpsi.feed_hcp_leasenote.REF1 10 Using where
2 DEPENDENT SUBQUERY lease_notes eq_ref PRIMARY PRIMARY 4 rl_hpsi.ref1.linkid 1 Using where
2 DEPENDENT SUBQUERY coid ref data_smtext,linkid,cfid data_smtext 1002 rl_hpsi.feed_hcp_leasenote.BLDGID 10 Using where
2 DEPENDENT SUBQUERY notedate ref linkid,cfid linkid 5 rl_hpsi.ref1.linkid 19 Using where
doesn't have it's own unique identifier. Instead it uses a combination of four fields to determine the record to be modified
No: if the four fields in combination constitute a unique key, then you have a unique identifier - just one with four parts.
BLDGID varchar(255),
LEASID varchar(255),
NOTEDATE varchar(255),
REF1 varchar(255),
NOTETEXT varchar(8000)
So you've no idea how the data is actually structured or you go this from a MSAccess programmer who doesn't know either.
SELECT NOTEDATE, REF1, REF2, LASTDATE, USERID, NOTETEXT, lid
FROM feed_hcp_leasenote
OMG. If that's the answer then you're asking the wrong questions.
Short of modifying the source database to include a unique ID (which I can't do) does anyone see a solution to this?
Find another job? Seriously. If you can't add a primary key to the import table / can't import it into a temporary table with a primary key defined, then you will spend a stupid amount of time trying to fix this.
BTW: While innodb will handle keys up to 3072 bytes (1024 on 32-bit) this will continue to run like a dog until you reduce your column sizes or use a hash of the actually PK data as the primary key.
It's not clear from your question how many rows you are adding / how many rows are already in the database. Nor have you provided the structure of the other tables. Nor have you provided an explain plan which should be your starting point for any performance problems.
It might be possible to get this running a lot faster - it's impossible to say from the information you have provided. But given the ridiculous constraint that you have to make it faster without changing the schema, I wonder what other horrors await.
I did think that, without knowing the details of the current schema, it would be possible to breakdown the current query into several components and check each one, maintaining a score in the import table - then use the score to determine what had unmatched data - however this requires schema changes too.
BTW have a google for the DISTINCT keyword in SQL.

Is there any gain from storing comments and replies on separate tables instead of a unary relation on a comment table?

Choice 1:
comments {commentid,replyto,comment}
//replyto will be null on many posts
Choice 2:
comments {commentid,comment}
replies {replyid, replyto, reply}
It looks like a matter of choice rather than linear benefit analysis at the moment.
The first option looks like a simple one, but the problem is that you're building a tree-structure in SQL.
and SQL does not support hierarchical data.
Not recommended - ever
TABLE comment
-------------
id unsigned integer auto_increment primary key,
reply_to unsigned integer,
comment text,
foreign key FK_comment_reply_to(reply_to) references comment.id
ON UPDATE CASCADE ON DELETE CASCADE
Recommended - if you want a tree 2 levels deep
If you build it using 2 tables
TABLE main_post
----------------
id unsigned integer auto_increment primary key,
body text,
TABLE reply
-------------
id unsigned integer auto_increment primary key,
reply_to unsigned integer,
body text,
foreign key FK_reply_reply_to(reply_to) references main_post.id
ON UPDATE CASCADE ON DELETE CASCADE
Then you are building a much simpler structure that can be easily queried in SQL because the tree is only 1 level deep.
For this reason I'd recommend choice number 2.
Alternatives for deeper trees
If you want a hierarchical structure I'd look at nested sets insteads, see:
http://www.pure-performance.com/2009/03/managing-hierarchical-data-in-sql/
In fact this is not 'only' matter of choice, but aware decision. Relational databases are not good at solving problems of hierarchical nature. There were tons of discussions, articles, and even books about that, so lets narrow the problem to your case.
The second choice would work fine ONLY if you were to allow replies to comments, and not to replies itself, thus this would be a tree with maximum 2 levels. That might be ok, but if you were to do that better solution would be to place everything with COMMENTS table, and add two columns: THREAD_ID (all the comments with the same THREAD_ID would belong to same thread), SEQ_NUM (or simply DATE would tell us which comment was first). Similar way of organising comments is implemented here on SO.
The first choice is quite simple and generic - but implements recurention with all its cons. Lets stop a bit and think... note that we are actually NOT building a tree, but a 'forest'. We will have many commen threads and every single thread will be a separate tree - relatively small amount of data to organise. In that case I would add a THREAD_ID column to COMMENTS table and use only that table (it would be also good to set an composite index on COMMENTS table containing THREAD_ID and COMMENTID columns - in exactly that order).
So upon above I would choose "choice 1".
Next decision should be about where to do the processing and comment tree construction? I would just get all the comments from the table an organise them on a controller (MVC) side, i.e. JAVA or C++. Traversing the list of comments and building the tree in Main Memory (using objects and pointers or hash tables) would be an easy thing. It is a good option also because small amount of nodes (comments and replies within one thread).
I would say it depends very much on what you're trying to achieve with this, from what I can understand if you want a max 2-level tree you should go with choice 2, if you want a deeper tree go with choice 1 with the following modification
Choice 1: comments {commentid,toplevelcommentid | thread | (whatever parent this comment and possibly other comments is/are linked to so you can easily recreate the structure afterwards),replyto,comment}
and when displaying results select everything that has commentid or toplevelcommentid equal to a value and order by commentid so you can easily recreate the structural data with a single select query
1) Queries against the TEXT table were always 3 times slower than those against the VARCHAR table (averages: 0.10 seconds for the VARCHAR table, 0.29 seconds for the TEXT table). The difference is 100% repeatable.
CREATE TABLE varcharTable (a varchar(255) NOT NULL, PRIMARY KEY (a)) ENGINE=MyISAM;
CREATE TABLE textTable (a text NOT NULL, PRIMARY KEY (a(255))) ENGINE=MyISAM;
mysql> explain SELECT SQL_NO_CACHE count(*) from varcharTable where a LIKE "n%";
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
| 1 | SIMPLE | varcharTable | range | PRIMARY | PRIMARY | 257 | NULL | 5882 | Using where; Using index |
+----+-------------+-------+-------+---------------+---------+---------+------+------+--------------------------+
1 row in set (0.00 sec)
mysql> explain SELECT SQL_NO_CACHE count(*) from T where a LIKE "n%";
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | T | range | PRIMARY | PRIMARY | 257 | NULL | 5882 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
1 row in set (0.00 sec)
Index is being used for the VARCHAR table, but not for the TEXT table (in the Extra column)
2) search is not required on comments table.So, querying is not required and since its long too. Its type is preferred to be text
And then since its text you cannot search on it .So, put the comments(non-searchable and affecting performance) and replies on separate table. So, that the replies table will function good and the comments table will be kept just for storage purpose, no search performed on them.
Conclusion: So, put them the Comments table in a separate table.