How to Optimized performance of JOIN query on large table

How to Optimized performance of JOIN query on large table - mysql

I am using Server version: 5.5.28-log MySQL Community Server (GPL).
I have a big table consist of 279703655 records called table A. I have to perform join on this table with one of my changelog table B and then insert matching records in new tmp table C.
B table has index on column type.
A table consist of prod_id,his_id and other columns.A table has index on both column prod_id,history_id.
When i am going to perform the following query
INSERT INTO C(prod,his_id,comm)
SELECT DISTINCT a.product_id,a.history_id,comm
FROM B as b INNER JOIN A as a ON a.his_id = b.his_id AND b.type="applications"
GROUP BY prod_id
ON DUPLICATE KEY UPDATE
`his_id` = VALUES(`his_id`);
it takes 7 to 8 min to insert records.
Even if i perform simple count from table A it took 15 min to give me count.
I have also tried a procedure to insert records in Limit but due to count query takes 15 min it is more slower then before.
BEGIN
DECLARE n INT DEFAULT 0;
DECLARE i INT DEFAULT 0;
SELECT COUNT(*) FROM A INTO n;
SET i=5000000;
WHILE i<n DO
INSERT INTO C(product_id,history_id,comments)
SELECT a.product_id,a.history_id,a.comments FROM B as b
INNER JOIN (SELECT * FROM A LIMIT i,1) as a ON a.history_id=b.history_id;
SET i = i + 5000000;
END WHILE;
End
But the above code is also take 15 to 20 min o execute.
Please suggest me how i make it faster.
Below is EXPLAIN result:
+----+-------------+-------+--------+---------------+---------+---------+-----------------+--------------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+-----------------+--------------+-------------+
| 1 | SIMPLE | a | ALL | (NULL) | (NULL) | (NULL) | (NULL) | 279703655 | |
| 1 | SIMPLE | b | eq_ref | PRIMARY | PRIMARY | 8 | DB.a.history_id | 1 | Using index |
+----+-------------+-------+--------+---------------+---------+---------+-----------------+--------------+-------------+
(from Comment)
CREATE TABLE B (
history_id bigint(20) unsigned NOT NULL AUTO_INCREMENT,
history_hash char(32) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
type enum('products','brands','partnames','mc_partnames','applications') NOT NULL,
stamp timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (history_id),
UNIQUE KEY history_hash (history_hash),
KEY type (type),
KEY stamp (stamp)
);

Let's first look at the tables.
What you call table B is really a history table. Its primary key is the history_id.
What you call table A is really a product table with one product per row and product_id its primary key. Each product also has a history_id. Thus you have created a 1:n relation. A product has one history row; one history row relates to multiple products.
You are selecting the product table rows that have an 'application' type history entry. This should be written as:
select product_id, history_id, comm
from product
where history_id in
(
select history_id
from history
where type = 'applications'
);
(A join would work just as well, but isn't as clear. As there is only one history row per product, you can't get duplicates. Both GROUP BY and DISTINCT are completely superfluous in your query and should be removed in order not to give the DBMS unecessary work to do. But as mentioned: better don't join at all. If you want rows from table A, select from table A. If you want to look up rows in table B, look them up in the WHERE clause, where all criteria belongs.)
Now, we would have to know how many rows may be affected. If only 1% of all history rows are 'applications', then an index should be used. Preferably
create index idx1 on history (type, history_id);
… which finds rows by type and gets their history_id right away.
If, say 20%, of all all history rows are 'applications', then reading the table sequentially might be more efficient.
Then, how many product rows may we get? Even with a single history row, we might get millions of related product rows. Or vice versa, with millions of history rows we might get no product row at all. Again, we can provide an index, which may or may not be used by the DBMS:
create index idx2 on product (history_id, product_id, comm);
This is about as fast as it gets. Two indexes offered and a proper written query without an unnecessary join. There were times when MySQL had performance problems with IN. People rewrote the clause with EXISTS then. I don't think this is still necessary.
As of MySQL 8.0.3, you can create histogram statistics for tables.
analyze history update histogram on type;
analyze product update histogram on history_id;
This is an important step to help the optimizer to find the optimal way to select the data.

Indexes needed (assuming it is history_id, not his_id):
B: INDEX(type, history_id) -- in this order. Note: "covering"
A: INDEX(history_id, product_id, comm)
What column or combination of columns provides the uniqueness constraint that IODKU needs?
Really-- Provide SHOW CREATE TABLE.

Related

Mysql : insert into select && where not exists

I need to insert some data in a table named ‘queue’ which is a patient queue in a particular date . Two fields data will be inserted .Two fields name are ‘PatientID’ and ‘Visiting Date’. Table ‘queue' like
QueueID | PatientID | Visiting_date |
-------------|-------------------|-------------------------|
1 | 4 | Current date |
table:queue
But while inserting the record there are two conditions :
Condition 1 : patitentID comes from patient table (given below)
Condition 2 : one record will be inserted to ‘queue’ table if it does not exist to prevent repeatation.ie PatientID=4 will not be inserted if already inserted.
-------------|-----------------|------------------|
patitentID | Patient Name | Contact no |
-------------|-----------------|------------------|
4 | David | 01245785874 |
table:patient
My SQL is: (it does not work)
INSERT INTO `queue`(`patientID`, `Visiting_date`)
SELECT patient.`patientID`,’CURDATE()’ FROM `patient`
WHERE NOT EXISTS (
SELECT `patientID`, `visiting_date`FROM `queue`
WHERE `patientID` = '4' AND `visting_date`=CURDATE()
) LIMIT 1;

You could set a foreign key to make sure the patients id exists.
In the Queue table you can set patientID as unique, this makes sure you can insert only unique id's in the queue table.
Also if you would like to be able to insert the same userID but with different dates you could specify unique constraint for multiple columns in MySQL.
If you want to solve it with a mysql query only you can use this question.

I would use a separate query to check if there is a user with that ID in that table.
SELECT * FROM queue WHERE PatientID = 4;
and then check the result of that query, if it returns a row, that means that there is a user in there and you don't do anything.
If the query doesn't return a row, that means you can now use a query to inert a user. Like this
INSERT INTO queue (PatientID, VisitingDate);

A unique can be used as index?

I have this table:
// votes
+----+---------+---------+
| id | user_id | post_id |
+----+---------+---------+
| 1 | 12345 | 12 |
| 2 | 12345 | 13 |
| 3 | 52344 | 12 |
+----+---------+---------+
Also this is a part of my query:
EXISTS (select 1 from votes v where u.id = v.user_id and p.id = v.post_id)
To make my query more efficient, I have added a index group on user_id and post_id:
ALTER TABLE `votes` ADD INDEX `user_id,post_id` (`user_id,post_id`)
What's my question? I also want to prevent of duplicate vote from one user to one post. So I have to create a unique index on user_id and post_id too. Now I want to know, should I create another index? or just a unique index is enough and I should remove previous index?

You do not need two indexes serving similar purpose. Only one of them would be used during a select operation, and both will have to be modified on insert, update and delete. These are unnecessary overheads. Go with the unique index, since it serves both the purposes. A range scan is almost guaranteed when using a unique indexed columns in a where clause.
EDIT :
The term for index does not matter. When you are creating an index, a B- tree structure is created, selecting a convenient root node, and rearranging column values. If all entries in the given column are going to be unique, normal index would also be of the same size as unique index, and would give same performance as unique index.
Primary index is also a unique index, with the exception that it would not allow null values.Null values are permitted in a unique index.

if you're trying to prevent multiple votes from the same user_id to the same post_id, then why don't you use a UNIQUE constraint?
ALTER TABLE votes
ADD CONSTRAINT uc_votes UNIQUE (user_id,post_id)
with regards to whether you should remove your index, you should review EXPLAIN concepts for query plan execution paths and performance. I suspect it will be better to keep them, but it will require testing.

In MySQL:
A PRIMARY KEY is a UNIQUE key.
A UNIQUE key is an INDEX.
"index" and "key" are synonyms.

How do I insert a column of values from one table to another, non-matching schemas?

I have two tables:
Table A: lastName,firstName,clientExtension
Table B: ~45 columns, however lastName,firstName,clientExtension are also in this table. The data types for these three columns match in each table.. lastName VARCHAR(150),firstName VARCHAR(150),clientExtension INT(5) unsigned.
Table A has 31 rows, no NULL values. The records in Table A are already in Table B, but my objective is to update the clientExtension value in Table B to be the clientExtension value from Table A for each agent.
This is what I have tried so far, with no luck..
INSERT INTO table_A (lastName, firstName, clientExtension)
SELECT clientExtension
FROM tableB AS tb
WHERE lastName=tb.lastName
AND firstName=tb.firstName;
I've also tried using the UPDATE function, however I can't seem to get it to work. It feels like what I'm trying to do is an INNER JOIN, except I'm not looking to create a new table with the output of the INNER JOIN, I'm looking to update existing records in Table B with the clientExtension values in Table A.
Any ideas??

This schema needs some help before you have more than a few dozen rows in those tables. If that is really your schema, then you have some problems when names change. It will take a few minutes to show a better approach, bear with me.
Then I will show the update/join pattern if you don't have it yet (on the better schema).
create table tableA
( -- assuming this is really a user table
id int auto_increment primary key, -- if you don't have this, you are going to have problems
firstName varchar(150) not null,
lastName varchar(150) not null,
clientExtension int not null -- sign, display width of no concern
);-- sign, display width of no concern
insert tableA (firstName,lastName,clientExtension) values ('f1','l1',777),('f2','l2',888);
create table tableB
( -- assuming this is really a user table
id int auto_increment primary key, -- if you don't have this, you are going to have problems
firstName varchar(150) not null,
lastName varchar(150) not null,
clientExtension int not null
);
insert tableB (firstName,lastName,clientExtension) values ('f1','l1',0),('f2','l2',0);
update tableB b
join tableA a
on a.id=b.id
set b.clientExtension=a.clientExtension;
select * from tableA;
(same as below)
select * from tableB;
+----+-----------+----------+-----------------+
| id | firstName | lastName | clientExtension |
+----+-----------+----------+-----------------+
| 1 | f1 | l1 | 777 |
| 2 | f2 | l2 | 888 |
+----+-----------+----------+-----------------+
The long and short of it is that if you join on names that change in one table and not another, you have problems. That is why you need a primary key that won't change (as opposed to when Bob becomes Robert again).
Also, if your tables are not user tables, then the PK of an int id is just as important. The id is used in other tables without de-normalized ideas of dragging firstName, lastName over as keys in those non-user Entity tables, if you will.
What do I mean by non-user Entity tables? Well I kinda just made that up, first phrase that came to my head. It is about data normalization, and concepts like 2nd and 3rd Normal Form. Let's say you have a paystub table. A row needs to be identified by PayeeId (that is your user id from above tables), and other info such as pay period, etc. A horrible way of identifying the Payee would be by first and last name.
Plan B: (I hold my nose at doing this, but here it is)
update tableB b
join tableA a
on a.firstName=b.firstName and a.lastName=b.lastName
set b.clientExtension=a.clientExtension;
-- 2 row(s) affected

Slow execution of a subquery when no matches

Please note that I have asked this question on dba.stackexchange.com, but I thought I'd post it here too:
In MySQL, I have two basic tables - Posts and Followers:
CREATE TABLE Posts (
id int(11) NOT NULL AUTO_INCREMENT,
posted int(11) NOT NULL,
body varchar(512) NOT NULL,
authorId int(11) NOT NULL,
PRIMARY KEY (id),
KEY posted (posted),
KEY authorId (authorId,posted)
) ENGINE=InnoDB;
CREATE TABLE Followers (
userId int(11) NOT NULL,
followerId int(11) NOT NULL,
PRIMARY KEY (userId,followerId),
KEY followerId (followerId)
) ENGINE=InnoDB;
I have the following query, which seems to be optimized enough:
SELECT p.*
FROM Posts p
WHERE p.authorId IN (SELECT f.userId
FROM Followers f
WHERE f.followerId = 9
ORDER BY authorId)
ORDER BY posted
LIMIT 0, 20
EXPLAIN output:
+------+--------------------+-------+-----------------+--------------------+---------+---------+------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------------+-------+-----------------+--------------------+---------+---------+------------+------+--------------------------+
| 1 | PRIMARY | p | index | NULL | posted | 4 | NULL | 20 | Using where |
| 2 | DEPENDENT SUBQUERY | f | unique_subquery | PRIMARY,followerId | PRIMARY | 8 | func,const | 1 | Using index; Using where |
+------+--------------------+-------+-----------------+--------------------+---------+---------+------------+------+--------------------------+
When followerId is a valid id (meaning, it actually exists in both tables), the query execution is almost immediate. However, when the id is not present in the tables, the query only returns results (empty set) after a 7 second delay.
Why is this happening? Is there some way to speed up this query for cases where there are no matches (without having to do a check ahead of time)?

Is there some way to speed up this query ...???
Yes. You should do two things.
First, you should use EXISTS instead of IN (cross reference SQL Server IN vs. EXISTS Performance). It'll speed up the instances where there is a match, which will come in handy as your data set grows (it's may be fast enough now, but that doesn't mean you shouldn't follow best practices, and in this case EXISTS is a better practice than IN)
Second, you should modify the keys on your second table just a little bit. You were off to a good start using the compound key on (userId,followerId), but in terms of optimizing this particular query, you need to keep in mind the "leftmost prefix" rule of MySQL indices, eg
If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to find rows. http://dev.mysql.com/doc/refman/5.6/en/multiple-column-indexes.html
What your Query Execution Plan from EXPLAIN is telling you is that SQL thinks it makes more sense to join Followers to Posts (using the Primary Key on Posts) and filter the results for a given followerId off of that index. Think of it like saying "Show me all the possible matches, then reduce that down to just the ones that match followerId = {}"
If you replace your followerId key with a compound key (followerId,userId), you should be able to quickly zoom in to just the user ids associated with a given followerID and do the existence check against those.
I wish I knew how to explain this better... it's kind of a tough concept to grasp until you have a "Aha!" moment and it clicks. But if you look into the leftmost prefix rules on indices, and also change the key on followerId to be a key on (followerId,userId), I think it'll speed it up quite a bit. And if you use EXISTS instead of IN, that'll help you maintain that speed even as your data set grows.

try this one :
SELECT p.*
FROM Posts p
inner join Followers f On f.userId = p.authorId
WHERE f.followerId = 9
ORDER BY posted
LIMIT 0, 20

How to handle data set without unique ID

I'm working on a data import routine from one source into another and I've got one table that doesn't have it's own unique identifier. Instead it uses a combination of four fields to determine the record to be modified. My source table structure is below:
feed_hcp_leasenote table:
BLDGID varchar(255),
LEASID varchar(255),
NOTEDATE varchar(255),
REF1 varchar(255),
NOTETEXT varchar(8000),
tempid int PRIMARY, AUTONUMBER
The first four are the fields which, when evaluated altogether, make a record unique in the source database. I'm importing this data into two tables, one for the note and another for the other fields. Here is my structure for the new database:
lease_note table:
lnid int PRIMARY AUTONUMBER,
notetext longtext,
lid int (lease ID, links to lease table)
customfield_data table (holds other data):
cfdid int PRIMARY AUTONUMBER,
data_date dateteime,
data_smtext varchar(1000),
linkid int (links the data to its source ID)
cfid int (links the data to its field type)
The problem that I'm running into is when I try to identify those records that exist in the source database without a match in the new database my query seems to be duplicating records to the point that the query never finishes and locks up my server. I can successfully query based on BLDGID and LEASID and limit the query to the proper records but when I try to JOIN the customfield_data table aliased to the NOTEDATE and REF1 fields it starts to exponentially duplicate records. Here's my query:
SELECT NOTEDATE, REF1, REF2, LASTDATE, USERID, NOTETEXT, lid
FROM feed_hcp_leasenote
JOIN customfield_data mrileaseid ON feed_hcp_leasenote.LEASID = mrileaseid.data_smtext AND mrileaseid.cfid = 36
JOIN leases ON mrileaseid.linkid = leases.lid
JOIN suites ON leases.sid = suites.sid
JOIN floors ON suites.fid = floors.fid
JOIN customfield_data coid ON floors.bid = coid.linkid AND coid.cfid = 1 AND coid.data_smtext = feed_hcp_leasenote.BLDGID
JOIN customfield_data status ON leases.lid = status.linkid AND status.cfid = 27 AND status.data_smtext <> 'I'
WHERE tempid NOT IN (
SELECT tempid
FROM feed_hcp_leasenote
JOIN customfield_data mrileaseid ON feed_hcp_leasenote.LEASID = mrileaseid.data_smtext AND mrileaseid.cfid = 36
JOIN leases ON mrileaseid.linkid = temp_leases.lid
JOIN suites ON leases.sid = suites.sid
JOIN floors ON suites.fid = floors.fid
JOIN customfield_data coid ON floors.bid = coid.linkid AND coid.data_smtext = feed_hcp_leasenote.BLDGID AND coid.cfid = 1
JOIN customfield_data notedate ON STR_TO_DATE(feed_hcp_leasenote.NOTEDATE, '%e-%b-%Y') = notedate.data_date AND notedate.cfid = 55
JOIN customfield_data ref1 ON feed_hcp_leasenote.REF1 = ref1.data_smtext AND ref1.cfid = 56
JOIN lease_notes ON leases.lid = lease_notes.lid AND notedate.linkid = lease_notes.lnid AND ref1.linkid = lease_notes.lnid )
At the moment, I've narrowed the problem down to the NOT IN subquery -- running just that part crashes the server. I imagine the problem is that because there can be multiple notes with the same BLDGID, LEASID, NOTEDATE, and REF1 (but not all 4), the query keeps selecting back on itself and effectively creating an infinite loop.
Short of modifying the source database to include a unique ID (which I can't do) does anyone see a solution to this? Thanks in advance!
(Edits based on feedback)
Sorry for the lack of information, I was worried about that. Basically I'm importing the data in feed_hcp_leasenote from a CSV file dumped from another database that I have no control over. I add a tempid field once the data is imported into my server with the idea of using it in the SELECT WHERE tempid NOT IN query, though I'm not married to that approach.
My goal is to split the data in feed_hcp_leasenote into two tables: lease_note which holds the primary record (with a unique ID) and the note itself and; customfield_data which holds other data related to the record.
The source data feed consists of about 65,000 records, of which I'm importing about 25,000 since the remainder are connected to records that have been deactivated.
(2nd Edit)
Visual Schema of relevant tables: http://www.tentenstudios.com/clients/relynx/schema.png
EXPLAIN query:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY status ref data_smtext,linkid,cfid cfid 4 const 928 Using where
1 PRIMARY mrileaseid ref data_smtext,linkid,cfid linkid 5 rl_hpsi.status.linkid 19 Using where
1 PRIMARY leases eq_ref PRIMARY,sid PRIMARY 4 rl_hpsi.mrileaseid.linkid 1 Using where
1 PRIMARY suites eq_ref PRIMARY,fid PRIMARY 4 rl_hpsi.leases.sid 1
1 PRIMARY floors eq_ref PRIMARY,bid PRIMARY 4 rl_hpsi.suites.fid 1
1 PRIMARY feed_hcp_leasenote ref BLDGID,LEASID LEASID 768 rl_hpsi.mrileaseid.data_smtext 19 Using where
1 PRIMARY coid ref data_smtext,linkid,cfid data_smtext 1002 rl_hpsi.feed_hcp_leasenote.BLDGID 10 Using where
2 DEPENDENT SUBQUERY feed_hcp_leasenote eq_ref PRIMARY,BLDGID,LEASID PRIMARY 4 func 1
2 DEPENDENT SUBQUERY mrileaseid ref data_smtext,linkid,cfid data_smtext 1002 rl_hpsi.feed_hcp_leasenote.LEASID 10 Using where
2 DEPENDENT SUBQUERY leases eq_ref PRIMARY,sid PRIMARY 4 rl_hpsi.mrileaseid.linkid 1
2 DEPENDENT SUBQUERY suites eq_ref PRIMARY,fid PRIMARY 4 rl_hpsi.leases.sid 1
2 DEPENDENT SUBQUERY floors eq_ref PRIMARY,bid PRIMARY 4 rl_hpsi.suites.fid 1
2 DEPENDENT SUBQUERY ref1 ref data_smtext,linkid,cfid data_smtext 1002 rl_hpsi.feed_hcp_leasenote.REF1 10 Using where
2 DEPENDENT SUBQUERY lease_notes eq_ref PRIMARY PRIMARY 4 rl_hpsi.ref1.linkid 1 Using where
2 DEPENDENT SUBQUERY coid ref data_smtext,linkid,cfid data_smtext 1002 rl_hpsi.feed_hcp_leasenote.BLDGID 10 Using where
2 DEPENDENT SUBQUERY notedate ref linkid,cfid linkid 5 rl_hpsi.ref1.linkid 19 Using where

doesn't have it's own unique identifier. Instead it uses a combination of four fields to determine the record to be modified
No: if the four fields in combination constitute a unique key, then you have a unique identifier - just one with four parts.
BLDGID varchar(255),
LEASID varchar(255),
NOTEDATE varchar(255),
REF1 varchar(255),
NOTETEXT varchar(8000)
So you've no idea how the data is actually structured or you go this from a MSAccess programmer who doesn't know either.
SELECT NOTEDATE, REF1, REF2, LASTDATE, USERID, NOTETEXT, lid
FROM feed_hcp_leasenote
OMG. If that's the answer then you're asking the wrong questions.
Short of modifying the source database to include a unique ID (which I can't do) does anyone see a solution to this?
Find another job? Seriously. If you can't add a primary key to the import table / can't import it into a temporary table with a primary key defined, then you will spend a stupid amount of time trying to fix this.
BTW: While innodb will handle keys up to 3072 bytes (1024 on 32-bit) this will continue to run like a dog until you reduce your column sizes or use a hash of the actually PK data as the primary key.
It's not clear from your question how many rows you are adding / how many rows are already in the database. Nor have you provided the structure of the other tables. Nor have you provided an explain plan which should be your starting point for any performance problems.
It might be possible to get this running a lot faster - it's impossible to say from the information you have provided. But given the ridiculous constraint that you have to make it faster without changing the schema, I wonder what other horrors await.
I did think that, without knowing the details of the current schema, it would be possible to breakdown the current query into several components and check each one, maintaining a score in the import table - then use the score to determine what had unmatched data - however this requires schema changes too.
BTW have a google for the DISTINCT keyword in SQL.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008