I am trying to normalize my database. I have broken out all redundant data and is now joining and inserting the new data. I have been porting 1 million rows at a time and that have worked well up until now. Now a million rows takes days instead of minutes, and it seam to be stuck on reading many many millions of rows and never getting to the inserting part.
I have this Query:
INSERT IGNORE INTO bbointra_normalized.entry (DATE,keyword,url,POSITION,competition,searchEngine) SELECT DATE(insDate) AS DATE,k.id AS kid ,u.id uid, POSITION, competition ,s.id AS sid FROM oldSingleTabels.tempData
INNER JOIN bbointra_normalized.keyword k ON tempData.keyword = k.keyword
INNER JOIN bbointra_normalized.searchEngine s ON tempData.searchEngine = s.searchEngine
INNER JOIN bbointra_normalized.urlHash u ON tempData.url = u.url
GROUP BY k.id, s.id, u.id ORDER BY k.id, s.id, u.id
EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
------ ----------- -------- ------ -------------------------------------------- ------------ ------- ---------------------------- ------ ----------------------------------------------
1 SIMPLE s index (NULL) searchEngine 42 (NULL) 539 Using index; Using temporary; Using filesort
1 SIMPLE k index (NULL) keyword 42 (NULL) 17652 Using index; Using join buffer
1 SIMPLE tempData ref keyword_url_insDate,keyword,searchEngine,url keyword 767 func 433 Using where
1 SIMPLE u ref url url 767 oldSingleTabels.tempData.url 1 Using index
SHOW INNODB STATUS:
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
1 read views open inside InnoDB
Main thread process no. 4245, id 140024097179392, state: waiting for server activity
Number of rows inserted 26193732, updated 0, deleted 0, read 3383512394
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 39676.56 reads/s
SQL for entry:
CREATE TABLE `entry` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`insDate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`date` int(11) NOT NULL,
`project` int(11) NOT NULL,
`keyword` int(11) NOT NULL,
`url` int(11) NOT NULL,
`position` int(11) NOT NULL,
`competition` int(11) NOT NULL,
`serachEngine` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `unikt` (`date`,`keyword`,`position`,`serachEngine`)
) ENGINE=InnoDB AUTO_INCREMENT=201 DEFAULT CHARSET=utf8 COLLATE=utf8_swedish_ci;
Try removing the GROUP BY and ORDER BY clauses, they are heavy to process and do not seem to add any value.
If there are indexes on table bbointra_normalized.entry, try removing these temporarily since it is a heavy process to update the indexes when inserting many rows.
At each INSERT/UPDATE MySQL updates the indexes of your tables. And this is slow.
If you are doing a massive INSERT/UPDATE, you should disable the keys, so the index is recalculated only once, instead of for each inserted/updated row.
Here is how :
SET FOREIGN_KEY_CHECKS=0
-- Your INSERT/UPDATE statement here
SET FOREIGN_KEY_CHECKS=1
You need to add some indexes - every field you are linking too needs it's own index
ALTER TABLE `entry` ADD KEY (`keyword`);
ALTER TABLE `entry` ADD KEY (`searchEngine`);
ALTER TABLE `entry` ADD KEY (`urlHash`);
It looks very much like the first one of these is the one needed the most
As many people pointed out, since this was a read problem, I broke the SELECT query down and tested the query minus one join at the time, I was expecting the huge URL table/keys to be the problem but soon found that the main problem was a corrupt table/index on the keyword table. I don't know how this could happen but after dropping and recreating that table, things magically started to work again just fine.
I later took #abasterfield's advice and added 3 more indexes to the Entry table and this sped up the select even more.
Related
In MySQL, I have two innodb tables, a small mission critical table, that needs to be readily available at all times for reads/writes. Call this mission_critical. I have a larger table (>10s of millions of rows), called big_table. I need to update big_table, for instance:
update mission_critical c, big_table b
set
b.something = c.something_else
where b.refID=c.id
This query could take more than an hour, but this creates a write-lock on the mission_critical table. Is there a way I can tell mysql, "I don't want a lock on mission_critical" so that that table can be written to?
I understand that this is not ideal from a transactional point of view. The only workaround I can think of right now is to make a copy of the small mission_critical table and do the update from that (which I don't care gets locked), but I'd rather not do that if there's a way to make MySQL natively deal with this more gracefully.
It is not the table that is locking but all of the records in mission_critical that are locked, since they are basically all scanned by the update. I am not assuming this; the symptom is that when a user logs in to an online system, it tries to update a datetime column in mission_critical to update the last time they logged in. These queries die due to a Lock wait timeout exceeded error while the query above is running. If I kill the query above, all pending queries run immediately.
mission_critical.id and big_table.refID are both indexed.
The pertinent portions of the creation statements for each table is:
mission_critical:
CREATE TABLE `mission_critical` (
`intID` int(11) NOT NULL AUTO_INCREMENT,
`id` smallint(6) DEFAULT NULL,
`something_else` varchar(50) NOT NULL,
`lastLoginDate` datetime DEFAULT NULL,
PRIMARY KEY (`intID`),
UNIQUE KEY `id` (`id`),
UNIQUE KEY `something_else` (`something_else`),
) ENGINE=InnoDB AUTO_INCREMENT=1432 DEFAULT CHARSET=latin1
big_table:
CREATE TABLE `big_table` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`postDate` date DEFAULT NULL,
`postTime` int(11) DEFAULT NULL,
`refID` smallint(6) DEFAULT NULL,
`something` varchar(50) NOT NULL,
`change` decimal(20,2) NOT NULL
PRIMARY KEY (`id`),
KEY `refID` (`refID`),
KEY `postDate` (`postDate`),
) ENGINE=InnoDB AUTO_INCREMENT=138139125 DEFAULT CHARSET=latin1
The explanation of the query is:
+----+-------------+------------------+------------+------+---------------+-------+---------+------------------------------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------------+------------+------+---------------+-------+---------+------------------------------------+------+----------+-------------+
| 1 | SIMPLE | mission_critical | | ALL | id | | | | 774 | 100 | Using where |
| 1 | UPDATE | big_table | | ref | refID | refID | 3 | db.mission_critical.something_else | 7475 | 100 | |
+----+-------------+------------------+------------+------+---------------+-------+---------+------------------------------------+------+----------+-------------+
I first suggested a workaround with a subquery, to create a copy in an internal temporary table. But in my test the small table was still locked for writes. So I guess your best bet is to make a copy manually.
The reason for the lock is described in this bug report: https://bugs.mysql.com/bug.php?id=72005
This is what Sinisa Milivojevic wrote in an answer:
update table t1,t2 ....
any UPDATE with a join is considered a multiple-table update. In that
case, a referenced table has to be read-locked, because rows must not
be changed in the referenced table during UPDATE until it has
finished. There can not be concurrent changes of the rows, nor DELETE
of the rows, nor, much less, exempli gratia any DDL on the referenced
table. The goal is simple, which is to have all tables with consistent
contents when UPDATE finishes, particularly since multiple-table
UPDATE can be executed with several passes.
In short, this behavior is for a good reason.
Consider writing INSERT and UPDATE triggers, which will update the big_table on the fly. That would delay writes on the mission_critical table. But it might be fast enough for you, and wouldn't need the mass-update-query any more.
Also check if it wouldn't be better to use char(50) instead of varchar(50). I'm not sure, but it's possible that it will improve the update performance because the row size wouldn't need to change. I could improve the update performance about 50% in a test.
UPDATE will lock the rows that it needs to change. It may also lock the "gaps" after those rows.
You may use MySQL transactions in loop
Update only 100 rows at once
BEGIN;
SELECT ... FOR UPDATE; -- arrange to have this select include the 100 rows
UPDATE ...; -- update the 100 rows
COMMIT;
May be worth trying a correlated subquery to see if the optimiser comes up with a different plan, but performance may be worse.
update big_table b
set b.something = (select c.something_else from mission_critical c where b.refID = c.id)
I have a MYSQL table Foo which has Primary key on id, and 2 other non primary keys on different columns.
Fiddle "select" example
My actual table contains many millions of rows so the behaviour of the explain is different, ie. it uses an Index_Merge on the 2 non primary indexes.
When I run the following Explain Update statement:
explain UPDATE Foo
SET readyState = 1
WHERE readyState = 0
AND productName = 'OpenAM'
LIMIT 30;
The Extra column contains "Using Temporary".
When I run the equivalent explain Select statement:
Explain Select id, productName, readyState
FROM Foo
WHERE readyState = 0
AND productName = 'OpenAM'
Limit 30;
The Extra column does not contain "Using Temporary".
The effect of this on my actual table is that when I update, there is a temporary table being created with several million rows as they are all matching the conditions of the update before the Limit 30 kicks in. The update takes 4-5 seconds whereas the select only takes ~0.001s as it does not create the temp table of the merged index. I understand that the Update will also need to update all 3 indexes (Primary + 2 non primary used in the query) but I would be shocked if it took 4 seconds to update 30 index rows in 3 indexes.
QUESTION: Is there a way to force the Update to not use the unneccessary Temporary table? I was under the impression that MYSQL treated an Update query plan the same way as a select.
If the temp table is required for Update and not for Select, why?
EDIT:
Show Create Table (removed a heap of columns since it is a very wide table):
CREATE TABLE Item (
ID int(11) NOT NULL AUTO_INCREMENT,
ImportId int(11) NOT NULL,
MerchantCategoryName varchar(200) NOT NULL,
HashId int(11) DEFAULT NULL,
Processing varchar(36) DEFAULT NULL,
Status int(11) NOT NULL,
AuditWho varchar(200) NOT NULL,
AuditWhen datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (ID),
KEY idx_Processing_Item (Processing),
KEY idx_Status_Item (Status),
KEY idx_MerchantCategoryName_Item (MerchantCategoryName),
KEY fk_Import_Item (ImportId),
KEY fk_Hash_Item (HashId),
CONSTRAINT fk_Hash_Item FOREIGN KEY (HashId) REFERENCES Hash (ID),
CONSTRAINT fk_Import_Item FOREIGN KEY (ImportId) REFERENCES Import (ID)
) ENGINE=InnoDB AUTO_INCREMENT=12004589 DEFAULT CHARSET=utf8
Update statement
explain UPDATE Item
SET Processing = 'd53dbc91-eef4-11e5-a3a6-06f88beef4f3',
Status = 2,
AuditWho = 'name',
AuditWhen = now()
WHERE EventId = 1
AND Processing is null
AND Status = 1
LIMIT 30;
Results:
'id','select_type','table','type','possible_keys','key','key_len','ref','rows','Extra',
'1','SIMPLE','Item','index_merge','idx_Processing_Item,idx_Status_Item,fk_Import_Item','idx_Processing_Item,idx_Status_Item,fk_Import_Item','111,4,4',\N,'1362610','Using intersect(idx_Processing_Item,idx_Status_Item,fk_Import_Item); Using where; Using temporary',
Select Query
explain select ID from Item where Status = 1 and Processing is null and ImportId = 1 limit 30;
Results:
'id','select_type','table','type','possible_keys','key','key_len','ref','rows','Extra',
'1','SIMPLE','Item','index_merge','idx_Processing_Item,idx_Status_Item,fk_ImportEvent_Item','idx_Processing_Item,idx_Status_Item,fk_Import_Item','111,4,4',\N,'1362610','Using intersect(idx_Processing_ItemPending,idx_Status_ItemPending,fk_ImportEvent_ItemPending); Using where; Using index',
A guess:
The UPDATE is changing an indexed value (readyState), correct? That means that the index in question is being changed as the UPDATE is using it? So, the UPDATE may be "protecting" itself by fetching the rows (in an inefficient way, apparently), tossing them into a tmp table, and only then performing the action.
"Index merge intersect" is almost always less efficient than a composite index: INDEX(readyState, productName) (in either order). Suggest you add that.
Since you have no ORDER BY, which "30" will be unpredictable. Suggest you add ORDER BY the-primary-key.
I have two queries, the first one (inner join) is super fast, and the second one (left join) is super slow. How do I make the second query fast?
EXPLAIN SELECT saved.email FROM saved INNER JOIN finished ON finished.email = saved.email;
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE finished index NULL email 258 NULL 32168 Using index
1 SIMPLE saved ref email email 383 func 1 Using where; Using index
EXPLAIN SELECT saved.email FROM saved LEFT JOIN finished ON finished.email = saved.email;
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE saved index NULL email 383 NULL 40971 Using index
1 SIMPLE finishedindex NULL email 258 NULL 32168 Using index
Edit: I have added table info for both tables down below.
CREATE TABLE `saved` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`slug` varchar(255) DEFAULT NULL,
`email` varchar(127) NOT NULL,
[omitted fields include varchar, text, longtext, int],
PRIMARY KEY (`id`),
KEY `slug` (`slug`),
KEY `email` (`email`)
) ENGINE=MyISAM AUTO_INCREMENT=56329 DEFAULT CHARSET=utf8;
CREATE TABLE `finished` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`slug` varchar(255) DEFAULT NULL,
`submitted` int(11) DEFAULT NULL,
`status` int(1) DEFAULT '0',
`name` varchar(255) DEFAULT NULL,
`email` varchar(255) DEFAULT NULL,
[omitted fields include varchar, text, longtext, int],
PRIMARY KEY (`id`),
KEY `assigned_user_id` (`assigned_user_id`),
KEY `event_id` (`event_id`),
KEY `slug` (`slug`),
KEY `email` (`email`),
KEY `city_id` (`city_id`),
KEY `status` (`status`),
KEY `recommend` (`recommend`),
KEY `pending_user_id` (`pending_user_id`),
KEY `submitted` (`submitted`)
) ENGINE=MyISAM AUTO_INCREMENT=33063 DEFAULT CHARSET=latin1;
With INNER JOIN, MySQL generally will start with the table with the smallest number of rows. In this case, it starts with table finished and does a look up for the corresponding record in saved using the index on saved.email.
For a LEFT JOIN, (excluding some optimizations) MySQL generally joins the records in order (starting with the left most table). In this case, MySQL starts with the table saved, then attempts to find each corresponding record in finished. Since there is no usable index on finished.email, it must do a full scan for each look up.
Edit
Now that you posted your schema, I can see that MySQL is ignoring the index (finished.email) when going from utf8 to latin1 character set. You've not posted the character sets and collations for each column, so I'm going by the default character set for the table. The collations must be compatible in order for MySQL to use the index.
MySQL can coerce (upgrade) a latin1 collation, which is very limited, up to a utf8 collation such as unicode_ci (so the first query can use the index on saved.email by upgrading latin1 collation to utf8), but the opposite is not true (the second query can't use the index on finished.email since it can't downgrade a utf8 collation down to latin1).
The solution is to change both email columns to a compatible collation, perhaps most easily by making them identical character sets and collations.
The LEFT JOIN query is slower than the INNER JOIN query because it's doing more work.
From the EXPLAIN output, it looks like MySQL is doing nested loop join. (There's nothing wrong with nested loops; I think that's the only join operation that MySQL uses in version 5.5 and earlier.)
For the INNER JOIN query, MySQL is using an efficient "ref" (index lookup) operation to locate the matching rows.
But for the LEFT JOIN query, it looks like MySQL is doing a full scan of the index to find the matching rows. So, with the nested loops join operation, MySQL is doing a full index scan scan for each row from the other table. So, that's on the order of tens of thousands of scans, and each of those scans is inspecting tens of thousands of rows.
Using the estimated row counts from the EXPLAIN output, that's going to require (40971*32168=) 1,317,955,128 string comparisons.
The INNER JOIN query avoids a lot of that work, so it's a lot faster. (It's avoiding all those string comparisons by using an index operation.
-- LEFT JOIN
id select table type key key_len ref rows Extra
-- ------ -------- ----- ----- ------- ---- ----- ------------------------
1 SIMPLE saved index email 383 NULL 40971 Using index
1 SIMPLE finished index email 258 NULL 32168 Using index
-- INNER JOIN
id select table type key key_len ref rows Extra
-- ------ -------- ----- ----- ------- ---- ----- ------------------------
1 SIMPLE finished index email 258 NULL 32168 Using index
1 SIMPLE saved ref email 383 func 1 Using where; Using index
^^^^^ ^^^^ ^^^^^ ^^^^^^^^^^^^
NOTE: Markus Adams spied the difference in characterset in the email columns CREATE TABLE statements that were added to your question.
I believe that it's the difference in the characterset that's preventing MySQL from using an index for your query.
Q2: How do I make the LEFT JOIN query faster?
A: I don't believe it's going to be possible to get that specific query to run faster, without a schema change, such as changing the characterset of the two email columns to match.
The only affect that the "outer join" to the finished table looks like it is to produce "duplicate" rows whenever more than one matching row is found. I'm not understanding why the outer join is needed. Why not just get rid of it altogether, and just do:
SELECT saved.email FROM saved
I'm afraid more info will probably be needed.
However, inner joins eliminate any item that has a null foreign key (no match, if you will). This means that there are less rows to scan to associate.
For a left join however, any non-match needs to be given a blank row, so all of the rows are scanned regardless -- nothing can be eliminated.
This makes the data set larger and requires more resources to process. Also, when you write your select, don't do select * -- instead, explicitly state which columns you want.
The data types of saved.email and finished.email differ in two respects. First, they have different lengths. Second, finished.email can be NULL. So, your LEFT JOIN operation can't exploit the index on finished.email.
Can you change the definition of finished.email to this, so it matches the field you're joining it with?
`email` varchar(127) NOT NULL
If you do you'll probably get a speedup.
I have the following query:
SELECT * FROM `alltrackers`
WHERE `deviceid`='FT_99000083401624'
AND `locprovider`!='none'
ORDER BY `id` DESC
This is the show create table:
CREATE TABLE `alltrackers` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`deviceid` varchar(50) NOT NULL,
`gpsdatetime` int(11) NOT NULL,
`locprovider` varchar(30) NOT NULL,
PRIMARY KEY (`id`),
KEY `statename` (`statename`),
KEY `gpsdatetime` (`gpsdatetime`),
KEY `locprovider` (`locprovider`),
KEY `deviceid` (`deviceid`(18))
) ENGINE=MyISAM AUTO_INCREMENT=8665045 DEFAULT CHARSET=utf8;
I've removed the columns which I thought were unnecessary for this question.
This is the EXPLAIN output for this query:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE alltrackers ref locprovider,deviceid deviceid 56 const 156416 Using
where; Using filesort
This particular query is showing as taking several seconds in mytop (mtop). I'm a bit confused though, as the same query but with a different "deviceid" doesn't take as long. Although I only need the last row, I've already removed LIMIT 1 as that makes it take even longer. This table currently contains 3 million rows.
It is used for storing the locations from different GPS devices. Each GPS device has a unique device ID. Locations come in and are added to the table. For statistics I'm running the above query to find the time of the last received location from a certain device.
I'm open to advice on ways to further optimize the query or even the tables.
Many thanks in advance.
If you only need the last row, add an index on (deviceid, id, locprovider). It would be even faster with an index on (deviceid, id, locprovider, gpsdatetime):
ALTER TABLE alltrackers
ADD INDEX special_covering_IDX
(deviceid, id, locprovider, gpsdatetime) ;
Then try this out:
SELECT id, locprovider, gpsdatetime
FROM alltrackers
WHERE deviceid = 'FT_99000083401624'
AND locprovider <> 'none'
ORDER BY id DESC
LIMIT 1 ;
this is my first question in stackoverflow, usually I'm used to searching the internet for an answer, but this time i could not find any answer for this question.
Simply my problem is a query takes TOO long to execute while it's a simple join between 2 tables
I'll post my query first then i will post more details about my system:
SELECT * FROM tbl_item
LEFT JOIN (SELECT * FROM tbl_item_details) AS tbl_item_details
ON tbl_item.item_id = tbl_item_details.item_details_item_id
WHERE item_active = 1 ORDER BY item_views DESC LIMIT 0,5
Here is my tables structure:
CREATE TABLE `tbl_item` (
`item_id` int(11) NOT NULL AUTO_INCREMENT,
`item_views` int(11) NOT NULL,
`item_active` tinyint(1) NOT NULL DEFAULT '1',
PRIMARY KEY (`item_id`)
) ENGINE=InnoDB AUTO_INCREMENT=821 DEFAULT CHARSET=utf8
tbl_item_details:
CREATE TABLE `tbl_item_details` (
`item_details_id` int(11) NOT NULL AUTO_INCREMENT,
`item_details_title` varchar(255) NOT NULL,
`item_details_content` longtext NOT NULL,
`item_details_item_id` int(11) NOT NULL,
PRIMARY KEY (`item_details_id`),
KEY `itm_dt_itm_id` (`item_details_item_id`),
CONSTRAINT `tbl_item_details_ibfk_1` FOREIGN KEY (`itm_dt_itm_id`) REFERENCES `tbl_item` (`itm_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=364 DEFAULT CHARSET=utf8
here is EXPLAIN query output:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY tbl_item ALL NULL NULL NULL NULL 358 Using where; Using temporary; Using filesort
1 PRIMARY ALL NULL NULL NULL NULL 358
2 DERIVED tbl_item_details ALL NULL NULL NULL NULL 422
each tables has ONLY 350 rows, and the big table (tbl_item_details) 1.5 MB so you see the tables are pretty small.
BASICALLY, the above query takes almost 4 seconds to execute on the following system:
CPU: Intel(R) Pentium(R) 4 CPU 3.20GHz (2 CPUs), ~3.2GHz
RAM: 3 GB
Mysql: 5.1.33 (included with wamp)
BEFORE, anyone suggests a solution here is what I tried, what worked and what did not:
things that worked and the query took much less time (0.06 seconds):
I tried removing ORDER BY I tried removing item_details_content from
the select I tried using INNER JOIN instead and it worked I tried
switching the tables in the join so the INNER table became OUTER and vice versa
I CANNOT use INNER JOIN because there might be rows in tbl_item that does not have matches in tbl_item_details and i want these records
things i tried which DID NOT work:
I tried adding an index to item_views (DID NOT WORK)
I tried removing the foreign keys constraints
I tried witching the tables engine to MyIsam
Obviously the problem happens when mysql is sorting the date and faces (relatively) large data in item_details_content, so if we get rid of one of the things (either sorting or the column item_details_content) it works just fine.
but the thing is, this should not happen! cos the table has really very small data considering it's just 350 rows with total of 1.5 MB! mysql should be able to handle much more data than that.
PLEASE, before suggesting drastic changes to the query structure, I'm afraid that is not possible, because I've been working on this framework for a while and the query is dynamically generated, and changes to the query might mean days of work BUT your suggestions are always welcome.
P.S: I tried this query on a powerful server (core i7 and 8 GB RAM) and it took 0.3 seconds which is still too long for such a database
Thanks a million
Have you tried:
SELECT * FROM tbl_item
LEFT JOIN tbl_item_details
ON tbl_item.item_id = tbl_item_details.item_details_item_id
WHERE item_active = 1 ORDER BY item_views DESC LIMIT 0,5
I really don't think you need the nested subquery that just selects * from tbl_item_details -- just use the table.