The question is about SQL legacy code for MySQL database.
It is known, that when doing INSERT ... ON DUPLICATE KEY UPDATE statement VALUES(col_name) function can be used to refer to column values from the INSERT portion instead of passing there exact values:
INSERT INTO table (a,b,c) VALUES (1,2,3)
ON DUPLICATE KEY UPDATE b=VALUES(b), c=VALUES(c)
My legacy code contains a lot of huge inserts in parametrized style (they are used in batch-inserts):
INSERT INTO table (a,b,c, <...dozens of params...>) VALUES (?,?,?,<...dozens of values...>)
ON DUPLICATE KEY UPDATE b=?, c=?, <...dozens of params...>
The question is: would it increase performance of batch-inserts if I will change all these queries to use VALUES(col_name) function (in UPDATE portion)?
My queries are executed from java code using jdbc driver. So, what I guess, is that for long text values it should significantly reduce size of queries. What about MySQL it self? Would it really in general give me increasing of speed?
Batched inserts can may run 10 times as fast and one row at a time. The reason for this is all the network, etc, overhead.
Another technique is to change from a single batched IODKU into two statements -- one to insert the new rows, one to do the updates. (I don't know if that will run any faster.) Here is a discussion of the two steps, in the context of "normalization".
Another thing to note: If there is an AUTO_INCREMENT involved (not as one of the columns mentioned), then IODKU may "burn" ids for the cases where it does an 'update'. That is, the IODKU (and INSERT IGNORE and a few others) get all the auto_incs that it might need, then proceeds to use the ones it does need and waste the others.
You get into "diminishing returns" if you try to insert more than a few hundred rows in a batch. And you stress the rollback log.
Related
I have to issue about ~1M sql queries in the following form:
update table1 ta join table2 tr on ta.tr_id=tr.id
set start_date=null, end_date=null
where title_id='X' and territory_id='AG' and code='FREE';
The sql statements are in a text document -- I can only copy paste them in as-is.
What would be the fastest way to do this? Is there some checks that I can disable so it only inserts them at the end? For example something like:
start transaction;
copy/paste all sql statements here;
commit;
I tried the above approach but saw zero speed improvement on the inserts. Are there any other things I can try?
The performance cost is partly attributed to running 1M separate SQL statements, but it's also attributed to the cost of rewriting rows and the corresponding indexes.
What I mean is, there are several steps to executing an SQL statement, and each of them take non-zero amount of time:
Start a transaction.
Parse the SQL, validate the syntax, check your privileges to make sure you have permission to update those tables, etc.
Change the values you updated in the row.
Change the values you updated in each index on that table that contain the columns you changed.
Commit the transaction.
In autocommit mode, the start & commit transaction implicitly happens for every SQL statement, so that causes maximum overhead. Using explict START and COMMIT as you showed reduces that overhead by doing each once.
Caveat: I don't usually run 1M updates in a single transaction. That causes other types of overhead, because MySQL needs to keep the original rows in case you ROLLBACK. As a compromise, I would execute maybe 1000 updates, then commit and start a new transaction. That at least reduces the START/COMMIT overhead by 99.9%.
In any case, the overhead of transactions isn't great. It might be unnoticeable compared to the cost of updating indexes.
MyISAM tables have an option to DISABLE KEYS, which means it doesn't have to update non-unique indexes during the transaction. But this might not be a good optimization for you, because (a) you might need indexes to be active, to help performance of lookups in your WHERE clause and the joins; and (b) it doesn't work in InnoDB, which is the default storage engine, and it's a better idea to use InnoDB.
You could also review if you have too many indexes or redundant indexes on your table. There's no sense having extra indexes you don't need, which only add cost to your updates.
There's also a possibility that you don't have enough indexes, and your UPDATE is slow because it's doing a table-scan for every statement. The table-scans might be so expensive that you'd be better off creating the needed indexes to optimize the lookups. You should use EXPLAIN to see if your UPDATE statement is well-optimized.
If you want me to review that, please run SHOW CREATE TABLE <tablename> for each of your tables in your update, and run EXPLAIN UPDATE ... for your example SQL statement. Add the output to your question above (please don't paste in a comment).
I have a table called research_words which has some hundred million rows.
Every day I have tens of million of new rows to be added, about 5% of them are totally new rows, and 95% are updates which have to add to some columns in that row. I don't know which is which so I use:
INSERT INTO research_words
(word1,word2,origyear,cat,numbooks,numpages,numwords)
VALUES
(34272,268706,1914,1,1,1,1)
ON DUPLICATE KEY UPDATE
numbooks=numbooks+1,numpages=numpages+1,numwords=numwords+1
This is an InnoDB table where the primary key is over word1,word2,origyear,cat.
The issue I'm having is that I have to insert the new rows each day and it's taking longer than 24 hours to insert each days rows! Obviously I can't have it taking longer than a day to insert the rows for the day. I have to find a way to make the inserts faster.
For other tables I've had great success with ALTER TABLE ... DISABLE KEYS; and LOAD DATA INFILE, which allows me to add billions of rows in less than an hour. That would be great, except that unfortunately I am incrementing to columns in this table. I doubt disabling the keys would help either because surely it will need them to check whether the row exists in order to add it.
My scripts are in PHP but when I add the rows I do so by an exec call directly to MySQL and pass it a text file of commands, instead of sending them with PHP, since it's faster this way.
Any ideas to fix the speed issue here?
Old question, but perhaps worth an answer all the same.
Part of the issue stems from the large number of inserts being run essentially one at a time, with a unique index update after each one.
In these instances, a better technique might be to select n rows to insert and put them in a temp table, left join them to the destination table, calculate their new values (in OP's situation IFNULL(dest.numpages+1,1) etc.) and then run two further commands - an insert where the insert fields are 1 and an update where they're greater. The updates don't require an index refresh, so they run much faster; the inserts don't require the same ON DUPLICATE KEY logic.
I want to insert or update, then insert a new log into another table.
I'm running a nifty little query to pull information from a staging table into other tables, something like
Insert into
select
on duplicate key update
What I'd like to do without php, or triggers (the lead dev doesn't like em, and I'm not that familiar with them either) is insert a new record into a logging table. Needed for reporting on what data was updated or inserted and on what table.
Any hints or examples?
Note: I was doing this with php just fine, although it was taking about 4 hours to process on 50K rows. Using the laravel php framework, looping over each entry in staging update 4 other tables with the data and log for each one was equalling 8 queries for each row (this was using laravel models not raw sql). I was able to optimise by pushing logs into an array and batch processing. But you can't beat 15sec processing time in mysql by bypassing all that throughput. Now I'm hooked on doing awesome things the sql way.
If you need executing more than one query statement. I refer to use transaction then trigger to guarantee Atomicity (part of ACID). Bellow code is sample for MySql transaction:
START TRANSACTION;
UPDATE ...
INSERT ...
DELETE ...
Other query statement
COMMIT;
Statements inside transaction will be executed all or nothing.
If you want to do two things (insert the base row and insert a log row), you'll need two statements. The second can (and should) be a trigger.
It would be better to use a Trigger, it is often used for Logging purposes
I have a table with 30M+ rows, and each index update is expensive.
I sometimes have to update and/or add 5000+ rows in a single insert.
Sometimes all rows are new, sometimes some are new.
I cannot use update - since I don't know which is already in the table, so I use INSERT .. ON DUPLICATE KEY UPDATE for a single column.
This sometimes take a lot of time >5 sec.
Is there a better way to do it? maybe I did not explain myself clear enough :)
Are you issuing 5000+ separate insert statements? If so, lock the table while doing the inserts; it'll go a lot faster.
I added BEGIN TRANSACTION and COMMIT to perform the insert, and it enhanced performance by x4 to x10.
I have a MyISAM table with more than 10^7 rows. When adding data to it, I have to update ~10 rows at the end. Is it faster to delete them and then insert the new ones, or is it faster to update those rows? Data that should be updated is not part of the index. What about index/data fragmentation
UPDATE is by far much faster.
When you UPDATE, the table records are just being rewritten with new data.
When you DELETE, the indexes should be updated (remember, you delete the whole row, not only the columns you need to modify) and datablocks may be moved (if you hit the PCTFREE limit)
And all this must be done again on INSERT.
That's why you should always use
INSERT ... ON DUPLICATE KEY UPDATE
instead of REPLACE.
The former one is an UPDATE operation in case of a key violation, while the latter one is DELETE / INSERT.
It is faster to update. You can also use INSERT ON DUPLICATE KEY UPDATE
INSERT INTO table (a,b,c) VALUES (1,2,3)
ON DUPLICATE KEY UPDATE c=c+1;
For more details read update documentation
Rather than deleting or updating data for the sake of performance, I would consider partitioning.
http://dev.mysql.com/doc/refman/5.1/en/partitioning-range.html
This will allow you to retain the data historically and not degrade performance.
Logically DELETE+ADD = 2 actions, UPDATE = 1 action. Also deleting and adding new changes records IDs on auto_increment, so if those records have relationships that would be broken, or would need updates too. I'd go for UPDATE.
using an update where Column='something' should use an index as long as the search criteria is in the index (whether it's a seek or scan is a completely different issue).
if you are doing these updates a lot but dont' have an index on the criteria column, i would recommend creating an index on the column that you are using. that should help speed things up.