Issuing multiple sql update statements in one go - mysql

I have to issue about ~1M sql queries in the following form:
update table1 ta join table2 tr on ta.tr_id=tr.id
set start_date=null, end_date=null
where title_id='X' and territory_id='AG' and code='FREE';
The sql statements are in a text document -- I can only copy paste them in as-is.
What would be the fastest way to do this? Is there some checks that I can disable so it only inserts them at the end? For example something like:
start transaction;
copy/paste all sql statements here;
commit;
I tried the above approach but saw zero speed improvement on the inserts. Are there any other things I can try?

The performance cost is partly attributed to running 1M separate SQL statements, but it's also attributed to the cost of rewriting rows and the corresponding indexes.
What I mean is, there are several steps to executing an SQL statement, and each of them take non-zero amount of time:
Start a transaction.
Parse the SQL, validate the syntax, check your privileges to make sure you have permission to update those tables, etc.
Change the values you updated in the row.
Change the values you updated in each index on that table that contain the columns you changed.
Commit the transaction.
In autocommit mode, the start & commit transaction implicitly happens for every SQL statement, so that causes maximum overhead. Using explict START and COMMIT as you showed reduces that overhead by doing each once.
Caveat: I don't usually run 1M updates in a single transaction. That causes other types of overhead, because MySQL needs to keep the original rows in case you ROLLBACK. As a compromise, I would execute maybe 1000 updates, then commit and start a new transaction. That at least reduces the START/COMMIT overhead by 99.9%.
In any case, the overhead of transactions isn't great. It might be unnoticeable compared to the cost of updating indexes.
MyISAM tables have an option to DISABLE KEYS, which means it doesn't have to update non-unique indexes during the transaction. But this might not be a good optimization for you, because (a) you might need indexes to be active, to help performance of lookups in your WHERE clause and the joins; and (b) it doesn't work in InnoDB, which is the default storage engine, and it's a better idea to use InnoDB.
You could also review if you have too many indexes or redundant indexes on your table. There's no sense having extra indexes you don't need, which only add cost to your updates.
There's also a possibility that you don't have enough indexes, and your UPDATE is slow because it's doing a table-scan for every statement. The table-scans might be so expensive that you'd be better off creating the needed indexes to optimize the lookups. You should use EXPLAIN to see if your UPDATE statement is well-optimized.
If you want me to review that, please run SHOW CREATE TABLE <tablename> for each of your tables in your update, and run EXPLAIN UPDATE ... for your example SQL statement. Add the output to your question above (please don't paste in a comment).

Related

Prepared Statements, MyISAM with 100 million records and Caching on MySQL

I have 10 large read-only tables. I make a lot of queries that have the same format with different parameters, for example:
SELECT 'count' FROM table1 WHERE x='SOME VARIABLE';
SELECT 'count' FROM table1 WHERE x='SOME OTHER VARIABLE';
...
SELECT 'count' FROM table2 WHERE x='' AND y='';
SELECT 'count' FROM table2 WHERE x='' AND y='';
...
SELECT 'count' FROM table3 WHERE x='' AND y='' AND z='';
The variables used are different each query so I almost never execute the same query twice. Is it correct that query caching and row caching on the MySQL side would be wasteful and they should be disabled? Table caching seems like it would be a good thing.
On the client side, I am using prepared statements which I assume is good. If I enable Prepared statement caching (via Slick) though won't that hurt my performance since the parameters are so variable? Is there anything else I can do to optimize my performance?
Should auto-commit be off since I'm only doing selects and will never need to rollback?
Given that you are using the MYISAM engine and have tables which have hundreds of millions of active rows, I would take care less of how I query the cache (due to your low complexity, this is most likely the least problem), but more focus on the proper organization of the data within the database:
Prepared Statements are totally ok. It may be helpful to not prepare the statement over and over again. Instead, just reuse the existing prepared statement (some environments even allow to store prepared statements on the client side) with a new set of parameter values. However, this mainly only saves time, which is being used in the query cache. As the complexity of your query is quite low, it can be assumed that this won't be the biggest time consumer.
Key Caching (also called Key Buffering), however, is - as the name already suggests - key for your game! Most DB configurations of MySQL suffer greatly from wrong values in that area, as the buffers are way too small. In a nutshell, key caching makes sure that the references to the data (for instance in your indices) can be accessed in main memory. If they are not in memory, they need to be retrieved from the disk, which is slow. To see if your key cache is efficient, you should watch the key hit ratio, when your system is under load. Details about that is greatly explained at https://dba.stackexchange.com/questions/58182/tuning-key-reads-in-mysql.
If the caches become large or are being displaced frequently due to the usage of other tables, it may be helpful to create own key caches for your tables. For details, see https://dev.mysql.com/doc/refman/5.5/en/cache-index.html
If you always access large portions of your table via the same attributes, it may make sense to change the ordering of the data storage on the disk by using ALTER TABLE ... ORDER BY expr1, expr2, .... For details on this approach see also https://dev.mysql.com/doc/refman/5.5/en/optimizing-queries-myisam.html
Avoid using variable-length columns, such as VARCHAR, BLOB or TEXT. They might help to save some space, but especially comparing their values can become time-consuming. Please note, however, that already one single column of such a type will MySQL make switch to Dynamic column mode.
Run ANALYZE TABLE after huge data changes to keep the statistics up to date. If you have deleted huge areas, it might help to OPTIMIZE TABLE, helping to make sure that there are no large gaps around which need to be skipped when reading.
Use INSERT DELAYED to write changes asynchronously, if you do not need the reply. This will greatly improve your performance, if there are other SELECT statements around at the same point in time.
Alternatively, if you need the reply, you may use INSERT LOW_PRIORITY. Then the execution of the concurrent SELECTs are preferred compared to your INSERT. This may help to ease the pain of the fact a little, that MyISAM only supports table-level locking.
You may try to provide Index Hints to your queries, especially if there are multiple indices on your table which are overlapping each other. You should try to use that index which has the smallest width, but still covers the most attributes.
However, please note that in your case the impact must be quite small: You are not ordering/grouping or joining, so the query optimizer should already be very good at finding the best one. Simply check by using EXPLAIN on your SELECT statement to see, if the choice of the index used is reasonable.
In short, Prepared Statements are totally ok. Key Caching is key - and there are some other things you can do to help MySQL getting along with the whole bulk of data.

Is it better to use deactive status column instead of deleting rows for mysql performance?

Recently i watched a video about CRUD operations in mysql and one of the things comes to my attention in that video, commentator claimed deleting rows bad for mysql index performance instead of that we should use a status column.
So, is there a really difference between those two ?
Deleting a row is indeed quite expensive, more expensive than setting a new value to a column. Some people don't ever delete a row from their databases (though it's sometimes due to preserving history, not performance considerations).
I usually do delayed deletions: when my app needs to delete a row, it doesn't actually delete, but sets a status instead. Then later, during low traffic period, I execute those deletions.
Some database engines need their data files to be compacted every once in a while, since they cannot reuse the space from deleted records. I'm not sure if InnoDB is one of those.
I guess the strategy is that deleting a row affects all indexes, whereas modifying a 'status' column might not affect any indexes (since you probably wouldn't index that column due to the low cardinality).
Still, when deleting rows, the impact on indexes is minimal. Inserting affects index performance when it fills up a page, causing the index to be rebuilt. This doesn't happen with deletes. With deletes, the index records are merely marked for deletion.
MySQL will later (when load is low) purge deleted rows from the indexes. So, deletes are already cached. Why double the effort?
Your deletes do need indexes just like your selects and updates in order to quickly find the record to delete. So, don't blame slow deletes that are due to missing or bad indexes on MySQL index performance. Your delete statement's WHERE clause should be able to utilize an index. With InnoDB, this is also important to ensure that just a single index record is locked instead having to lock all of the records or a range.

Problematic performance with continuous UPDATE / INSERT in Mysql

Currently we have a database and a script which has 2 update and 1 select, 1 insert.
The problem is we have 20,000 People who run this script every hour. Which cause the mysql to run with 100% cpu.
For the insert, it's for logging, we want to log all the data to our mysql, but as the table scale up, application become slower and slower. We are running on InnoDB, but some people say it should be MyISAM. What should we use? In this log table, we do sometimes pull out the log for statistical purpose. 40->50 times a day only.
Our solution is to use Gearman [http://gearman.org/] to delay insert to the database. But how about the update.
We need to update 2 table, 1 from the customer to update the balance(balance = balance -1), and the other is to update the count from another table.
How should we make this faster and more CPU efficient?
Thank you
but as the table scale up, application become slower and slower
This usually means that you're missing an index somewhere.
MyISAM is not good: in addition to being non ACID compliant, it'll lock the whole table to do an insert -- which kills concurrency.
Read the MySQL documentation carefully:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
Especially "innodb_flush_log_at_trx_commit" -
http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html
I would stay away from MyISAM as it has concurrency issues when mixing SELECT and INSERT statements. If you can keep your insert tables small enough to stay in memory, they'll go much faster. Batching your updates in a transaction will help them go faster as well. Setting up a test environment and tuning for your actual job is important.
You may also want to look into partitioning to rotate your logs. You'd drop the old partition and create a new one for the current data. This is much faster than than deleting the old rows.

mysql deletion efficiency

I have a table with large amount of data. The data need to be updated frequently: delete old data and add new data. I have two options
whenever there is an deletion event, I delete the entry immediately
I marked delete the entries and use an cron job to delete at unpeak time.
any efficiency difference between the two options? or any better solution?
Both delete and update can have triggers, this may affect performance (check if that's your case).
Updating a row is usually faster than deleting (due to indexes etc.)
However, in deleting a single row in an operation, the performance impact shouldn't be that big. If your measurements show that the database spends significant time deleting rows, then your mark-and-sweep approach might help. The key word here is probably measured - unless the single deletes are significantly slower than updates, I wouldn't bother.
You should use low_priority_updates - http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#sysvar_low_priority_updates. This will give higher priority to your selects than insert/delete/update operations. I used this in production and got a pretty decent speed improvement. The only problem I see with it is that you will lose more data in case of a crashing server.
With MySQL, deletes are simply marked for deletion internally, and when the CPU is (nearly) idle, MySQL then updates the indexes.
Still, if this is a problem, and you are deleting many rows, consider using DELETE QUICK. This tells InnoDB to not update the index, just leave it marked as deleted, so it can be reused.
To recover the unused index space, simply OPTIMIZE TABLE nightly.
In this case, there's no need to implement the functionality in your application that MySQL will do internally.

Dummies guide to locking in innodb

The typical documentation on locking in innodb is way too confusing. I think it will be of great value to have a "dummies guide to innodb locking"
I will start, and I will gather all responses as a wiki:
The column needs to be indexed before row level locking applies.
EXAMPLE: delete row where column1=10; will lock up the table unless column1 is indexed
Here are my notes from working with MySQL support on a recent, strange locking issue (version 5.1.37):
All rows and index entries traversed to get to the rows being changed will be locked. It's covered at:
http://dev.mysql.com/doc/refman/5.1/en/innodb-locks-set.html
"A locking read, an UPDATE, or a DELETE generally set record locks on every index record that is scanned in the processing of the SQL statement. It does not matter whether there are WHERE conditions in the statement that would exclude the row. InnoDB does not remember the exact WHERE condition, but only knows which index ranges were scanned. ... If you have no indexes suitable for your statement and MySQL must scan the entire table to process the statement, every row of the table becomes locked, which in turn blocks all inserts by other users to the table."
That is a MAJOR headache if true.
It is. A workaround that is often helpful is to do:
UPDATE whichevertable set whatever to something where primarykey in (select primarykey from whichevertable where constraints order by primarykey);
The inner select doesn't need to take locks and the update will then have less work to do for the updating. The order by clause ensures that the update is done in primary key order to match InnoDB's physical order, the fastest way to do it.
Where large numbers of rows are involved, as in your case, it can be better to store the select result in a temporary table with a flag column added. Then select from the temporary table where the flag is not set to get each batch. Run updates with a limit of say 1000 or 10000 and set the flag for the batch after the update. The limits will keep the amount of locking to a tolerable level while the select work will only have to be done once. Commit after each batch to release the locks.
You can also speed this work up by doing a select sum of an unindexed column before doing each batch of updates. This will load the data pages into the buffer pool without taking locks. Then the locking will last for a shorter timespan because there won't be any disk reads.
This isn't always practical but when it is it can be very helpful. If you can't do it in batches you can at least try the select first to preload the data, if it's small enough to fit into the buffer pool.
If possible use the READ COMMITTED transaction isolation mode. See:
http://dev.mysql.com/doc/refman/5.1/en/set-transaction.html
To get that reduced locking requires use of row-based binary logging (rather than the default statement based binary logging).
Two known issues:
Subqueries can be less than ideally optimised sometimes. In this case it was an undesirable dependent subquery - the suggestion I made to use a subquery turned out to be unhelpful compared to the alternative in this case because of that.
Deletes and updates do not have the same range of query plans as select statements so sometimes it's hard to properly optimise them without measuring the results to work out exactly what they are doing.
Both of these are gradually improving. This bug is one example where we've just improved the optimisations available for an update, though the changes are significant and it's still going through QA to be sure it doesn't have any great adverse effects:
http://bugs.mysql.com/bug.php?id=36569