I need a little help with SELECT FOR UPDATE (resp. LOCK IN SHARE MODE).
I have a table with around 400 000 records and I need to run two different processing functions on each row.
The table structure is appropriately this:
data (
`id`,
`mtime`, -- When was data1 set last
`data1`,
`data2` DEFAULT NULL,
`priority1`,
`priority2`,
PRIMARY KEY `id`,
INDEX (`mtime`),
FOREIGN KEY ON `data2`
)
Functions are a little different:
first function - has to run in loop on all records (is pretty fast), should select records based on priority1; sets data1 and mtime
second function - has to run only once on each records (is pretty slow), should select records based on priority2; sets data1 and mtime
They shouldn't modify the same row at the same time, but the select may return one row in both of them (priority1 and priority2 have different values) and it's okay for transaction to wait if that's the case (and I'd expect that this would be the only case when it'll block).
I'm selecting data based on following queries:
-- For the first function - not processed first, then the oldest,
-- the same age goes based on priority
SELECT id FROM data ORDER BY mtime IS NULL DESC, mtime, priority1 LIMIT 250 FOR UPDATE;
-- For the second function - only processed not processed order by priority
SELECT if FROM data ORDER BY priority2 WHERE data2 IS NULL LIMIT 50 FOR UPDATE;
But what I am experiencing is that every time only one query returns at the time.
So my questions are:
Is it possible to acquire two separate locks in two separate transactions on separate bunch of rows (in the same table)?
Do I have that many collisions between first and second query (I have troubles debugging that, any hint on how to debug SELECT ... FROM (SELECT ...) WHERE ... IN (SELECT) would be appreciated )?
Can ORDER BY ... LIMIT ... cause any issues?
Can indexes and keys cause any issues?
Key things to check for before getting much further:
Ensure the table engine is InnoDB, otherwise "for update" isn't going to lock the row, as there will be no transactions.
Make sure you're using the "for update" feature correctly. If you select something for update, it's locked to that transaction. While other transactions may be able to read the row, it can't be selected for update, updated or deleted by any other transaction until the lock is released by the original locking transaction.
To keep things clean, try explicitly starting a transaction using "START TRANSACTION", run your select "for update", do whatever you're going to do to the records that are returned, and finish up by explicitly executing a "COMMIT" to close out the transaction.
Order and limit will have no impact on the issue you're experiencing as far as I can tell, whatever was going to be returned by the Select will be the rows that get locked.
To answer your questions:
Is it possible to acquire two separate locks in two separate transactions on separate bunch of rows (in the same table)?
Yes, but not on the same rows. Locks can only exist at the row level in one transaction at a time.
Do I have that many collisions between first and second query (I have troubles debugging that, any hint on how to debug SELECT ... FROM (SELECT ...) WHERE ... IN (SELECT) would be appreciated )?
There could be a short period where the row lock is being calculated, which will delay the second query, however unless you're running many hundreds of these select for updates at once, it shouldn't cause you any significant or noticable delays.
Can ORDER BY ... LIMIT ... cause any issues?
Not in my experience. They should work just as they always would on a normal select statement.
Can indexes and keys cause any issues?
Indexes should exist as always to ensure sufficient performance, but they shouldn't cause any issues with obtaining a lock.
All points in accepted answer seem fine except below 2 points:
"whatever was going to be returned by the Select will be the rows that get locked." &
"Can indexes and keys cause any issues?
but they shouldn't cause any issues with obtaining a lock."
Instead all the rows which are internally read by DB during deciding which rows to select and return will be locked. For example below query will lock all rows of the table but might select and return only few rows:
select * from table where non_primary_non_indexed_column = ? for update
Since there is no index, DB will have to read the entire table to search for your desired row and hence lock entire table.
If you want to lock only one row either you need to specify its primary key or an indexed column in the where clause. Thus indexing becomes very important in case of locking only the appropriate rows.
This is a good reference - https://dev.mysql.com/doc/refman/5.7/en/innodb-locking-reads.html
Related
I have a table with more than 40 million records.i want to delete about 150000 records with a sql query:
DELETE
FROM t
WHERE date="2013-11-24"
but I get error 1206(The total number of locks exceeds the lock table size).
I searched a lot and change the buffer pool size:
innodb_buffer_pool_size=3GB
but it didn't work.
I also tried to lock tables but didn't work too:
Lock Tables t write;
DELETE
FROM t
WHERE date="2013-11-24";
unlock tables;
I know one solution is to split the process of deleting but i want this be my last option.
I am using mysql server, server OS is centos and server Ram is 4GB.
I'll appreciate any help.
You can use Limit on your delete and try deleting data in batches of say 10,000 records at a time as:
DELETE
FROM t
WHERE date="2013-11-24"
LIMIT 10000
You can also include an ORDER BY clause so that rows are deleted in the order specified by the clause:
DELETE
FROM t
WHERE date="2013-11-24"
ORDER BY primary_key_column
LIMIT 10000
There are a lot of quirky ways this error can occur. I will try to list one or two and perhaps the analogy holds true for someone reading this at some point.
On larger datasets even when changing innodb_buffer_pool_size to a larger value, you can hit this error when an adequate index is not in place to isolate the rows in the where clause. Or in some cases with the primary index (see this) and the comment from Roger Gammans:
From the (5.0 documentation for innodb):-
If you have no indexes suitable for your statement and MySQL must scan
the entire table to process the statement, every row of the table
becomes locked, which in turn blocks all inserts by other users to the
table. It is important to create good indexes so that your queries do
not unnecessarily scan many rows.
A visual of how this error can occur and difficult to solve is with this simple schema:
CREATE TABLE `students` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`thing` int(11) NOT NULL,
`campusId` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `ix_stu_cam` (`camId`)
) ENGINE=InnoDB;
A table with 50 Million rows. FK's not shown, not the issue. This table was originally for showing query performance also not important. Yet, in initializing thing=id in blocks of 1M rows, I had to perform a limit during the block update to prevent other problems, by using:
update students
set thing=id
where thing!=id
order by id desc
limit 1000000 ; -- 1 Million
This was all well until it got down to say 600000 left to update as seen by
select count(*) from students where thing!=id;
Why I was doing that count(*) stemmed from repeated
Error 1206: The total number of locks exceeds the lock table size
I could keep lowering my LIMIT shown in the above update, but in the end I would be left, say, with 1200 != in the count, and the problem just continued.
Why did it continue? Because the system filled the lock table as it scanned this large table. Sure, it might "intra implicit transaction" have changed those last 1200 row to equal, in my mind, but due to the lock table filling up, in reality would abort the transaction with nothing set. And the process would stalemate.
Illustration 2:
In this example, let's say I have 288 rows of the 50 Million row table that could be updated shown above. Due to the end-game problem described, I would often find a problem running this query twice:
update students set thing=id where thing!=id order by id desc limit 200 ;
But I would not have a problem with these:
update students set thing=id where thing!=id order by id desc limit 200;
update students set thing=id where thing!=id order by id desc limit 88 ;
Solutions
There are many ways to solve this, including but not limited to:
A. The creation of another index on a column suggesting the data had been updated, perhaps a boolean. And incorporating that into the where clause. Yet on huge tables, the creation of somewhat temporary indexes may be out of the question.
B. Populating a 2nd table with yet to be cleaned id's could be another solution. Coupled with and update with a join pattern.
C. Dynamically changing the LIMIT value so as to not cause an overrun of the lock table. The overrun can occur when there are simply no more rows to UPDATE or DELETE (your operation), the LIMIT has not been reached, and the lock table fills up in a fruitless scan for more that simply don't exist (seen above in Illustration2).
The main point of this answer is to offer an understanding of why it is happening. And for any reader to craft an end-game solution that fits their needs (versus, at times, fruitless changes to system variables, reboots, and prayers).
The simplest way is to create an index on the date column. I had 170 million rows and was deleting 6.5 million rows. I ran into the same problem and solved it by creating non-clustered index on the column which I was using in the WHERE clause then I executed the delete query and it worked.
Delete the index if you don't need it for future.
It is unclear to me (by reading MySQL docs) if the following query ran on INNODB tables on MySQL 5.1, would create WRITE LOCK for each of the rows the db updates internally (5000 in total) or LOCK all the rows in the batch. As the database has really heavy load, this is very important.
UPDATE `records`
INNER JOIN (
SELECT id, name FROM related LIMIT 0, 5000
) AS `j` ON `j`.`id` = `records`.`id`
SET `name` = `j`.`name`
I'd expect it to be per row but as I do not know a way to make sure it is so, I decided to ask someone with deeper knowledge. If this is not the case and the db would LOCK all the rows in the set, I'd be thankful if you give me explanation why.
The UPDATE is running in transaction - it's an atomic operation, which means that if one of the rows fails (because of unique constrain for example) it won't update any of the 5000 rows. This is one of the ACID properties of a transactional database.
Because of this the UPDATE hold a lock on all of the rows for the entire transaction. Otherwise another transaction can further update the value of a row, based on it's current value (let's say update records set value = value * '2'). This statement should produce different result depending if the first transaction commits or rollbacks. Because of this it should wait for the first transaction to complete all 5000 updates.
If you want to release the locks, just do the update in (smaller) batches.
P.S. autocommit controls if each statement is issued in own transaction, but does not effect the execution of a single query
My question is similar to:
Ignoring locked row in a MySQL query
except that I have already implemented a logic close to what's suggested in the accepted answer. My question is how to set the process id initially. All servers run a query like (the code is in ruby on rails but the resulting mysql query is):
UPDATE (some_table) SET process_id=(some process_id) WHERE (some condition on row_1) AND process_id is null ORDER BY (row_1) LIMIT 100
Now what happens is all processes try to update the same rows, they get locked and they timeout waiting for the lock. I would like the servers to ignore the rows that are locked (because after the lock is released the process_id won't be null anymore so there is no point for locking here).
I could try to randomize the batch of records to update but the problem is I want to prioritize the update based on row_1 as in the query above.
So my question is, is there a way in mysql to check if a record is locked and ignore it if it is?
No, there is no way to ignore already-locked rows. Your best bet will be to ensure that nothing locks any row for any extended period of time. That will ensure that any lock conflicts are very short in duration. That will generally mean "advisory" locking of rows by locking them within a transaction (using FOR UPDATE) and updating the row to mark it as "locked".
For example, first you want to find your candidate row(s) without locking anything:
SELECT id FROM t WHERE lock_expires IS NULL AND lock_holder IS NULL <some other conditions>;
Now lock only the row you want, very quickly:
START TRANSACTION;
SELECT * FROM t WHERE id = <id> AND lock_expires IS NULL AND lock_holder IS NULL;
UPDATE t SET lock_expires = <some time>, lock_holder = <me> WHERE id = <id>;
COMMIT;
(Technical note: If you are planning to lock multiple rows, always lock them in a specific order. Ascending order by primary key is a decent choice. Locking out-of-order or in random order will subject your program to deadlocks from competing processes.)
Now you can take as long as you want (less than lock_expires) to process your row(s) without blocking any other process (they won't match the row during the non-locking select, so will always ignore it). Once the row is processed, you can UPDATE or DELETE it by id, also without blocking anything.
If I SELECT IDs then UPDATE using those IDs, then the UPDATE query is faster than if I would UPDATE using the conditions in the SELECT.
To illustrate:
SELECT id FROM table WHERE a IS NULL LIMIT 10; -- 0.00 sec
UPDATE table SET field = value WHERE id IN (...); -- 0.01 sec
The above is about 100 times faster than an UPDATE with the same conditions:
UPDATE table SET field = value WHERE a IS NULL LIMIT 10; -- 0.91 sec
Why?
Note: the a column is indexed.
Most likely the second UPDATE statement locks much more rows, while the first one uses unique key and locks only the rows it's going to update.
The two queries are not identical. You only know that the IDs are unique in the table.
UPDATE ... LIMIT 10 will update at most 10 records.
UPDATE ... WHERE id IN (SELECT ... LIMIT 10) may update more than 10 records if there are duplicate ids.
I don't think there can be a one straight-forward answer to your "why?" without doing some sort of analysis and research.
The SELECT queries are normally cached, which means that if you run the same SELECT query multiple times, the execution time of the first query is normally greater than the following queries. Please note that this behavior can only be experienced where the SELECT is heavy and not in scenarios where even the first SELECT is much faster. So, in your example it might be that the SELECT took 0.00s because of the caching. The UPDATE queries are using different WHERE clauses and hence it is likely that their execution times are different.
Though the column a is indexed, but it is not necessary that MySQL must be using the index when doing the SELECT or the UPDATE. Please study the EXPLAIN outputs. Also, see the output of SHOW INDEX and check if the "Comment" column reads "disabled" for any indexes? You may read more here - http://dev.mysql.com/doc/refman/5.0/en/show-index.html and http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html.
Also, if we ignore the SELECT for a while and focus only on the UPDATE queries, it is obvious that they aren't both using the same WHERE condition - the first one runs on id column and the latter on a. Though both columns are indexed but it does not necessarily mean that all the table indexes perform alike. It is possible that some index is more efficient than the other depending on the size of the index or the datatype of the indexed column or if it is a single- or multiple-column index. There sure might be other reasons but I ain't an expert on it.
Also, I think that the second UPDATE is doing more work in the sense that it might be putting more row-level locks compared to the first UPDATE. It is true that both UPDATES are finally updating the same number of rows. But where in the first update, it is 10 rows that are locked, I think in the second UPDATE, all rows with a as NULL (which is more than 10) are locked before doing the UPDATE. Perhaps MySQL first applies the locking and then runs the LIMIT clause to update only limited records.
Hope the above explanation makes sense!
Do you have a composite index or separate indexes?
If it is a composite index of id and a columns,
In 2nd update statement the a column's index would not be used. The reason is that only the left most prefix indexes are used (unless if a is the PRIMARY KEY)
So if you want the a column's index to be used, you need in include id in your WHERE clause as well, with id first then a.
Also it depends on what storage engine you are using since MySQL does indexes at the engine level, not server.
You can try this:
UPDATE table SET field = value WHERE id IN (...) AND a IS NULL LIMIT 10;
By doing this id is in the left most index followed by a
Also from your comments, the lookups are much faster because if you are using InnoDB, updating columns would mean that the InnoDB storage engine would have to move indexes to a different page node, or have to split a page if the page is already full, since InnoDB stores indexes in sequential order. This process is VERY slow and expensive, and gets even slower if your indexes are fragmented, or if your table is very big
The comment by Michael J.V is the best description. This answer assumes a is a column that is not indexed and 'id' is.
The WHERE clause in the first UPDATE command is working off the primary key of the table, id
The WHERE clause in the second UPDATE command is working off a non-indexed column. This makes the finding of the columns to be updated significantly slower.
Never underestimate the power of indexes. A table will perform better if the indexes are used correctly than a table a tenth the size with no indexing.
Regarding "MySQL doesn't support updating the same table you're selecting from"
UPDATE table SET field = value
WHERE id IN (SELECT id FROM table WHERE a IS NULL LIMIT 10);
Just do this:
UPDATE table SET field = value
WHERE id IN (select id from (SELECT id FROM table WHERE a IS NULL LIMIT 10));
The accepted answer seems right but is incomplete, there are major differences.
As much as I understand, and I'm not a SQL expert:
The first query you SELECT N rows and UPDATE them using the primary key.
That's very fast as you have a direct access to all rows based on the fastest possible index.
The second query you UPDATE N rows using LIMIT
That will lock all rows and release again after the update is finished.
The big difference is that you have a RACE CONDITION in case 1) and an atomic UPDATE in case 2)
If you have two or more simultanous calls of the case 1) query you'll have the situation that you select the SAME id's from the table.
Both calls will update the same IDs simultanously, overwriting each other.
This is called "race condition".
The second case is avoiding that issue, mysql will lock all rows during the update.
If a second session is doing the same command it will have a wait time until the rows are unlocked.
So no race condition is possible at the expense of lost time.
if an INSERT and a SELECT are done simultaneously on a mysql table which one will go first?
Example: Suppose "users" table row count is 0.
Then this two queries are run at the same time (assume it's at the same mili/micro second):
INSERT into users (id) values (1)
and
SELECT COUNT(*) from users
Will the last query return 0 or 1?
Depends whether your users table is MyISAM or InnoDB.
If it's MyISAM, one statement or the other takes a lock on the table, and there's little you can do to control that, short of locking tables yourself.
If it's InnoDB, it's transaction-based. The multi-versioning architecture allows concurrent access to the table, and the SELECT will see the count of rows as of the instant its transaction started. If there's an INSERT going on simultaneously, the SELECT will see 0 rows. In fact you could even see 0 rows by a SELECT executed some seconds later, if the transaction for the INSERT has not committed yet.
There's no way for the two transactions to start truly simultaneously. Transactions are guaranteed to have some order.
It depends on which statement will be executed first. If first then the second will return 1, if the second one executes first, then it will return 0. Even you are executing them on the computer with multiple physical cores and due to the lock mechanism, they will never ever execute at the exactly same time stamp.