I have a MySQL table of Users, and a table of Actions performed by the Users (linked to that User by a the primary key, userid ). The Actions table has an incrementing key indx. Whenever I add a new row to that table, I then update the latest column of the relevant Users row with the indx of the row I just added to the Actions table. So something like:
INSERT INTO actions(indx,actionname,userid) VALUES(default, "myaction", 1);
UPDATE users SET latest=LAST_INSERT_ID() WHERE userid=1;
The idea being that I can check for updates for a User by seeing if the latest is higher then the last time I checked.
My issue is that if more than one connection is opened on the database and they try and add an Action for the same User at the same time, connection2 could conceivably run their INSERT and UPDATE between the INSERT and update of connection1, and the latest entry of the user they're both trying to update will no longer have the indx of the most recent action entry.
I've been reading up on transaction, isolation levels, etc. But haven't really found a way around this (though my understanding of how these work exactly is pretty shaky, so maybe I just misunderstood). I think I need a way to lock the Actions table until the User table is updated. This application only gets used by a few hundred users tops, so I don't think the performance hit due to momentarily locking the table will be too bad.
So is that something that can be done in MySQL? Is there a better solution? I imagine this general pattern must be pretty common: having one table with a bunch of varieties of rows, and a second table with a row that tracks meta data for each variety in table A and needs to be updated atomically each time that first table is changed. So I'm hoping there's a solution that isn't too complex
Use SELECT ... FOR UPDATE to lock the row in order to serialize the access to the table and prevent from race conditions:
START TRANSACTION;
SELECT any_column FROM users WHERE userid=1 FOR UPDATE;
INSERT INTO actions(indx,actionname,userid) VALUES(default, "myaction", 1);
UPDATE users SET latest=LATEST_INSERT_ID() WHERE userid=1;
COMMIT;
However this will slown down your INSERTing rate, because all these transactions from all sessions will be serialized.
The better option is to not store the last ID in users table at all. Just use SELECT max( id ) FROM actions WHERE userid = xxxx in all places where this number is required. With an index on actions( userid ) this query will be very fast (assuming that id column is the primary key in this table), and the inserts will not be slowed down
Related
I've been searching the internet for a couple hours now and I'm not sure how to resolve this at all. So brief description is a customer posts orders to our system and they can supply a Customer Reference that our system will reject if that Customer Reference already exists.
I can't make the column in MySQL UNIQUE as different clients sometimes use the same Customer Reference and we do not require the Customer Reference so sometimes it's just left blank.
Originally I was just checking if the Customer Reference existed if necessary and then inserting the row if it did not exist. This works on 99.99% of cases, but I have a client that mass sends orders and those sometimes have duplicates. Which since they're posting quickly the select can happen before the first insert and duplicates arise.
I've switched to code like this below:(Shortened for example, this only runs if customerReference is not blank)
INSERT INTO ordersTable (clientID,customerReference,deliveryName) SELECT clientID, customerReference,deliveryName
FROM (SELECT 'clientID' as clientID, 'customerReference' as customerReference, 'deliveryName' as deliveryName) t
WHERE NOT EXISTS (SELECT 1 FROM ordersTable u WHERE u.customerReference = t.customerReference AND u.clientID = t.clientID);
This ends in deadlocks for any processes after the original row is inserted. I was hoping to avoid deadlocks?
My options it seems are:
Live with it deadlocking because I know if it deadlocks then the row already exists and instead of looking at affected_rows ==0 make it affected_rows <= 0.
Try to come up with some column that will make a unique record hash per order based on client ID and Customer Reference? and then do an "INSERT IGNORE" for that column?
I wasn't too confident in either solution so I thought it couldn't hurt to ask for advice first.
Have you tried using a transaction with a unique constraint on the uniqueID and clientID columns? This will prevent duplicates from being inserted, and you can catch the exception that is thrown when a replication is attempted to be inserted and handle it as needed.
INSERT INTO ordersTable (clientID,uniqueID,deliveryName)
VALUES ('clientID', 'uniqueID', 'deliveryName')
ON DUPLICATE KEY UPDATE deliveryName = VALUES(deliveryName);
Ok, you can also use "INSERT IGNORE" statement. This statement tells the server to insert the new record, but if there is a violation of a UNIQUE index or PRIMARY KEY, ignore the error and don't insert the new record.
INSERT IGNORE INTO ordersTable (clientID,uniqueID,deliveryName)
VALUES ('clientID', 'uniqueID', 'deliveryName');
Short version
Would someone provide an example of this? There are 3 SQL tables. Using INSERT ... SELECT, take data from table 1 and insert it into table 2. Then, INSERT rows into table 3, using the auto-increment id of each table 2 row just inserted using that INSERT ... SELECT statement.
INSERT ... SELECT creates multiple rows but you cannot obtain the auto-increment ID from them, for use in a subsequent INSERT statement.
Expanded version
I'm looking for an efficient way to use the auto increment IDs, created from an INSERT ... SELECT, in a second INSERT.
Imagine this scenario in a warehouse.
The warehouse receives a pallet of goods from a supplier. The pallet contains multiple individual items, which must be dispatched to different customers. The pallet is booked in, broken down and checked. Each item is then assigned to the correct customer and marked as "ready". At this point, each item is dispatched with the dispatch status recorded per customer. Each Customer's account balance is reduced by a given value based on the item.
The issue is linking the account reduction to the item dispatch. There are 3 tables:
GoodsIn: records the pallet arrival from the supplier
CREATE TABLE GoodsIn ('InID' 'CustomerID', 'ItemSKU_ID', 'HasBeenChecked')
GoodsOut: records the SKU dispatch to the Customer
CREATE TABLE GoodsOut ('OutID', 'CustomerID', 'ItemSKU_ID', 'DateDispatched')
Account: records each Customer transaction/balance
CREATE TABLE Ledger ('LedgerID', 'BalanceClose', 'AdjustmentAmount', 'CustomerID', 'ActionID')
(I've massively simplified this - please accept that GoodsIn and GoodsOut cannot be combined)
When an SKU is marked as ready for dispatch, I can use the following to automatically update the Ledger balance, taking the last balance row per customer and updating it
INSERT INTO Ledger (BalanceClose, AdjustmentAmount, CustomerID)
SELECT Ledger.BalanceClose +
(SELECT #Price:=ItemSKUData.ItemPrice FROM ItemSKUData WHERE ItemSKUData.ItemSKU_ID = GoodsIn.ItemSKU_ID) AS NEWBALANCECLOSE,
#Price AS ADJUSTMENTAMOUNT,
Ledger.CustomerID
FROM Ledger
INNER JOIN GoodsIn ON GoodsIn.CustomerID = Ledger.CustomerID
WHERE GoodsIn.HasBeenChecked = TRUE
AND Ledger.LedgerID IN (SELECT MAX(Ledger.LedgerID) FROM Ledger GROUP BY Ledger.CustomerID)
This all works absolutely fine - I get a new Ledger row, with the updated BalanceClose, for each GoodsIn row where GoodsIn.HasBeenChecked = TRUE. Each of these Ledger rows gets an auto-increment Ledger.LedgerID on INSERT.
I can then do pretty much the same code to INSERT into the GoodsOut table. Again as with Ledger, GoodsOut.OutID is an auto-increment ID.
I now need to link those Ledger rows (Ledger.ActionID) to the GoodsOut.OutID. This is the purpose of Ledger.ActionID - it needs to map to each GoodsOut.OutID, so that the reduction of the Ledger balance is linked to the action of sending the goods to the customer.
In theory, if this was a single INSERT and not an INSERT SELECT, I would simply take the GoodsOut.LAST_INSERT_ID() and use it on the INSERT INTO Ledger.
But because I'm using an INSERT ... SELECT, I can't get the auto-increment ID of each row.
The only way I can see to do this is to use a dummy column in the GoodsOut table, and store the GoodsIn.InID in it. I could then get the GoodsOut.OutID using a WHERE in the INSERT ... SELECT for the Ledger.
It doesn't feel very elegant and safe though.
So this is my question. I need to link table A to table B using table B's auto-increment ID, when all rows in BOTH table A and table B are created using INSERT ... SELECT.
You're right, when you do INSERT...SELECT for batch inserts, you don't have easy access to the auto-increment id. LAST_INSERT_ID() returns only the first id generated.
One documented behavior of bulk inserts is that the id's generated are guaranteed to be consecutive, because bulk inserts lock the table until the end of the statement.
https://dev.mysql.com/doc/refman/5.7/en/innodb-auto-increment-handling.html says:
innodb_autoinc_lock_mode = 1 (“consecutive” lock mode)
This is the default lock mode. In this mode, “bulk inserts” use the special AUTO-INC table-level lock and hold it until the end of the statement. This applies to all INSERT ... SELECT, REPLACE ... SELECT, and LOAD DATA statements. Only one statement holding the AUTO-INC lock can execute at a time.
This means if you know the first value generated, and the number of rows inserted (which you should be able to get from ROW_COUNT()), and the order of rows inserted, then you can reliably know all the id's generated.
The MySQL JDBC driver relies on this, for example. When you do a bulk insert, the full list of id's generated is not returned to the client (that is, the JDBC driver), but the driver has a Java method to return the full list. This is accomplished by Java code inferring the values, and assuming they are consecutive.
We have various tables to represent various types of data. Each table has a corresponding revisions table to track history of this data. Each revision (entry in a revisions table) has a unique number. This number is stored in a change metadata table. Each of these tables references a parent_id. Before we make any changes to the tables we lock the parent row with SELECT … FOR UPDATE.
After making an update/insert we also increment the change number and write that number to the change metadata table. To do so we do a SELECT MAX on the change metadata number and then increment it.
The issue we’re seeing is that somehow a transaction is getting an old change number from the select max statement. To illustrate:
Transaction 1:
START TRANSACTION
lock with FOR UPDATE
do stuff...
Get Latest Change Number (9)
Insert Revision with Number 10
COMMIT
Transaction 2:
START TRANSACTION
lock with FOR UPDATE
do stuff...
Get Latest Change Number (7)
Insert Revision with Number 8
COMMIT
This causes the revision insert for transaction 2 to fail as the change number is a unique key. I’m leaning towards it being an issue of repeatable reads but I’m not sure how the old data can persist across transactions in such a way. For each transactions there's a START TRANSACTION statement and then immediately the parent id is locked with FOR UPDATE. We have a high traffic site with multiple concurrent transactions. It's possible there are many waiting on the lock at any one time. I'd be happy to clarify any point and would appreciate any insight anyone could offer.
SELECT MAX on the change metadata number
That needs FOR UPDATE, too.
Another approach:
Have a "sequence number generator" table.
CREATE TABLE Sequence (
pk TINYINT NOT NULL,
seq INT UNSIGNED NOT NULL AUTO_INCREMENT,
PRIMARY KEY(pk), -- For ON DUPLICATE KEY UPDATE
INDEX(seq) -- Sufficient for AUTO_INCREMENT
);
The only action (once initialized) should be
INSERT INTO Sequence (pk, seq) VALUE (1, 0)
ON DUPLICATE KEY UPDATE seq := LAST_INSERT_ID(seq+1);
That will update the one row atomically. Then (in the same connection), do this to get the new seq:
SELECT LAST_INSERT_ID();
That statement is tied to the connection, so there is no chance of someone else getting your number.
I have two scripts; one of them inserts rows into the database, and the other processes newly entered, so-far-unprocessed rows.
CREATE TABLE table (id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, col1 VARCHAR(32), col2 VARCHAR(32));
So the first script does several separate insert queries:
INSERT INTO table (id, col1 ,col2) VALUES (0, 'val1_1', 'val1_2');
INSERT INTO table (id, col1 ,col2) VALUES (0, 'val2_1', 'val2_2');
INSERT INTO table (id, col1 ,col2) VALUES (0, 'val3_1', 'val3_2');
...
Then the second script uses something like this to select the unprocessed rows:
SELECT * FROM table WHERE id > (SELECT MAX(id FROM table_processed)) ORDER BY id LIMIT 1000;
(do some processing)
(for each id processed from table: INSERT INTO table_processed (id) VALUES ({table.id});)
Sometimes, the first script will need to insert something like 5000 rows. I noticed that there was at least one instance when the processing script seemed to skip over many of the rows (basically skipped 3000 of them), and was wondering what could cause this and how to prevent it (if it skips over them once, then the next time it'll continue to skip over them since it uses > MAX(id)).
Or is this not supposed to happen? (in which case I guess it'd have to be error with the second script query)
If 2 insert transactions are running, and a later transaction (=gets a higher auto_incremented id) is done earlier, those higher auto increment ids are visible earlier to other transactions (i.e: your processing one) then the lower ones (in a not yet committed transaction, or possibly even an rolled back one). Every INSERT gets an id of the global sequence, so those 2 transactions could not even have a single range of id's, but create a sort of striped use of said range. A good way to work is to never rely on either order or value of auto_incremented ids, do not use them for anything but an identifier.
The most obvious solutions are:
Do not use that MAX(id), but do a LEFT JOIN of table to table_processed, and use those not yet existing in table_processed, but this may be heavy on the selecting side.
Let the INSERTs do an exclusive LOCK on the table (undesirable in busy scenarios, you already seem to have multiple concurrent INSERTs).
Let the INSERTs be done with a processed=0 indexed column (possibly this is just the default value, and you can omit it in the insert), and just SELECT .. FROM table WHERE processed=0, set to 1 when done.
A simple mistake to make is to say: OK, I'll just COMMIT after every single insert so that transaction is done as soon as possible, which is still vulnerable to race conditions, so don't use that.
I'm converting a webapp from mysql to SQL Server. Now I want to convert the following code (this is a simplified version):
LOCK TABLES media WRITE, deleted WRITE;
INSERT INTO deleted (stuff) SELECT stuff FROM media WHERE id=1 OR id=2;
DELETE FROM media WHERE id=1 OR id=2;
UNLOCK TABLES;
Because I'm copying stuff which is going to be deleted I want to make sure any reads to 'media' or 'deleted' will wait until this whole operation is ready. Otherwise these reads will see stuff that isnt there anymore a sec later.
How can I replicate this behavior in SQL Server? I read some pages on transactions and isolation levels but I can't figure out if I can disable any read to table 'media' and 'deleted' (or on row-level).
Thanx!
You could use lock hints in your query. If you specify a table lock and hold it until the end of the transaction, that should be equivalent.
begin transaction;
INSERT INTO deleted
SELECT stuff FROM media WITH (tablock holdlock)
WHERE id = 1 or id = 2;
DELETE FROM media where id = 1 or id = 2;
commit;
As a DB-agnostic approach, you might consider having a "deleted" or "inactive" column indicating whether or not results should be returned to users. For example, you could use an integer for the column, excluding the record from user's view if the value of the column is not zero. So, instead of the select and insert above, you could do (all examples are in the MySQL SQL dialect):
UPDATE media SET inactive=1 WHERE stuff=1 OR stuff=2;
This would exclude the records from user view. You could then copy the inactive records to the "deleted" table and delete them from the media table if desired, based on the time you last updated inactive records:
INSERT INTO deleted (stuff) SELECT stuff FROM media WHERE inactive = 1;
DELETE from media WHERE inactive <= 1;
The integer could be used to identify the "inactive job" that "deleted" the records.
Based on how you've described the schema, this scenario doesn't quite match what the locking approach because the "media" table could be modified during the execution of the UPDATE statement. That could be solved (or at least mitigated) if there were a column, such as a time stamp, that could be used to further define the records to mark inactive.