How to avoid duplicate entries on INSERT in MySQL? - mysql

My application is generating the ID numbers when registering a new customer then inserting it into the customer table.
The method for generating the ID is by reading the last ID number then incrementing it by one then inserting it into the table.
The application will be used in a network environment with more than 30 users, so there is a possibility (probability?) for at least two users to read the same last ID number at the saving stage, which means both will get the same ID number.
Also I'm using transaction. I need a logical solution that I couldn't find on other sites.
Please reply with a description so I can understand it very well.

use an autoincrement, you can get the last id issued with the mysql_insert_id property.
If for some reason that's not doable, you can craete another table to hold the last id used, then you increment that in a transaction, and then use it as the key for your insert into the table. Got to be two transctions though, otherwise you'll have the same issue you have now. That can get messy and is an extra level of maintenance though. (reset your next id table to zero when ther are still some in teh related table and things go nipples up quick.
Short of putting an exclusive lock on the table during the insert operation (not even slightly recomended), your current solution just can't work.
Okay expanded answer based on leaving schema as it is.
Option 1 in pseudo code
StartTransaction
try
NextId = GetNextId(...)
AddRecord(NextID...)
commit transaction
catch Primary Key Violation
rollback transaction
Do the entire thing again
end
Obviously you could end up in an infinite loop here, unlikely but possible, probably run out of stack space first.
You could some how queue the requests and then attempt to process them, if successful remove from queue.
BUT make customerid an auto inc the entire problem dispappears.
It will still be the primary key, you just don't have to work out what it needs to be any more, in fact you don't supply it in the insert statement, mysql will just take care of it for you.
The only thing you have to remember is if you need the id that has been automatically created is to request it in one transaction.
So your insert query needs to be in the form
Insert SomeTable(SomeColumns) Values(SomeValues)
Select mysql_insert_id
or if multiple statements gets in the way wrap two statements in a start stransaction commit transaction pair.

Related

How do I decrease an Integer in MySQL?

I have a server which uses a MySQL database for storing users. I have an integer as the primary key. The integer is the primary key and with auto increment.
The problem is when registration fails (on the website provided by my server) the integer still increases by 1, which means: if a user succeeds in signing up, the user will get the id of one. Just as it should be. However, if a user then fails to register (the username already being taken for example), and then succeeds to register the user will get the id of 3.
I am looking for syntax like: UserId-- or UserId = Userid - 1.
Auto-increment values are not intended to be consecutive. They must be unique, that's all.
If you try to decrement the auto-increment value on error, you'll create a race condition in your app. For example:
Your session tries to register a user. Suppose this generates id 41.
A second session running at the same time also tries to register a user. This generates id 42.
Your session returns an error, because the username you tried to register already exists. So you try to force MySQL to decrement the auto-increment.
A third session registers another user, using id 41. The auto-increment increments so the next registration will use id 42.
The next session tries to register with id 42, but this id has already been used, resulting in mass hysteria, the stock market crashes, and dogs and cats start living together.
Lesson: Don't get obsessed with id's being consecutive. They're bound to have gaps from time to time. Either an insert fails, or else you roll back a transaction, or you delete a record, etc. There are even some bugs in MySQL, that cause auto-increment to skip numbers sometimes. But there's no actual consequence to those bugs.
The way to avoid the race condition described above is that auto-increment must not decrement, even if an insert fails. It only increases. This does result in "lost" values sometimes.
I helped a company in exactly the same situation you are in, where registration of new usernames caused errors if the username was already taken. In their case, we found that someone using their site was creating a new account every day, and they were automating it. They would try "user1" and get an error, then try "user2" and "user3" and so on, until they found one that was available. But it was causing auto-increment values to be discarded, and every day the gap became larger and larger. Eventually, they ran out of integers, because the gaps were losing 1500+ id values for each registration.
(The site was an online gambling site, and we guessed that some user was trying to create daily throwaway accounts, and they had scripted a loop to register user1 through userN, but they start over at 1 each day, not realizing the consequences for the database.)
I recommended to them this fix: Change the registration code in their app to SELECT for the username first. If it exists, then return a friendly warning to the user and ask them to choose another name to register. This is done without attempting the INSERT.
If the SELECT finds there is no username, then try the INSERT, but be ready to handle the error anyway, in case some other session "stole" that username in the moment between the SELECT and the INSERT. Hopefully this will be rare, because it's a slim chance for someone to sneak their registration in between those two steps.
In any case, do not feel obliged to have consecutive id values.

Selecting and updating a row while dealing with race conditions?

We have a table of elements that can be issued to clients. These elements can only ever be given to a client once, and we have situations where many clients could be pulling elements all at the same time. We then need to return data associated with it (so there is an update, and then a select).
The current solution is that a random one is found/updated to be issued=true and sets its id as LAST_INSERTED_ID; then immediately afterwards it makes the select call to find where('id = LAST_INSERTED_ID()') which is unique per connection.
Since we're updating where issued=false to issued=true and [last inserted], that one call is small enough to not encounter race condition issues.
But, all this is being done in SQL and feels very hackish. This does not seem like a rare enough problem that it has not been solved using a more Railsy solution. Wrapping a transaction might work to prevent double-issues, but then we'd need retry logic in the case the transaction failed.
What solution are we not thinking of?
You will want to use database-level locking to avoid race conditions.
One way to do this in MySQL is SELECT FOR UPDATE like this:
SELECT * FROM elements WHERE issued=false LIMIT 1 FOR UPDATE
In ActiveRecord (Rails), this is called pessimistic locking, and an implementation would look like this:
Element.transaction do
element = Element.lock(true).where(issued: false).first
element.issued = true
# ... do other stuff to assign to a given client
element.save!
end
If that got kicked off more than once at the same time, the 2nd call would be blocked until the first call finished, so by the time it executed, the first record would already be updated to issued=true and the 2nd call would return the next record instead of the same record.
You can read about SELECT FOR UPDATE here

Locking Mysql Transaction

We have a table (say, child) that has a relation to another table (say, parent). In a perfect world, it will always have a parent row, and sometimes a child row. It should never have more than one child row, but it may in the future (so a Unique index is not suitable long-term).
Right now, we use transactions and lock the rows. However, because Mysql locks the rows based to the point in time in which it starts, each transaction (if one starts before the other commits) is able to create their own row. Then, on insert, each insert takes effect and we end up with two rows. I think each row is locking its own row, and without committing, it is hidden to the other thread. Basically, transactions have a chicken and egg type of problem.
How can we enforce a policy of up to one row? We can add a unique index and at the time the transaction commits, it will fail. But then we remove the ability to add multiple rows in the future (when one parent would have two child's), which is problematic.
This has to be solved somehow. I just don't know how personally.
Edit 1: Updated Schema information (I'm using a job schema to represent the problem)
Table: job (the "Parent")
job_id
job_title
job_payment
Table: job_asignment (the "Child")
job_id
user_id (assigned worker)
est_hours
opt_insurance
Our application is a SaaS-based product that helps manage workflows. We check whether everything necessary is okay before hand (like whether the job is still in the right status, whether the person trying to accept the job was given access, and so on). Then, if that is true, we assign him (insert or update the row in the job_assignment table).
Our problem is that our system takes 2 to 3 seconds for the rest of the assignment to happen (place payment holds, insert the actual row, email the worker that they are assigned, move the status to assigned, and so on). During this time, another user also tries to accept the job, his thread validates everything before (where we check if its still available), and it is. We then start the process on him too, since each thread is a transaction and the changes haven't been committed.
Then, we get two assignment rows. For us, thats bad right now since we only pay one worker.
We would use application locking with temp files or something, but were on a load balanced (HA) environment and cannot guarantee that both users hit the same server.
This seems really rudimentary, but I can't figure out how to solve it. Other than a unique index, the only other way is to highly invest in hardware for the DB and get that window as small as we can.
Does this clarify anything?

How to retrieve the new rows of a table every minute

I have a table, to which rows are only appended (not updated or deleted) with transactions (I'll explain why this is important), and I need to fetch the new, previously unfetched, rows of this table, every minute with a cron.
How am I going to do this? In any programming language (I use Perl but that's irrelevant.)
I list the ways I thought of how to solve this problem, and ask you to show me the correct one (there HAS to be one...)
The first way that popped to my head was to save (in a file) the largest auto_incrementing id of the rows fetched, so in the next minute I can fetch with: WHERE id > $last_id. But that can miss rows. Because new rows are inserted in transactions, it's possible that the transaction that saves the row with id = 5 commits before the transaction that saves the row with id = 4. It's therefore possible that the cron script retrieves row 5 but not row 4, and when row 4 gets committed one split second later, it will never gets fetched (because 4 is not > than 5 which is the $last_id).
Then I thought I could make the cron job fetch all rows that have a date field in the last TWO minutes, check which of these rows have been retrieved again in the previous run of the cron job (to do this I would need to save somewhere which row ids were retrieved), compare, and process only the new ones. Unfortunately this is complicated, and also doesn't solve the problem that will occur if a certain inserting transaction takes TWO AND A HALF minutes to commit for some weird database reason, which will cause the date to be too old for the next iteration of the cron job to fetch.
Then I thought of installing a message queue (MQ) like RabbitMQ or any other. The same process that does the inserting transaction, would notify RabbitMQ of the new row, and RabbitMQ would then notify an always-running process that processes new rows. So instead of getting a batch of rows inserted in the last minute, that process would get the new rows one-by-one as they are written. This sounds good, but has too many points of failure - RabbitMQ might be down for a second (in a restart for example) and in that case the insert transaction will have committed without the receiving process having ever received the new row. So the new row will be missed. Not good.
I just thought of one more solution: the receiving processes (there's 30 of them, doing the exact same job on exactly the same data, so the same rows get processed 30 times, once by each receiving process) could write in another table that they have processed row X when they process it, then when time comes they can ask for all rows in the main table that don't exist in the "have_processed" table with an OUTER JOIN query. But I believe (correct me if I'm wrong) that such a query will consume a lot of CPU and HD on the DB server, since it will have to compare the entire list of ids of the two tables to find new entries (and the table is huge and getting bigger each minute). It would have been fast if the receiving process was only one - then I would have been able to add a indexed field named "have_read" in the main table that would make looking for new rows extremely fast and easy on the DB server.
What is the right way to do it? What do you suggest? The question is simple, but a solution seems hard (for me) to find.
Thank you.
I believe the 'best' way to do this would be to use one process that checks for new rows and delegates them to the thirty consumer processes. Then your problem becomes simpler to manage from a database perspective and a delegating process is not that difficult to write.
If you are stuck with communicating to the thirty consumer processes through the database, the best option I could come up with is to create a trigger on the table, which copies each row to a secondary table. Copy each row to the secondary table thirty times (once for each consumer process). Add a column to this secondary table indicating the 'target' consumer process (for example a number from 1 to 30). Each consumer process checks for new rows with its unique number and then deletes those. If you are worried that some rows are deleted before they are processed (because the consumer crashes in the middle of processing), you can fetch, process and delete them one by one.
Since the secondary table is kept small by continuously deleting processed rows, INSERTs, SELECTs and DELETEs would be very fast. All operations on this secondary table would also be indexed by the primary key (if you place the consumer ID as first field of the primary key).
In MySQL statements, this would look like this:
CREATE TABLE `consumer`(
`id` INTEGER NOT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `consumer`(`id`) VALUES
(1),
(2),
(3)
-- all the way to 30
;
CREATE TABLE `secondaryTable` LIKE `primaryTable`;
ALTER TABLE `secondaryTable` ADD COLUMN `targetConsumerId` INTEGER NOT NULL FIRST;
-- alter the secondary table further to allow several rows with the same primary key (by adding targetConsumerId to the primary key)
DELIMTER //
CREATE TRIGGER `mark_to_process` AFTER INSERT ON `primaryTable`
FOR EACH ROW
BEGIN
-- by doing a cross join with the consumer table, this automatically inserts the correct amount of rows and adding or deleting consumers is just a matter of adding or deleting rows in the consumer table
INSERT INTO `secondaryTable`(`targetConsumerId`, `primaryTableId`, `primaryTableField1`, `primaryTableField2`) SELECT `consumer`.`id`, `primaryTable`.`id`, `primaryTable`.`field1`, `primaryTable`.`field2` FROM `consumer`, `primaryTable` WHERE `primaryTable`.`id` = NEW.`id`;
END//
DELIMITER ;
-- loop over the following statements in each consumer until the SELECT doesn't return any more rows
START TRANSACTION;
SELECT * FROM secondaryTable WHERE targetConsumerId = MY_UNIQUE_CONSUMER_ID LIMIT 1;
-- here, do the processing (so before the COMMIT so that crashes won't let you miss rows)
DELETE FROM secondaryTable WHERE targetConsumerId = MY_UNIQUE_CONSUMER_ID AND primaryTableId = PRIMARY_TABLE_ID_OF_ROW_JUST_SELECTED;
COMMIT;
I've been thinking on this for a while. So, let me see if I got it right. You have a HUGE table in which N, amount which may vary in time, processes write (let's call them producers). Now, there are these M, amount which my vary in time, other processes that need to at least process once each of those records added (let's call them consumers).
The main issues detected are:
Making sure the solution will work with dynamic N and M
It is needed to keep track of the unprocessed records for each consumer
The solution has to escalate as much as possible due to the huge amount of records
In order to tackle those issues I thought on this. Create this table (PK in bold):
PENDING_RECORDS(ConsumerID, HugeTableID)
Modify the consumers so that each time they add a record to the HUGE_TABLE they also add M records to the PENDING_RECORDS table so that it has the HugeTableID and also each of the ConsumerID that exist at that time. Each time a consumer runs it will query the PENDING_RECORDS table and will find a small amount of matches for itself. It will then join against the HUGE_TABLE (note it will be an inner join, not a left join) and fetch the actual data it needs to process. Once the data is processed then the consumer will delete the records fetched from the PENDING_RECORDS table, keeping it decently small.
Interesting, i must say :)
1) First of all - is it possible to add a field to the table that has rows only added (let's call it 'transactional_table')? I mean, is it a design paradigm and you have a reason not to do any sort of updates on this table, or is it "structurally" blocked (i.e. user connecting to db has no privileges to perform updates on this table) ?
Because then the simplest way to do it is to add "have_read" column to this table with default 0, and update this column on fetched rows with 1 (even if 30 processess do this simultanously, you should be fine as it would be very fast and it won't corrupt your data). Even if 30 processess mark the same 1000 rows as fetched - nothing is corrupt. Although if you do not operate on InnoDB, this might be not the best way as far as performance is concerned (MyISAM locks whole tables on updates, InnoDB only rows that are updated).
2) If this is not what you could use - I would surely check out the solution you gave as your last one, with a little modification. Create a table (let's say: fetched_ids), and save fetched rows' ids in that table. Then you could use something like :
SELECT tt.* from transactional_table tt
RIGHT JOIN fetched_ids fi ON tt.id = fi.row_id
WHERE fi.row_id IS NULL
This will return the rows from you transactional table, that have not been saved as already fetched. As long as both (tt.id) and (fi.row_id) have (ideally unique) indexes, this should work just fine even on large sets of data. MySQL handles JOINS on indexed fields pretty well. Do not fear trying out - create new table, copy ids to it, delete some of them and run your query. You'll see the results and you'll know if they are satisfactory :)
P.S. Of course, adding rows to this 'fetched_ids' table should be ran carefully not to create unnecessary duplicates (30 simultaneous processes could write 30 times the data you need - and if you need performance, you should watch out for this case).
How about a second table with a structure like this:
source_fk - this would hold an ID of the data rows you want to read.
process_id - This would be a unique id for one of the 30 processes.
then do a LEFT JOIN and exclude items from your source that have entries matching the specified process_id.
once you get your results, just go back and add the source_fk and process_id for each result you get.
One plus about this is you can add more processes later on with no problem.
I would try adding a timestamp column and use it as a reference when retrieving new rows.

How to properly avoid Mysql Race Conditions

I know this has been asked before, but I'm still confused and would like to avoid any problems before I go into programming if possible.
I plan on having an internal website with at least 100 users active at any given time. Users would post an item (inserted into db with a 0 as its value) and that item would be shown via a php site (db query). Users then get the option to press a button and lock that item as theirs (assign the value of that item as their id)
How do I ensure that 2 or more users don't retrieve the same item at the same time. I know in programming like c++ I would just use plain ol mutex lock. Is their an equivalent in mysql where it will lock just one item entry like that? I've seen references to LOCK_TABLES and GET_LOCK and many others so I'm still very confused on what would be best.
There is potential for many people all racing to press that one button and it would be disastrous if multiple people get a confirmation.
I know this is a prime example of a race condition, but mysql is foreign territory for me.
I obviously will query the value of the item before I update it and make sure it hasn't written, but what is the best way to ensure that this race condition is avoided.
Thanks in advance.
To achieve this, you will need to lock the record somehow.
Add a column LockedBy defaulting to 0.
When someone pushes the button execute a query resembling this:
UPDATE table SET LockedBy= WHERE LockedBy=0 and id=;
After the update verify the affected rows (in php mysql_affected_rows). If the value is 0 it means the query did not update anything because the LockedBy column is not 0 and thus locked by someone else.
Hope this helps
When you post a row, set the column to NULL, not 0.
Then when a user updates the row to make it their own, update it as follows:
UPDATE MyTable SET ownership = COALESCE(ownership, $my_user_id) WHERE id = ...
COALESCE() returns its first non-null argument. So even if you and I are updating concurrently, the first one to commit gets to set the value. The second one will not override that value.
You may consider Transactions
BEGING TRANSACTION;
SELECT ownership FROM ....;
UPDATE table .....; // set the ownership if the table not owned yet
COMMIT;
and also you can ROLLBACK all the queries between the transaction if you caught an error !