Best practice for verifying uniqueness in database?

Best practice for verifying uniqueness in database? - mysql

I'm creating a database which requires several fields to be unique, and was wondering which method is least expensive in terms of checking that uniqueness?
Query the database with a mysqli() call to check if a value exists?
Use PHP to download a file of all entries, then check that file and delete afterwards
Set columns to "Unique" index
If the best option (which I'm assuming it is) is to set the columns to unique, then how do you go about handling the error that gets thrown when the value already exists without breaking out of the function? Or is that even possible?

Querying the database first risks race conditions. That is, you SELECT to verify the value isn't already there, so you can INSERT it. But unfortunately, in the brief moment between your SELECT and your INSERT, someone else slips in and inserts the value you were going to add. So you end up having to catch the error anyway.
This may seem unlikely, but there's some old wisdom: "one in a million is next Tuesday." I.e. when we process millions of transactions per day, even a rare fluke is bound to happen sooner than we think.
This is right out. What happens when the set of entries is 10 million long? 100 million? 1 billion? This solution doesn't scale, so just put it out of your mind immediately.
Yes, use a UNIQUE constraint. Attempt the INSERT and handle the error. This avoids a race condition, because your INSERT's unique check is atomic. That is, no one can slip in between the clock ticks to add a value before you can insert it.
One caveat of this: in MySQL's InnoDB storage engine, if you try an INSERT and it fails due to conflicting with a UNIQUE constraint (or other reason for failure), it doesn't reverse its allocation of the next auto-increment value. The row is not inserted, but the auto-inc value is generated and discarded. So if you have frequent cases of such failures, you could end up skipping a lot of integers in your primary key. I had one case where my customer actually ran out of integers because they were skipping 1500 id values for each row that was successfully inserted. In their case, I suggested using your solution 1, then try the insert only if they are "pretty sure" of a safe insert, but then they have to handle the error anyway just in case of the race condition.
Handling the error means checking the return value every time you execute an SQL query. I can't tell you how many questions I read on StackOverflow where programmers fail to check that execute() returned false, and they wonder why their INSERT failed.

The quick answer is let the database do it if at all possible.
The slower answer depends on how you want to handle exceptions to your uniqueness requirement.
If you never need to over-ride the uniqueness requirement, you can use a UNIQUE index in MySQL. Then you can use "ON DUPLICATE KEY" to handle the exceptions.
However, if you sometimes need to allow a duplicate, you can't use a UNIQUE key and you'd be best using a regular INDEX and doing a query first to see if the value exists before you insert it.

Well, the least expensive is one point, the user experience is another.
I would personnaly go for a query (with a custom message if key is found) AND a Unique constraint (to have a consistent db). So 1 + 3.
But if you want less expensive, just go to the unique constraint, and try to build some comprehensive error message, using the error message from mysqli_error.
So 1 + 3 or 3, but not 2.

Related

How does a lock work for two inserts in MySQL?

Let's say isolation level is Repeatable Read as it's really is as default for MySQL.
I have two inserts (no checking, no unique columns).
a) Let's say these two inserts happen at the same moment. What will happen? Will it first run the first insert and the second or both of them in different MySQL's threads?
b) Let's say I have insert statement and column called vehicle_id as unique, but before that, I check if it exists or not. If it doesn't exist, go on and insert. Let's say two threads in my code both come at the same moment. So they will both go into if statement since they happened at the same moment.
Now, they both have to do insert with the same vehicle_id. How does MySQL handle this? If it's asynchronous or something, maybe both inserts might happen so quickly that they will both get inserted even though vehicle_id was the same as unique field. If it's not asynchronous or something, one will get inserted first, second one waits. When one is done, second one goes and tries to insert, but it won't insert because of unique vehicle_id restriction. How does this situation work?
I am asking because locks in repeatable read for INSERT lose their essence. I know how it's going to work for Updating/Selecting.

As I understand it the situation is:
a) the threads are assigned for each connection. If both updates are received on the same connection then they will be executed in the same thread, one after the other according to the order in whcih they are received. If they're in different threads then it will be down to whichever thread is scheduled first and that's likely to be OS determined and non-deterministic from your point of view.
b) if a column is defined as UNIQUE at the server, then you cannot insert a second row with the same value so the second insert must fail.
Trying to use a conflicting index in the way you described appears to be an application logic problem, not a MySQL problem. Whatever entity is responsible for your unique ID's (which is your application in this case) it needs to ensure that they are unique. One approach is to implement an Application Lock using MySQL which allows applications running in isolation from each other to share a lock at the server. Check in the mysql docs for how to use this. It's usual use is intended to be application level - therefore not binding on the MySQL server. Another approach would be to use Uuids for unique keys and rely on their uniqueness when you need to create a new one.

Setting MySQL unique key or checking for duplicate in application part?

Which one is more reliable and has better performance? Setting MySQL unique key and using INSERT IGNORE or first checking if data exists on database and act according to the result?
If the answer is the second one, is there any way to make a single SQL query instead of two?
UPDATE: I ask because my colleagues in the company I work believe that deal with such issues should be done in application part which is more reliable according to them.

You application won't catch duplicates.
Two concurrent calls can insert the same data, because each process doesn't see the other while your application checks for uniqueness. Each process thinks it's OK to INSERT.
You can force some kind of serialisation but then you have a bottleneck and performance limit. And you will have other clients writing to the database, even if it is just a release script-
That is why there are such things as unique indexes and constraints generally. Foreign keys, triggers, check constraints, NULL/NIOT NULL, datatype constraints are all there to enforce data integrity
There is also the arrogance of some code monkey thinking they can do better.
See programmers.se: Constraints in a relational databases - Why not remove them completely? and this Enforcing Database Constraints In Application Code (SO)

Settings a unique key is better. It will reduce the amount of round-trips to mysql you'll have to do for a single operation, and item uniqueness is ensured, reducing errors caused by your own logic.

You definitely should set a unique key in your MySQL table, no matter what you decide.
As far as the other part of your question, definitely use insert ignore on duplicate key update if that is what you intend for your application.
I.e. if you're going to load a bunch of data and you don't care what the old data was, you just want the new data, that is the way to go.
On the other hand, if there is some sort of decision branch that is based on whether the change is an update or a new value, I think you would have to go with option 2.
I.e. If changes to the table are recorded in some other table (e.g. table: change_log with columns: id,table,column,old_val,new_val), then you couldn't just use INSERT IGNORE because you would never be able to tell which values were changed vs. which were newly inserted.

Concurrently retrieve (select) or create (insert) new row in generic SQL without conflicts

I have a system which has a complex primary key for interfacing with external systems, and a fast, small opaque primary key for internal use. For example: the external key might be a compound value - something like (given name (varchar), family name (varchar), zip code (char)) and the internal key would be an integer ("customer ID").
When I receive an incoming request with the external key, I need to look up the internal key - and here's the tricky part - allocate a new internal key if I don't already have one for the given external ID.
Obviously if I have only one client talking to the database at a time, this is fine. SELECT customer_id FROM customers WHERE given_name = 'foo' AND ..., then INSERT INTO customers VALUES (...) if I don't find a value. But, if there are potentially many requests coming in from external systems concurrently, and many may arrive for a previously unheard-of customer all at once, there is a race condition where multiple clients may try to INSERT the new row.
If I were modifying an existing row, that would be easy; simply SELECT FOR UPDATE first, to acquire the appropriate row-level lock, before doing an UPDATE. But in this case, I don't have a row that I can lock, because the row doesn't exist yet!
I've come up with several solutions so far, but each of them has some pretty significant issues:
Catch the error on INSERT, re-try the entire transaction from the top. This is a problem if the transaction involves a dozen customers, especially if the incoming data is potentially talking about the same customers in a different order each time. It's possible to get stuck in mutually recursive deadlock loops, where the conflict occurs on a different customer each time. You can mitigate this with an exponential wait time between re-try attempts, but this is a slow and expensive way to deal with conflicts. Also, this complicates the application code quite a bit as everything needs to be restartable.
Use savepoints. Start a savepoint before the SELECT, catch the error on INSERT, and then roll back to the savepoint and SELECT again. Savepoints aren't completely portable, and their semantics and capabilities differ slightly and subtly between databases; the biggest difference I've noticed is that, sometimes they seem to nest and sometimes they don't, so it would be nice if I could avoid them. This is only a vague impression though - is it inaccurate? Are savepoints standardized, or at least practically consistent? Also, savepoints make it difficult to do things in parallel on the same transaction, because you might not be able to tell exactly how much work you'll be rolling back, although I realize I might just need to live with that.
Acquire some global lock, like a table-level lock using a LOCK statement (oracle mysql postgres). This obviously slows down these operations and results in a lot of lock contention, so I'd prefer to avoid it.
Acquire a more fine-grained, but database-specific lock. I'm only familiar with Postgres's way of doing this, which is very definitely not supported in other databases (the functions even start with "pg_") so again it's a portability issue. Also, postgres's way of doing this would require me to convert the key into a pair of integers somehow, which it may not neatly fit into. Is there a nicer way to acquire locks for hypothetical objects?
It seems to me that this has got to be a common concurrency problem with databases but I haven't managed to find a lot of resources on it; possibly just because I don't know the canonical phrasing. Is it possible to do this with some simple extra bit of syntax, in any of the tagged databases?

I'm not clear on why you can't use INSERT IGNORE, which will run without error and you can check if an insert occurred (modified records). If the insert "fails", then you know the key already exists and you can do a SELECT. You could do the INSERT first, then the SELECT.
Alternatively, if you are using MySQL, use InnoDB which supports transactions. That would make it easier to rollback.

Perform each customer's "lookup or maybe create" operations in autocommit mode, prior to and outside of the main, multi-customer transaction.

WRT generating an opaque primary key, there are a number of options, eg., use a guid or (at least, with Oracle) a sequence table. WRT insuring the external key is unique, apply unique constraint on the column. If the insert fails because the key exists, reattempt the fetch. You can use an insert with where not exist or where not in. Use a stored procedure to reduce the round trips and improve performance.

Do numerical primary keys of deleted records in a database get reused for future new records?

For example if I have an auto-numbered field, I add new records without specifying this field and let DB engine to pick it for me.
So, will it pick the number of the deleted record? If yes, when?
// SQL Server, MySQL. //
Follow-up question: What happens when DB engine runs out of numbers to use for primary keys?

NO. numerical primary keys will not reused, except you specify them manually(you should really avoid this!)

AFAIK, this could happen in MySQL:
How AUTO_INCREMENT Handling Works in InnoDB:
InnoDB uses the in-memory auto-increment counter as long as the server runs. When the server is stopped and restarted, InnoDB reinitializes the counter for each table for the first INSERT to the table, as described earlier.
After a restart of server. Innodb reuse previously generated auto_increment values.
:
Suggested fix:
innodb table should not lose the track of next number for auto_increment column after
restart.

Depends on the auto-numbering system. If you're using a sequence of any kind, the numbers of deleted records will not get reused, as the sequence does not know about them.

Generally, no, the numbers are not reused.
However, you can -- in products like Oracle -- specify a sequence generator which cycles around and will reuse numbers.
Whether those are numbers of deleted records or not is your applications's problem.

This question needs to be made more precise:
... "with Oracle Sequences"
... "with MySQL autonumber columns"
... etc...

As long as you create the table correctly you will not reuse numbers.
However you can RESEED the identity column (IN MSSQL anyway) by using the following:
-- Enter the number of the last valid entry in the table not the next number to be used
DBCC CHECKIDENT ([TableName], RESEED, [NumberYouWantToStartAt])
This is of course insane... and should never be done :)

MySQL will not reuse IDs unless you truncate the table or delete from the table with no where clause (in which case MySQL, internally, simply does a truncate).

Not specifically. If the key is being read from a sequence or autoincrementing identity column the sequence will just plug along and produce the next value. However, you can deactivate this (set identity_insert on on SQL Server) and put any number you want in the column as long as it doesn't violate the uniqueness constraint.

Yeah, it really depends on the way you generate the id.
For example if you are using a GUID as the primary key, most implementations of getting a random new Guid are not likely to pick another guid again, but it will given enough time and if the Guid is not in the table the insert statement will go fine, but if there is already a guid there you will get a primary key constraint violation.

I consider the MySQL "feature" of reusing id's a bug.
Consider something like processing of file uploads. Using the database id as a filename is a good practice : simple, no risk of exploits with user-supplied filenames, etc.
You can't really make everything transactional when the filesystem is involved... you'll have to commit the database transaction then write the file, or write the file and commit the database transaction, but if one or both fail, or you have a crash, or your network filesystem has a fit, you might have a valid record in the database and no file, or a file without a database record, since the thing is not atomic.
If such a problem happens, and the first thing the server does when coming back is overwrite the ids, and thus the files, of rolled back transactions, it sucks. Those files could have been useful.

no, imagine if your bank decided to re-use your account_id - arghhhh !!

mysql insert race condition

How do you stop race conditions in MySQL? the problem at hand is caused by a simple algorithm:
select a row from table
if it doesn't exist, insert it
and then either you get a duplicate row, or if you prevent it via unique/primary keys, an error.
Now normally I'd think transactions help here, but because the row doesn't exist, the transaction don't actually help (or am I missing something?).
LOCK TABLE sounds like an overkill, especially if the table is updated multiple times per second.
The only other solution I can think of is GET_LOCK() for every different id, but isn't there a better way? Are there no scalability issues here as well? And also, doing it for every table sounds a bit unnatural, as it sounds like a very common problem in high-concurrency databases to me.

what you want is LOCK TABLES
or if that seems excessive how about INSERT IGNORE with a check that the row was actually inserted.
If you use the IGNORE keyword, errors
that occur while executing the INSERT
statement are treated as warnings
instead.

It seems to me you should have a unique index on your id column, so a repeated insert would trigger an error instead of being blindingly accepted again.
That can be done by defining the id as a primary key or using a unique index by itself.
I think the first question you need to ask is why do you have many threads doing the exact SAME work? Why would they have to insert the exact same row?
After that being answered, I think that just ignoring the errors will be the most performant solution, but measure both approaches (GET_LOCK v/s ignore errors) and see for yourself.
There is no other way that I know of. Why do you want to avoid errors? You still have to code for the case when another type of error occurs.
As staticsan says transactions do help but, as they usually are implied, if two inserts are ran by different threads, they will both be inside an implied transactions and see consistent views of the database.

Locking the entire table is indeed overkill. To get the effect that you want, you need something that the litterature calls "predicate locks". No one has ever seen those except printed on the paper that academic studies are published on. The next best thing are locks on the "access paths" to the data (in some DBMS's : "page locks").
Some non-SQL systems allow you to do both (1) and (2) in one single statement, more or less meaning the potential race conditions arising from your OS suspending your execution thread right between (1) and (2), are entirely eliminated.
Nevertheless, in the absence of predicate locks such systems will still need to resort to some kind of locking scheme, and the finer the "granularity" (/"scope") of the locks it takes, the better for concurrency.
(And to conclude : some DBMS's - especially the ones you don't have to pay for - do indeed offer no finer lock granularity than "the entire table".)

On a technical level, a transaction will help here because other threads won't see the new row until you commit the transaction.
But in practice that doesn't solve the problem - it only moves it. Your application now needs to check whether the commit fails and decide what to do. I would normally have it rollback what you did, and restart the transaction because now the row will be visible. This is how transaction-based programmer is supposed to work.

I ran into the same problem and searched the Net for a moment :)
Finally I came up with solution similar to method to creating filesystem objects in shared (temporary) directories to securely open temporary files:
$exists = $success = false;
do{
$exists = check();// select a row in the table
if (!$exists)
$success = create_record();
if ($success){
$exists = true;
}else if ($success != ERROR_DUP_ROW){
log_error("failed to create row not 'coz DUP_ROW!");
break;
}else{
//probably other process has already created the record,
//so try check again if exists
}
}while(!$exists)
Don't be afraid of busy-loop - normally it will execute once or twice.

You prevent duplicate rows very simply by putting unique indexes on your tables. That has nothing to do with LOCKS or TRANSACTIONS.
Do you care if an insert fails because it's a duplicate? Do you need to be notified if it fails? Or is all that matters that the row was inserted, and it doesn't matter by whom or how many duplicates inserts failed?
If you don't care, then all you need is INSERT IGNORE. There is no need to think about transactions or table locks at all.
InnoDB has row level locking automatically, but that applies only to updates and deletes. You are right that it does not apply to inserts. You can't lock what doesn't yet exist!
You can explicitly LOCK the entire table. But if your purpose is to prevent duplicates, then you are doing it wrong. Again, use a unique index.
If there is a set of changes to be made and you want an all-or-nothing result (or even a set of all-or-nothing results within a larger all-or-nothing result), then use transactions and savepoints. Then use ROLLBACK or ROLLBACK TO SAVEPOINT *savepoint_name* to undo changes, including deletes, updates and inserts.
LOCK tables is not a replacement for transactions, but it is your only option with MyISAM tables, which do not support transactions. You can also use it with InnoDB tables if row-level level locking isn't enough. See this page for more information on using transactions with lock table statements.

I have a similar issue. I have a table that under most circumstances should have a unique ticket_id value, but there are some cases where I will have duplicates; not the best design, but it is what it is.
User A checks to see if the ticket is reserved, it isn't
User B checks to see if the ticket is reserved, it isn't
User B inserts a 'reserved' record into the table for that ticket
User A inserts a 'reserved' record into the table for that ticket
User B check for duplicate? Yes, is my record newer? Yes, leave it
User A check for duplicate? Yes, is my record newer? No, delete it
User B has reserved the ticket, User A reports back that the ticket has been taken by someone else.
The key in my instance is that you need a tie-breaker, in my case it's the auto-increment id on the row.

In case insert ignore doesnt fit for you as suggested in the accepted answer , so according to the requirements in your question :
1] select a row from table
2] if it doesn't exist, insert it
Another possible approach is to add a condition to the insert sql statement,
e.g :
INSERT INTO table_listnames (name, address, tele)
SELECT * FROM (SELECT 'Rupert', 'Somewhere', '022') AS tmp
WHERE NOT EXISTS (
SELECT name FROM table_listnames WHERE name = 'Rupert'
) LIMIT 1;
Reference:
https://stackoverflow.com/a/3164741/179744

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Best practice for verifying uniqueness in database? - mysql

Related

How does a lock work for two inserts in MySQL?

Setting MySQL unique key or checking for duplicate in application part?

Concurrently retrieve (select) or create (insert) new row in generic SQL without conflicts

Do numerical primary keys of deleted records in a database get reused for future new records?

mysql insert race condition

Categories

Resources