Using MySQL unique index to prevent duplicates, instead of duplicate searching? - mysql

I have a large table (5 million rows), with a unique identifier column called 'unique_id'
I'm running the INSERT query through Node.js (node-mysql bindings) and there's a chance that duplicates could be attempted to be inserted.
The two solutions are:
1) Make 'unique_id' an index, and check the entire database for a duplicate record, prior to INSERT:
'SELECT unique_id WHERE example = "'+unique_id+'" LIMIT 1'
2) Make 'unique_id' a unique index within MySQL, and perform the INSERT without checking for duplicates. Clearly, any duplicates would cause error and not be inserted into the table.
My hunch is that solution 2) is better, as it prevents a search of worse-case (5 million - 1) rows for a duplicate.
Are there any downsides to using solution 2)?

There is a number of advantages to defining a unique, primary index for the unique_id column:
Semantic correctness - currently the name does not reflect reality as you can have duplicates in column called 'unique_id',
Autogenerating of unique ids - you can delegate this job to the database and avoid conflict of ids (this would not be a problem if you were using UUID instead of integers),
Speed gain - to be a reliable solution 1 would require a blocking transaction (no new rows should be inserted between checking for duplicate and inserting a row). Delegating this to MySQL will be much more efficient,
Following a common pattern - this is exactly what unique and primary indexes were designed to do. Your solution will be easy to understand to other developers,
Less code.
With the 2nd solution you might need to handle the attempt of inserting a duplicate (unless your unique ids are generated by MySQL).
Autoincremented primary index:
https://dev.mysql.com/doc/refman/5.7/en/example-auto-increment.html

Surprisingly, it makes little difference performance-wise. The search will use (and require) the same index.
What little performance difference there is, however, is to the advantage of your (2) solution.
Actually in MySQL you can get rid of the error altogether using the IGNORE keyword:
INSERT IGNORE INTO ... VALUES (1, 2, 3), (4, 5, 6), (7, 8, 9)...;
will always succeed (will skip inserting duplicates). This allows to insert several values in a single statement, as above.
You might also be interested in the ON DUPLICATE KEY UPDATE family of tricks :-).
The real difference, as M.M. already stated, is in integrity. Using a UNIQUE index constraint you are sure of your data; otherwise, you need to LOCK the table between the moment you check it and the moment when you insert the new tuple, to avoid the risk of someone else inserting the same value.
Your (1) solution may have a place if the "duplicateness" of the data requires significant business logic work, that cannot be easily translated into a MySQL constraint. In that case you would
lock the table,
search for candidate duplicates (say you get 20 of them),
fetch the data and verify whether they are truly candidates
insert the new tuple if none conflict,
release the lock.
(It might be argued on good grounds that the need to do such a complicated merry-go-round stems from some error in the database design. Ideally you should be able to do everything in MySQL. But business reality has a way of being far from ideal sometimes).

Related

Can/should I make id column that is part of a composite key non-unique [duplicate]

I have got a table which has an id (primary key with auto increment), uid (key refering to users id for example) and something else which for my question won’t matter.
I want to make, lets call it, different auto-increment keys on id for each uid entry.
So, I will add an entry with uid 10, and the id field for this entry will have a 1 because there were no previous entries with a value of 10 in uid. I will add a new one with uid 4 and its id will be 3 because I there were already two entried with uid 4.
...Very obvious explanation, but I am trying to be as explainative an clear as I can to demonstrate the idea... clearly.
What SQL engine can provide such a functionality natively? (non Microsoft/Oracle based)
If there is none, how could I best replicate it? Triggers perhaps?
Does this functionality have a more suitable name?
In case you know about a non SQL database engine providing such a functioality, name it anyway, I am curious.
Thanks.
MySQL's MyISAM engine can do this. See their manual, in section Using AUTO_INCREMENT:
For MyISAM tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index. In this case, the generated value for the AUTO_INCREMENT column is calculated as MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is useful when you want to put data into ordered groups.
The docs go on after that paragraph, showing an example.
The InnoDB engine in MySQL does not support this feature, which is unfortunate because it's better to use InnoDB in almost all cases.
You can't emulate this behavior using triggers (or any SQL statements limited to transaction scope) without locking tables on INSERT. Consider this sequence of actions:
Mario starts transaction and inserts a new row for user 4.
Bill starts transaction and inserts a new row for user 4.
Mario's session fires a trigger to computes MAX(id)+1 for user 4. You get 3.
Bill's session fires a trigger to compute MAX(id). I get 3.
Bill's session finishes his INSERT and commits.
Mario's session tries to finish his INSERT, but the row with (userid=4, id=3) now exists, so Mario gets a primary key conflict.
In general, you can't control the order of execution of these steps without some kind of synchronization.
The solutions to this are either:
Get an exclusive table lock. Before trying an INSERT, lock the table. This is necessary to prevent concurrent INSERTs from creating a race condition like in the example above. It's necessary to lock the whole table, since you're trying to restrict INSERT there's no specific row to lock (if you were trying to govern access to a given row with UPDATE, you could lock just the specific row). But locking the table causes access to the table to become serial, which limits your throughput.
Do it outside transaction scope. Generate the id number in a way that won't be hidden from two concurrent transactions. By the way, this is what AUTO_INCREMENT does. Two concurrent sessions will each get a unique id value, regardless of their order of execution or order of commit. But tracking the last generated id per userid requires access to the database, or a duplicate data store. For example, a memcached key per userid, which can be incremented atomically.
It's relatively easy to ensure that inserts get unique values. But it's hard to ensure they will get consecutive ordinal values. Also consider:
What happens if you INSERT in a transaction but then roll back? You've allocated id value 3 in that transaction, and then I allocated value 4, so if you roll back and I commit, now there's a gap.
What happens if an INSERT fails because of other constraints on the table (e.g. another column is NOT NULL)? You could get gaps this way too.
If you ever DELETE a row, do you need to renumber all the following rows for the same userid? What does that do to your memcached entries if you use that solution?
SQL Server should allow you to do this. If you can't implement this using a computed column (probably not - there are some restrictions), surely you can implement it in a trigger.
MySQL also would allow you to implement this via triggers.
In a comment you ask the question about efficiency. Unless you are dealing with extreme volumes, storing an 8 byte DATETIME isn't much of an overhead compared to using, for example, a 4 byte INT.
It also massively simplifies your data inserts, as well as being able to cope with records being deleted without creating 'holes' in your sequence.
If you DO need this, be careful with the field names. If you have uid and id in a table, I'd expect id to be unique in that table, and uid to refer to something else. Perhaps, instead, use the field names property_id and amendment_id.
In terms of implementation, there are generally two options.
1). A trigger
Implementations vary, but the logic remains the same. As you don't specify an RDBMS (other than NOT MS/Oracle) the general logic is simple...
Start a transaction (often this is Implicitly already started inside triggers)
Find the MAX(amendment_id) for the property_id being inserted
Update the newly inserted value with MAX(amendment_id) + 1
Commit the transaction
Things to be aware of are...
- multiple records being inserted at the same time
- records being inserted with amendment_id being already populated
- updates altering existing records
2). A Stored Procedure
If you use a stored procedure to control writes to the table, you gain a lot more control.
Implicitly, you know you're only dealing with one record.
You simply don't provide a parameter for DEFAULT fields.
You know what updates / deletes can and can't happen.
You can implement all the business logic you like without hidden triggers
I personally recommend the Stored Procedure route, but triggers do work.
It is important to get your data types right.
What you are describing is a multi-part key. So use a multi-part key. Don't try to encode everything into a magic integer, you will poison the rest of your code.
If a record is identified by (entity_id,version_number) then embrace that description and use it directly instead of mangling the meaning of your keys. You will have to write queries which constrain the version number but that's OK. Databases are good at this sort of thing.
version_number could be a timestamp, as a_horse_with_no_name suggests. This is quite a good idea. There is no meaningful performance disadvantage to using timestamps instead of plain integers. What you gain is meaning, which is more important.
You could maintain a "latest version" table which contains, for each entity_id, only the record with the most-recent version_number. This will be more work for you, so only do it if you really need the performance.

MySQL query and insertion optimisation with varchar(255) UUIDs

I think this question has been asked in some way shape or form but I couldn't find a question that had asked exactly what I wish to understand so I thought I'd put the question here
Problem statement
I have built a web application with a MySQL database of say customer records with an INT(11) id PK AI field and a VARCHAR(255) uuid field. The uuid field is not indexed nor set as unique. The uuid field is used as a public identifier so its part of URLs etc. - e.g. https://web.com/get_customer/[uuid]. This was done because the UUID is 'harder' to guess for a regular John Doe - but understand that it is certainly not 'unguessable' in theory. But the issue now is that as the database is growing larger I have observed that the query to retrieve a particular customer record is taking longer to complete.
My thoughts on how to solve the issue
The solution that is coming to mind is to make the uuid field unique and also index the same. But I've been doing some reading in relation to this and various blog posts, StackOverflow answers on this have described putting indices on UUIDs as being really bad for performance. I also read that it will also increase the time it takes to insert a new customer record into the database as the MySQL database will take time to find the correct location in which to place the record as a part of the index.
The above mentioned https://web.com/get_customer/[uuid] can be accessed without having to authenticate which is why I'm not using the id field for the same. It is possible for me to consider moving to integer based UUIDs (I don't need the UUIDs to be universally unique - they just need to be unique for that particular table) - will that improve the the indicing performance and in turn the insertion and querying performance?
Is there a good blog post or information page on how to best set up a database for such a requirement - Need the ability to store a customer record which is 'hard' to guess, easy to insert and easy to query in a large data set.
Any assistance is most appreciated. Thank you!
The received wisdom you mention about putting indexes on UUIDs only comes up when you use them in place of autoincrementing primary keys. Why? The entire table (InnoDB) is built behind the primary key as a clustered index, and bulk loading works best when the index values are sequential.
You certainly can put an ordinary index on your UUID column. If you want your INSERT operations to fail in the astronomically unlikely event you get a random duplicate UUID value you can use an index like this.
ALTER TABLE customer ADD UNIQUE INDEX uuid_constraint (uuid);
But duplicate UUIDv4s are very rare indeed. They have 122 random bits, and most software generating them these days uses cryptographic-quality random number generators. Omitting the UNIQUE index is, I believe, an acceptable risk. (Don't use UUIDv1, 2, 3, or 5: they're not hard enough to guess to keep your data secure.)
If your UUID index isn't unique, you save time on INSERTs and UPDATEs: they don't need to look at the index to detect uniqueness constraint violations.
Edit. When UUID data is in a UNIQUE index, INSERTs are more costly than they are in a similar non-unique index. Should you use a UNIQUE index? Not if you have a high volume of INSERTs. If you have a low volume of INSERTs it's fine to use UNIQUE.
This is the index to use if you omit UNIQUE:
ALTER TABLE customer ADD UNIQUE INDEX uuid (uuid);
To make lookups very fast you can use covering indexes. If your most common lookup query is, for example,
SELECT uuid, givenname, surname, email
FROM customer
WHERE uuid = :uuid
you can create this so-called covering index.
ALTER TABLE customer
ADD INDEX uuid_covering (uuid, givenname, surname, email);
Then your query will be satisfied directly from the index and therefore be faster.
There's always an extra cost to INSERT and UPDATE operations when you have more indexes. But the cost of a full table scan for a query is, in a large table, far far greater than the extra INSERT or UPDATE cost. That's doubly true if you do a lot of queries.
In computer science there's often a space / time tradeoff. SQL indexes use space to save time. It's generally considered a good tradeoff.
(There's all sorts of trickery available to you by using composite primary keys to speed things up. But that's a topic for when you have gigarows.)
(You can also save index and table space by storing UUIDs in BINARY(16) columns and use UUID_TO_BIN() and BIN_TO_UUID() functions to convert them. )

MySQL insert between ID's

If I have a database with the following information, how can I setup my next INSERT query so that the ID is filled in? (so that it is 5 in this instance.)
Basically, once it gets to 24, it will continue inserting in order (ex: 30,31,32)
You don't. Not with an auto-incrementing integer anyway.
You could change the column to not be an auto-incrementing integer, but then you'll need to determine the next ID before performing each insert which would make all of your INSERT queries unnecessarily complex and the code more difficult to maintain. Not to mention introducing a significant point of failure if multiple threads try to insert and the operation to find the next ID and insert a record isn't fully atomic.
Why do you even need this? There's no reason for a database-generated primary key integer to be contiguous like that. Its purpose is to be unique, and as long as it serves that purpose it's working. There's no need to "fill in the holes" left by previously deleted records.
You could add a different column to the database and perform the logic for finding the next contiguous number when inserting records on that column. But you'd still run into the same aforementioned problems of race conditions and unnecessary complexity.
Change your filename to something more meaningful than the id.
I think something like files/uploads/20130515_170349.wv (for the first row) makes a lot of sense (assuming you don't have more than one file per second.
This also has the advantage that ordering the file names alphabetically is chronological order, making it easier to see the newer and older files.
You can just give it the I'd field and value
Insert into table (I'd, etc, etc) values (5, etc, etc);
However I don't think you can do it dynamically. If I'd is auto increment then it'll keep on oncrementinf whether or not previous tuples have been deleted etc.

Maintaining a list of unique values in a database

Let's say you have a random number generator spitting out numbers between 1 and 100 000 000 and you want to store them in a database (MySQL) with the timestamp when they were generaeted. If a number that has previously been seen comes, it is discarded.
What would be the best algorithm to make this happen? SELECT then INSERT as necessary? Is there something more efficient?
You can go for a SEQUENCE:
+
no relations are being locked, thus best performance;
no race conditions;
portable.
-
it is possible to get “gaps” in the series of numbers.
You can do a SELECT ... then INSERT ...:
+
no gaps, you can also do some complicated math on your numbers.
-
it's possible to get another parallel session in the middle between SELECT and INSERT and end up with 2 equal numbers;
if there's a UNIQUE constraint, then previos situation will lead to an exception;
to avoid such situation, you might go for an explicit table locks, but this will cause an immediate performance impact.
You can choose INSERT ON DUPLICATE KEY UPDATE, and by now it seems to be the best option (take a look at "INSERT IGNORE" vs "INSERT ... ON DUPLICATE KEY UPDATE"), at least in my view, with the only exception — not portable to other RDBMSes.
P.S. This article is not related to MySQL, but it is worth reading it to get an overview of all the catches that can happen on your way.
If you don't need to insert a new random value every time you can use INSERT IGNORE or REPLACE INTO. Otherwise you should SELECT to check and then INSERT.
This would normally be solved by creating a unique index on the random number column in the table. You could experiment to see if a b-tree versus a hash has better performance.
If you have lots of memory, you could pre-populate a table with 100,000,000 rows -- all possible values. Then, when you look to see if something is already created, then you only need to see if the time stamp is non-null. However, this would require over a Gbyte of RAM to store the table in memory, and would only be the opimal solution if you are trying to maximize transactions per second.
If you put a UNIQUE index on the column with the extracted numbers any INSERT attempting to duplicate a UNIQUE key will fail.
Therefore the easiest and most portable version will be (PHP code, but you get the idea):
function extraction() {
do {
$random = generate_random_number();
$result = #mysql_query("INSERT INTO extractions(number) VALUE ($random)");
} while (!$result);
return $random;
}

UPDATE vs INSERT performance

Am I correct to assume that an UPDATE query takes more resources than an INSERT query?
I am not a database guru but here my two cents:
Personally I don't think you have much to do in this regard, even if INSERT would be faster (all to be proven), can you convert an update in an insert?! Frankly I don't think you can do it all the times.
During an INSERT you don't usually have to use WHERE to identify which row to update but depending on your indices on that table the operation can have some cost.
During an update if you do not change any column included in any indices you could have quick execution, if the where clause is easy and fast enough.
Nothing is written in stones and really I would imagine it depends on whole database setup, indices and so on.
Anyway, found this one as a reference:
Top 84 MySQL Performance Tips
If you plan to perform a large processing (such as rating or billing for a cellular company), this question has a huge impact on system performance.
Performing large scale updates vs making many new tables and index has proven to reduce my company billing process form 26 hours to 1 hour!
I have tried it on 2 million records for 100,000 customer.
I first created the billing table and then every customer summary calls, I updated the billing table with the duration, price, discount.. a total of 10 fields.
In the second option I created 4 phases.
Each phase reads the previous table(s), creates index (after the table insert completed) and using: "insert into from select .." I have created the next table for the next phase.
Summary
Although the second alternative requires much more disk space (all views and temporary tables deleted at the end) there are 3 main advantages to this option:
It was 4 time faster than option 1.
In case there was a problem in the middle of the process I could start the process from the point it failed, as all the tables for the beginning of the phase were ready and the process could restart from this point. If the process fails implementing the first option, you will need to start the all the process all over again.
This made the development and QA work much faster as they could work parallel.
The key resource here is disk access (IOPS to be precise) and we should evaluate which ones results in minimum of that.
Agree with others on how it is impossible to give a generic answer but some thoughts to lead you in the right direction , assume a simple key-value store and key is indexed. Insertion is inserting a new key and update is updating the value of an existing key.
If that is the case (a very common case) , update would be faster than insertion because update involves an indexed lookup and changing an existing value without touching the index. You can assume that is one disk read to get the data and possibly one disk write. On the other hand insertion would involve two disk writes one for index , one for data. But the another hidden cost is the btree node splitting and new node creation which would happen in background while insertion leading to more disk access on average.
You cannot compare an INSERT and an UPDATE in general. Give us an example (with schema definition) and we will explain which one costs more and why. Also, you can compere a concrete INSERT and an UPDATE by checking their plan and execution time.
Some rules of thumbs though:
if you only update only one field, which is not indexed and you only update one record and you use rowid/primary key to find that record then this UPDATE will cost less, than
an INSERT, which will also affect only one row, though this row will have many not null constrained, indexed fields; and all those indexes have to be maintained (e.g. add a new leaf)
It depends. A simple UPDATE that uses a primary key in the WHERE clause and updates only a single non-indexed field would likely be less costly than an INSERT on the same table. But even that depends on the database engine involved. An UPDATE that involved modifying many indexed fields, however, might be more costly than the INSERT on that table because more index key modifications would be required. An UPDATE with a poorly constructed WHERE clause that required a table scan of millions of records would certainly be more expensive than an INSERT on that table.
These statements can take many forms, but if you limit the discussion to their "basic" forms that involve a single record, then the larger portion of the cost will usually be dedicated to modifying the indexes. Each indexed field that is modified during an UPDATE would typically involve two basic operations (delete the old key and add the new key) whereas the INSERT would require one (add the new key). Of course, a clustered index would then add some other dynamics as would locking issues, transaction isolation, etc. So, ultimately, the comparison between these statements in a general sense is not really possible and would probably require benchmarking of specific statements if it actually mattered.
Typically, though, it makes sense to just use the correct statement and not worry about it since it is usually not an option to choose between an UPDATE and an INSERT.
It depends. If update don't require changes of the key it's most likely that it will only costs like a search and then it will probably cost less than an insert, unless database is organized like an heap.
This is the only think i can state, because performances greatly depends on the database organization used.
If you for example use MyISAM that i suppose organized like an isam, insert should cost generally the same in terms of database read accesses but it will require some additional write operation.
On Sybase / SQL Server an update which impacts a column with a read-only index is internally replaced by a delete and then an insert, so this is obviously slower than insert. I do not know the implementation for other engines but I think this is a common strategy at least when indices are involved.
Now for tables without indices ( or for update requests not involving any index ) I suppose there are cases where the update can be faster, depending on the structure of the table.
In mysql you can change your update to insert with ON DUPLICATE KEY UPDATE
INSERT INTO t1 (a,b,c) VALUES (1,2,3)
ON DUPLICATE KEY UPDATE c=c+1;
UPDATE t1 SET c=c+1 WHERE a=1;
A lot of people here are commenting that you can't compare an insert vs update but I disagree. People should understand that an update takes a lot more resources than insert or even possibly deleting and inserting.
Now regarding how you can even compare the 2 as one doesn't directly replace the other. But in certain cases you make an insert and then update the table with data from another table.
For instance I get a feed from an API which contains id1, but this table relates to another table and I would like to add table2_id. Instead of doing an update statement that takes a lot more resources, I can handle this in the backend which is faster and just do an insert statement instead of an insert and then an update. The update statement also locks the table causing a traffic jam so to speak.