UPDATE vs INSERT performance - mysql

Am I correct to assume that an UPDATE query takes more resources than an INSERT query?

I am not a database guru but here my two cents:
Personally I don't think you have much to do in this regard, even if INSERT would be faster (all to be proven), can you convert an update in an insert?! Frankly I don't think you can do it all the times.
During an INSERT you don't usually have to use WHERE to identify which row to update but depending on your indices on that table the operation can have some cost.
During an update if you do not change any column included in any indices you could have quick execution, if the where clause is easy and fast enough.
Nothing is written in stones and really I would imagine it depends on whole database setup, indices and so on.
Anyway, found this one as a reference:
Top 84 MySQL Performance Tips

If you plan to perform a large processing (such as rating or billing for a cellular company), this question has a huge impact on system performance.
Performing large scale updates vs making many new tables and index has proven to reduce my company billing process form 26 hours to 1 hour!
I have tried it on 2 million records for 100,000 customer.
I first created the billing table and then every customer summary calls, I updated the billing table with the duration, price, discount.. a total of 10 fields.
In the second option I created 4 phases.
Each phase reads the previous table(s), creates index (after the table insert completed) and using: "insert into from select .." I have created the next table for the next phase.
Summary
Although the second alternative requires much more disk space (all views and temporary tables deleted at the end) there are 3 main advantages to this option:
It was 4 time faster than option 1.
In case there was a problem in the middle of the process I could start the process from the point it failed, as all the tables for the beginning of the phase were ready and the process could restart from this point. If the process fails implementing the first option, you will need to start the all the process all over again.
This made the development and QA work much faster as they could work parallel.

The key resource here is disk access (IOPS to be precise) and we should evaluate which ones results in minimum of that.
Agree with others on how it is impossible to give a generic answer but some thoughts to lead you in the right direction , assume a simple key-value store and key is indexed. Insertion is inserting a new key and update is updating the value of an existing key.
If that is the case (a very common case) , update would be faster than insertion because update involves an indexed lookup and changing an existing value without touching the index. You can assume that is one disk read to get the data and possibly one disk write. On the other hand insertion would involve two disk writes one for index , one for data. But the another hidden cost is the btree node splitting and new node creation which would happen in background while insertion leading to more disk access on average.

You cannot compare an INSERT and an UPDATE in general. Give us an example (with schema definition) and we will explain which one costs more and why. Also, you can compere a concrete INSERT and an UPDATE by checking their plan and execution time.
Some rules of thumbs though:
if you only update only one field, which is not indexed and you only update one record and you use rowid/primary key to find that record then this UPDATE will cost less, than
an INSERT, which will also affect only one row, though this row will have many not null constrained, indexed fields; and all those indexes have to be maintained (e.g. add a new leaf)

It depends. A simple UPDATE that uses a primary key in the WHERE clause and updates only a single non-indexed field would likely be less costly than an INSERT on the same table. But even that depends on the database engine involved. An UPDATE that involved modifying many indexed fields, however, might be more costly than the INSERT on that table because more index key modifications would be required. An UPDATE with a poorly constructed WHERE clause that required a table scan of millions of records would certainly be more expensive than an INSERT on that table.
These statements can take many forms, but if you limit the discussion to their "basic" forms that involve a single record, then the larger portion of the cost will usually be dedicated to modifying the indexes. Each indexed field that is modified during an UPDATE would typically involve two basic operations (delete the old key and add the new key) whereas the INSERT would require one (add the new key). Of course, a clustered index would then add some other dynamics as would locking issues, transaction isolation, etc. So, ultimately, the comparison between these statements in a general sense is not really possible and would probably require benchmarking of specific statements if it actually mattered.
Typically, though, it makes sense to just use the correct statement and not worry about it since it is usually not an option to choose between an UPDATE and an INSERT.

It depends. If update don't require changes of the key it's most likely that it will only costs like a search and then it will probably cost less than an insert, unless database is organized like an heap.
This is the only think i can state, because performances greatly depends on the database organization used.
If you for example use MyISAM that i suppose organized like an isam, insert should cost generally the same in terms of database read accesses but it will require some additional write operation.

On Sybase / SQL Server an update which impacts a column with a read-only index is internally replaced by a delete and then an insert, so this is obviously slower than insert. I do not know the implementation for other engines but I think this is a common strategy at least when indices are involved.
Now for tables without indices ( or for update requests not involving any index ) I suppose there are cases where the update can be faster, depending on the structure of the table.

In mysql you can change your update to insert with ON DUPLICATE KEY UPDATE
INSERT INTO t1 (a,b,c) VALUES (1,2,3)
ON DUPLICATE KEY UPDATE c=c+1;
UPDATE t1 SET c=c+1 WHERE a=1;

A lot of people here are commenting that you can't compare an insert vs update but I disagree. People should understand that an update takes a lot more resources than insert or even possibly deleting and inserting.
Now regarding how you can even compare the 2 as one doesn't directly replace the other. But in certain cases you make an insert and then update the table with data from another table.
For instance I get a feed from an API which contains id1, but this table relates to another table and I would like to add table2_id. Instead of doing an update statement that takes a lot more resources, I can handle this in the backend which is faster and just do an insert statement instead of an insert and then an update. The update statement also locks the table causing a traffic jam so to speak.

Related

Can/should I make id column that is part of a composite key non-unique [duplicate]

I have got a table which has an id (primary key with auto increment), uid (key refering to users id for example) and something else which for my question won’t matter.
I want to make, lets call it, different auto-increment keys on id for each uid entry.
So, I will add an entry with uid 10, and the id field for this entry will have a 1 because there were no previous entries with a value of 10 in uid. I will add a new one with uid 4 and its id will be 3 because I there were already two entried with uid 4.
...Very obvious explanation, but I am trying to be as explainative an clear as I can to demonstrate the idea... clearly.
What SQL engine can provide such a functionality natively? (non Microsoft/Oracle based)
If there is none, how could I best replicate it? Triggers perhaps?
Does this functionality have a more suitable name?
In case you know about a non SQL database engine providing such a functioality, name it anyway, I am curious.
Thanks.
MySQL's MyISAM engine can do this. See their manual, in section Using AUTO_INCREMENT:
For MyISAM tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index. In this case, the generated value for the AUTO_INCREMENT column is calculated as MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is useful when you want to put data into ordered groups.
The docs go on after that paragraph, showing an example.
The InnoDB engine in MySQL does not support this feature, which is unfortunate because it's better to use InnoDB in almost all cases.
You can't emulate this behavior using triggers (or any SQL statements limited to transaction scope) without locking tables on INSERT. Consider this sequence of actions:
Mario starts transaction and inserts a new row for user 4.
Bill starts transaction and inserts a new row for user 4.
Mario's session fires a trigger to computes MAX(id)+1 for user 4. You get 3.
Bill's session fires a trigger to compute MAX(id). I get 3.
Bill's session finishes his INSERT and commits.
Mario's session tries to finish his INSERT, but the row with (userid=4, id=3) now exists, so Mario gets a primary key conflict.
In general, you can't control the order of execution of these steps without some kind of synchronization.
The solutions to this are either:
Get an exclusive table lock. Before trying an INSERT, lock the table. This is necessary to prevent concurrent INSERTs from creating a race condition like in the example above. It's necessary to lock the whole table, since you're trying to restrict INSERT there's no specific row to lock (if you were trying to govern access to a given row with UPDATE, you could lock just the specific row). But locking the table causes access to the table to become serial, which limits your throughput.
Do it outside transaction scope. Generate the id number in a way that won't be hidden from two concurrent transactions. By the way, this is what AUTO_INCREMENT does. Two concurrent sessions will each get a unique id value, regardless of their order of execution or order of commit. But tracking the last generated id per userid requires access to the database, or a duplicate data store. For example, a memcached key per userid, which can be incremented atomically.
It's relatively easy to ensure that inserts get unique values. But it's hard to ensure they will get consecutive ordinal values. Also consider:
What happens if you INSERT in a transaction but then roll back? You've allocated id value 3 in that transaction, and then I allocated value 4, so if you roll back and I commit, now there's a gap.
What happens if an INSERT fails because of other constraints on the table (e.g. another column is NOT NULL)? You could get gaps this way too.
If you ever DELETE a row, do you need to renumber all the following rows for the same userid? What does that do to your memcached entries if you use that solution?
SQL Server should allow you to do this. If you can't implement this using a computed column (probably not - there are some restrictions), surely you can implement it in a trigger.
MySQL also would allow you to implement this via triggers.
In a comment you ask the question about efficiency. Unless you are dealing with extreme volumes, storing an 8 byte DATETIME isn't much of an overhead compared to using, for example, a 4 byte INT.
It also massively simplifies your data inserts, as well as being able to cope with records being deleted without creating 'holes' in your sequence.
If you DO need this, be careful with the field names. If you have uid and id in a table, I'd expect id to be unique in that table, and uid to refer to something else. Perhaps, instead, use the field names property_id and amendment_id.
In terms of implementation, there are generally two options.
1). A trigger
Implementations vary, but the logic remains the same. As you don't specify an RDBMS (other than NOT MS/Oracle) the general logic is simple...
Start a transaction (often this is Implicitly already started inside triggers)
Find the MAX(amendment_id) for the property_id being inserted
Update the newly inserted value with MAX(amendment_id) + 1
Commit the transaction
Things to be aware of are...
- multiple records being inserted at the same time
- records being inserted with amendment_id being already populated
- updates altering existing records
2). A Stored Procedure
If you use a stored procedure to control writes to the table, you gain a lot more control.
Implicitly, you know you're only dealing with one record.
You simply don't provide a parameter for DEFAULT fields.
You know what updates / deletes can and can't happen.
You can implement all the business logic you like without hidden triggers
I personally recommend the Stored Procedure route, but triggers do work.
It is important to get your data types right.
What you are describing is a multi-part key. So use a multi-part key. Don't try to encode everything into a magic integer, you will poison the rest of your code.
If a record is identified by (entity_id,version_number) then embrace that description and use it directly instead of mangling the meaning of your keys. You will have to write queries which constrain the version number but that's OK. Databases are good at this sort of thing.
version_number could be a timestamp, as a_horse_with_no_name suggests. This is quite a good idea. There is no meaningful performance disadvantage to using timestamps instead of plain integers. What you gain is meaning, which is more important.
You could maintain a "latest version" table which contains, for each entity_id, only the record with the most-recent version_number. This will be more work for you, so only do it if you really need the performance.

Is select faster than insert

I have an big load file that I downloaded. This contains records that I will have to load into the database. Based on the size of the data, it will likely take 2 weeks or more to finish (since there is preprocessing etc). A coworker asked me to make what she called a delta file, which checks the current database to see if the data already exists based on a certain field in the database and IFF it exists then we will keep that in the load file, otherwise we will discard it.
I'm confused because to implement this I would need to do a select query for every file in the load file to check if it exists. a select would take O(n) I'm assuming. Then the insert (for a smaller data set) an additional O(1).
whereas an insert would just take O(1).
I'd like to 1) understand why this implementation is faster (If I don't understand things properly) and 2) a possible solution to implementation of this delta file if you can think of something smarter than what I suggested
Thanks
Databases make indexes for columns specified in the schema. The way your data is indexed can make a massive difference in performance. Without an index, a select operation may be O(n) but with an index it may be O(1).
Insert operations must maintain the index. For large data loading operations you may be well off to disable indexing until the end so you are doing a single index update on all the data instead of many index updates on each record you insert.
Some measurements I did the other day indicate that selects are faster than inserts in my situation. I came across this question because I am trying to learn if this is generally true or reflects something specific about the way I have it setup.

Maintaining a list of unique values in a database

Let's say you have a random number generator spitting out numbers between 1 and 100 000 000 and you want to store them in a database (MySQL) with the timestamp when they were generaeted. If a number that has previously been seen comes, it is discarded.
What would be the best algorithm to make this happen? SELECT then INSERT as necessary? Is there something more efficient?
You can go for a SEQUENCE:
+
no relations are being locked, thus best performance;
no race conditions;
portable.
-
it is possible to get “gaps” in the series of numbers.
You can do a SELECT ... then INSERT ...:
+
no gaps, you can also do some complicated math on your numbers.
-
it's possible to get another parallel session in the middle between SELECT and INSERT and end up with 2 equal numbers;
if there's a UNIQUE constraint, then previos situation will lead to an exception;
to avoid such situation, you might go for an explicit table locks, but this will cause an immediate performance impact.
You can choose INSERT ON DUPLICATE KEY UPDATE, and by now it seems to be the best option (take a look at "INSERT IGNORE" vs "INSERT ... ON DUPLICATE KEY UPDATE"), at least in my view, with the only exception — not portable to other RDBMSes.
P.S. This article is not related to MySQL, but it is worth reading it to get an overview of all the catches that can happen on your way.
If you don't need to insert a new random value every time you can use INSERT IGNORE or REPLACE INTO. Otherwise you should SELECT to check and then INSERT.
This would normally be solved by creating a unique index on the random number column in the table. You could experiment to see if a b-tree versus a hash has better performance.
If you have lots of memory, you could pre-populate a table with 100,000,000 rows -- all possible values. Then, when you look to see if something is already created, then you only need to see if the time stamp is non-null. However, this would require over a Gbyte of RAM to store the table in memory, and would only be the opimal solution if you are trying to maximize transactions per second.
If you put a UNIQUE index on the column with the extracted numbers any INSERT attempting to duplicate a UNIQUE key will fail.
Therefore the easiest and most portable version will be (PHP code, but you get the idea):
function extraction() {
do {
$random = generate_random_number();
$result = #mysql_query("INSERT INTO extractions(number) VALUE ($random)");
} while (!$result);
return $random;
}

MySQL: is Partitioning a good way to handle deletes?

I have an MySQL table:
CREATE TABLE responses (
id INT NOT NULL AUTO_INCREMENT,
other_id INT NOT NULL,
details TEXT,
deleted BOOLEAN,
PRIMARY KEY (id)
);
Users could delete records in responses.
My plan is to use the field deleted to implement deletion. Any time a user deletes a record, I'll set deleted to 1.
Every now and then I might want to get rid of all the deleted records or to archive them. I'm thinking of using partitioning to make that faster:
PARTITION BY LIST(deleted) (
PARTITION pActive VALUES IN (0),
PARTITION pDeleted VALUES IN (1)
);
My question is would this make the act of deletion slower? Now that when I change a record's "deleted" field, MySQL would need to move the record to an entirely different partition. That seems like it could be slow.
Any suggestions would be greatly appreciated.
Yes, I'd expect the transition between the two states to be slower for sake of being passed between the partitions. But deleted/non-deleted queries on existing values would be faster, though queries that don't involve the deletion status wouldn't be improved.
It's all about what is the most common operation in the table, and accepting that there can be compromises.
I've been on a project in the past where this approach was used, and my personal feeling is that it's not the best approach. I think it's better to just delete the records. When you have a flag like this, everyone who uses your database has to understand that records that exist in the table might not be "real" records, depending on whether or not the deleted flag is set. It just makes the database less intuitive, and harder to use in my opinion.
If you're concerned about performance, I'd look at properly allocating your tablespaces, and you can still employ a partitioning scheme. You could partition the data by year and month (if you need that level of granularity) to help with the performance.
But I would avoid the deleted flag. On the project I worked on, it just became a real headache. For example, what if somebody tries to insert another record exactly the same as the one that was "deleted" (deleted here means deleted flag is true). Do you set deleted to false on the existing record, or do you insert another brand new record? If you insert a brand new record, how do you define your primary key on the table since now you have 2 records with the same key? Do you make deleted part of the key? The point is that you have to deal with all those types of non-trivial issues.

Best way to update record X when Y is inserted

I have a huge table that is mainly used for backup and administrative purposes. The only records that matters is the last inserted record.
On every hit to order by time inserted is just too slow. I want keep a separate table with the last inserted id.
In PHP I now insert, get last inserted id, and update the other table.
Is there a more efficient way to do this.
You could do this on the database end by using a trigger.
(Sorry about posting this as a separate answer, was a bit too long for a comment on Matti's answer.)
There is a small performance overhead associated with triggers, but if I recall correctly it's fairly negligible for normal use (depending on what you're doing with it of course). Mostly it'd only be a problem if you're performing bulk uploads (in which case you'd usually drop/disable the triggers for the duration of the task). Seems to me that the overhead here would be very minimal seeing as you're only really performing one INSERT/UPDATE on X in addition to the INSERT on Y.
Essentially, a trigger will scale a lot better compared to your current method because instead of having to perform a lookup to find the last updated record you can just perform the insert operation, then directly insert the primary key of the new record into the "last updated" table.
Why don't you add an index on that field?
Quick seach and sort is exactly what an index is for.
Updating your own 'pseudo-index' in a table amounts to re-inventing the wheel.
Besides, adding a trigger to a DB always feels very dubious (as in un-obvious) to me!