I have tried to find thoughts on this, but can't find quite what I am looking for.
In one topic on here, people seemed to agree that "almost" all of the time, an update would be much preferable to deleting a row and re-inserting it.
HOWEVER, my situation is which would be better between doing several hundred individual updates vs. 1 mass delete and 1 bulk insert for those hundreds of rows.
Wouldn't all of the time saved from doing the bulk insert more than offset the extra work from doing the delete/insert method vs. update?
No other table needs the ids from these rows, by the way.
I think the answer would depend on the exact update/delete query you are trying to run, and on the data. But, in general I would expect that just doing an update would be faster than deleting and re-inserting. The reason is that very similar lookup logic will have to run in either approach to target the records in question. In the case of delete/insert, you would then remove those records and bulk insert. However, in the case of update, you would have already found the records, and would just need to mutate some data.
Related
I would like to know if there's some regular way to handle duplicates in the database without actually removing the duplicated rows. Or a specific name for what I'm trying to achieve, so I can check it out.
Why would I keep duplicates? Because I have to monitor them. I have to know that they're duplicates and are not e.g. searchable, but at the same time, I have to keep them, because I update the rows from external source and if I'd remove them, they'd go back to the database as soon as I update from external source.
I have two ideas:
Have an additional boolean column "searchable", but I feel it's a partial solution, it can turn out to be insufficient in the future
Have an additional column "duplicate_of". It would keep id of the column of which the row is duplicate. It would be a foreign key of the same table which is kind of weird., isn't it?
I know it's not a specific programming question, but I think that someone must have handled a similar situation (Facebook - Pages they keep track of those which are duplicates of others) and it would be great to know a verified solution.
EDIT: these are close duplicates, indetified mainly by their location (lat, lng), so DISTINCT is probably not a solution here
I would create a view that has DISTINCT values. Having an additional column to be searchable sounds tedious. Your second idea is actually more feasible and there is nothing weird about a self-referencing table.
The solution depends on several other factors. In particular, does the database support real deletes and updates (apart from setting the duplication information)?
You have a range of solutions. One is to place distinct values in a separate table, periodically. This works well if you have batch inserts, and no updates/deletes.
If you have a database that is being updated, then you might want to maintain a version number on the record. This lets you track it. Presumably, if it is a duplicate, there is another duplicate key inside it.
The problem with your second approach is that it can result in a tree-like structure of duplicates. Where A-->B-->C and D--> so A and D are duplicates, but this is not obvious. If you always put in the earliest value and there are no updates or deletes, then this solution is reasonable.
We are using innodb and have a table that will have many millions of rows. One of the columns will be a varchar(32) whose value will change fairly often. Doing updates to this varchar on tens of thousands of rows will take a long time, so we are trying with the idea of splitting this field off into its own table and then instead of doing updates, we can do a delete followed by a batch insert using load data in file. It seems like this will greatly improve performance. Am I missing something though? Is there an easier way to improve update performance? Has anybody done anything like this before?
If you can select the rows you want to update based on indeces alone this should in practice do the same as your suggestions (and still keep a sane data organization, hence be preferable). Quiet possibly this is even faster than doing it yourself.
You could create an index appropriate to the where clause of your update statement.
The idea of splitting it up may improve performance (I'm not sure), but only, when all values change at once. When individual values change, this approach is slower than the approach with one table.
Another precondition for being faster is, that you must know the key->value mapping of the second table beforehand. If you have to look into the first table for deciding how to store values in the second one, you are also slower than with one table.
So I have a checkbox form where users can select multiple values. Then can then go back and select different values. Each value is stored as a row (UserID,value).
How do you do that INSERT when some rows might be duplicates of an already-existing row in the table?
Should I first delete the existing values and then INSERT the new values?
ON DUPLICATE KEY UPDATE seems tricky since I would be INSERTing multiple rows at once, so how would I define and separate just the ones that need UPDATING vs. the ones that need INSERTING?
For example, let's say a user makes his first-time selection:
INSERT INTO
Choices(UserID,value)
VALUES
('1','banana'),('1','apple'),('1','orange'),('1','cranberry'),('1','lemon')
What if the user goes back later and makes different choices which include SOME of the values in his original query which will thus cause duplicates?
How should I handle that best?
In my opinion, simply deleting the existing choices and then inserting the new ones is the best way to go. It may not be the most efficient overall, but it is simple to code and thus has a much better chance of being correct.
Otherwise it is necessary to find the intersection of the new choices and old choices. Then either delete the obsolete ones or change them to the new choices (and then insert/delete depending on if the new set of choices is bigger or smaller than the original set). The added risk of the extra complexity does not seem worth it.
Edit As #Andrew points out in the comments, deleting the originals en masse may not be a good plan if these records happened to be "parent" records in a referential integrity definition. My thinking was that this seemed like an unlikely situation based on the OP's description. But it is definitely worth consideration.
It's not clear to me when you would ever need to update a record in the database in your case.
It sounds like you need to maintain a set of choices per user, which the user may on occasion change. Therefore, each time the user provides a new set of choices, any prior set of choices should be discarded. So you would delete all old records, then insert any new ones.
You might consider carrying out a comparison of the prior and new choices - either in the server or client code - in order to calculate the minimum set of deletes and/or inserts needed to reduce database writes. But that smells like premature optimisation.
Putting all that to one side - if you want a re-insert to be ignored then you should use INSERT IGNORE, then existing rows will be quietly ignored and new ones will be inserted.
I don't know much about mysql but in MS SQL 2000+ we can execute a stored proc with XML as one of it's parameters. This XML would contain a list of identity-value pairs. We would open this XML as a table using openxml and figure out which rows need to be deleted or inserted using left or right outer join. As of SQL 2008 (I think) we have a new merge statement that let's us perform delete, update and insert row operations in one statement on ONE table. This way we can take advantage of Set mathematical operations from SQL instead of looping through arrays in the application code.
You can also keep your select list retrieved from the database in session and compare the "old list" to the "newly selected list" in your application code. You would need to figure out which rows need to be deleted or added. You probably don't need to worry about updates because you are probably only keeping foreign keys in this table and the descriptions are in some kind of a reference table.
There is another way in SQL 2008 that involves using user defined data-types as custom tables but I don't know much about it.
Personally, I prefer the XML route because you just send the end-state into the sp and your sp automatically figures out which rows need to deleted or inserted.
Hope this helps.
Am I correct to assume that an UPDATE query takes more resources than an INSERT query?
I am not a database guru but here my two cents:
Personally I don't think you have much to do in this regard, even if INSERT would be faster (all to be proven), can you convert an update in an insert?! Frankly I don't think you can do it all the times.
During an INSERT you don't usually have to use WHERE to identify which row to update but depending on your indices on that table the operation can have some cost.
During an update if you do not change any column included in any indices you could have quick execution, if the where clause is easy and fast enough.
Nothing is written in stones and really I would imagine it depends on whole database setup, indices and so on.
Anyway, found this one as a reference:
Top 84 MySQL Performance Tips
If you plan to perform a large processing (such as rating or billing for a cellular company), this question has a huge impact on system performance.
Performing large scale updates vs making many new tables and index has proven to reduce my company billing process form 26 hours to 1 hour!
I have tried it on 2 million records for 100,000 customer.
I first created the billing table and then every customer summary calls, I updated the billing table with the duration, price, discount.. a total of 10 fields.
In the second option I created 4 phases.
Each phase reads the previous table(s), creates index (after the table insert completed) and using: "insert into from select .." I have created the next table for the next phase.
Summary
Although the second alternative requires much more disk space (all views and temporary tables deleted at the end) there are 3 main advantages to this option:
It was 4 time faster than option 1.
In case there was a problem in the middle of the process I could start the process from the point it failed, as all the tables for the beginning of the phase were ready and the process could restart from this point. If the process fails implementing the first option, you will need to start the all the process all over again.
This made the development and QA work much faster as they could work parallel.
The key resource here is disk access (IOPS to be precise) and we should evaluate which ones results in minimum of that.
Agree with others on how it is impossible to give a generic answer but some thoughts to lead you in the right direction , assume a simple key-value store and key is indexed. Insertion is inserting a new key and update is updating the value of an existing key.
If that is the case (a very common case) , update would be faster than insertion because update involves an indexed lookup and changing an existing value without touching the index. You can assume that is one disk read to get the data and possibly one disk write. On the other hand insertion would involve two disk writes one for index , one for data. But the another hidden cost is the btree node splitting and new node creation which would happen in background while insertion leading to more disk access on average.
You cannot compare an INSERT and an UPDATE in general. Give us an example (with schema definition) and we will explain which one costs more and why. Also, you can compere a concrete INSERT and an UPDATE by checking their plan and execution time.
Some rules of thumbs though:
if you only update only one field, which is not indexed and you only update one record and you use rowid/primary key to find that record then this UPDATE will cost less, than
an INSERT, which will also affect only one row, though this row will have many not null constrained, indexed fields; and all those indexes have to be maintained (e.g. add a new leaf)
It depends. A simple UPDATE that uses a primary key in the WHERE clause and updates only a single non-indexed field would likely be less costly than an INSERT on the same table. But even that depends on the database engine involved. An UPDATE that involved modifying many indexed fields, however, might be more costly than the INSERT on that table because more index key modifications would be required. An UPDATE with a poorly constructed WHERE clause that required a table scan of millions of records would certainly be more expensive than an INSERT on that table.
These statements can take many forms, but if you limit the discussion to their "basic" forms that involve a single record, then the larger portion of the cost will usually be dedicated to modifying the indexes. Each indexed field that is modified during an UPDATE would typically involve two basic operations (delete the old key and add the new key) whereas the INSERT would require one (add the new key). Of course, a clustered index would then add some other dynamics as would locking issues, transaction isolation, etc. So, ultimately, the comparison between these statements in a general sense is not really possible and would probably require benchmarking of specific statements if it actually mattered.
Typically, though, it makes sense to just use the correct statement and not worry about it since it is usually not an option to choose between an UPDATE and an INSERT.
It depends. If update don't require changes of the key it's most likely that it will only costs like a search and then it will probably cost less than an insert, unless database is organized like an heap.
This is the only think i can state, because performances greatly depends on the database organization used.
If you for example use MyISAM that i suppose organized like an isam, insert should cost generally the same in terms of database read accesses but it will require some additional write operation.
On Sybase / SQL Server an update which impacts a column with a read-only index is internally replaced by a delete and then an insert, so this is obviously slower than insert. I do not know the implementation for other engines but I think this is a common strategy at least when indices are involved.
Now for tables without indices ( or for update requests not involving any index ) I suppose there are cases where the update can be faster, depending on the structure of the table.
In mysql you can change your update to insert with ON DUPLICATE KEY UPDATE
INSERT INTO t1 (a,b,c) VALUES (1,2,3)
ON DUPLICATE KEY UPDATE c=c+1;
UPDATE t1 SET c=c+1 WHERE a=1;
A lot of people here are commenting that you can't compare an insert vs update but I disagree. People should understand that an update takes a lot more resources than insert or even possibly deleting and inserting.
Now regarding how you can even compare the 2 as one doesn't directly replace the other. But in certain cases you make an insert and then update the table with data from another table.
For instance I get a feed from an API which contains id1, but this table relates to another table and I would like to add table2_id. Instead of doing an update statement that takes a lot more resources, I can handle this in the backend which is faster and just do an insert statement instead of an insert and then an update. The update statement also locks the table causing a traffic jam so to speak.
I have a huge table that is mainly used for backup and administrative purposes. The only records that matters is the last inserted record.
On every hit to order by time inserted is just too slow. I want keep a separate table with the last inserted id.
In PHP I now insert, get last inserted id, and update the other table.
Is there a more efficient way to do this.
You could do this on the database end by using a trigger.
(Sorry about posting this as a separate answer, was a bit too long for a comment on Matti's answer.)
There is a small performance overhead associated with triggers, but if I recall correctly it's fairly negligible for normal use (depending on what you're doing with it of course). Mostly it'd only be a problem if you're performing bulk uploads (in which case you'd usually drop/disable the triggers for the duration of the task). Seems to me that the overhead here would be very minimal seeing as you're only really performing one INSERT/UPDATE on X in addition to the INSERT on Y.
Essentially, a trigger will scale a lot better compared to your current method because instead of having to perform a lookup to find the last updated record you can just perform the insert operation, then directly insert the primary key of the new record into the "last updated" table.
Why don't you add an index on that field?
Quick seach and sort is exactly what an index is for.
Updating your own 'pseudo-index' in a table amounts to re-inventing the wheel.
Besides, adding a trigger to a DB always feels very dubious (as in un-obvious) to me!