We are using innodb and have a table that will have many millions of rows. One of the columns will be a varchar(32) whose value will change fairly often. Doing updates to this varchar on tens of thousands of rows will take a long time, so we are trying with the idea of splitting this field off into its own table and then instead of doing updates, we can do a delete followed by a batch insert using load data in file. It seems like this will greatly improve performance. Am I missing something though? Is there an easier way to improve update performance? Has anybody done anything like this before?
If you can select the rows you want to update based on indeces alone this should in practice do the same as your suggestions (and still keep a sane data organization, hence be preferable). Quiet possibly this is even faster than doing it yourself.
You could create an index appropriate to the where clause of your update statement.
The idea of splitting it up may improve performance (I'm not sure), but only, when all values change at once. When individual values change, this approach is slower than the approach with one table.
Another precondition for being faster is, that you must know the key->value mapping of the second table beforehand. If you have to look into the first table for deciding how to store values in the second one, you are also slower than with one table.
Related
I have tried to find thoughts on this, but can't find quite what I am looking for.
In one topic on here, people seemed to agree that "almost" all of the time, an update would be much preferable to deleting a row and re-inserting it.
HOWEVER, my situation is which would be better between doing several hundred individual updates vs. 1 mass delete and 1 bulk insert for those hundreds of rows.
Wouldn't all of the time saved from doing the bulk insert more than offset the extra work from doing the delete/insert method vs. update?
No other table needs the ids from these rows, by the way.
I think the answer would depend on the exact update/delete query you are trying to run, and on the data. But, in general I would expect that just doing an update would be faster than deleting and re-inserting. The reason is that very similar lookup logic will have to run in either approach to target the records in question. In the case of delete/insert, you would then remove those records and bulk insert. However, in the case of update, you would have already found the records, and would just need to mutate some data.
I have an big load file that I downloaded. This contains records that I will have to load into the database. Based on the size of the data, it will likely take 2 weeks or more to finish (since there is preprocessing etc). A coworker asked me to make what she called a delta file, which checks the current database to see if the data already exists based on a certain field in the database and IFF it exists then we will keep that in the load file, otherwise we will discard it.
I'm confused because to implement this I would need to do a select query for every file in the load file to check if it exists. a select would take O(n) I'm assuming. Then the insert (for a smaller data set) an additional O(1).
whereas an insert would just take O(1).
I'd like to 1) understand why this implementation is faster (If I don't understand things properly) and 2) a possible solution to implementation of this delta file if you can think of something smarter than what I suggested
Thanks
Databases make indexes for columns specified in the schema. The way your data is indexed can make a massive difference in performance. Without an index, a select operation may be O(n) but with an index it may be O(1).
Insert operations must maintain the index. For large data loading operations you may be well off to disable indexing until the end so you are doing a single index update on all the data instead of many index updates on each record you insert.
Some measurements I did the other day indicate that selects are faster than inserts in my situation. I came across this question because I am trying to learn if this is generally true or reflects something specific about the way I have it setup.
I'm going to create a table which will have an amount of rows between 1000-20000, and I'm having fields that might repeat a lot... about 60% of the rows will have this value, where about each 50-100 have a shared value.
I've been concerned about efficiency lately and I'm wondering whether it would be better to store this string on each row (it would be between 8-20 characters) or to create another table and link them with its representative ID instead... So having ~1-50 rows in this table replacing about 300-5000 strings with ints?
Is this a good approach, or at all even neccessary?
Yes, it's a good approach in most circumstances. It's called normalisation, and is mainly done for two reasons:
Removing repeated data
Avoiding repeating entities
I can't tell from your question what the reason would be in your case.
The difference between the two is that the first reuses values that just happen to look the same, while the second connects values that have the same meaning. The practical difference is in what should happen if a value changes, i.e. if the value changes for one record, should the value itself change so that it changes for all other records also using it, or should that record be connected to a new value so that the other records are left unchanged.
If it's for the first reason then you will save space in the database, but it will be more complicated to update records. If it's for the second reason you will not only save space, but you will also reduce the risk of inconsistency, as a value is only stored in one place.
That is a good approach to have a look-up table for the strings. That way you can build more efficient indexes on the integer values. It wouldn't be absolutely necessary but as a good practice I would do that.
I would recommend using an int with a foreign key to a lookup table (like you describe in your second scenario). This will cause the index to be much smaller than indexing a VARCHAR so the storage required would be smaller. It should perform better, too.
Avitus is right, that it's generally a good practice to create lookups.
Think about the JOINS you will use this table in. 1000-20000 rows are not a lot to be handled by MySQL. If you don't have any, I would not bother about the lookups, just index the column.
BUT as soon as you start joining the table with others (of the same size) that's where the performance loss comes in, which you can (most likely) compensate by introducing lookups.
Am I correct to assume that an UPDATE query takes more resources than an INSERT query?
I am not a database guru but here my two cents:
Personally I don't think you have much to do in this regard, even if INSERT would be faster (all to be proven), can you convert an update in an insert?! Frankly I don't think you can do it all the times.
During an INSERT you don't usually have to use WHERE to identify which row to update but depending on your indices on that table the operation can have some cost.
During an update if you do not change any column included in any indices you could have quick execution, if the where clause is easy and fast enough.
Nothing is written in stones and really I would imagine it depends on whole database setup, indices and so on.
Anyway, found this one as a reference:
Top 84 MySQL Performance Tips
If you plan to perform a large processing (such as rating or billing for a cellular company), this question has a huge impact on system performance.
Performing large scale updates vs making many new tables and index has proven to reduce my company billing process form 26 hours to 1 hour!
I have tried it on 2 million records for 100,000 customer.
I first created the billing table and then every customer summary calls, I updated the billing table with the duration, price, discount.. a total of 10 fields.
In the second option I created 4 phases.
Each phase reads the previous table(s), creates index (after the table insert completed) and using: "insert into from select .." I have created the next table for the next phase.
Summary
Although the second alternative requires much more disk space (all views and temporary tables deleted at the end) there are 3 main advantages to this option:
It was 4 time faster than option 1.
In case there was a problem in the middle of the process I could start the process from the point it failed, as all the tables for the beginning of the phase were ready and the process could restart from this point. If the process fails implementing the first option, you will need to start the all the process all over again.
This made the development and QA work much faster as they could work parallel.
The key resource here is disk access (IOPS to be precise) and we should evaluate which ones results in minimum of that.
Agree with others on how it is impossible to give a generic answer but some thoughts to lead you in the right direction , assume a simple key-value store and key is indexed. Insertion is inserting a new key and update is updating the value of an existing key.
If that is the case (a very common case) , update would be faster than insertion because update involves an indexed lookup and changing an existing value without touching the index. You can assume that is one disk read to get the data and possibly one disk write. On the other hand insertion would involve two disk writes one for index , one for data. But the another hidden cost is the btree node splitting and new node creation which would happen in background while insertion leading to more disk access on average.
You cannot compare an INSERT and an UPDATE in general. Give us an example (with schema definition) and we will explain which one costs more and why. Also, you can compere a concrete INSERT and an UPDATE by checking their plan and execution time.
Some rules of thumbs though:
if you only update only one field, which is not indexed and you only update one record and you use rowid/primary key to find that record then this UPDATE will cost less, than
an INSERT, which will also affect only one row, though this row will have many not null constrained, indexed fields; and all those indexes have to be maintained (e.g. add a new leaf)
It depends. A simple UPDATE that uses a primary key in the WHERE clause and updates only a single non-indexed field would likely be less costly than an INSERT on the same table. But even that depends on the database engine involved. An UPDATE that involved modifying many indexed fields, however, might be more costly than the INSERT on that table because more index key modifications would be required. An UPDATE with a poorly constructed WHERE clause that required a table scan of millions of records would certainly be more expensive than an INSERT on that table.
These statements can take many forms, but if you limit the discussion to their "basic" forms that involve a single record, then the larger portion of the cost will usually be dedicated to modifying the indexes. Each indexed field that is modified during an UPDATE would typically involve two basic operations (delete the old key and add the new key) whereas the INSERT would require one (add the new key). Of course, a clustered index would then add some other dynamics as would locking issues, transaction isolation, etc. So, ultimately, the comparison between these statements in a general sense is not really possible and would probably require benchmarking of specific statements if it actually mattered.
Typically, though, it makes sense to just use the correct statement and not worry about it since it is usually not an option to choose between an UPDATE and an INSERT.
It depends. If update don't require changes of the key it's most likely that it will only costs like a search and then it will probably cost less than an insert, unless database is organized like an heap.
This is the only think i can state, because performances greatly depends on the database organization used.
If you for example use MyISAM that i suppose organized like an isam, insert should cost generally the same in terms of database read accesses but it will require some additional write operation.
On Sybase / SQL Server an update which impacts a column with a read-only index is internally replaced by a delete and then an insert, so this is obviously slower than insert. I do not know the implementation for other engines but I think this is a common strategy at least when indices are involved.
Now for tables without indices ( or for update requests not involving any index ) I suppose there are cases where the update can be faster, depending on the structure of the table.
In mysql you can change your update to insert with ON DUPLICATE KEY UPDATE
INSERT INTO t1 (a,b,c) VALUES (1,2,3)
ON DUPLICATE KEY UPDATE c=c+1;
UPDATE t1 SET c=c+1 WHERE a=1;
A lot of people here are commenting that you can't compare an insert vs update but I disagree. People should understand that an update takes a lot more resources than insert or even possibly deleting and inserting.
Now regarding how you can even compare the 2 as one doesn't directly replace the other. But in certain cases you make an insert and then update the table with data from another table.
For instance I get a feed from an API which contains id1, but this table relates to another table and I would like to add table2_id. Instead of doing an update statement that takes a lot more resources, I can handle this in the backend which is faster and just do an insert statement instead of an insert and then an update. The update statement also locks the table causing a traffic jam so to speak.
I have a huge table that is mainly used for backup and administrative purposes. The only records that matters is the last inserted record.
On every hit to order by time inserted is just too slow. I want keep a separate table with the last inserted id.
In PHP I now insert, get last inserted id, and update the other table.
Is there a more efficient way to do this.
You could do this on the database end by using a trigger.
(Sorry about posting this as a separate answer, was a bit too long for a comment on Matti's answer.)
There is a small performance overhead associated with triggers, but if I recall correctly it's fairly negligible for normal use (depending on what you're doing with it of course). Mostly it'd only be a problem if you're performing bulk uploads (in which case you'd usually drop/disable the triggers for the duration of the task). Seems to me that the overhead here would be very minimal seeing as you're only really performing one INSERT/UPDATE on X in addition to the INSERT on Y.
Essentially, a trigger will scale a lot better compared to your current method because instead of having to perform a lookup to find the last updated record you can just perform the insert operation, then directly insert the primary key of the new record into the "last updated" table.
Why don't you add an index on that field?
Quick seach and sort is exactly what an index is for.
Updating your own 'pseudo-index' in a table amounts to re-inventing the wheel.
Besides, adding a trigger to a DB always feels very dubious (as in un-obvious) to me!