Is select faster than insert - mysql

I have an big load file that I downloaded. This contains records that I will have to load into the database. Based on the size of the data, it will likely take 2 weeks or more to finish (since there is preprocessing etc). A coworker asked me to make what she called a delta file, which checks the current database to see if the data already exists based on a certain field in the database and IFF it exists then we will keep that in the load file, otherwise we will discard it.
I'm confused because to implement this I would need to do a select query for every file in the load file to check if it exists. a select would take O(n) I'm assuming. Then the insert (for a smaller data set) an additional O(1).
whereas an insert would just take O(1).
I'd like to 1) understand why this implementation is faster (If I don't understand things properly) and 2) a possible solution to implementation of this delta file if you can think of something smarter than what I suggested
Thanks

Databases make indexes for columns specified in the schema. The way your data is indexed can make a massive difference in performance. Without an index, a select operation may be O(n) but with an index it may be O(1).
Insert operations must maintain the index. For large data loading operations you may be well off to disable indexing until the end so you are doing a single index update on all the data instead of many index updates on each record you insert.
Some measurements I did the other day indicate that selects are faster than inserts in my situation. I came across this question because I am trying to learn if this is generally true or reflects something specific about the way I have it setup.

Related

Mysql what if too much data in a table

Data is increasing in one table everyday, it might lower the performance . I was thinking if I can create a trigger which move table A into A1 and create a new table A every a period of time, so that insert or update could be faster in table A. Is this the right way to save performance ? If not, what should I do ?
(for example, insert or update 1000 rows per second in table A, how is the performance after 3 years ?)
We are designing softwares for a factory. There are product lines which pcb boards are made on. We need to insert almost 60 pcb records per second for years. (1000 rows seem to be exaggerated)
First, you are talking about several terabytes for a single table. Is your disk that big? Yes, MySQL can handle that big a table.
Will it slow down? It depends on
The indexes. If you have 'random' indexes, the INSERTs will slow down to about 1 insert per disk hit. On a spinning HDD, that is only about 100 per second. SSD might be able to handle 1000/sec. Please provide SHOW CREATE TABLE.
Does the table have an AUTO_INCREMENT? If so, it needs to be BIGINT, not INT. But, if possible, get rid of it all together (to save space). Again, let's see the SHOW.
"Point" queries (load one row via an index) are mostly unaffected by the size of the table. They will be about twice as slow in a trillion-row table as in a million-row table. A point query will take milliseconds or tens of milliseconds; no big deal.
A table scan will take hours or days; hopefully you are not doing that.
A billion-row scan of part of the table will take days or weeks unless you are using the PRIMARY KEY or have a "covering" index. Let's see the queries and the SHOW.
The best technique is not to store the data. Summarize it as it arrives, save the summaries, then toss the raw data. (OK, you might store the raw in a csv file just in case you need to build a new summary table or fix a bug in an existing one.)
Having a few summary tables instead of the raw data would shrink the data to under 1TB and allow the relevant queries to run 10 times as fast. (OK, point queries would be only slightly faster.)
PARTITIONing (or otherwise splitting up the table)? It depends. Let's see the queries and the SHOW. In many situations, PARTITIONing does not speed up anything.
Will you be deleting or modifying existing rows? I hope not. That adds more dimensions of problems. If, on the other hand, you need to purge 'old' data, then that is an excellent use for PARTITIONing. For 3 years' worth of data, I would PARTITION BY RANGE(TO_DAYS(..)) and have monthly partitions. Then a monthly DROP PARTITION would be very fast.
Very Huge data may decrease the performance of server, So there is a way to handle this :
1) you have to create another table to store archive data ( old data ) using Archive storage mechanism . ( https://dev.mysql.com/doc/refman/8.0/en/archive-storage-engine.html )
2) create MySQL job/scheduler to move older records to archive table. schedule in timeslot
when server is maximum idle.
3) after moving older records to archive table, re-index the original table.
this will serve the purpose of performance.
It is unlikely that 1000 row tables perform sufficiently poorly that doing a table copy every once in a while is an overall net gain. And anyway, what would the new table have that the old one did not which would improve performance?
The key to having tables perform efficiently is intelligent table design and management of indexes. That is how zillion row tables are effective in geospatial work, library catalogs, astronomy, and how internet search engines find useful data, etc.
Each index defined does cause more mysql impact especially at row insert time. Assuming there are more reads than inserts, this is an advantage because most queries are rapidly completed thanks to a suitable index.
Indexes are best defined with a thorough understanding of the queries made against the table—both in quality and quantity. And, if there is any tendency for the nature of the queries to trend over months or years, then the indexes would need additions, modifications, or—yes—even deletions.
It seems to me there is something inherently wrong with the way you are using MySQL to begin with.
A database system is supposed to manage data that is required by your application in order for it to work. If you think flushing the table every so often is something acceptable, then that doesn't seem to be the case.
Perhaps you are better off just using log files. Split them by date, delete old ones if and when you decide they are no longer relevant or need the disk space. It's even safer to do that way from a recovery perspective.
If you need a better suggestion, then improve your question to include exactly what you are trying to accomplish so we can help you with it.

Transforming one column into a new column in minimal number of queries

I have a table on a server with fields docid (bigint(20)), msgtext (longtext) and others I don't care about.
What I want to do is run a function on the msgtext (which uses some regular expressions to clean msgtext) and then add the result back into the database in the newly created column cleanMsgtext. My goal is to optimize this, and ideally minimize my own memory requirements in the process.
To minimize the memory requirements the easiest thing to do would be to do this one step at a time. I.e. I'd pull down one row, run the function, and then push back the result. Of course that's a little extreme so it makes more sense for me to pull down the rows, say, 10,000 rows at a time to minimize the select queries. What I want to do is to try to minimize the update queries as well. I.e. My select query is something like
select docid, msgtext
from db
where docid='1' or docid='2' or ... or docid='10000'
which I can easily do programatically by pulling down docids ahead of time or whatever (though maybe this is also stupid?). This alone causes and enormous speedup which I'm happy with.
What I wonder is if there is an equivalent way of doing this for updating the table. Can I just concatenate update messages in a similar fashion to create one query from multiple queries? I've read things where this is possible if you pull down the entire row each time, but for me the docid is a unique identifier so that seems like it shouldn't be necessary.
So in conclusion, I would like to select batches of size N, do my work on the data locally, then update again in batches of size N. Is this possible?

What's wrong with using a wildcard in your mySQL query? [duplicate]

This question already has answers here:
Which is faster/best? SELECT * or SELECT column1, colum2, column3, etc
(49 answers)
Closed 9 years ago.
Basically what's the difference in terms of security and speed in these 2 queries?
SELECT * FROM `myTable`
and
SELECT `id`, `name`, `location`, `place` etc... FROM `myTable`
Would using * increase the benchmark on my query and perform slower than static rows?
There won't be much appreciable difference in performance if you also select all columns individually.
The idea is to select only the data you require and no more, which can improve performance if there is alot of unneeded columns in your query, for example, when you join several tables.
Ofc, on the other side of the coin, using * makes life easier when you make changes to the table.
Security-wise, the less you select, the less potentially sensitive data can be inadvertently dumped to the user's browser. Imagine if * included the column social_security_number and somewhere in your debug code it gets printed out as an HTML comment.
Performance-wise, in many cases your database is on another server, so requesting the entire row when you only need a small part of it means a lot more data going over the network.
There is not a single, simple answer, and your literal question cannot fully be answered without more detail of the specific table structure, but I'm going with the assumption that you aren't actually talking about a specific query against a specific table, but rather about selecting columns explicitly or using the *.
SELECT * is always wasteful of something unless you are actually going to use every column that exists in the rows you're reading... maybe network bandwidth, or CPU resources, or disk I/O, or a combination, but something is being unnecessarily used, though that "something" may be in very small and imperceptible quantities ... or it may not ... but it can add up over time.
The two big examples that come to mind where SELECT * can be a performance killer are cases like...
...tables with long VARCHAR and BLOB columns:
Columns such as BLOB and VARCHAR that are too long to fit on a B-tree page are stored on separately allocated disk pages called overflow pages. We call such columns off-page columns. The values of these columns are stored in singly-linked lists of overflow pages, and each such column has its own list of one or more overflow pages
— http://dev.mysql.com/doc/refman/5.6/en/innodb-row-format-overview.html
So if * includes columns that weren't stored on-page with the rest of the row data, you just took an I/O hit and/or wasted space in your buffer pool with accesses that could have been avoided had you selected only what you needed.
...also cases where SELECT * prevents the query from using a covering index:
If the index is a covering index for the queries and can be used to satisfy all data required from the table, only the index tree is scanned. In this case, the Extra column says Using index. An index-only scan usually is faster than ALL because the size of the index usually is smaller than the table data.
— http://dev.mysql.com/doc/refman/5.6/en/explain-output.html
When one or more columns are indexed, copies of the column data are stored, sorted, in the index tree, which also includes the primary key, for finding the rest of the row data. When selecting from a table, if all of the columns you are selecting can be found within a single index, the optimizer will automatically choose to return the data to you by reading it directly from the index, instead of going to the time and effort to read in all of the row data... and this, some cases, is a very significant difference in the performance of a query, because it can mean substantially smaller resource usage.
If EXPLAIN SELECT does not reveal the exact same query plan when selecting the individual columns you need compared with the plan used when selecting *, then you are looking at some fairly hard evidence that you are putting the server through unnecessary work by selecting things you aren't going to use.
In additional cases, such as with the information_schema tables, the columns you select can make a particularly dramatic and obvious difference in performance. The information_schema tables are not actually tables -- they're server internal structures exposed via the SQL interface... and the columns you select can significantly change the performance of the query because the server has to do more work to calculate the values of some columns, compared to others. A similar situation is true of FEDERATED tables, which actually fetch data from a remote MySQL server to make a foreign table appear logically to be local. The columns you select are actually transferred across the network between servers.
Explicitly selecting the columns you need can also lead to fewer sneaky bugs. If a column you were using in code is later dropped from a table, the place in your code's data structure -- in some languages -- is going to contain an undefined value, which in many languages is the same think you would see if the column still existed but was null... so the code thinks "okay, that's null, so..." a logical error follows. Had you explicitly selected the columns you wanted, subsequent executions of the query would throw a hard error instead of quietly misbehaving.
MySQL's C-client API, which some other client libraries are built on, supports two modes of fetching data, one of which is mysql_store_result, which buffers the data from the server on the client side before the application actually reads it into its internal structures... so as you are "reading from the server" you may have already implicitly allocated a lot of memory on the client side to store that incoming result-set even when you think you're fetching a row at a time. Selecting unnecessary columns means even more memory needed.
SELECT COUNT(*) is an exception. The COUNT() function counts the number of non-null values seen, and * merely means "count the rows"... it doesn't examine column data, so if you want a star there, go for it.
As a favor to your future self, unless you want to go back later and rewrite all of those queries when you're trying to get more performance out of your server, you should bite the bullet and do the extra typing, now.
As a bonus, when other people see your code, they won't accuse you of laziness or inexperience.

execution time of insert and retrieval operation in mysql

which will take more execution time insert operation or select operation if both are single query affecting only one row.
for eg:
insert into example values('id','name','email')
or
select *from example where id='id';
For all benchmarking, you need to keep in mind:
It's not usually a simple matter, there are a large number of factors that can affect the figures.
Some of these factors are the number of indexes, historical patterns of read/write, database tuning, disk layout, fragmentation and so forth.
It rarely matter for small thing like a one-row operation, provided you have an intelligent setup (correct indexes and so on).
The best way to tell, for a given setup, is to test.
In any case, this sort of question usually arises when you want to choose the fastest of two functionally identical choices.
In this case, there is zero crossover in functionality so I'm not sure what you will gain with this answer. If you want to insert information, use insert. If you want to extract it, use select.
It's not like you can use select (no matter how fast it may be) to insert data into your database (well, other than as part of insert into ... select ...).

UPDATE vs INSERT performance

Am I correct to assume that an UPDATE query takes more resources than an INSERT query?
I am not a database guru but here my two cents:
Personally I don't think you have much to do in this regard, even if INSERT would be faster (all to be proven), can you convert an update in an insert?! Frankly I don't think you can do it all the times.
During an INSERT you don't usually have to use WHERE to identify which row to update but depending on your indices on that table the operation can have some cost.
During an update if you do not change any column included in any indices you could have quick execution, if the where clause is easy and fast enough.
Nothing is written in stones and really I would imagine it depends on whole database setup, indices and so on.
Anyway, found this one as a reference:
Top 84 MySQL Performance Tips
If you plan to perform a large processing (such as rating or billing for a cellular company), this question has a huge impact on system performance.
Performing large scale updates vs making many new tables and index has proven to reduce my company billing process form 26 hours to 1 hour!
I have tried it on 2 million records for 100,000 customer.
I first created the billing table and then every customer summary calls, I updated the billing table with the duration, price, discount.. a total of 10 fields.
In the second option I created 4 phases.
Each phase reads the previous table(s), creates index (after the table insert completed) and using: "insert into from select .." I have created the next table for the next phase.
Summary
Although the second alternative requires much more disk space (all views and temporary tables deleted at the end) there are 3 main advantages to this option:
It was 4 time faster than option 1.
In case there was a problem in the middle of the process I could start the process from the point it failed, as all the tables for the beginning of the phase were ready and the process could restart from this point. If the process fails implementing the first option, you will need to start the all the process all over again.
This made the development and QA work much faster as they could work parallel.
The key resource here is disk access (IOPS to be precise) and we should evaluate which ones results in minimum of that.
Agree with others on how it is impossible to give a generic answer but some thoughts to lead you in the right direction , assume a simple key-value store and key is indexed. Insertion is inserting a new key and update is updating the value of an existing key.
If that is the case (a very common case) , update would be faster than insertion because update involves an indexed lookup and changing an existing value without touching the index. You can assume that is one disk read to get the data and possibly one disk write. On the other hand insertion would involve two disk writes one for index , one for data. But the another hidden cost is the btree node splitting and new node creation which would happen in background while insertion leading to more disk access on average.
You cannot compare an INSERT and an UPDATE in general. Give us an example (with schema definition) and we will explain which one costs more and why. Also, you can compere a concrete INSERT and an UPDATE by checking their plan and execution time.
Some rules of thumbs though:
if you only update only one field, which is not indexed and you only update one record and you use rowid/primary key to find that record then this UPDATE will cost less, than
an INSERT, which will also affect only one row, though this row will have many not null constrained, indexed fields; and all those indexes have to be maintained (e.g. add a new leaf)
It depends. A simple UPDATE that uses a primary key in the WHERE clause and updates only a single non-indexed field would likely be less costly than an INSERT on the same table. But even that depends on the database engine involved. An UPDATE that involved modifying many indexed fields, however, might be more costly than the INSERT on that table because more index key modifications would be required. An UPDATE with a poorly constructed WHERE clause that required a table scan of millions of records would certainly be more expensive than an INSERT on that table.
These statements can take many forms, but if you limit the discussion to their "basic" forms that involve a single record, then the larger portion of the cost will usually be dedicated to modifying the indexes. Each indexed field that is modified during an UPDATE would typically involve two basic operations (delete the old key and add the new key) whereas the INSERT would require one (add the new key). Of course, a clustered index would then add some other dynamics as would locking issues, transaction isolation, etc. So, ultimately, the comparison between these statements in a general sense is not really possible and would probably require benchmarking of specific statements if it actually mattered.
Typically, though, it makes sense to just use the correct statement and not worry about it since it is usually not an option to choose between an UPDATE and an INSERT.
It depends. If update don't require changes of the key it's most likely that it will only costs like a search and then it will probably cost less than an insert, unless database is organized like an heap.
This is the only think i can state, because performances greatly depends on the database organization used.
If you for example use MyISAM that i suppose organized like an isam, insert should cost generally the same in terms of database read accesses but it will require some additional write operation.
On Sybase / SQL Server an update which impacts a column with a read-only index is internally replaced by a delete and then an insert, so this is obviously slower than insert. I do not know the implementation for other engines but I think this is a common strategy at least when indices are involved.
Now for tables without indices ( or for update requests not involving any index ) I suppose there are cases where the update can be faster, depending on the structure of the table.
In mysql you can change your update to insert with ON DUPLICATE KEY UPDATE
INSERT INTO t1 (a,b,c) VALUES (1,2,3)
ON DUPLICATE KEY UPDATE c=c+1;
UPDATE t1 SET c=c+1 WHERE a=1;
A lot of people here are commenting that you can't compare an insert vs update but I disagree. People should understand that an update takes a lot more resources than insert or even possibly deleting and inserting.
Now regarding how you can even compare the 2 as one doesn't directly replace the other. But in certain cases you make an insert and then update the table with data from another table.
For instance I get a feed from an API which contains id1, but this table relates to another table and I would like to add table2_id. Instead of doing an update statement that takes a lot more resources, I can handle this in the backend which is faster and just do an insert statement instead of an insert and then an update. The update statement also locks the table causing a traffic jam so to speak.