UPDATE vs COUNT vs SELECT performance - mysql

Is this statement true or false
The performance of these queries
SELECT * FROM table;
UPDATE table SET field = 1;
SELECT COUNT(*) FROM table;
Are identical
Or is there ever a case in which the performance of one will greatly differ from the other?
UPDATE
I'm more interested if there's a large difference between the SELECT and the UPDATE. You can ignore the COUNT(*) if you want
Assume the select performs full table scan. The update will also perform update on all rows in the table.
Assume the update is only updating one field - though it will update all rows (it's an indexed field)
I know that they'll take different time and that they do different things. What I want to know is if the difference will be significant or not. EG. If the update will take 5 times longer than the select then it's significant. Use this as the threshold. And there's no need to be precise. Just give an approximation.

There are different resource types involved:
disk I/O (this is the most costly part of every DBMS)
buffer pressure: fetching a row will cause fetching a page from disk, which will need buffer memory to be stored in
work/scratch memory for intermediate tables, structures and aggregates.
"terminal" I/O to the front-end process.
cost of locking, serialisation and versioning and journaling
CPU cost : this is neglectable in most cases (compared to disk I/O)
The UPDATE query in the question is the hardest: it will cause all disk pages for the table to be fetched, put into buffers, altered into new buffers and written back to disk. In normal circumstances, it will also cause other processes to be locked out, with contention and even more buffer pressure as a result.
The SELECT * query needs all the pages, too; and it needs to convert/format them all into frontend-format and send them back to the frontend.
The SELECT COUNT(*) is the cheapest, on all resources. In the worst case all the disk pages have to be fetched. If an index is present, fewer disk- I/O and buffers are needed. The CPU cost is still neglectable (IMHO) and the "terminal" output is marginal.

When you say "performance", do you mean "how long it takes them to execute"?
One of them is returning all data in all rows.
One of them (if you remove the "FROM") is writing data to the rows.
One is counting rows and returning none of the data in the rows.
All three of those queries are doing entirely different things. Therefore, it is reasonable to conclude that all three of them will take different amounts of time to complete.
Most importantly, why are you asking this question? What are you trying to solve? I have a bad feeling that you're going down a wrong path by asking this.

I have a large (granted indexed) table here at work, and this is what I found
select * from X (limited to the first 100,000 records) (12.5 seconds)
select count(*) from X (returned millions of records) (15.57 seconds)
Update on an indexed table is very fast (less then a second)

The SELECT and UPDATE should be about the same (but they could easily vary, this depends on the database). COUNT(*) is cached in many databases, at some level, so that query could easily be O(1). Of course a lazy implementation of UPDATE could also be O(1), but I don't know of anyone doing that currently.
tl;dr: "False" or "it depends".

All three queries do vastly different things.
They each have their own performance characteristics and are not directly comparable.
Can you clarify what you are attempting to investigate?

Related

Surprising timing stats for sql queries

I have two queries whose timing parameters I want to analyze.
The first query is taking much longer than the second one while in my opinion it should be the other way round.Any Explanations?
First query:
select mrn
from EncounterInformation
limit 20;
Second query:
select enc.mrn, fl.fileallocationid
from EncounterInformation as enc inner join
FileAllocation as fl
on enc.encounterIndexId = fl.encounterid
limit 20;
The first query runs in 0.760 seconds on MYSQL while second one runs in 0.509 seconds surprisingly.
There are many reasons why measured performance between two queries might be different:
The execution plans for the queries (the dominant factor)
The size of the data being returned (perhaps mrn is a string that is really long for the result set in the first query but not the second)
Other database activity, that locks tables and indexes
Other server activity
Data and index caches that are pre-loaded -- either in the database itself or in the underlying OS components
Your observation is correct. The first should be faster than the second. More importantly though is the observation that this simply does not make sense for your simple query:
The first query runs in 0.760 seconds
select mrn
from EncounterInformation
limit 20;
The work done for this would typically be to load one data page (or maybe a handful). That would only consistently take 0.760 seconds if:
You had really slow data storage (think "carrier pigeons").
EncounterInformation is a view and not a table.
You don't understand timings.
If the difference is between 0.760 milliseconds and 0.509 milliseconds, then the difference is really small and likely due to other issues -- warm caches, other activity on the server, other database activity.
It is also possible that you are measuring elapsed time and not database time, so network congestion could potentially be an issue.
If you are querying views, all bets are off without knowing what the views are. In fact, if you care about performance you should be including the execution plan in the question.
I can't explain the difference. What I can say is that your observation is reasonable, but your question lacks lots of information that suggests you need to learn more about how to understand timings. I would suggest that you start with learning about explain.

MySQL query speed or rows read

Sorry for lots of useless text. Most important stuff is told on last 3 paragraphs :D
Recently we had some mysql problem in one of client servers. Something out of blue starts sky-rocking CPU of mysql process. This problem lead us to finding and optimizing bad queries and here is a problem.
I was thinking that optimization is speeding up queries (total time needed for a query to execute). But after optimizing several queries towards it my colleague starting colleague started complaining that some queries read too many rows, all rows from table (as shown with EXPLAIN).
After rewriting a query I noticed that, if I want a query to read less rows - query speed suffers, if I query is made for speed - more rows are read.
And that didn't make me a sense: less rows read, but execution time is longer
And that made me wonder what should be done. Of course it would be perfect to have fast query which reads least rows. But since it doesn't seem to be possible for me, I'm searching for some answers. Which approach should I take - speed or less rows read? What are pros&cons when query is fast but with more rows read and when less rows are read with speed suffer? What happens with server at different cases?
After googling all I could find was articles and discussions about how to improve speed, but neither covered those different cases I mentioned before.
I'm looking forward to seeing even personal choices of course with some reasoning.
Links which could direct me right way are welcome too.
I think your problem depends on how you are limiting the amount of rows read. If you read less rows by implementing more WHERE clauses that MySQL needs to run against, then yes, performance will take a hit.
I would look at perhaps indexing some of your columns that make your search more complex. Simple data types are faster to lookup than complex ones. See if you are searching toward indexed columns.
Without more data, I can give you some hints:
Be sure your tables are properly indexed. Create the appropriate indexes for each of your tables. Also drop the indexes that are not needed.
Decide the best approach for each query. For example, if you use group by only to deduplicate rows, you are wasting resources; it is better to use select distinct (on an indexed field).
"Divide and conquer". Can you split your process in two, three or more intermediate steps? If the answer is "yes", then: Can you create temporary tables for some of these steps? I've split proceses using temp tables, and they are very useful to speed things up.
The count of rows read reported by EXPLAIN is an estimate anyway -- don't take it as a literal value. Notice that if you run EXPLAIN on the same query multiple times, the number of rows read changes each time. This estimate can even be totally inaccurate, as there have been bugs in EXPLAIN from time to time.
Another way to measure query performance is SHOW SESSION STATUS LIKE 'Handler%' as you test the query. This will tell you accurate counts of how many times the SQL layer made requests for individual rows to the storage engine layer. For examples, see my presentation, SQL Query Patterns, Optimized.
There's also an issue of whether the rows requested were already in the buffer pool (I'm assuming you use InnoDB), or did the query have to read them from disk, incurring I/O operations. A small number of rows read from disk can be orders of magnitude slower than a large number of rows read from RAM. This doesn't necessarily account for your case, but it points out that such a scenario can occur, and "rows read" doesn't tell you if the query caused I/O or not. There might even be multiple I/O operations for a single row, because of InnoDB's multi-versioning.
Insight into the difference between logical row request vs. physical I/O reads is harder to get. In Percona Server, enhancements to the slow query log include the count of InnoDB I/O operations per query.

COUNT(*) WHERE vs. SELECT(*) WHERE performance

I am building a forum and I am trying to count all of the posts submitted by each user. Should I use COUNT(*) WHERE user_id = $user_id, or would it be faster if I kept a record of how many posts each user has each time he made a post and used a SELECT query to find it?
How much of a performance difference would this make? Would there be any difference between using InnoDB and MyISAM storage engines for this?
If you keep a record of how many post a user made, it will definitely be faster.
If you have an index on user field of posts table, you will get decent query speeds also. But it will hurt your database when your posts table is big enough. If you are planning to scale, then I would definitely recommend keeping record of users posts on a specific field.
Storing precalculated values is a common and simple, but very efficient sort of optimization.
So just add the column with amount of comments user has posted and maintain it with triggers or by your application.
The performance difference is:
With COUNT(*) you always will have index lookup + counting of results
With additional field you'll have index lookup + returning of a number (that already has an answer).
And there will be no significant difference between myisam and innodb in this case
Store the post count. It seems that this is a scalability question, regardless of the storage engine. Would you recalculate the count each time the user submitted a post, or would you run a job to take care of this load somewhere outside of the webserver sphere? What is your post volume? What kind of load can your server(s) handle? I really don't think the storage engine will be the point of failure. I say store the value.
If you have the proper index on user_id, then COUNT(user_id) is trivial.
It's also the correct approach, semantically.
this is really one of those 'trade off' questions.
Realistically, if your 'Posts' table has an index on the 'UserID' column and you are truly only wanting to return the number of posts pers user then using a query based on this column should perform perfectly well.
If you had another table 'UserPosts' for e'g., yes it would be quicker to query that table, but the real question would be 'is your 'Posts' table really so large that you cant just query it for this count. The trade off on both approaches is obviously this:
1) having a separate audit table, then there is an overhead when adding, updating a post
2) not having a separate audit table, then overhead in querying the table directly
My gut instinct is always to design a system to record the data in a sensibly normalised fashion. I NEVER make tables based on the fact that it might be quicker to GET some data for reporting purposes. I would only create them, if the need arised and it was essential to incoroporate them then, i would incorporate it.
At the end of the day, i think unless your 'posts' table is ridiculously large (i.e. more than a few millions of records, then there should be no problem in querying it for a distinct user count, presuming it is indexed correctly, i.e. an index placed on the 'UserID' column.
If you're using this information purely for display purposes (i.e. user jonny has posted 73 times), then it's easy enough to get the info out from the DB once, cache it, and then update it (the cache), when or if a change detection occurs.
Performance on post or performance on performance on count? From a data purist perspective a recorded count is not the same as an actual count. You can watch the front door to an auditorium and add the people that come in and subtract those the leave but what if some sneak in the back door? What if you bulk delete a problem topic? If you record the count then the a post is slowed down to calculate and record the count. For me data integrity is everything and I will count(star) every time. I just did a test on a table with 31 million row for a count(star) on an indexed column where the value had 424,887 rows - 1.4 seconds (on my P4 2 GB development machine as I intentionally under power my development server so I get punished for slow queries - on the production 8 core 16 GB server that count is less than 0.1 second). You can never guard your data from unexpected changes or errors in your program logic. Count(star) is the count and it is fast. If count(star) is slow you are going to have performance issues in other queries. I did star as the symbol caused a format change.
there are a whole pile of trade-offs, so no-one can give you the right answer. but here's an approach no-one else has mentioned:
you could use the "select where" query, but cache the result in a higher layer (memcache for example). so you code would look like:
count = memcache.get('article-count-' + user_id)
if count is None:
count = database.execute('select ..... where user_id = ' + user_id)
memcache.put('article-count-' + user_id, count)
and you would also need, when a user makes a new post
memcache.delete('article-count-' + user_id)
this will work best when the article count is used often, but updated rarely. it combines the advantage of efficient caching with the advantage of a normalized database. but it is not a good solution if the article count is needed only rarely (in which case, is optimisation necessary at all?). another unsuitable case is when someone's article count is needed often, but it is almost always a different person.
a further advantage of an approach like this is that you don't need to add the caching now. you can use the simplest database design and, if it turns out to be important to cache this data, add the caching later (without needing to change your schema).
more generally: you don't need to cache in your database. you could also put a cache "around" your database. something i have done with java is to use caching at the ibatis level, for example.

What implication does select * from foo have? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Can select * usage ever be justified?
Curious to hear this from folks with more DBA insight, but what performance implications does an application face from when you see a query like:
select * from some_large_table;
You have to do a full table scan since no index is being hit, and I believe if we're talking O notation, we're speaking O(N) here where N is the size of the table. Is this typically considered not optimal behavior? What if you really do need everything from the table at certain times? Yes we have tools such as pagination etc, but I'm talking strictly from a database perspective here. Is this type of behavior normally frowned upon?
What happens if you don't specify columns, is that the DB Engine has to query the master table data for the column list. This query is really fast, but causes a minor performance issue. As long as you're not doing a sloppy SELECT * with a JOIN statement or nested queries, you should be fine. However, note the small performance impact of letting the DB Engine doing a query to find the columns.
MySQL server opens a cursor on server-side to read that table. The client of the query may read none or all records and performance for the client will only depend on the number of records it actually fetched. Also the performance of the query on server-side can acutally be faster than query with some conditions as it involves also some index reading. Only if client fetched all records, it will be equivalent to full table scan.
Selecting more columns than you need (select *) is always bad. Don't do more than you have to
If you're selecting from the whole table, it doesn't matter if you have an index.
Some other issues you're going to run into is how you want to lock the table. If this is a busy application you might not want to prevent locking entirely because of the inconsistent data that might be returned. But if you lock too tightly it could slow the query further. O(n) is considered acceptable in any computer science application. However in databases we measure in time & number of reads/writes. This is a huge number of reads and will probably take a long time to execute. Therefore it's unacceptable.

Begin Viewing Query Results Before Query Ends

Lets say I query a table with 500K rows. I would like to begin viewing any rows in the fetch buffer, which holds the result set, even though the query has not yet completed. I would like to scroll thru the fetch buffer. If I scroll too far ahead, I want to display a message like: "REACHED LAST ROW IN FETCH BUFFER.. QUERY HAS NOT YET COMPLETED".
Could this be accomplished using fgets() to read the fetch buffer while the query continues building the result set? Doing this implies multi-threading*
Can a feature like this, other than the FIRST ROWS hint directive, be provided in Oracle, Informix, MySQL, or other RDBMS?
The whole idea is to have the ability to start viewing rows before a long query completes, while displaying a counter of how many rows are available for immediate viewing.
EDIT: What I'm suggesting may require a fundamental change in a DB server's architecture, as to the way they handle their internal fetch buffers, e.g. locking up the result set until the query has completed, etc. A feature like the one I am suggesting would be very useful, especially for queries which take a long time to complete. Why have to wait until the whole query completes, when you could start viewing some of the results while the query continues to gather more results!
Paraphrasing:
I have a table with 500K rows. An ad-hoc query without a good index to support it requires a full table scan. I would like to immediately view the first rows returned while the full table scan continues. Then I want to scroll through the next results.
It seems that what you would like is some sort of system where there can be two (or more) threads at work. One thread would be busy synchronously fetching the data from the database, and reporting its progress to the rest of the program. The other thread would be dealing with the display.
In the meantime, I would like to display the progress of the table scan, example: "Searching...found 23 of 500,000 rows so far".
It isn't clear that your query will return 500,000 rows (indeed, let us hope it does not), though it may have to scan all 500,000 rows (and may well have only found 23 rows that match so far). Determining the number of rows to be returned is hard; determining the number of rows to be scanned is easier; determining the number of rows already scanned is very difficult.
If I scroll too far ahead, I want to display a message like: "Reached last row in look-ahead buffer...query has not completed yet".
So, the user has scrolled past the 23rd row, but the query is not yet completed.
Can this be done? Maybe like: spawn/exec, declare scroll cursor, open, fetch, etc.?
There are a couple of issues here. The DBMS (true of most databases, and certainly of IDS) remains tied up as far as the current connection on processing the one statement. Obtaining feedback on how a query has progressed is difficult. You could look at the estimated rows returned when the query was started (information in the SQLCA structure), but those values are apt to be wrong. You'd have to decide what to do when you reach row 200 of 23, or you only get to row 23 of 5,697. It is better than nothing, but it is not reliable. Determining how far a query has progressed is very difficult. And some queries require an actual sort operation, which means that it is very hard to predict how long it will take because no data is available until the sort is done (and once the sort is done, there is only the time taken to communicate between the DBMS and the application to hold up the delivery of the data).
Informix 4GL has many virtues, but thread support is not one of them. The language was not designed with thread safety in mind, and there is no easy way to retrofit it into the product.
I do think that what you are seeking would be most easily supported by two threads. In a single-threaded program like an I4GL program, there isn't an easy way to go off and fetch rows while waiting for the user to type some more input (such as 'scroll down the next page full of data').
The FIRST ROWS optimization is a hint to the DBMS; it may or may not give a significant benefit to the perceived performance. Overall, it typically means that the query is processed less optimally from the DBMS perspective, but getting results to the user quickly can be more important than the workload on the DBMS.
Somewhere down below in a much down-voted answer, Frank shouted (but please don't SHOUT):
That's exactly what I want to do, spawn a new process to begin displaying first_rows and scroll through them even though the query has not completed.
OK. The difficulty here is organizing the IPC between the two client-side processes. If both are connected to the DBMS, they have separate connections, and therefore the temporary tables and cursors of one session are not available to the other.
When a query is executed, a temporary table is created to hold the query results for the current list. Does the IDS engine place an exclusive lock on this temp table until the query completes?
Not all queries result in a temporary table, though the result set for a scroll cursor usually does have something approximately equivalent to a temporary table. IDS does not need to place a lock on the temporary table backing a scroll cursor because only IDS can access the table. If it was a regular temp table, there'd still not be a need to lock it because it cannot be accessed except by the session that created it.
What I meant with the 500k rows, is nrows in the queried table, not how many expected results will be returned.
Maybe a more accurate status message would be:
Searching 500,000 rows...found 23 matching rows so far
I understand that an accurate count of nrows can be obtained in sysmaster:sysactptnhdr.nrows?
Probably; you can also get a fast and accurate count with 'SELECT COUNT(*) FROM TheTable'; this does not scan anything but simply accesses the control data - probably effectively the same data as in the nrows column of the SMI table sysmaster:sysactptnhdr.
So, spawning a new process is not clearly a recipe for success; you have to transfer the query results from the spawned process to the original process. As I stated, a multithreaded solution with separate display and database access threads would work after a fashion, but there are issues with doing this using I4GL because it is not thread-aware. You'd still have to decide how the client-side code is going store the information for display.
There are three basic limiting factors:
The execution plan of the query. If the execution plan has a blocking operation at the end (such as a sort or an eager spool), the engine cannot return rows early in the query execution. It must wait until all rows are fully processed, after which it will return the data as fast as possible to the client. The time for this may itself be appreciable, so this part could be applicable to what you're talking about. In general, though, you cannot guarantee that a query will have much available very soon.
The database connection library. When returning recordsets from a database, the driver can use server-side paging or client-side paging. Which is used can and does affect which rows will be returned and when. Client-side paging forces the entire query to be returned at once, reducing the opportunity for displaying any data before it is all in. Careful use of the proper paging method is crucial to any chance to display data early in a query's lifetime.
The client program's use of synchronous or asynchronous methods. If you simply copy and paste some web example code for executing a query, you will be a bit less likely to be working with early results while the query is still running—instead the method will block and you will get nothing until it is all in. Of course, server-side paging (see point #2) can alleviate this, however in any case your application will be blocked for at least a short time if you do not specifically use an asynchronous method. For anyone reading this who is using .Net, you may want to check out Asynchronous Operations in .Net Framework.
If you get all of these right, and use the FAST FIRSTROW technique, you may be able to do some of what you're looking for. But there is no guarantee.
It can be done, with an analytic function, but Oracle has to full scan the table to determine the count no matter what you do if there's no index. An analytic could simplify your query:
SELECT x,y,z, count(*) over () the_count
FROM your_table
WHERE ...
Each row returned will have the total count of rows returned by the query in the_count. As I said, however, Oracle will have to finish the query to determine the count before anything is returned.
Depending on how you're processing the query (e.g., a PL/SQL block in a form), you could use the above query to open a cursor, then loop through the cursor and display sets of records and give the user the chance to cancel.
I'm not sure how you would accomplish this, since the query has to complete prior to the results being known. No RDBMS (that I know of) offers any means of determining how many results to a query have been found prior to the query completing.
I can't speak factually for how expensive such a feature would be in Oracle because I have never seen the source code. From the outside in, however, I think it would be rather costly and could double (if not more) the length of time a query took to complete. It would mean updating an atomic counter after each result, which isn't cheap when you're talking millions of possible rows.
So I am putting up my comments into this answer-
In terms of Oracle.
Oracle maintains its own buffer cache inside the system global area (SGA) for each instance. The hit ratio on the buffer cache depends on the sizing and reaches 90% most of the time, which means 9 out of 10 hits will be satisfied without reaching the disk.
Considering the above, even if there is a "way" (so to speak) to access the buffer chache for a query you run, the results would highly depend on the cache sizing factor. If a buffer cache is too small, the cache hit ratio will be small and more physical disk I/O will result, which will render the buffer cache un-reliable in terms of temp-data content. If a buffer cache is too big, then parts of the buffer cache will be under-utilized and memory resources will be wasted, which in terms would render too much un-necessary processing trying to access the buffer cache while in order to peek in it for the data you want.
Also depending on your cache sizing and SGA memory it would be upto the ODBC driver / optimizer to determine when and how much to use what (cache buffering or Direct Disk I/O).
In terms of trying to access the "buffer cache" to find "the row" you are looking for, there might be a way (or in near future) to do it, but there would be no way to know if what you are looking for ("The row") is there or not after all.
Also, full table scans of large tables usually result in physical disk reads and a lower buffer cache hit ratio.You can get an idea of full table scan activity at the data file level by querying v$filestat and joining to SYS.dba_data_files. Following is a query you can use and sample results:
SELECT A.file_name, B.phyrds, B.phyblkrd
FROM SYS.dba_data_files A, v$filestat B
WHERE B.file# = A.file_id
ORDER BY A.file_id;
Since this whole ordeal is highly based on multiple parameters and statistics, the results of what you are looking for may remain a probability driven off of those facotrs.