Surprising timing stats for sql queries - mysql

I have two queries whose timing parameters I want to analyze.
The first query is taking much longer than the second one while in my opinion it should be the other way round.Any Explanations?
First query:
select mrn
from EncounterInformation
limit 20;
Second query:
select enc.mrn, fl.fileallocationid
from EncounterInformation as enc inner join
FileAllocation as fl
on enc.encounterIndexId = fl.encounterid
limit 20;
The first query runs in 0.760 seconds on MYSQL while second one runs in 0.509 seconds surprisingly.

There are many reasons why measured performance between two queries might be different:
The execution plans for the queries (the dominant factor)
The size of the data being returned (perhaps mrn is a string that is really long for the result set in the first query but not the second)
Other database activity, that locks tables and indexes
Other server activity
Data and index caches that are pre-loaded -- either in the database itself or in the underlying OS components
Your observation is correct. The first should be faster than the second. More importantly though is the observation that this simply does not make sense for your simple query:
The first query runs in 0.760 seconds
select mrn
from EncounterInformation
limit 20;
The work done for this would typically be to load one data page (or maybe a handful). That would only consistently take 0.760 seconds if:
You had really slow data storage (think "carrier pigeons").
EncounterInformation is a view and not a table.
You don't understand timings.
If the difference is between 0.760 milliseconds and 0.509 milliseconds, then the difference is really small and likely due to other issues -- warm caches, other activity on the server, other database activity.
It is also possible that you are measuring elapsed time and not database time, so network congestion could potentially be an issue.
If you are querying views, all bets are off without knowing what the views are. In fact, if you care about performance you should be including the execution plan in the question.
I can't explain the difference. What I can say is that your observation is reasonable, but your question lacks lots of information that suggests you need to learn more about how to understand timings. I would suggest that you start with learning about explain.

Related

Getting the cost of every MySQL query

We have a web application backed by MySQL serving hundreds of queries per second. I'm looking for a way to measure the "cost" of every query in production. I'm imagining some option where, for every query, MySQL returns the query results along with the CPU and I/O cost of executing that query.
The end goal is to aggregate those costs by endpoint (e.g. "/search") and by the logged-in user ID. That way, when we're having issues with site, we can quickly see if there's a particular action or user ID that is using up a large chunk of our MySQL resources.
Close but not quite (AFAICT):
This answer comes close: https://stackoverflow.com/a/12880997/163832
It describes the precision and accuracy problems with EXPLAIN and recommends an alternative that measures what actually happened rather than estimating what will happen.
The alternative does seem better for my use case, but there are still problems:
I looked at the available stats and can't find ones that measure CPU or I/O.
I don't think I can afford to do FLUSH STATUS and then SHOW SESSION STATUS ... on every query.
This doesn't work when many queries are running concurrently.

UPDATE vs COUNT vs SELECT performance

Is this statement true or false
The performance of these queries
SELECT * FROM table;
UPDATE table SET field = 1;
SELECT COUNT(*) FROM table;
Are identical
Or is there ever a case in which the performance of one will greatly differ from the other?
UPDATE
I'm more interested if there's a large difference between the SELECT and the UPDATE. You can ignore the COUNT(*) if you want
Assume the select performs full table scan. The update will also perform update on all rows in the table.
Assume the update is only updating one field - though it will update all rows (it's an indexed field)
I know that they'll take different time and that they do different things. What I want to know is if the difference will be significant or not. EG. If the update will take 5 times longer than the select then it's significant. Use this as the threshold. And there's no need to be precise. Just give an approximation.
There are different resource types involved:
disk I/O (this is the most costly part of every DBMS)
buffer pressure: fetching a row will cause fetching a page from disk, which will need buffer memory to be stored in
work/scratch memory for intermediate tables, structures and aggregates.
"terminal" I/O to the front-end process.
cost of locking, serialisation and versioning and journaling
CPU cost : this is neglectable in most cases (compared to disk I/O)
The UPDATE query in the question is the hardest: it will cause all disk pages for the table to be fetched, put into buffers, altered into new buffers and written back to disk. In normal circumstances, it will also cause other processes to be locked out, with contention and even more buffer pressure as a result.
The SELECT * query needs all the pages, too; and it needs to convert/format them all into frontend-format and send them back to the frontend.
The SELECT COUNT(*) is the cheapest, on all resources. In the worst case all the disk pages have to be fetched. If an index is present, fewer disk- I/O and buffers are needed. The CPU cost is still neglectable (IMHO) and the "terminal" output is marginal.
When you say "performance", do you mean "how long it takes them to execute"?
One of them is returning all data in all rows.
One of them (if you remove the "FROM") is writing data to the rows.
One is counting rows and returning none of the data in the rows.
All three of those queries are doing entirely different things. Therefore, it is reasonable to conclude that all three of them will take different amounts of time to complete.
Most importantly, why are you asking this question? What are you trying to solve? I have a bad feeling that you're going down a wrong path by asking this.
I have a large (granted indexed) table here at work, and this is what I found
select * from X (limited to the first 100,000 records) (12.5 seconds)
select count(*) from X (returned millions of records) (15.57 seconds)
Update on an indexed table is very fast (less then a second)
The SELECT and UPDATE should be about the same (but they could easily vary, this depends on the database). COUNT(*) is cached in many databases, at some level, so that query could easily be O(1). Of course a lazy implementation of UPDATE could also be O(1), but I don't know of anyone doing that currently.
tl;dr: "False" or "it depends".
All three queries do vastly different things.
They each have their own performance characteristics and are not directly comparable.
Can you clarify what you are attempting to investigate?

MySQL query speed or rows read

Sorry for lots of useless text. Most important stuff is told on last 3 paragraphs :D
Recently we had some mysql problem in one of client servers. Something out of blue starts sky-rocking CPU of mysql process. This problem lead us to finding and optimizing bad queries and here is a problem.
I was thinking that optimization is speeding up queries (total time needed for a query to execute). But after optimizing several queries towards it my colleague starting colleague started complaining that some queries read too many rows, all rows from table (as shown with EXPLAIN).
After rewriting a query I noticed that, if I want a query to read less rows - query speed suffers, if I query is made for speed - more rows are read.
And that didn't make me a sense: less rows read, but execution time is longer
And that made me wonder what should be done. Of course it would be perfect to have fast query which reads least rows. But since it doesn't seem to be possible for me, I'm searching for some answers. Which approach should I take - speed or less rows read? What are pros&cons when query is fast but with more rows read and when less rows are read with speed suffer? What happens with server at different cases?
After googling all I could find was articles and discussions about how to improve speed, but neither covered those different cases I mentioned before.
I'm looking forward to seeing even personal choices of course with some reasoning.
Links which could direct me right way are welcome too.
I think your problem depends on how you are limiting the amount of rows read. If you read less rows by implementing more WHERE clauses that MySQL needs to run against, then yes, performance will take a hit.
I would look at perhaps indexing some of your columns that make your search more complex. Simple data types are faster to lookup than complex ones. See if you are searching toward indexed columns.
Without more data, I can give you some hints:
Be sure your tables are properly indexed. Create the appropriate indexes for each of your tables. Also drop the indexes that are not needed.
Decide the best approach for each query. For example, if you use group by only to deduplicate rows, you are wasting resources; it is better to use select distinct (on an indexed field).
"Divide and conquer". Can you split your process in two, three or more intermediate steps? If the answer is "yes", then: Can you create temporary tables for some of these steps? I've split proceses using temp tables, and they are very useful to speed things up.
The count of rows read reported by EXPLAIN is an estimate anyway -- don't take it as a literal value. Notice that if you run EXPLAIN on the same query multiple times, the number of rows read changes each time. This estimate can even be totally inaccurate, as there have been bugs in EXPLAIN from time to time.
Another way to measure query performance is SHOW SESSION STATUS LIKE 'Handler%' as you test the query. This will tell you accurate counts of how many times the SQL layer made requests for individual rows to the storage engine layer. For examples, see my presentation, SQL Query Patterns, Optimized.
There's also an issue of whether the rows requested were already in the buffer pool (I'm assuming you use InnoDB), or did the query have to read them from disk, incurring I/O operations. A small number of rows read from disk can be orders of magnitude slower than a large number of rows read from RAM. This doesn't necessarily account for your case, but it points out that such a scenario can occur, and "rows read" doesn't tell you if the query caused I/O or not. There might even be multiple I/O operations for a single row, because of InnoDB's multi-versioning.
Insight into the difference between logical row request vs. physical I/O reads is harder to get. In Percona Server, enhancements to the slow query log include the count of InnoDB I/O operations per query.

How can I select some amount of rows, like "get as many rows as possible in 5 seconds"?

The aim is: getting the highest number of rows and not getting more rows than rows loaded, after 5 seconds. The aim is not creating a timeout.
after months, I thought maybe this would work and it didn't:
declare #d1 datetime2(7); set #d1=getdate();
select c1,c2 from t1 where (datediff(ss,#d1,getdate())<5)
Although the trend in recent years for relational databases has moved more and more toward cost-based query optimization, there is no RDBMS I am aware of that inherently supports designating a maximum cost (in time or I/O) for a query.
The idea of "just let it time out and use the records collected so far" is a flawed solution. The flaw lies in the fact that a complex query may spend the first 5 seconds performing a hash on a subtree of the query plan, to generate data that will be used by a later part of the plan. So after 5 seconds, you may still have no records.
To get the most records possible in 5 seconds, you would need a query that had a known estimated execution plan, which could then be used to estimate the optimal number of records to request in order to make the query run for as close to 5 seconds as possible. In other words, knowing that the query optimizer estimates it can process 875 records per second, you could request 4,375 records. The query might run a bit longer than 5 seconds sometimes, but over time your average execution should fall close to 5 seconds.
So...how to make this happen?
In your particular situation, it's not feasible. The catch is "known estimated execution plan". To make this work reliably, you'd need a stored procedure with a known execution plan, not an ad-hoc query. Since you can't create stored procedures in your environment, that's a non-starter. For others who want to explore that solution, though, here's an academic paper by a team who implemented this concept in Oracle. I haven't read the full paper, but based on the abstract it sounds like their work could be translated to any RDBMS that has cost-based optimization (e.g. MS SQL, MySQL, etc.)
OK, So what can YOU do in your situation?
If you can't do it the "right" way, solve it with a hack.
My suggestion: keep your own "estimated cost" statistics.
Do some testing in advance and estimate how many rows you can typically get back in 4 seconds. Let's say that number is 18,000.
So you LIMIT your query to 18,000 rows. But you also track the execution time every time you run it and keep a moving average of, say, the last 50 executions. If that average is less than 4.5s, add 1% to the query size and reset the moving average. So now your app is requesting 18,180 rows every time. After 50 iterations, if the moving average is under 4.5s, add 1% again.
And if your moving average ever exceeds 4.75s, subtract 1%.
Over time, this method should converge to an optimized N-rows solution for your particular query/environment/etc. And should adjust (slowly but steadily) when conditions change (e.g. high-concurrency vs low-concurrency)
Just one -- scratch that, two -- more things...
As a DBA, I have to say...it should be exceedingly rare for any query to take more than 5 seconds. In particular, if it's a query that runs frequently and is used by the front end application, then it absolutely should not ever run for 5 seconds. If you really do have a user-facing query that can't complete in 5 seconds, that's a sign that the database design needs improvement.
Jonathan VM's Law Of The Greenbar Report I used to work for a company that still used a mainframe application that spit out reams of greenbar dot-matrix-printed reports every day. Most of these were ignored, and of the few that were used, most were never read beyond the first page. A report might have thousands of rows sorted by descending account age...and all that user needed was to see the 10 most aged. My law is this: The number of use cases that actually require seeing a vast number of rows is infinitesimally small. Think - really think - about the use case for your query, and whether having lots and lots of records is really what that user needs.
Your while loop idea won't solve the problem entirely. It is possible that the very first iteration through the loop could take longer than 5 seconds. Plus, it will likely result in retrieving far fewer rows in the allotted time than if you tried to do it with just a single query.
Personally, I wouldn't try to solve this exact problem. Instead, I would do some testing, and through trial and error identify a number of records that I am confident will load in under five seconds. Then, I would just place a LIMIT on the loading query.
Next, depending on the requirements I would either set a timeout on the DB call of five seconds or just live with the chance that some calls will exceed the time restriction.
Lastly, consider that on most modern hardware for most queries, you can return a very large number of records within five seconds. It's hard to imagine returning all of that data to the UI and still have it be usable, if that is your intention.
-Jason
I've never tried this, but if a script is running this query you could try running an unbuffered query (in php, this would be something like mysql_unbuffered_query())... you could then store these into an array while the query is running. You could then set the mysql query timeout to five minutes. When the query is killed, if you've set your while() loop to check for a timeout response it can then terminate the loop and you'll have an array with all of the records returned in 5 minutes. Again, I'm not sure this would work, but I'd be interested to see if it would accomplish what you're looking to do.
You could approach this problem like this, but I doubt that this logic is really what I'd recommend for real world use.
You have a 10s interval, you try one query, it gets you the row in 0.1s. That would imply you could get at least 99 similar queries still in the remaining 9.9s.
However, getting 99 queries at once should proove faster than getting them one-by-one (which your initial calculation would suggest). So you get the 99 queries and check the time again.
Let's say the operation performed 1.5 times as fast as the single query, because getting more queries at once is more efficient, leaving you with 100rows at a time of 7.5s. You calculate that by average you have so far gotten 100rows per 7.5s, calculate a new amount of possible queries for the rest of the time and query again, and so on. You would, however, need to set a threshold limit for this loop, let's say something like: Don't get any new queries any more after 9.9s.
This solution obviously is neither the most smooth nor something I'd really use, but maybe it serves to solve the OP's problem.
Also, jmacinnes already pointed out: "It is possible that the very first iteration through the loop could take longer than 10[5] seconds."
I'd certainly be interested myself, if someone can come up with a proper solution to this problem.
To get data from the table you should do two things:
execute a query (SELECT something FROM table)
fill the table or read data
You are asking about second one. I'm not that familiar with php, but I think it does not matter. We use fetching to get first records quickly and show them to the user, then fetch records as needed. In ADO.NET you could use IDataReader to get records one by one, in php I think you could use similar methods, for example - mysqli_fetch_row in mysqli extension or mysql_fetch_row in mysql extension. In this case you could stop reading data at any moment.

What is the cost of indexing multiple db columns?

I'm writing an app with a MySQL table that indexes 3 columns. I'm concerned that after the table reaches a significant amount of records, the time to save a new record will be slow. Please inform how best to approach the indexing of columns.
UPDATE
I am indexing a point_value, the
user_id, and an event_id, all required
for client-facing purposes. For an
instance such as scoring baseball runs
by player id and game id. What would
be the cost of inserting about 200 new
records a day, after the table holds
records for two seasons, say 72,000
runs, and after 5 seasons, maybe a
quarter million records? Only for
illustration, but I'm expecting to
insert between 25 and 200 records a
day.
Index what seems the most logical (that should hopefully be obvious, for example, a customer ID column in the CUSTOMERS table).
Then run your application and collect statistics periodically to see how the database is performing. RUNSTATS on DB2 is one example, I would hope MySQL has a similar tool.
When you find some oft-run queries doing full table scans (or taking too long for other reasons), then, and only then, should you add more indexes. It does little good to optimise a once-a-month-run-at-midnight query so it can finish at 12:05 instead of 12:07. However, it's a huge improvement to reduce a customer-facing query from 5 seconds down to 2 seconds (that's still too slow, customer-facing queries should be sub-second if possible).
More indexes tend to slow down inserts and speed up queries. So it's always a balancing act. That's why you only add indexes in specific response to a problem. Anything else is premature optimization and should be avoided.
In addition, revisit the indexes you already have periodically to see if they're still needed. It may be that the queries that caused you to add those indexes are no longer run often enough to warrant it.
To be honest, I don't believe indexing three columns on a table will cause you to suffer unless you plan on storing really huge numbers of rows :-) - indexing is pretty efficient.
After your edit which states:
I am indexing a point_value, the user_id, and an event_id, all required for client-facing purposes. For an instance such as scoring baseball runs by player id and game id. What would be the cost of inserting about 200 new records a day, after the table holds records for two seasons, say 72,000 runs, and after 5 seasons, maybe a quarter million records? Only for illustration, but I'm expecting to insert between 25 and 200 records a day.
My response is that 200 records a day is an extremely small value for a database, you definitely won't have anything to worry about with those three indexes.
Just this week, I imported a days worth of transactions into one of our database tables at work and it contained 2.1 million records (we get at least one transaction per second across the entire day from 25 separate machines). And it has four separate composite keys which is somewhat more intensive than your three individual keys.
Now granted, that's on a DB2 database but I can't imagine IBM are so much better than the MySQL people that MySQL can only handle less than 0.01% of the DB2 load.
I made some simple tests using my real project and real MySql database.
My results are: adding average index (1-3 columns in an index) to a table - makes inserts slower by 2.1%. So, if you add 20 indexes, your inserts will be slower by 40-50%. But your selects will be 10-100 times faster.
So is it ok to add many indexes? - It depends :) I gave you my results - You decide!
Nothing for select queries, though updates and especially inserts will be order of magnitudes slower - which you won't really notice before you start inserting a LOT of rows at the same time...
In fact at a previous employer (single user, desktop system) we actually DROPPED indexes before starting our "import routine" - which would first delete all records before inserting a huge number of records into the same table...
Then when we were finished with the insertion job we would re-create the indexes...
We would save 90% of the time for this operation by dropping the indexes before starting the operation and re-creating the indexes afterwards...
This was a Sybase database, but the same numbers apply for any database...
So be careful with indexes, they're FAR from "free"...
Only for illustration, but I'm expecting to insert between 25 and 200 records a day.
With that kind of insertion rate, the cost of indexing an extra column will be negligible.
Without some more details about expected usage of the data in your table worrying about indexes slowing you down smells a lot like premature optimization that should be avoided.
If you are really concerned about it, then setup a test database and simulate performance in the worst case scenarios. A test proving that is or is not a problem will probably be much more useful then trying to guess and worry about what may happen. If there is a problem you will be able to use your test setup to try different methods to fix the issue.
The index is there to speed retrieval of data, so the question should be "What data do I need to access quickly?". Without the index, some queries will do a full table scan (go through every row in the table) in order to find the data that you want. With a significant amount of records this will be a slow and expensive operation. If it is for a report that you run once a month then maybe thats okay; if it is for frequently accessed data then you will need the index to give your users a better experience.
If you find the speed of the insert operations are slow because of the index then this is a problem you can solve at the hardware level by throwing more CPUs, RAM and better hard drive technology at the problem.
What Pax said.
For the dimensions you describe, the only significant concern I can imagine is "What is the cost of failing to index multiple db columns?"