Judging Length of SQL Query based on Explain - mysql

I am trying to get a ball park estimate about how long a query is going to take based on the output of the explain. I am using a MYSQL database.
I know that you can't determine how a long a query is going to take with any certainty. I am just looking for a ballpark estimate, i.e. 1 hour, 8 hours, 1 day, 2 weeks etc. Thank you!

you can't, it totally depends of your infrastructure. The explain plan is first of all really indicative, and then really relative to your server.
Use TKPROF for better results anyway.
EDIT: Mysql doesn't have tkprof unlike Oracle (The rest is still valid) Instead, use PROFILING for mor relevant informations: http://dev.mysql.com/doc/refman/5.0/en/show-profiles.html

Related

How to improve a simple MySQL-Query

There is this rather simple query that I have to run on a livesystem, in order to get a count. The problem is that the table and database are rather inefficiently designed and since it is a livesystem altering it is not an option at this point.
So I have to figure out a query that runs fast and won't slow down the system too much, because for the time of the query execution the system basically stops which is not really what I would like a livesystem to do, so I need to streamline my query in order to make it perform in an acceptable time.
SELECT id1, count(id2) AS count FROM table GROUP BY id1 ORDER BY count
DESC;
So here is the query, unfortunately it is so simple that I am out of ideas on how to further improve it, maybe someone else has an idea ... ?
Application Get "good enough" results via application changes:
If you have access to the application, but not the database, then there are possibilities:
Periodically run that slow query and capture the results. Then use the cached results.
Do you need all
What is the goal? Find a few of the most common id1's? Rank all of them?
Back to the query
COUNT(id2) checks for id2 being not null; this us usually unnecessary, so COUNT(*) is better. However the speedup is insignificant.
ORDER BY NULL is irrelevant if you are picking off the rows with the highest COUNT -- the sort needs to be done somewhere. Moving it to the application does not help; at least not much.
Adding LIMIT 10 would only help because of cutting down on the time to send the data back to the client.
INDEX(id1) is the best index for the query (after changing to COUNT(*)). But the operation still requires
full index scan to do the COUNT and GROUP BY
sort the grouped results -- for the ORDER BY
Zero or near-zero downtime
Do you have replication established? Galera Clustering?
Look into pt-online-schema-change and gh-ost.
What is the real goal?
We cannot fix the query as written. What things can we change? Better yet, what is the ultimate goal -- perhaps there is an approach that does not involve any query that looks the least like the one you are trying to speed up.
Now I have just dumped the table and imported it into a MySQL-Docker, ran the query there, took ages and I actually had to move my entire Docker because the dump was so huge, but in the end I got my results and now I know how many id2s are associated with specific id1s (apostrophe to form a plural? You may want to double-check that ;) ).
As it was already pointed out, there wasn't much room for improvement on the query anymore.
FYI suddenly the care about stopping the system was gone and now we are indexing the table, so far it took 6 hours, no end in sight :D
Anyways, thanks for the help everyone.

MySQL code locks up server

I have code that I have tested on the same table, but with only a few records in it.
It worked great on a handful (30) records. Did exactly what I wanted it to do.
When I added a 200 records to the table - it locks up. I have to restart apache and have tried waiting for ever for it to finish.
I could use some help figuring out why.
My table has the proper indexes and I am not having trouble in any other way.
Thanks in advance.
UPDATE `base_data_test_20000_rows` SET `NO_TOP_RATING` =
(SELECT COUNT(`ID`) FROM `base_data_test_20000_rows_2`
WHERE
`base_data_test_20000_rows_2`.`ID` != `base_data_test_20000_rows`.`ID`
AND
`base_data_test_20000_rows_2`.`ANALYST` = `base_data_test_20000_rows`.`ANALYST`
AND
`base_data_test_20000_rows_2`.`IRECCD` =
(SELECT COUNT(`ID`) FROM `base_data_test_20000_rows_2`
WHERE `IRECCD` =
(select MIN(`IRECCD`) FROM `base_data_test_20000_rows_2`
WHERE
`base_data_test_20000_rows_2`.`ANNDATS_CONVERTED` >= DATE_SUB(`base_data_test_20000_rows`.`ANNDATS_CONVERTED`,INTERVAL 1 YEAR)
AND
`base_data_test_20000_rows_2`.`ID` != `base_data_test_20000_rows`.`ID`
AND
`base_data_test_20000_rows_2`.`ESTIMID` = `base_data_test_20000_rows`.`ESTIMID`
)
)
)
WHERE `base_data_test_20000_rows`.`ANALYST` != ''
The code is just meant to look a year back for a particular brokerage - get the lowest rating - then count the number of times that analyst had the lowest rating. write that vale to the NO_TOP_RATING column.
I'm pretty sure I was wrong with my original suggestion, chainging the select count to it's own number won't help since you have conditions on your query
This is merely a hackish solution. The real way to solve this would be to optimize your query. But as a work around you could set the record count to a mysql variable, and then reference that variable in the query.
This means that you will have to make sure you set the count to the variable before you run the query. But this means that should records be added in between the time you set the variable and complete running the query you will not have the right count.
http://dev.mysql.com/doc/refman/5.0/en/user-variables.html
further thoughts:
I took a closer look before submitting this answer. That might not actually be possible since you have the where statements which is individualized to each record.
It's slow because you are using a query that counts within a query that counts within a query that has a min. It's like you are iterating through every row three times each time you iterate through a row. Which is an exponential search. So if the database has 10 records you are possibly going through each record 10^3ish times. At the number of rows you have, it's hellish.
I'm sure that there is a way to do what you are trying to do, but I can't actually tell what you are trying to do.
I would have to agree with DRapp that seeing dummy data could help us analyze what's really going on.
Since I can't wrap my head around it all, what I would try, without fully understanding what you are doing, would be to create a view of each of your sub queries and then do a query on that. http://dev.mysql.com/doc/refman/5.0/en/create-view.html
But that probably won't escape the redundancy, but it might help with the speed. But since I don't fully understand what you are doing, that's probably not the best answer.
Another not so good answer would be if you aren't running this on a mission critical db and it can go offline while you run the query then you could just changed your mysql settings and let this query run for those hours you quoted and hope it doesn't crash. But that seems less than ideal, as I have no idea if that requires additional disk space or memory to preform.
So really my best answer I can ever give you at this point, is try to see if you can approach your problem from a different angle. Or post some dummy data of what the info in Base_data_test_20000_rows looks like and what you expect it to look like after the query runs.
-Hope that helps point you to the right direction

How can I select some amount of rows, like "get as many rows as possible in 5 seconds"?

The aim is: getting the highest number of rows and not getting more rows than rows loaded, after 5 seconds. The aim is not creating a timeout.
after months, I thought maybe this would work and it didn't:
declare #d1 datetime2(7); set #d1=getdate();
select c1,c2 from t1 where (datediff(ss,#d1,getdate())<5)
Although the trend in recent years for relational databases has moved more and more toward cost-based query optimization, there is no RDBMS I am aware of that inherently supports designating a maximum cost (in time or I/O) for a query.
The idea of "just let it time out and use the records collected so far" is a flawed solution. The flaw lies in the fact that a complex query may spend the first 5 seconds performing a hash on a subtree of the query plan, to generate data that will be used by a later part of the plan. So after 5 seconds, you may still have no records.
To get the most records possible in 5 seconds, you would need a query that had a known estimated execution plan, which could then be used to estimate the optimal number of records to request in order to make the query run for as close to 5 seconds as possible. In other words, knowing that the query optimizer estimates it can process 875 records per second, you could request 4,375 records. The query might run a bit longer than 5 seconds sometimes, but over time your average execution should fall close to 5 seconds.
So...how to make this happen?
In your particular situation, it's not feasible. The catch is "known estimated execution plan". To make this work reliably, you'd need a stored procedure with a known execution plan, not an ad-hoc query. Since you can't create stored procedures in your environment, that's a non-starter. For others who want to explore that solution, though, here's an academic paper by a team who implemented this concept in Oracle. I haven't read the full paper, but based on the abstract it sounds like their work could be translated to any RDBMS that has cost-based optimization (e.g. MS SQL, MySQL, etc.)
OK, So what can YOU do in your situation?
If you can't do it the "right" way, solve it with a hack.
My suggestion: keep your own "estimated cost" statistics.
Do some testing in advance and estimate how many rows you can typically get back in 4 seconds. Let's say that number is 18,000.
So you LIMIT your query to 18,000 rows. But you also track the execution time every time you run it and keep a moving average of, say, the last 50 executions. If that average is less than 4.5s, add 1% to the query size and reset the moving average. So now your app is requesting 18,180 rows every time. After 50 iterations, if the moving average is under 4.5s, add 1% again.
And if your moving average ever exceeds 4.75s, subtract 1%.
Over time, this method should converge to an optimized N-rows solution for your particular query/environment/etc. And should adjust (slowly but steadily) when conditions change (e.g. high-concurrency vs low-concurrency)
Just one -- scratch that, two -- more things...
As a DBA, I have to say...it should be exceedingly rare for any query to take more than 5 seconds. In particular, if it's a query that runs frequently and is used by the front end application, then it absolutely should not ever run for 5 seconds. If you really do have a user-facing query that can't complete in 5 seconds, that's a sign that the database design needs improvement.
Jonathan VM's Law Of The Greenbar Report I used to work for a company that still used a mainframe application that spit out reams of greenbar dot-matrix-printed reports every day. Most of these were ignored, and of the few that were used, most were never read beyond the first page. A report might have thousands of rows sorted by descending account age...and all that user needed was to see the 10 most aged. My law is this: The number of use cases that actually require seeing a vast number of rows is infinitesimally small. Think - really think - about the use case for your query, and whether having lots and lots of records is really what that user needs.
Your while loop idea won't solve the problem entirely. It is possible that the very first iteration through the loop could take longer than 5 seconds. Plus, it will likely result in retrieving far fewer rows in the allotted time than if you tried to do it with just a single query.
Personally, I wouldn't try to solve this exact problem. Instead, I would do some testing, and through trial and error identify a number of records that I am confident will load in under five seconds. Then, I would just place a LIMIT on the loading query.
Next, depending on the requirements I would either set a timeout on the DB call of five seconds or just live with the chance that some calls will exceed the time restriction.
Lastly, consider that on most modern hardware for most queries, you can return a very large number of records within five seconds. It's hard to imagine returning all of that data to the UI and still have it be usable, if that is your intention.
-Jason
I've never tried this, but if a script is running this query you could try running an unbuffered query (in php, this would be something like mysql_unbuffered_query())... you could then store these into an array while the query is running. You could then set the mysql query timeout to five minutes. When the query is killed, if you've set your while() loop to check for a timeout response it can then terminate the loop and you'll have an array with all of the records returned in 5 minutes. Again, I'm not sure this would work, but I'd be interested to see if it would accomplish what you're looking to do.
You could approach this problem like this, but I doubt that this logic is really what I'd recommend for real world use.
You have a 10s interval, you try one query, it gets you the row in 0.1s. That would imply you could get at least 99 similar queries still in the remaining 9.9s.
However, getting 99 queries at once should proove faster than getting them one-by-one (which your initial calculation would suggest). So you get the 99 queries and check the time again.
Let's say the operation performed 1.5 times as fast as the single query, because getting more queries at once is more efficient, leaving you with 100rows at a time of 7.5s. You calculate that by average you have so far gotten 100rows per 7.5s, calculate a new amount of possible queries for the rest of the time and query again, and so on. You would, however, need to set a threshold limit for this loop, let's say something like: Don't get any new queries any more after 9.9s.
This solution obviously is neither the most smooth nor something I'd really use, but maybe it serves to solve the OP's problem.
Also, jmacinnes already pointed out: "It is possible that the very first iteration through the loop could take longer than 10[5] seconds."
I'd certainly be interested myself, if someone can come up with a proper solution to this problem.
To get data from the table you should do two things:
execute a query (SELECT something FROM table)
fill the table or read data
You are asking about second one. I'm not that familiar with php, but I think it does not matter. We use fetching to get first records quickly and show them to the user, then fetch records as needed. In ADO.NET you could use IDataReader to get records one by one, in php I think you could use similar methods, for example - mysqli_fetch_row in mysqli extension or mysql_fetch_row in mysql extension. In this case you could stop reading data at any moment.

Is there any way to tell how long a MySQL query will take to run?

Other than running it to completion...
Is there some sort of progress you can monitor to see what operations are happening as your query is being processed that would give you a sense of how long it's going to take, or at least what step it is on, what steps have happened, and which remain?
If yes, would this same tool help you identify the part of your query that is taking the longest?
I'm trying to get a better sense for what makes some queries take longer than others.
It's called profiling : http://dev.mysql.com/tech-resources/articles/using-new-query-profiler.html
MySQL has one built-in for you. : )
You can use explain to ask MySQL to show you why a function takes as longs as it does.
http://dev.mysql.com/doc/refman/5.0/en/explain.html
The idea is that it'll show you things like which indexes it uses, etc, which will help you to then optimise either the query or the table to make it quicker.
That's not a direct answer to your question, because it won't tell you real-time progress of the query, which is what you're asking for, but it does directly answer the last sentence of your question, and so it may actually be more useful to you for what you really want to know than any real-time progress report would be.

How do I use EXPLAIN to *predict* performance of a MySQL query?

I'm helping maintain a program that's essentially a friendly read-only front-end for a big and complicated MySQL database -- the program builds ad-hoc SELECT queries from users' input, sends the queries to the DB, gets the results, post-processes them, and displays them nicely back to the user.
I'd like to add some form of reasonable/heuristic prediction for the constructed query's expected performance -- sometimes users inadvertently make queries that are inevitably going to take a very long time (because they'll return huge result sets, or because they're "going against the grain" of the way the DB is indexed) and I'd like to be able to display to the user some "somewhat reliable" information/guess about how long the query is going to take. It doesn't have to be perfect, as long as it doesn't get so badly and frequently out of whack with reality as to cause a "cry wolf" effect where users learn to disregard it;-) Based on this info, a user might decide to go get a coffee (if the estimate is 5-10 minutes), go for lunch (if it's 30-60 minutes), kill the query and try something else instead (maybe tighter limits on the info they're requesting), etc, etc.
I'm not very familiar with MySQL's EXPLAIN statement -- I see a lot of information around on how to use it to optimize a query or a DB's schema, indexing, etc, but not much on how to use it for my more limited purpose -- simply make a prediction, taking the DB as a given (of course if the predictions are reliable enough I may eventually switch to using them also to choose between alternate forms a query could take, but, that's for the future: for now, I'd be plenty happy just to show the performance guesstimates to the users for the above-mentioned purposes).
Any pointers...?
EXPLAIN won't give you any indication of how long a query will take.
At best you could use it to guess which of two queries might be faster, but unless one of them is obviously badly written then even that is going to be very hard.
You should also be aware that if you're using sub-queries, even running EXPLAIN can be slow (almost as slow as the query itself in some cases).
As far as I'm aware, MySQL doesn't provide any way to estimate the time a query will take to run. Could you log the time each query takes to run, then build an estimate based on the history of past similar queries?
I think if you want to have a chance of building something reasonably reliable out of this, what you should do is build a statistical model out of table sizes and broken-down EXPLAIN result components correlated with query processing times. Trying to build a query execution time predictor based on thinking about the contents of an EXPLAIN is just going to spend way too long giving embarrassingly poor results before it gets refined to vague usefulness.
MySQL EXPLAIN has a column called Key. If there is something in this column, this is a very good indication, it means that the query will use an index.
Queries that use indicies are generally safe to use since they were likely thought out by the database designer when (s)he designed the database.
However
There is another field called Extra. This field sometimes contains the text using_filesort.
This is very very bad. This literally means MySQL knows that the query will have a result set larger than the available memory, and therefore will start to swap the data to disk in order to sort it.
Conclusion
Instead of trying to predict the time a query takes, simply look at these two indicators. If a query is using_filesort, deny the user. And depending on how strict you want to be, if the query is not using any keys, you should also deny it.
Read more about the resultset of the MySQL EXPLAIN statement