How to improve a simple MySQL-Query - mysql

There is this rather simple query that I have to run on a livesystem, in order to get a count. The problem is that the table and database are rather inefficiently designed and since it is a livesystem altering it is not an option at this point.
So I have to figure out a query that runs fast and won't slow down the system too much, because for the time of the query execution the system basically stops which is not really what I would like a livesystem to do, so I need to streamline my query in order to make it perform in an acceptable time.
SELECT id1, count(id2) AS count FROM table GROUP BY id1 ORDER BY count
DESC;
So here is the query, unfortunately it is so simple that I am out of ideas on how to further improve it, maybe someone else has an idea ... ?

Application Get "good enough" results via application changes:
If you have access to the application, but not the database, then there are possibilities:
Periodically run that slow query and capture the results. Then use the cached results.
Do you need all
What is the goal? Find a few of the most common id1's? Rank all of them?
Back to the query
COUNT(id2) checks for id2 being not null; this us usually unnecessary, so COUNT(*) is better. However the speedup is insignificant.
ORDER BY NULL is irrelevant if you are picking off the rows with the highest COUNT -- the sort needs to be done somewhere. Moving it to the application does not help; at least not much.
Adding LIMIT 10 would only help because of cutting down on the time to send the data back to the client.
INDEX(id1) is the best index for the query (after changing to COUNT(*)). But the operation still requires
full index scan to do the COUNT and GROUP BY
sort the grouped results -- for the ORDER BY
Zero or near-zero downtime
Do you have replication established? Galera Clustering?
Look into pt-online-schema-change and gh-ost.
What is the real goal?
We cannot fix the query as written. What things can we change? Better yet, what is the ultimate goal -- perhaps there is an approach that does not involve any query that looks the least like the one you are trying to speed up.

Now I have just dumped the table and imported it into a MySQL-Docker, ran the query there, took ages and I actually had to move my entire Docker because the dump was so huge, but in the end I got my results and now I know how many id2s are associated with specific id1s (apostrophe to form a plural? You may want to double-check that ;) ).
As it was already pointed out, there wasn't much room for improvement on the query anymore.
FYI suddenly the care about stopping the system was gone and now we are indexing the table, so far it took 6 hours, no end in sight :D
Anyways, thanks for the help everyone.

Related

AWS Athena Timing out / not enough resources at this scale factor

I ran into a problem with athena when I was trying to join and order two tables together. My query statement looks very similar to this:
SELECT *
from Table_1
LEFT JOIN Table_2 ON Table_1
where Table_1.id = Table_2.id AND Table_1.date = Table_2.date
ORDER BY Table_1.id, Table_1.date
My Tables are potentially big depending on the dataset I am working with, with about a million rows or more. After doing some research, I realize that the ORDER BY could potentially be slowing down my query, but even when I take it out, it is still timing out. At the same time, I need the ORDER BY to structure my data because I will be turning this into a csv file. I have also read that I could split my query up in order to use different workers and take advantage of Athena's ability to do parallel work, but I don't know exactly know how to do that in Athena, so if someone could elaborate and explain how that could possibly done, that would be perfect. Another thing that I was thinking of doing was partitioning my data based on columns, but I would like it if someone could explain the benefits to me of doing that since I won't be selecting only a portion of my table, but the whole table every time.
I don't know if this is relevant, but my file sizes are usualy around ~100mb or less. However, from the different posts on here that I see with the same problem, they are dealing with more than 10gb, so I'm not sure if there's just something wrong fundamentally with my use of Athena.
Edit: I was thinking of paginating my queries to see if that could fix my issue, such as using offset and limit in a loop and just appending the data together. Would that be a viable solution?
After doing some more tests on my code to see at what point it breaks in terms of the payload size, I realized that my total amount of rows were waaaaay more than it should be. I found out that my select statement is not what it actually is as I portrayed it in my post. it was missing the AND Table_1.date = Table_2.date part due to a bug in the construction of query (because I am conditionally creating it). This resulted in the amount of rows being multiplied by up to 10x as I've noticed, which is what was screwing up my query and eating up all of athena's resources. So everything works fine now. However, I will leave this post up mostly for learning purposes to see if any of the questions to this potential problem are answered.

How to reset MySQL performance reports?

I'm trying to use the MySQL performance reports to try and find what is bottlenecking my read and write performance under special situations and It's cluttered up with loads of old statistics regarding other queries to my tables.
I want to clear all the performance data so I can get a fresh look.
The Clear Event Tables button doesn't actually seem to clear anything.
How do I do this?
Using MySQL Workbench
Go to Performance schema setup
Click Clear Event Tables
Refresh Reports page. All events will be cleared
(This does not answer the question as asked. Instead, it addresses the broader question of "how do I improve query performance".)
Here's a simple-minded way to get useful metrics, even for fast queries:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
The numbers may match the table size or the resultset size. These are good indications of table scan and some action (eg, sort) on the resultset.
A table scan probably means that no index was being used. Figuring out why is something that metrics probably can't tell you. But, sometimes there is a way to find that out:
EXPLAIN FORMAT=JSON SELECT ...
This will give you (in newer versions) the "cost" of various options. This may help you understand why one index is used over another.
optimizer_trace is another tool.
But nothing will give you any clue that INDEX(a,b), which you don't have, would be better than INDEX(a), which you do have. And that is one of the main points in my index cookbook.
Here's another example of what is hard to deduce from numbers. There was a production server with MySQL chewing up 100% of the CPU. The slowlog pointed to a simple SELECT that was being performed a lot. It had
WHERE DATE(indexed_column) = '2011-11-11'
Changing that dropped the CPU to a mere 2%:
WHERE indexed_column >= '2011-11-11'
AND indexed_column < '2011-11-11' + INTERVAL 1 DAY
The table was fully cached in RAM. (Hence, high CPU, low I/O.) The query had to do a full table scan, applying the DATE function to that indexed column for each row. After changing the code, the index did what it is supposed to do.

MySQL code locks up server

I have code that I have tested on the same table, but with only a few records in it.
It worked great on a handful (30) records. Did exactly what I wanted it to do.
When I added a 200 records to the table - it locks up. I have to restart apache and have tried waiting for ever for it to finish.
I could use some help figuring out why.
My table has the proper indexes and I am not having trouble in any other way.
Thanks in advance.
UPDATE `base_data_test_20000_rows` SET `NO_TOP_RATING` =
(SELECT COUNT(`ID`) FROM `base_data_test_20000_rows_2`
WHERE
`base_data_test_20000_rows_2`.`ID` != `base_data_test_20000_rows`.`ID`
AND
`base_data_test_20000_rows_2`.`ANALYST` = `base_data_test_20000_rows`.`ANALYST`
AND
`base_data_test_20000_rows_2`.`IRECCD` =
(SELECT COUNT(`ID`) FROM `base_data_test_20000_rows_2`
WHERE `IRECCD` =
(select MIN(`IRECCD`) FROM `base_data_test_20000_rows_2`
WHERE
`base_data_test_20000_rows_2`.`ANNDATS_CONVERTED` >= DATE_SUB(`base_data_test_20000_rows`.`ANNDATS_CONVERTED`,INTERVAL 1 YEAR)
AND
`base_data_test_20000_rows_2`.`ID` != `base_data_test_20000_rows`.`ID`
AND
`base_data_test_20000_rows_2`.`ESTIMID` = `base_data_test_20000_rows`.`ESTIMID`
)
)
)
WHERE `base_data_test_20000_rows`.`ANALYST` != ''
The code is just meant to look a year back for a particular brokerage - get the lowest rating - then count the number of times that analyst had the lowest rating. write that vale to the NO_TOP_RATING column.
I'm pretty sure I was wrong with my original suggestion, chainging the select count to it's own number won't help since you have conditions on your query
This is merely a hackish solution. The real way to solve this would be to optimize your query. But as a work around you could set the record count to a mysql variable, and then reference that variable in the query.
This means that you will have to make sure you set the count to the variable before you run the query. But this means that should records be added in between the time you set the variable and complete running the query you will not have the right count.
http://dev.mysql.com/doc/refman/5.0/en/user-variables.html
further thoughts:
I took a closer look before submitting this answer. That might not actually be possible since you have the where statements which is individualized to each record.
It's slow because you are using a query that counts within a query that counts within a query that has a min. It's like you are iterating through every row three times each time you iterate through a row. Which is an exponential search. So if the database has 10 records you are possibly going through each record 10^3ish times. At the number of rows you have, it's hellish.
I'm sure that there is a way to do what you are trying to do, but I can't actually tell what you are trying to do.
I would have to agree with DRapp that seeing dummy data could help us analyze what's really going on.
Since I can't wrap my head around it all, what I would try, without fully understanding what you are doing, would be to create a view of each of your sub queries and then do a query on that. http://dev.mysql.com/doc/refman/5.0/en/create-view.html
But that probably won't escape the redundancy, but it might help with the speed. But since I don't fully understand what you are doing, that's probably not the best answer.
Another not so good answer would be if you aren't running this on a mission critical db and it can go offline while you run the query then you could just changed your mysql settings and let this query run for those hours you quoted and hope it doesn't crash. But that seems less than ideal, as I have no idea if that requires additional disk space or memory to preform.
So really my best answer I can ever give you at this point, is try to see if you can approach your problem from a different angle. Or post some dummy data of what the info in Base_data_test_20000_rows looks like and what you expect it to look like after the query runs.
-Hope that helps point you to the right direction

How can I select some amount of rows, like "get as many rows as possible in 5 seconds"?

The aim is: getting the highest number of rows and not getting more rows than rows loaded, after 5 seconds. The aim is not creating a timeout.
after months, I thought maybe this would work and it didn't:
declare #d1 datetime2(7); set #d1=getdate();
select c1,c2 from t1 where (datediff(ss,#d1,getdate())<5)
Although the trend in recent years for relational databases has moved more and more toward cost-based query optimization, there is no RDBMS I am aware of that inherently supports designating a maximum cost (in time or I/O) for a query.
The idea of "just let it time out and use the records collected so far" is a flawed solution. The flaw lies in the fact that a complex query may spend the first 5 seconds performing a hash on a subtree of the query plan, to generate data that will be used by a later part of the plan. So after 5 seconds, you may still have no records.
To get the most records possible in 5 seconds, you would need a query that had a known estimated execution plan, which could then be used to estimate the optimal number of records to request in order to make the query run for as close to 5 seconds as possible. In other words, knowing that the query optimizer estimates it can process 875 records per second, you could request 4,375 records. The query might run a bit longer than 5 seconds sometimes, but over time your average execution should fall close to 5 seconds.
So...how to make this happen?
In your particular situation, it's not feasible. The catch is "known estimated execution plan". To make this work reliably, you'd need a stored procedure with a known execution plan, not an ad-hoc query. Since you can't create stored procedures in your environment, that's a non-starter. For others who want to explore that solution, though, here's an academic paper by a team who implemented this concept in Oracle. I haven't read the full paper, but based on the abstract it sounds like their work could be translated to any RDBMS that has cost-based optimization (e.g. MS SQL, MySQL, etc.)
OK, So what can YOU do in your situation?
If you can't do it the "right" way, solve it with a hack.
My suggestion: keep your own "estimated cost" statistics.
Do some testing in advance and estimate how many rows you can typically get back in 4 seconds. Let's say that number is 18,000.
So you LIMIT your query to 18,000 rows. But you also track the execution time every time you run it and keep a moving average of, say, the last 50 executions. If that average is less than 4.5s, add 1% to the query size and reset the moving average. So now your app is requesting 18,180 rows every time. After 50 iterations, if the moving average is under 4.5s, add 1% again.
And if your moving average ever exceeds 4.75s, subtract 1%.
Over time, this method should converge to an optimized N-rows solution for your particular query/environment/etc. And should adjust (slowly but steadily) when conditions change (e.g. high-concurrency vs low-concurrency)
Just one -- scratch that, two -- more things...
As a DBA, I have to say...it should be exceedingly rare for any query to take more than 5 seconds. In particular, if it's a query that runs frequently and is used by the front end application, then it absolutely should not ever run for 5 seconds. If you really do have a user-facing query that can't complete in 5 seconds, that's a sign that the database design needs improvement.
Jonathan VM's Law Of The Greenbar Report I used to work for a company that still used a mainframe application that spit out reams of greenbar dot-matrix-printed reports every day. Most of these were ignored, and of the few that were used, most were never read beyond the first page. A report might have thousands of rows sorted by descending account age...and all that user needed was to see the 10 most aged. My law is this: The number of use cases that actually require seeing a vast number of rows is infinitesimally small. Think - really think - about the use case for your query, and whether having lots and lots of records is really what that user needs.
Your while loop idea won't solve the problem entirely. It is possible that the very first iteration through the loop could take longer than 5 seconds. Plus, it will likely result in retrieving far fewer rows in the allotted time than if you tried to do it with just a single query.
Personally, I wouldn't try to solve this exact problem. Instead, I would do some testing, and through trial and error identify a number of records that I am confident will load in under five seconds. Then, I would just place a LIMIT on the loading query.
Next, depending on the requirements I would either set a timeout on the DB call of five seconds or just live with the chance that some calls will exceed the time restriction.
Lastly, consider that on most modern hardware for most queries, you can return a very large number of records within five seconds. It's hard to imagine returning all of that data to the UI and still have it be usable, if that is your intention.
-Jason
I've never tried this, but if a script is running this query you could try running an unbuffered query (in php, this would be something like mysql_unbuffered_query())... you could then store these into an array while the query is running. You could then set the mysql query timeout to five minutes. When the query is killed, if you've set your while() loop to check for a timeout response it can then terminate the loop and you'll have an array with all of the records returned in 5 minutes. Again, I'm not sure this would work, but I'd be interested to see if it would accomplish what you're looking to do.
You could approach this problem like this, but I doubt that this logic is really what I'd recommend for real world use.
You have a 10s interval, you try one query, it gets you the row in 0.1s. That would imply you could get at least 99 similar queries still in the remaining 9.9s.
However, getting 99 queries at once should proove faster than getting them one-by-one (which your initial calculation would suggest). So you get the 99 queries and check the time again.
Let's say the operation performed 1.5 times as fast as the single query, because getting more queries at once is more efficient, leaving you with 100rows at a time of 7.5s. You calculate that by average you have so far gotten 100rows per 7.5s, calculate a new amount of possible queries for the rest of the time and query again, and so on. You would, however, need to set a threshold limit for this loop, let's say something like: Don't get any new queries any more after 9.9s.
This solution obviously is neither the most smooth nor something I'd really use, but maybe it serves to solve the OP's problem.
Also, jmacinnes already pointed out: "It is possible that the very first iteration through the loop could take longer than 10[5] seconds."
I'd certainly be interested myself, if someone can come up with a proper solution to this problem.
To get data from the table you should do two things:
execute a query (SELECT something FROM table)
fill the table or read data
You are asking about second one. I'm not that familiar with php, but I think it does not matter. We use fetching to get first records quickly and show them to the user, then fetch records as needed. In ADO.NET you could use IDataReader to get records one by one, in php I think you could use similar methods, for example - mysqli_fetch_row in mysqli extension or mysql_fetch_row in mysql extension. In this case you could stop reading data at any moment.

How do I use EXPLAIN to *predict* performance of a MySQL query?

I'm helping maintain a program that's essentially a friendly read-only front-end for a big and complicated MySQL database -- the program builds ad-hoc SELECT queries from users' input, sends the queries to the DB, gets the results, post-processes them, and displays them nicely back to the user.
I'd like to add some form of reasonable/heuristic prediction for the constructed query's expected performance -- sometimes users inadvertently make queries that are inevitably going to take a very long time (because they'll return huge result sets, or because they're "going against the grain" of the way the DB is indexed) and I'd like to be able to display to the user some "somewhat reliable" information/guess about how long the query is going to take. It doesn't have to be perfect, as long as it doesn't get so badly and frequently out of whack with reality as to cause a "cry wolf" effect where users learn to disregard it;-) Based on this info, a user might decide to go get a coffee (if the estimate is 5-10 minutes), go for lunch (if it's 30-60 minutes), kill the query and try something else instead (maybe tighter limits on the info they're requesting), etc, etc.
I'm not very familiar with MySQL's EXPLAIN statement -- I see a lot of information around on how to use it to optimize a query or a DB's schema, indexing, etc, but not much on how to use it for my more limited purpose -- simply make a prediction, taking the DB as a given (of course if the predictions are reliable enough I may eventually switch to using them also to choose between alternate forms a query could take, but, that's for the future: for now, I'd be plenty happy just to show the performance guesstimates to the users for the above-mentioned purposes).
Any pointers...?
EXPLAIN won't give you any indication of how long a query will take.
At best you could use it to guess which of two queries might be faster, but unless one of them is obviously badly written then even that is going to be very hard.
You should also be aware that if you're using sub-queries, even running EXPLAIN can be slow (almost as slow as the query itself in some cases).
As far as I'm aware, MySQL doesn't provide any way to estimate the time a query will take to run. Could you log the time each query takes to run, then build an estimate based on the history of past similar queries?
I think if you want to have a chance of building something reasonably reliable out of this, what you should do is build a statistical model out of table sizes and broken-down EXPLAIN result components correlated with query processing times. Trying to build a query execution time predictor based on thinking about the contents of an EXPLAIN is just going to spend way too long giving embarrassingly poor results before it gets refined to vague usefulness.
MySQL EXPLAIN has a column called Key. If there is something in this column, this is a very good indication, it means that the query will use an index.
Queries that use indicies are generally safe to use since they were likely thought out by the database designer when (s)he designed the database.
However
There is another field called Extra. This field sometimes contains the text using_filesort.
This is very very bad. This literally means MySQL knows that the query will have a result set larger than the available memory, and therefore will start to swap the data to disk in order to sort it.
Conclusion
Instead of trying to predict the time a query takes, simply look at these two indicators. If a query is using_filesort, deny the user. And depending on how strict you want to be, if the query is not using any keys, you should also deny it.
Read more about the resultset of the MySQL EXPLAIN statement