I apologize if this question is too general, I can post and example code, but it may not be reproducible because there is no access to the actual database.
Suppose I have a big MySQL query with a lot of joins and unions, as well as functions like concat, date, various time and date conversion functions. And it uses a lot of tables, nested select queries etc. Lets suppose it is a select query.
My questions is, where would one start, if they need to optimize this script to run faster? Currently its taking hours to complete. Furthermore, If I run stand-alone chunks of code from it (including some nested queries, etc.) they run much faster. Therefore there are one or a few bottlenecks. Perhaps certain tables are not indexed properly.
I am aware of profiling and bench-marking as well as explain functionality in MySQL they all help us to understand what MySQL does behind the scenes, but all provide the summary for the entire script overall. What would be the best way to identify these bottlenecks without profiling each each portion of the script separately? Are there best practices when faced with such problem?
Again, I apologize for asking a question that may be too broad. I can post and example code, but it may not be reproducible because there is no access to the actual database.
After using EXPLAIN and making sure I make use of proper indexing, I would run it on a subset of your data so I can get it in seconds (easier when tweaking the query).
I would run each subquery individually first and take a note how long they perform. Then run the query that calls that subquery/derived and see how long it performs. Comment out some subqueries and see how it performs. Soon you will get the picture which parts are your bottleneck.
Then I would start experimenting with different techniques. Perhaps using a temporary table first, or maybe I need to run daily cron jobs that summarizes the data for me.
Therefore there are one or a few bottlenecks. Perhaps certain tables are not indexed properly.
This sounds like something you could solve using EXPLAIN?
I don't use MySQL but this is sort of a software agnostic problem. Assuming that you are already doing some of the "right" things such as filtering on indexed fields, etc, there are two steps that might help.
First - move the code to a Stored Procedure. The benefit of this is that the code needs to be compile only once. If your current query is not run frequently, it has to be compiled every time it runs, which takes time.
Second - use temporary tables. While it's not intuitive, I have found that this often improves execution time dramatically.
Related
I see this error in phpmyadmin:
The number of joins that do not use indexes
but I never use join in my code.
Now I want get a list of query that do not use indexes.
How can I get this list?
I tried enabling slow query log, but I can not understand which query is not use indexes.
Can someone guide me?
There is no list of "joins not using indexes".
Certain admin queries use VIEWs that have JOINs; possibly that is where they came from.
There are valid cases in which not using an index is OK for a JOIN. A simple example is joining two tables where one of the tables has so few rows that an index would not help with performance.
The slowlog provides something more important -- a list of the "worst" queries (from a performance point of view). Any "slow" query with JOIN that needs an index will show up, even without that setting turned on. (I prefer to turn off that option, since it clutters the slowlog with unexciting queries.)
I'll briefly mention that this isn't an error, it's not even really a warning. The Advisor tab is meant to make broad and generic performance suggestions that are meant to guide you towards optimizing your system. Having some of these suggestions that aren't fixable or don't apply to your situation is pretty normal. In fact, my test system gives the same advice about the join without an index.
As Rick James alluded to, these queries might not come directly from code that you write...some administrative tasks may be triggering it (your operating system might run some housekeeping queries, MySQL runs some queries against itself, etc).
The best way to look at the queries is to log them, which is answered in Log all queries in mysql. You could use the "Process" sub tab of the "Status" area in phpMyAdmin (very similar to how you get to the Advisor tab) to look at active queries, but that's not going to give you a complete picture over time. Once you have that list of queries, you can analyze them to determine if there are improvements you can make to the code.
Just interested, maybe someone might know that. If I use lazy load to get all attributes, relations and so on it makes ~350 queries to database it takes about 2 sec to fully render the page. If i make one big query with multiple joins to get all relations I need it makes ~20 queries one is really big, and problem is that this big query first time takes about 10 sec to execute, after that it gets cached and it goes much faster and whole page loads in ~1.5 sec, but problem is that every user has different parameters to that query so for every user first time it goes for 10 sec.. why it goes so long for first time?
May I ask, if you are using a stored procedure? I have added a link with some advantages of using a stored procedure https://docs.oracle.com/cd/F49540_01/DOC/java.815/a64686/01_intr3.htm . Can you give some examples of your parameters for different users?
Thanks
As you gave no information on the data base schema, the data size and other parameters it is very difficult to determine the root cause of the bad performance. However, there is another answer here on StackOverflow that might be a great starting point for further investigation.
In general consider the following questions to start investigating / optimizing:
Do you really need all the information you fetch from the DB (at once)?
Is the database optimized for the queries you execute?
How often do you need to execute the queries and if you cache them, how often does the cache outdate?
I have to perform some serious data mining on very large data sets stored in MySQL db. However, queries that require a bit more than a basic SELECT * FROM X WHERE ... tend to become rather inefficient since they return results on the order of 10e6 or more, especially when JOIN on one or more tables is introduced - think of joining 2 or more tables containing several tens of millions rows (after filtering data), which is something that pretty much happens on every query. More than often we'd like to run aggregate functions on these (sum, avg, count, etc), but this is impossible since MySQL simply chokes.
I should note that many efforts were put to optimize the current performance - all tables are indexed properly and queries are tuned, the hardware is top notch, the storage engine was configured and so on. However, still each query takes very long - to the point where "let's run it before we go home and hope for the best when we come to work tomorrow." Not good.
This has to be a solvable problem - many large companies perform very data and computational intensive mining, and handle it well (without writing their own storage engines, google). I'm willing to accept time penalty to get the job done, but on the order of hours, not days. My question is - what do people use to counter problems like this? I've heard of storage engines geared to this type of problem (greenplum, etc.), but I wanted to hear how this problem is typically approached. Our current data store is obviously relational and should probably remain such, but any thoughts or suggestions are welcome. Thanks.
I suggest PostgreSQL, which I've been working with quite successfully on tables with ~0.5B rows that required some complex join operations. Oracle should be good for that too, but I don't have much experience with it.
It should be noted that switching an RDBMS isn't a magic solution, if you want to scale to those sizes there's a LOT of hard work to be done in optimizing your queries, optimizing the database structure and indexes, fine tuning the database configuration, using the right hardware for your usage, replication, using materialized views (which are extremely powerful when used correctly. see here and here - its postgres specific, but applies to other RDBMSs too)... and at some point, you just have to throw more money on the problem.
edited fixed some weird typos (useless android auto correct...) and added some resources about materialized views
We have used MS SqlServer to run analytics on financial data with ten of millions of rows and more using complex JOIN and aggregation. Several things that we have done other than what you have mentioned are:
We chunk the calculation into a lot of temporary tables instead of using sub-query. These tables then we apply proper keys, indexing and so on via the code. Query with sub-query just fails for us
In the temporary tables, we often apply the clustered index that makes sense for us. Note that this temporary tables are filtered results so applying the index on the fly is not expensive compared to use the sub query in place of this temporary tables. Note I am speaking from our experience and might not apply to all cases
As we have done a lot of aggregation function as well, we did a lot indexing on the group columns
We do a lot of query run planning using SQL Query Analyzer that shows us the execution plan. Based on the plan, we revised the query, change the index
We provide hints for the SQL Server that we think could help the execution such as the choice of JOIN Algorithm to take (Hash, Merged or Nested)
I'm trying to iteratively optimize a slow MySQL query, which means I run the query, get timing, tweak it, re-run it, get timing, etc. The problem is that the timing is non-stationary, and later executions of the query perform very differently from earlier executions.
I know to clear the query cache, or turn it off, between executions. I also know that, at some level, the OS will affect query performance in ways MySQL can't control or understand. But in general, what's the best I can do wrt this kind of iterative query optimization, so that I can compare apples to apples?
Your best tool for query optimization is EXPLAIN. It will take a bit to learn what the output means, but after doing so, you will understand how MySQL's (horrible, broken, backwards) query planner decides to best retrieve the requested data.
Changes in the parameters to the query can result in wildly different query plans, so this may account for some of the problems you are seeing.
You might want to consider using the slow query log to capture all queries that might be running with low performance. Perhaps you'll find that the query in question only falls into the low performance category when it uses certain parameters?
Create a script that runs the query 1000 times, or whatever number of iterations causes the results to stabilize.
Then follow your process as described above, but just make sure you aren't relying on a single execution, but rather an average of multiple executions, because you're right, the results will not be stable as row counts change, and your machine is doing other things.
Also, try to use a wide array of inputs to the query, if that makes sense for your use case.
I'm helping maintain a program that's essentially a friendly read-only front-end for a big and complicated MySQL database -- the program builds ad-hoc SELECT queries from users' input, sends the queries to the DB, gets the results, post-processes them, and displays them nicely back to the user.
I'd like to add some form of reasonable/heuristic prediction for the constructed query's expected performance -- sometimes users inadvertently make queries that are inevitably going to take a very long time (because they'll return huge result sets, or because they're "going against the grain" of the way the DB is indexed) and I'd like to be able to display to the user some "somewhat reliable" information/guess about how long the query is going to take. It doesn't have to be perfect, as long as it doesn't get so badly and frequently out of whack with reality as to cause a "cry wolf" effect where users learn to disregard it;-) Based on this info, a user might decide to go get a coffee (if the estimate is 5-10 minutes), go for lunch (if it's 30-60 minutes), kill the query and try something else instead (maybe tighter limits on the info they're requesting), etc, etc.
I'm not very familiar with MySQL's EXPLAIN statement -- I see a lot of information around on how to use it to optimize a query or a DB's schema, indexing, etc, but not much on how to use it for my more limited purpose -- simply make a prediction, taking the DB as a given (of course if the predictions are reliable enough I may eventually switch to using them also to choose between alternate forms a query could take, but, that's for the future: for now, I'd be plenty happy just to show the performance guesstimates to the users for the above-mentioned purposes).
Any pointers...?
EXPLAIN won't give you any indication of how long a query will take.
At best you could use it to guess which of two queries might be faster, but unless one of them is obviously badly written then even that is going to be very hard.
You should also be aware that if you're using sub-queries, even running EXPLAIN can be slow (almost as slow as the query itself in some cases).
As far as I'm aware, MySQL doesn't provide any way to estimate the time a query will take to run. Could you log the time each query takes to run, then build an estimate based on the history of past similar queries?
I think if you want to have a chance of building something reasonably reliable out of this, what you should do is build a statistical model out of table sizes and broken-down EXPLAIN result components correlated with query processing times. Trying to build a query execution time predictor based on thinking about the contents of an EXPLAIN is just going to spend way too long giving embarrassingly poor results before it gets refined to vague usefulness.
MySQL EXPLAIN has a column called Key. If there is something in this column, this is a very good indication, it means that the query will use an index.
Queries that use indicies are generally safe to use since they were likely thought out by the database designer when (s)he designed the database.
However
There is another field called Extra. This field sometimes contains the text using_filesort.
This is very very bad. This literally means MySQL knows that the query will have a result set larger than the available memory, and therefore will start to swap the data to disk in order to sort it.
Conclusion
Instead of trying to predict the time a query takes, simply look at these two indicators. If a query is using_filesort, deny the user. And depending on how strict you want to be, if the query is not using any keys, you should also deny it.
Read more about the resultset of the MySQL EXPLAIN statement