How Long is Acceptable to Run a mySQL Query - mysql

When I test this query it takes around 17 - 20 seconds to complete.
UPDATE ex_hotel_temp
SET specialoffer='1'
WHERE hid IN
(SELECT hid
FROM ex_dates
WHERE offer_id IS NOT NULL
OR xfory_id IS NOT NULL
OR long_id IS NOT NULL
OR early_id IS NOT NULL
GROUP BY hid)
Although this is a cronjob running at night to do some housekeeping on the database (there is no site visitor sitting waiting for the result), it seems to me to be an unacceptable load on the server. Am I right, or am I fussing over nothing?
When I run each element of the query individually it takes about 0.001 sec. Should I therefore break it up into a series of simple queries instead?
LATER EDIT:
With the assistance of the comments and answers received, I decided to split the query into two. The result is this:
$query_hotel = "SELECT hid FROM ex_dates WHERE offer_id IS NOT NULL OR xfory_id IS NOT NULL OR long_id IS NOT NULL OR early_id IS NOT NULL GROUP BY hid";
$hotel = mysql_query($query_hotel, $MySQL_XXX) or die(mysql_error());
$row_hotel = mysql_fetch_assoc($hotel);
$totalRows_hotel = mysql_num_rows($hotel);
$hid_array = array();
do {
array_push($hid_array,$row_hotel['hid']);
}while ($row_hotel = mysql_fetch_assoc($hotel)) ;
$hid_list = implode("','",$hid_array);
$hid_list = "'$hid_list'";
// Mark the hotels as having a special offer
$query_update = "UPDATE ex_hotel_temp SET specialoffer='1' WHERE hid IN ($hid_list)";
$result = mysql_query($query_update, $MySQL_XXX) or die(mysql_error());
It's not pretty, but it works.
As there are two queries with a bit of PHP thrown in, I can't get an accurate measure of how long it takes to run, but just by looking at the time for the page to load it is obviously much closer to the fractions of a second than 20 seconds.
Thanks to all.

You say that this runs overnight in a CRON job, and you say this supports a "site" - if this is a public-facing website, yes, you should worry.
There's no such thing as business hours on the interwebs - there will be visitors interacting with your website, hopefully trying to buy stuff, at all hours of the day; even "national" sites tend to see traffic through the night in my experience (though typically at only a small rate compared to peak hours).
It's possible that your CRON job is causing other queries to run slowly too - it depends on what's causing the query to run slowly, and whether you're using transactions. The problem with web sites is that users tend to get impatient when the site is slow, refreshing the page, often creating more traffic to the database, and if there are other slow queries on the site, it's not impossible that the site becomes unusable for a while, even with a fairly limited number of users.
So, if there may be users of your site while the script runs, it's definitely worth tidying up.
The other reason you might worry is that in my experience, database performance is not linear - queries don't slow down in linear proportion to the number of records in your table. Instead, they tend to be hockey-stick like - everything is fine, until you reach a tipping point, and everything grinds to a halt. You may be riding that hockey-stick curve, and it could easily escalate from 17-20 seconds to 17-20 minutes.
The fix looks simple - the group by is redundant, and splitting the query into smaller queries should help the subselect use indices.

I wouldn't care, just make sure that the cron job does not timeout half way trough the proces.
I personally had query's in the past then ran for serval minutes in cron jobs without any problems.

Related

Surprising timing stats for sql queries

I have two queries whose timing parameters I want to analyze.
The first query is taking much longer than the second one while in my opinion it should be the other way round.Any Explanations?
First query:
select mrn
from EncounterInformation
limit 20;
Second query:
select enc.mrn, fl.fileallocationid
from EncounterInformation as enc inner join
FileAllocation as fl
on enc.encounterIndexId = fl.encounterid
limit 20;
The first query runs in 0.760 seconds on MYSQL while second one runs in 0.509 seconds surprisingly.
There are many reasons why measured performance between two queries might be different:
The execution plans for the queries (the dominant factor)
The size of the data being returned (perhaps mrn is a string that is really long for the result set in the first query but not the second)
Other database activity, that locks tables and indexes
Other server activity
Data and index caches that are pre-loaded -- either in the database itself or in the underlying OS components
Your observation is correct. The first should be faster than the second. More importantly though is the observation that this simply does not make sense for your simple query:
The first query runs in 0.760 seconds
select mrn
from EncounterInformation
limit 20;
The work done for this would typically be to load one data page (or maybe a handful). That would only consistently take 0.760 seconds if:
You had really slow data storage (think "carrier pigeons").
EncounterInformation is a view and not a table.
You don't understand timings.
If the difference is between 0.760 milliseconds and 0.509 milliseconds, then the difference is really small and likely due to other issues -- warm caches, other activity on the server, other database activity.
It is also possible that you are measuring elapsed time and not database time, so network congestion could potentially be an issue.
If you are querying views, all bets are off without knowing what the views are. In fact, if you care about performance you should be including the execution plan in the question.
I can't explain the difference. What I can say is that your observation is reasonable, but your question lacks lots of information that suggests you need to learn more about how to understand timings. I would suggest that you start with learning about explain.

Best and most efficient way for ELO-score calculation for users in database

I'm having a hard time wrapping my head around the issue of an ELO-score-like calculation for a large amount of users on our platform.
For example. For every user in a large set of users, a complex formule, based on variable amounts of "things done", will result in a score for each user for a match-making-like principle.
For our situation, it's based on the amount of posts posted, connections accepted, messages sent, amount of sessions in a time period of one month, .. other things done etc.
I had two ideas to go about doing this:
Real-time: On every post, message, .. run the formula for that user
Once a week: Run the script to calculate everything for all users.
The concerns about these two I have:
Real-time: Would be an overkill of queries and calculations for each action a user performs. If let's say, 500 users are active, all of them are performing actions, the database would be having a hard time I think. There would them also run a script to re-calculate the score for inactive users (to lower their score)
Once a week: If we have for example 5.000 users (for our first phase), than that would result into running the calculation formula 5.000 times and could take a long time and will increase in time when more users join.
The calculation-queries for a single variable in a the entire formula of about 12 variables are mostly a simple 'COUNT FROM table', but a few are like counting "all connections of my connections" which takes a few joins.
I started with "logging" every action into a table for this purpose, just the counter values and increase/decrease them with every action and running the formula with these values (a record per week). This works but can't be applied for every variable (like the connections of connections).
Note: Our server-side is based on PHP with MySQL.
We're also running Redis, but I'm not sure if this could improve those bits and pieces.
We have the option to export/push data to other servers/databases if needed.
My main example is the app 'Tinder' which uses a sort-like algorithm for match making (maybe with less complex data variables because they're not using groups and communities that you can join)
I'm wondering if they run that real-time on every swipe, every setting change, .. or if they have like a script that runs continiously for a small batch of users each time.
Where it all comes down to. What would be the most efficient/non-database-table-locking way to do this, with keeping the idea in mind that there will be a moment that we're having 50.000 users for example?
The way I would handle this:
Implement the realtime algorithm.
Measure. Is it actually slow? Try optimizing
Still slow? Move the algorithm to a separate asynchronous process. Have the process run whenever there's an update. Really this is the same thing as 1, but it doesn't slow down PHP requests and if it gets busy, it can take more time to catch up.
Still slow? Now you might be able to optimize by batching several changes.
If you have 5000 users right now, make sure it runs well with 5000 users. You're not going to grow to 50.000 overnight, so adjust and invest in this as your problem changes. You might be surprised where your performance problems are.
Measuring is key though. If you really want to support 50K users right now, simulate and measure.
I suspect you should use the database as the "source of truth" aka "persistent storage".
Then fetch whatever is needed from the dataset when you update the ratings. Even lots of games by 5000 players should not take more than a few seconds to fetch and compute on.
Bottom line: Implement "realtime"; come back with table schema and SELECTs if you find that the table fetching is a significant fraction of the total time. Do the "math" in a programming language, not SQL.

Inconsistent response time in mysql select query

We have lot of MySQL select queries for some reporting need. Most are little complex and they generally include 1. five-six join statements 2. three-four inner queries in select clause.
All the indexes are properly in place in the production environment. We have checked with explain query syntax multiple time and they are OK.
Some of the query behave very strangely in in terms of response time. The same query returns in less than 500 milli secs at times (which shows all index working fine), and when we run it after 1 min or so - it gives result with a high response time (varying from five-six seconds to 30 seconds.) Some time (around once in 20 times..) it gives a time out error.
This might be due to server load - but the high variance is so frequent that we think there is something else to set to solve it.
Can some one please show me some direction on what else to do!
Thanks,
Sumit
This kind of behaviour is usually caused by a bottleneck in your stack.
It's like a rotating door in a building - the door can handle 1 person at a time, and each person takes 3 seconds; as long as people don't arrive at a rate over 1 person every 3 seconds, you don't know it's a bottleneck. If people arrive at a faster rate for a short period of time, the queue grows a little but disappears quickly. If people arrive at a rate of 1 person every 2.5 seconds for an hour, the queue becomes unmanageable, and can take far longer than that 1 hour to disappear.
Your database system is made up of a long corridor with rotating doors - most doors can operate in parallel, but they are all limited.
(Sorry for the rubbish analogy, but I find it helps to visualize these things with real-world images).
If the queries are showing a high degree of variance in their performance profile, I'd look at the system performance monitor (top in Linux, Perfmon in windows) and try to correlate slow performance with the behaviour of the system. If you see a sudden spike in CPU utilization when the queries slow down, that's likely to be your bottleneck; if you see a sudden spike in disk throughput, you might look there.
Once you have a hypothesis about the bottleneck, you can look at ways of resolving them - throwing hardware at the problem is usually cheapest.

Getting top line metrics FAST from a large MySQL DB?

I'm painfully aware there probably isn't a magic bullet to this, but it's becoming a problem. Each user has hundreds of thousands of rows of metrics data across 3 tables, this is updated on a second by second basis.
When a user logs in, I want to quickly deliver them top line stats for a number of their assets (i.e. alongside each asset in navi they have top level stats).
I've tried a number of ideas; but please - if someone has some advice or experience in this area it'd be great. Stuff tried or looked into so far:-
Produce static versions of top line stats every hour or so - This is intensive across all users and all assets. So how this can be done regularly, I'm not sure.
Call stats via AJAX, so they can be processed and fill in (getting top level stats right now can take up to 10 seconds for a larger user) once page has loaded. This could also cache stats in session to save redoing queries each page load.
Query run at 30 min intervals, i.e. you log on, it'll query and then it'll hopefully use query cache every time it's loaded (only 1/2 seconds) until the next 30min interval.
The first one seems to have most legs, but I'm not sure how to do this, given only a small number of users will be needing those stats - it seems awfully expensive to do it for everyone all the time.
Produce static versions of top line stats every hour or so - This is
intensive across all users and all assets. So how this can be done
regularly, I'm not sure.
Call stats via AJAX, so they can be processed and fill in (getting
top level stats right now can take up to 10 seconds for a larger
user) once page has loaded. This could also cache stats in session to
save redoing queries each page load.
Query run at 30 min intervals, i.e. you log on, it'll query and then
it'll hopefully use query cache every time it's loaded (only 1/2
seconds) until the next 30min interval.
Your option 1 and 3 in mySQL is known as a materialized view MySQL doesn't currently support them but the concept can be completed link provides examples
hundreds of thousands of records isn't that much. good indexes and the use of analytic queries will get you quite far. Sadly this concept isn't implemented in full but there are workarounds as well as indicated in the link provided.
It really depends on top line stats. are you wanting real time data down to the second or are 10-20 or even 30 minute intervals acceptable? Using event scheduler one can schedule the creation/update of reporting table(s) which contain summarized data faster to query. This data then is available at fractions of seconds delivery time as all the heavy lifting has already been completed. Your focus can then be on indexing these tables to improve performance without worrying about impacts to production tables.
You are in the datawarehousing domain with your setup. This means, that not all the NF1 rules apply. So my approach would be to use triggers to fill a seperate stats table.

How can I select some amount of rows, like "get as many rows as possible in 5 seconds"?

The aim is: getting the highest number of rows and not getting more rows than rows loaded, after 5 seconds. The aim is not creating a timeout.
after months, I thought maybe this would work and it didn't:
declare #d1 datetime2(7); set #d1=getdate();
select c1,c2 from t1 where (datediff(ss,#d1,getdate())<5)
Although the trend in recent years for relational databases has moved more and more toward cost-based query optimization, there is no RDBMS I am aware of that inherently supports designating a maximum cost (in time or I/O) for a query.
The idea of "just let it time out and use the records collected so far" is a flawed solution. The flaw lies in the fact that a complex query may spend the first 5 seconds performing a hash on a subtree of the query plan, to generate data that will be used by a later part of the plan. So after 5 seconds, you may still have no records.
To get the most records possible in 5 seconds, you would need a query that had a known estimated execution plan, which could then be used to estimate the optimal number of records to request in order to make the query run for as close to 5 seconds as possible. In other words, knowing that the query optimizer estimates it can process 875 records per second, you could request 4,375 records. The query might run a bit longer than 5 seconds sometimes, but over time your average execution should fall close to 5 seconds.
So...how to make this happen?
In your particular situation, it's not feasible. The catch is "known estimated execution plan". To make this work reliably, you'd need a stored procedure with a known execution plan, not an ad-hoc query. Since you can't create stored procedures in your environment, that's a non-starter. For others who want to explore that solution, though, here's an academic paper by a team who implemented this concept in Oracle. I haven't read the full paper, but based on the abstract it sounds like their work could be translated to any RDBMS that has cost-based optimization (e.g. MS SQL, MySQL, etc.)
OK, So what can YOU do in your situation?
If you can't do it the "right" way, solve it with a hack.
My suggestion: keep your own "estimated cost" statistics.
Do some testing in advance and estimate how many rows you can typically get back in 4 seconds. Let's say that number is 18,000.
So you LIMIT your query to 18,000 rows. But you also track the execution time every time you run it and keep a moving average of, say, the last 50 executions. If that average is less than 4.5s, add 1% to the query size and reset the moving average. So now your app is requesting 18,180 rows every time. After 50 iterations, if the moving average is under 4.5s, add 1% again.
And if your moving average ever exceeds 4.75s, subtract 1%.
Over time, this method should converge to an optimized N-rows solution for your particular query/environment/etc. And should adjust (slowly but steadily) when conditions change (e.g. high-concurrency vs low-concurrency)
Just one -- scratch that, two -- more things...
As a DBA, I have to say...it should be exceedingly rare for any query to take more than 5 seconds. In particular, if it's a query that runs frequently and is used by the front end application, then it absolutely should not ever run for 5 seconds. If you really do have a user-facing query that can't complete in 5 seconds, that's a sign that the database design needs improvement.
Jonathan VM's Law Of The Greenbar Report I used to work for a company that still used a mainframe application that spit out reams of greenbar dot-matrix-printed reports every day. Most of these were ignored, and of the few that were used, most were never read beyond the first page. A report might have thousands of rows sorted by descending account age...and all that user needed was to see the 10 most aged. My law is this: The number of use cases that actually require seeing a vast number of rows is infinitesimally small. Think - really think - about the use case for your query, and whether having lots and lots of records is really what that user needs.
Your while loop idea won't solve the problem entirely. It is possible that the very first iteration through the loop could take longer than 5 seconds. Plus, it will likely result in retrieving far fewer rows in the allotted time than if you tried to do it with just a single query.
Personally, I wouldn't try to solve this exact problem. Instead, I would do some testing, and through trial and error identify a number of records that I am confident will load in under five seconds. Then, I would just place a LIMIT on the loading query.
Next, depending on the requirements I would either set a timeout on the DB call of five seconds or just live with the chance that some calls will exceed the time restriction.
Lastly, consider that on most modern hardware for most queries, you can return a very large number of records within five seconds. It's hard to imagine returning all of that data to the UI and still have it be usable, if that is your intention.
-Jason
I've never tried this, but if a script is running this query you could try running an unbuffered query (in php, this would be something like mysql_unbuffered_query())... you could then store these into an array while the query is running. You could then set the mysql query timeout to five minutes. When the query is killed, if you've set your while() loop to check for a timeout response it can then terminate the loop and you'll have an array with all of the records returned in 5 minutes. Again, I'm not sure this would work, but I'd be interested to see if it would accomplish what you're looking to do.
You could approach this problem like this, but I doubt that this logic is really what I'd recommend for real world use.
You have a 10s interval, you try one query, it gets you the row in 0.1s. That would imply you could get at least 99 similar queries still in the remaining 9.9s.
However, getting 99 queries at once should proove faster than getting them one-by-one (which your initial calculation would suggest). So you get the 99 queries and check the time again.
Let's say the operation performed 1.5 times as fast as the single query, because getting more queries at once is more efficient, leaving you with 100rows at a time of 7.5s. You calculate that by average you have so far gotten 100rows per 7.5s, calculate a new amount of possible queries for the rest of the time and query again, and so on. You would, however, need to set a threshold limit for this loop, let's say something like: Don't get any new queries any more after 9.9s.
This solution obviously is neither the most smooth nor something I'd really use, but maybe it serves to solve the OP's problem.
Also, jmacinnes already pointed out: "It is possible that the very first iteration through the loop could take longer than 10[5] seconds."
I'd certainly be interested myself, if someone can come up with a proper solution to this problem.
To get data from the table you should do two things:
execute a query (SELECT something FROM table)
fill the table or read data
You are asking about second one. I'm not that familiar with php, but I think it does not matter. We use fetching to get first records quickly and show them to the user, then fetch records as needed. In ADO.NET you could use IDataReader to get records one by one, in php I think you could use similar methods, for example - mysqli_fetch_row in mysqli extension or mysql_fetch_row in mysql extension. In this case you could stop reading data at any moment.