How can I select some amount of rows, like "get as many rows as possible in 5 seconds"? - mysql

The aim is: getting the highest number of rows and not getting more rows than rows loaded, after 5 seconds. The aim is not creating a timeout.
after months, I thought maybe this would work and it didn't:
declare #d1 datetime2(7); set #d1=getdate();
select c1,c2 from t1 where (datediff(ss,#d1,getdate())<5)

Although the trend in recent years for relational databases has moved more and more toward cost-based query optimization, there is no RDBMS I am aware of that inherently supports designating a maximum cost (in time or I/O) for a query.
The idea of "just let it time out and use the records collected so far" is a flawed solution. The flaw lies in the fact that a complex query may spend the first 5 seconds performing a hash on a subtree of the query plan, to generate data that will be used by a later part of the plan. So after 5 seconds, you may still have no records.
To get the most records possible in 5 seconds, you would need a query that had a known estimated execution plan, which could then be used to estimate the optimal number of records to request in order to make the query run for as close to 5 seconds as possible. In other words, knowing that the query optimizer estimates it can process 875 records per second, you could request 4,375 records. The query might run a bit longer than 5 seconds sometimes, but over time your average execution should fall close to 5 seconds.
So...how to make this happen?
In your particular situation, it's not feasible. The catch is "known estimated execution plan". To make this work reliably, you'd need a stored procedure with a known execution plan, not an ad-hoc query. Since you can't create stored procedures in your environment, that's a non-starter. For others who want to explore that solution, though, here's an academic paper by a team who implemented this concept in Oracle. I haven't read the full paper, but based on the abstract it sounds like their work could be translated to any RDBMS that has cost-based optimization (e.g. MS SQL, MySQL, etc.)
OK, So what can YOU do in your situation?
If you can't do it the "right" way, solve it with a hack.
My suggestion: keep your own "estimated cost" statistics.
Do some testing in advance and estimate how many rows you can typically get back in 4 seconds. Let's say that number is 18,000.
So you LIMIT your query to 18,000 rows. But you also track the execution time every time you run it and keep a moving average of, say, the last 50 executions. If that average is less than 4.5s, add 1% to the query size and reset the moving average. So now your app is requesting 18,180 rows every time. After 50 iterations, if the moving average is under 4.5s, add 1% again.
And if your moving average ever exceeds 4.75s, subtract 1%.
Over time, this method should converge to an optimized N-rows solution for your particular query/environment/etc. And should adjust (slowly but steadily) when conditions change (e.g. high-concurrency vs low-concurrency)
Just one -- scratch that, two -- more things...
As a DBA, I have to say...it should be exceedingly rare for any query to take more than 5 seconds. In particular, if it's a query that runs frequently and is used by the front end application, then it absolutely should not ever run for 5 seconds. If you really do have a user-facing query that can't complete in 5 seconds, that's a sign that the database design needs improvement.
Jonathan VM's Law Of The Greenbar Report I used to work for a company that still used a mainframe application that spit out reams of greenbar dot-matrix-printed reports every day. Most of these were ignored, and of the few that were used, most were never read beyond the first page. A report might have thousands of rows sorted by descending account age...and all that user needed was to see the 10 most aged. My law is this: The number of use cases that actually require seeing a vast number of rows is infinitesimally small. Think - really think - about the use case for your query, and whether having lots and lots of records is really what that user needs.

Your while loop idea won't solve the problem entirely. It is possible that the very first iteration through the loop could take longer than 5 seconds. Plus, it will likely result in retrieving far fewer rows in the allotted time than if you tried to do it with just a single query.
Personally, I wouldn't try to solve this exact problem. Instead, I would do some testing, and through trial and error identify a number of records that I am confident will load in under five seconds. Then, I would just place a LIMIT on the loading query.
Next, depending on the requirements I would either set a timeout on the DB call of five seconds or just live with the chance that some calls will exceed the time restriction.
Lastly, consider that on most modern hardware for most queries, you can return a very large number of records within five seconds. It's hard to imagine returning all of that data to the UI and still have it be usable, if that is your intention.
-Jason

I've never tried this, but if a script is running this query you could try running an unbuffered query (in php, this would be something like mysql_unbuffered_query())... you could then store these into an array while the query is running. You could then set the mysql query timeout to five minutes. When the query is killed, if you've set your while() loop to check for a timeout response it can then terminate the loop and you'll have an array with all of the records returned in 5 minutes. Again, I'm not sure this would work, but I'd be interested to see if it would accomplish what you're looking to do.

You could approach this problem like this, but I doubt that this logic is really what I'd recommend for real world use.
You have a 10s interval, you try one query, it gets you the row in 0.1s. That would imply you could get at least 99 similar queries still in the remaining 9.9s.
However, getting 99 queries at once should proove faster than getting them one-by-one (which your initial calculation would suggest). So you get the 99 queries and check the time again.
Let's say the operation performed 1.5 times as fast as the single query, because getting more queries at once is more efficient, leaving you with 100rows at a time of 7.5s. You calculate that by average you have so far gotten 100rows per 7.5s, calculate a new amount of possible queries for the rest of the time and query again, and so on. You would, however, need to set a threshold limit for this loop, let's say something like: Don't get any new queries any more after 9.9s.
This solution obviously is neither the most smooth nor something I'd really use, but maybe it serves to solve the OP's problem.
Also, jmacinnes already pointed out: "It is possible that the very first iteration through the loop could take longer than 10[5] seconds."
I'd certainly be interested myself, if someone can come up with a proper solution to this problem.

To get data from the table you should do two things:
execute a query (SELECT something FROM table)
fill the table or read data
You are asking about second one. I'm not that familiar with php, but I think it does not matter. We use fetching to get first records quickly and show them to the user, then fetch records as needed. In ADO.NET you could use IDataReader to get records one by one, in php I think you could use similar methods, for example - mysqli_fetch_row in mysqli extension or mysql_fetch_row in mysql extension. In this case you could stop reading data at any moment.

Related

Running a cron to update 1 million records in every hour fails

We have an E-commerce system with more than 1 million users with a total or 4 to 5 million records in order table. We use codeigniter framework as back end and Mysql as database.
Due to this excessive number of users and purchases, we use cron jobs to update the order details and referral bonus points in every hour to make the things work.
Now we have a situation that these data updates exceeds one hour and the next batch of updates reach before finishing the previous one, there by leading into a deadlock and failure of the system.
I'd like to know about the different possible architectural and database scaling options and suggestions to get rid of this situation. We are using only the monolithic architecture to run this application.
Don't use cron. Have a single process that starts over when it finishes. If one pass lasts more than an hour, the next one will start late. (Checking PROCESSLIST is clumsy and error-prone. OTOH, this continually-running approach needs a "keep-alive" cronjob.)
Don't UPDATE millions of rows. Instead, find a way to put the desired info in a separate table that the user joins to. Presumably, that extra table would only 1 row (if everyone is controlled by the same game) or a small number of rows (if there are only a small number of patterns to handle).
Do have the slowlog turned on, with a small value for long_query_time (possibly "1.0", maybe lower). Use pt-query-digest to summarize it to find the "worst" queries. Then we can help you make them take less time, thereby helping to calm your busy system and improve the 'user experience'.
Do use batched INSERT. (A one INSERT with 100 rows runs about 10 times as fast as 100 single-row INSERTs.) Batching UPDATEs is tricky, but can be done with IODKU.
Do use batches of 100-1000 rows. (This is somewhat optimal considering the various things that can happen.)
Do use transactions judiciously. Do check for errors (including deadlocks) at every step.
Do tell us what you are doing in the hourly update. We might be able to provide more targeted advice than that 15-year-old book.
Do realize that you have scaled beyond the capabilities of the typical 3rd-party package. That is, you will have to learn the details of SQL.
I have some ideas here for you - mixed up with some questions.
Assuming you are limited in what you can do (i.e. you can't re-architect you way out of this) and that the database can't be tuned further:
Make the list of records to be processed as small as possible
i.e. Does the job have to run over all records? These 4-5 million records - are they all active orders, or that's how many you have in total for all time? Obviously just process the bare minimum.
Split and parallel process
You mentioned "batches" but never explained what that meant - can you elaborate?
Can you get multiple instances of the cron job to run at once, each covering a different segment of the records?
Multi-Record Operations
The easy (lazy) way to program updates is to do it in a loop that iterates through each record and processes it individually, but relational databases can do updates over multiple records at once. I'm pretty sure there's a proper term for that but I can't recall it. Are you processing each row individually or doing multi-record updates?
How does the cron job query the database? Have you hand-crafted the most efficient queries possible, or are you using some ORM / framework to do stuff for you?

Best and most efficient way for ELO-score calculation for users in database

I'm having a hard time wrapping my head around the issue of an ELO-score-like calculation for a large amount of users on our platform.
For example. For every user in a large set of users, a complex formule, based on variable amounts of "things done", will result in a score for each user for a match-making-like principle.
For our situation, it's based on the amount of posts posted, connections accepted, messages sent, amount of sessions in a time period of one month, .. other things done etc.
I had two ideas to go about doing this:
Real-time: On every post, message, .. run the formula for that user
Once a week: Run the script to calculate everything for all users.
The concerns about these two I have:
Real-time: Would be an overkill of queries and calculations for each action a user performs. If let's say, 500 users are active, all of them are performing actions, the database would be having a hard time I think. There would them also run a script to re-calculate the score for inactive users (to lower their score)
Once a week: If we have for example 5.000 users (for our first phase), than that would result into running the calculation formula 5.000 times and could take a long time and will increase in time when more users join.
The calculation-queries for a single variable in a the entire formula of about 12 variables are mostly a simple 'COUNT FROM table', but a few are like counting "all connections of my connections" which takes a few joins.
I started with "logging" every action into a table for this purpose, just the counter values and increase/decrease them with every action and running the formula with these values (a record per week). This works but can't be applied for every variable (like the connections of connections).
Note: Our server-side is based on PHP with MySQL.
We're also running Redis, but I'm not sure if this could improve those bits and pieces.
We have the option to export/push data to other servers/databases if needed.
My main example is the app 'Tinder' which uses a sort-like algorithm for match making (maybe with less complex data variables because they're not using groups and communities that you can join)
I'm wondering if they run that real-time on every swipe, every setting change, .. or if they have like a script that runs continiously for a small batch of users each time.
Where it all comes down to. What would be the most efficient/non-database-table-locking way to do this, with keeping the idea in mind that there will be a moment that we're having 50.000 users for example?
The way I would handle this:
Implement the realtime algorithm.
Measure. Is it actually slow? Try optimizing
Still slow? Move the algorithm to a separate asynchronous process. Have the process run whenever there's an update. Really this is the same thing as 1, but it doesn't slow down PHP requests and if it gets busy, it can take more time to catch up.
Still slow? Now you might be able to optimize by batching several changes.
If you have 5000 users right now, make sure it runs well with 5000 users. You're not going to grow to 50.000 overnight, so adjust and invest in this as your problem changes. You might be surprised where your performance problems are.
Measuring is key though. If you really want to support 50K users right now, simulate and measure.
I suspect you should use the database as the "source of truth" aka "persistent storage".
Then fetch whatever is needed from the dataset when you update the ratings. Even lots of games by 5000 players should not take more than a few seconds to fetch and compute on.
Bottom line: Implement "realtime"; come back with table schema and SELECTs if you find that the table fetching is a significant fraction of the total time. Do the "math" in a programming language, not SQL.

How does pagination of results in databases work?

This is a general question that applies to MySQL, Oracle DB or whatever else might be out there.
I know for MySQL there is LIMIT offset,size; and for Oracle there is 'ROW_NUMBER' or something like that.
But when such 'paginated' queries are called back to back, does the database engine actually do the entire 'select' all over again and then retrieve a different subset of results each time? Or does it do the overall fetching of results only once, keeps the results in memory or something, and then serves subsets of results from it for subsequent queries based on offset and size?
If it does the full fetch every time, then it seems quite inefficient.
If it does full fetch only once, it must be 'storing' the query somewhere somehow, so that the next time that query comes in, it knows that it has already fetched all the data and just needs to extract next page from it.
In that case, how will the database engine handle multiple threads? Two threads executing the same query?
I am very confused :(
I desagree with #Bill Karwin. First of all, do not make assumptions in advance whether something will be quick or slow without taking measurements, and complicate the code in advance to download 12 pages at once and cache them because "it seems to me that it will be faster".
YAGNI principle - the programmer should not add functionality until deemed necessary.
Do it in the simplest way (ordinary pagination of one page), measure how it works on production, if it is slow, then try a different method, if the speed is satisfactory, leave it as it is.
From my own practice - an application that retrieves data from a table containing about 80,000 records, the main table is joined with 4-5 additional lookup tables, the whole query is paginated, about 25-30 records per page, about 2500-3000 pages in total. Database is Oracle 12c, there are indexes on a few columns, queries are generated by Hibernate.
Measurements on production system at the server side show that an average time (median - 50% percentile) of retrieving one page is about 300 ms. 95% percentile is less than 800 ms - this means that 95% of requests for retrieving a single page is less that 800ms, when we add a transfer time from the server to the user and a rendering time of about 0.5-1 seconds, the total time is less than 2 seconds. That's enough, users are happy.
And some theory - see this answer to know what is purpose of Pagination pattern
Yes, the query is executed over again when you run it with a different OFFSET.
Yes, this is inefficient. Don't do that if you have a need to paginate through a large result set.
I'd suggest doing the query once, with a large LIMIT — enough for 10 or 12 pages. Then save the result in a cache. When the user wants to advance through several pages, then your application can fetch the 10-12 pages you saved in the cache and display the page the user wants to see. That is usually much faster than running the SQL query for each page.
This works well if, like most users, your user reads only a few pages and then changes their query.
Re your comment:
By cache I mean something like Memcached or Redis. A high-speed, in-memory key/value store.
MySQL views don't store anything, they're more like a macro that runs a predefined query for you.
Oracle supports materialized views, so that might work better, but querying the view would have the overhead of interpreting an SQL query.
A simpler in-memory cache should be much faster.

Reliably select from a database table at fixed time intervals

I have a fairly 'active' CDR table I want to select records from it every say 5 minutes for those last 5 minutes. The problem is it has a SHA IDs generated on a few of the other columns so all I have to lean on is a timestamp field by which I filter by date to select the time window of records I want.
The next problem is that obviously I cannot guarantee my script will run on the second precisely every time, or that the wall clocks of the server will be correct (which doesn't matter) and most importantly there almost certainly will be more than one record per second say 3 rows '2013-08-08 14:57:05' and before the second expired one more might be inserted.
By the time for '2013-08-08 14:57:05' and get records BETWEEN '2013-08-08 14:57:05' AND '2013-08-08 15:02:05' there will be more records for '2013-08-08 14:57:05' which I would have missed.
Essentially:
imprecise wall clock time
no sequential IDs
multiple records per second
query execution time
unreliable frequency of running the query
Are all preventing me from getting a valid set of rows in a specified rolling time window. Any suggestions for how I can go around these?
If you are using the same clock then i see no reason why things would be wrong. a resolution you would want to consider is a datetime table. So that way, every time you updated the start and stop times based on the server time.... then as things are added it would be guarenteed to be within that timeframe.
I mean, you COULD do it by hardcoding, but my way would sort of forcibly store a start and stop point in the database to use.
I would use Cron to handle the intervals and timing somewhat. Not use the time from that, but just to not lock up the database by checking all the time.
I probably not got all the details but to answer to your question title "Reliably select from a database table at fixed time intervals"...
I don't think you could even hope for a query to be run at "second precise" time.
One key problem with that approach is that you will have to deal with concurrent access and lock. You might be able to send the query at fixed time maybe, but your query might be waiting on the DB server for several seconds (or being executed seeing fairly outdated snapshot of the db). Especially in your case since the table is apparently "busy".
As a suggestion, if I were you, I would spend some time to think about queue messaging systems (like http://www.rabbitmq.com/ just to cite one, not presaging it is somehow "your" solution). Anyway those kind of tools are probably more suited to your needs.

How do I use EXPLAIN to *predict* performance of a MySQL query?

I'm helping maintain a program that's essentially a friendly read-only front-end for a big and complicated MySQL database -- the program builds ad-hoc SELECT queries from users' input, sends the queries to the DB, gets the results, post-processes them, and displays them nicely back to the user.
I'd like to add some form of reasonable/heuristic prediction for the constructed query's expected performance -- sometimes users inadvertently make queries that are inevitably going to take a very long time (because they'll return huge result sets, or because they're "going against the grain" of the way the DB is indexed) and I'd like to be able to display to the user some "somewhat reliable" information/guess about how long the query is going to take. It doesn't have to be perfect, as long as it doesn't get so badly and frequently out of whack with reality as to cause a "cry wolf" effect where users learn to disregard it;-) Based on this info, a user might decide to go get a coffee (if the estimate is 5-10 minutes), go for lunch (if it's 30-60 minutes), kill the query and try something else instead (maybe tighter limits on the info they're requesting), etc, etc.
I'm not very familiar with MySQL's EXPLAIN statement -- I see a lot of information around on how to use it to optimize a query or a DB's schema, indexing, etc, but not much on how to use it for my more limited purpose -- simply make a prediction, taking the DB as a given (of course if the predictions are reliable enough I may eventually switch to using them also to choose between alternate forms a query could take, but, that's for the future: for now, I'd be plenty happy just to show the performance guesstimates to the users for the above-mentioned purposes).
Any pointers...?
EXPLAIN won't give you any indication of how long a query will take.
At best you could use it to guess which of two queries might be faster, but unless one of them is obviously badly written then even that is going to be very hard.
You should also be aware that if you're using sub-queries, even running EXPLAIN can be slow (almost as slow as the query itself in some cases).
As far as I'm aware, MySQL doesn't provide any way to estimate the time a query will take to run. Could you log the time each query takes to run, then build an estimate based on the history of past similar queries?
I think if you want to have a chance of building something reasonably reliable out of this, what you should do is build a statistical model out of table sizes and broken-down EXPLAIN result components correlated with query processing times. Trying to build a query execution time predictor based on thinking about the contents of an EXPLAIN is just going to spend way too long giving embarrassingly poor results before it gets refined to vague usefulness.
MySQL EXPLAIN has a column called Key. If there is something in this column, this is a very good indication, it means that the query will use an index.
Queries that use indicies are generally safe to use since they were likely thought out by the database designer when (s)he designed the database.
However
There is another field called Extra. This field sometimes contains the text using_filesort.
This is very very bad. This literally means MySQL knows that the query will have a result set larger than the available memory, and therefore will start to swap the data to disk in order to sort it.
Conclusion
Instead of trying to predict the time a query takes, simply look at these two indicators. If a query is using_filesort, deny the user. And depending on how strict you want to be, if the query is not using any keys, you should also deny it.
Read more about the resultset of the MySQL EXPLAIN statement