Begin Viewing Query Results Before Query Ends - mysql

Lets say I query a table with 500K rows. I would like to begin viewing any rows in the fetch buffer, which holds the result set, even though the query has not yet completed. I would like to scroll thru the fetch buffer. If I scroll too far ahead, I want to display a message like: "REACHED LAST ROW IN FETCH BUFFER.. QUERY HAS NOT YET COMPLETED".
Could this be accomplished using fgets() to read the fetch buffer while the query continues building the result set? Doing this implies multi-threading*
Can a feature like this, other than the FIRST ROWS hint directive, be provided in Oracle, Informix, MySQL, or other RDBMS?
The whole idea is to have the ability to start viewing rows before a long query completes, while displaying a counter of how many rows are available for immediate viewing.
EDIT: What I'm suggesting may require a fundamental change in a DB server's architecture, as to the way they handle their internal fetch buffers, e.g. locking up the result set until the query has completed, etc. A feature like the one I am suggesting would be very useful, especially for queries which take a long time to complete. Why have to wait until the whole query completes, when you could start viewing some of the results while the query continues to gather more results!

Paraphrasing:
I have a table with 500K rows. An ad-hoc query without a good index to support it requires a full table scan. I would like to immediately view the first rows returned while the full table scan continues. Then I want to scroll through the next results.
It seems that what you would like is some sort of system where there can be two (or more) threads at work. One thread would be busy synchronously fetching the data from the database, and reporting its progress to the rest of the program. The other thread would be dealing with the display.
In the meantime, I would like to display the progress of the table scan, example: "Searching...found 23 of 500,000 rows so far".
It isn't clear that your query will return 500,000 rows (indeed, let us hope it does not), though it may have to scan all 500,000 rows (and may well have only found 23 rows that match so far). Determining the number of rows to be returned is hard; determining the number of rows to be scanned is easier; determining the number of rows already scanned is very difficult.
If I scroll too far ahead, I want to display a message like: "Reached last row in look-ahead buffer...query has not completed yet".
So, the user has scrolled past the 23rd row, but the query is not yet completed.
Can this be done? Maybe like: spawn/exec, declare scroll cursor, open, fetch, etc.?
There are a couple of issues here. The DBMS (true of most databases, and certainly of IDS) remains tied up as far as the current connection on processing the one statement. Obtaining feedback on how a query has progressed is difficult. You could look at the estimated rows returned when the query was started (information in the SQLCA structure), but those values are apt to be wrong. You'd have to decide what to do when you reach row 200 of 23, or you only get to row 23 of 5,697. It is better than nothing, but it is not reliable. Determining how far a query has progressed is very difficult. And some queries require an actual sort operation, which means that it is very hard to predict how long it will take because no data is available until the sort is done (and once the sort is done, there is only the time taken to communicate between the DBMS and the application to hold up the delivery of the data).
Informix 4GL has many virtues, but thread support is not one of them. The language was not designed with thread safety in mind, and there is no easy way to retrofit it into the product.
I do think that what you are seeking would be most easily supported by two threads. In a single-threaded program like an I4GL program, there isn't an easy way to go off and fetch rows while waiting for the user to type some more input (such as 'scroll down the next page full of data').
The FIRST ROWS optimization is a hint to the DBMS; it may or may not give a significant benefit to the perceived performance. Overall, it typically means that the query is processed less optimally from the DBMS perspective, but getting results to the user quickly can be more important than the workload on the DBMS.
Somewhere down below in a much down-voted answer, Frank shouted (but please don't SHOUT):
That's exactly what I want to do, spawn a new process to begin displaying first_rows and scroll through them even though the query has not completed.
OK. The difficulty here is organizing the IPC between the two client-side processes. If both are connected to the DBMS, they have separate connections, and therefore the temporary tables and cursors of one session are not available to the other.
When a query is executed, a temporary table is created to hold the query results for the current list. Does the IDS engine place an exclusive lock on this temp table until the query completes?
Not all queries result in a temporary table, though the result set for a scroll cursor usually does have something approximately equivalent to a temporary table. IDS does not need to place a lock on the temporary table backing a scroll cursor because only IDS can access the table. If it was a regular temp table, there'd still not be a need to lock it because it cannot be accessed except by the session that created it.
What I meant with the 500k rows, is nrows in the queried table, not how many expected results will be returned.
Maybe a more accurate status message would be:
Searching 500,000 rows...found 23 matching rows so far
I understand that an accurate count of nrows can be obtained in sysmaster:sysactptnhdr.nrows?
Probably; you can also get a fast and accurate count with 'SELECT COUNT(*) FROM TheTable'; this does not scan anything but simply accesses the control data - probably effectively the same data as in the nrows column of the SMI table sysmaster:sysactptnhdr.
So, spawning a new process is not clearly a recipe for success; you have to transfer the query results from the spawned process to the original process. As I stated, a multithreaded solution with separate display and database access threads would work after a fashion, but there are issues with doing this using I4GL because it is not thread-aware. You'd still have to decide how the client-side code is going store the information for display.

There are three basic limiting factors:
The execution plan of the query. If the execution plan has a blocking operation at the end (such as a sort or an eager spool), the engine cannot return rows early in the query execution. It must wait until all rows are fully processed, after which it will return the data as fast as possible to the client. The time for this may itself be appreciable, so this part could be applicable to what you're talking about. In general, though, you cannot guarantee that a query will have much available very soon.
The database connection library. When returning recordsets from a database, the driver can use server-side paging or client-side paging. Which is used can and does affect which rows will be returned and when. Client-side paging forces the entire query to be returned at once, reducing the opportunity for displaying any data before it is all in. Careful use of the proper paging method is crucial to any chance to display data early in a query's lifetime.
The client program's use of synchronous or asynchronous methods. If you simply copy and paste some web example code for executing a query, you will be a bit less likely to be working with early results while the query is still running—instead the method will block and you will get nothing until it is all in. Of course, server-side paging (see point #2) can alleviate this, however in any case your application will be blocked for at least a short time if you do not specifically use an asynchronous method. For anyone reading this who is using .Net, you may want to check out Asynchronous Operations in .Net Framework.
If you get all of these right, and use the FAST FIRSTROW technique, you may be able to do some of what you're looking for. But there is no guarantee.

It can be done, with an analytic function, but Oracle has to full scan the table to determine the count no matter what you do if there's no index. An analytic could simplify your query:
SELECT x,y,z, count(*) over () the_count
FROM your_table
WHERE ...
Each row returned will have the total count of rows returned by the query in the_count. As I said, however, Oracle will have to finish the query to determine the count before anything is returned.
Depending on how you're processing the query (e.g., a PL/SQL block in a form), you could use the above query to open a cursor, then loop through the cursor and display sets of records and give the user the chance to cancel.

I'm not sure how you would accomplish this, since the query has to complete prior to the results being known. No RDBMS (that I know of) offers any means of determining how many results to a query have been found prior to the query completing.
I can't speak factually for how expensive such a feature would be in Oracle because I have never seen the source code. From the outside in, however, I think it would be rather costly and could double (if not more) the length of time a query took to complete. It would mean updating an atomic counter after each result, which isn't cheap when you're talking millions of possible rows.

So I am putting up my comments into this answer-
In terms of Oracle.
Oracle maintains its own buffer cache inside the system global area (SGA) for each instance. The hit ratio on the buffer cache depends on the sizing and reaches 90% most of the time, which means 9 out of 10 hits will be satisfied without reaching the disk.
Considering the above, even if there is a "way" (so to speak) to access the buffer chache for a query you run, the results would highly depend on the cache sizing factor. If a buffer cache is too small, the cache hit ratio will be small and more physical disk I/O will result, which will render the buffer cache un-reliable in terms of temp-data content. If a buffer cache is too big, then parts of the buffer cache will be under-utilized and memory resources will be wasted, which in terms would render too much un-necessary processing trying to access the buffer cache while in order to peek in it for the data you want.
Also depending on your cache sizing and SGA memory it would be upto the ODBC driver / optimizer to determine when and how much to use what (cache buffering or Direct Disk I/O).
In terms of trying to access the "buffer cache" to find "the row" you are looking for, there might be a way (or in near future) to do it, but there would be no way to know if what you are looking for ("The row") is there or not after all.
Also, full table scans of large tables usually result in physical disk reads and a lower buffer cache hit ratio.You can get an idea of full table scan activity at the data file level by querying v$filestat and joining to SYS.dba_data_files. Following is a query you can use and sample results:
SELECT A.file_name, B.phyrds, B.phyblkrd
FROM SYS.dba_data_files A, v$filestat B
WHERE B.file# = A.file_id
ORDER BY A.file_id;
Since this whole ordeal is highly based on multiple parameters and statistics, the results of what you are looking for may remain a probability driven off of those facotrs.

Related

Getting the cost of every MySQL query

We have a web application backed by MySQL serving hundreds of queries per second. I'm looking for a way to measure the "cost" of every query in production. I'm imagining some option where, for every query, MySQL returns the query results along with the CPU and I/O cost of executing that query.
The end goal is to aggregate those costs by endpoint (e.g. "/search") and by the logged-in user ID. That way, when we're having issues with site, we can quickly see if there's a particular action or user ID that is using up a large chunk of our MySQL resources.
Close but not quite (AFAICT):
This answer comes close: https://stackoverflow.com/a/12880997/163832
It describes the precision and accuracy problems with EXPLAIN and recommends an alternative that measures what actually happened rather than estimating what will happen.
The alternative does seem better for my use case, but there are still problems:
I looked at the available stats and can't find ones that measure CPU or I/O.
I don't think I can afford to do FLUSH STATUS and then SHOW SESSION STATUS ... on every query.
This doesn't work when many queries are running concurrently.

How does pagination of results in databases work?

This is a general question that applies to MySQL, Oracle DB or whatever else might be out there.
I know for MySQL there is LIMIT offset,size; and for Oracle there is 'ROW_NUMBER' or something like that.
But when such 'paginated' queries are called back to back, does the database engine actually do the entire 'select' all over again and then retrieve a different subset of results each time? Or does it do the overall fetching of results only once, keeps the results in memory or something, and then serves subsets of results from it for subsequent queries based on offset and size?
If it does the full fetch every time, then it seems quite inefficient.
If it does full fetch only once, it must be 'storing' the query somewhere somehow, so that the next time that query comes in, it knows that it has already fetched all the data and just needs to extract next page from it.
In that case, how will the database engine handle multiple threads? Two threads executing the same query?
I am very confused :(
I desagree with #Bill Karwin. First of all, do not make assumptions in advance whether something will be quick or slow without taking measurements, and complicate the code in advance to download 12 pages at once and cache them because "it seems to me that it will be faster".
YAGNI principle - the programmer should not add functionality until deemed necessary.
Do it in the simplest way (ordinary pagination of one page), measure how it works on production, if it is slow, then try a different method, if the speed is satisfactory, leave it as it is.
From my own practice - an application that retrieves data from a table containing about 80,000 records, the main table is joined with 4-5 additional lookup tables, the whole query is paginated, about 25-30 records per page, about 2500-3000 pages in total. Database is Oracle 12c, there are indexes on a few columns, queries are generated by Hibernate.
Measurements on production system at the server side show that an average time (median - 50% percentile) of retrieving one page is about 300 ms. 95% percentile is less than 800 ms - this means that 95% of requests for retrieving a single page is less that 800ms, when we add a transfer time from the server to the user and a rendering time of about 0.5-1 seconds, the total time is less than 2 seconds. That's enough, users are happy.
And some theory - see this answer to know what is purpose of Pagination pattern
Yes, the query is executed over again when you run it with a different OFFSET.
Yes, this is inefficient. Don't do that if you have a need to paginate through a large result set.
I'd suggest doing the query once, with a large LIMIT — enough for 10 or 12 pages. Then save the result in a cache. When the user wants to advance through several pages, then your application can fetch the 10-12 pages you saved in the cache and display the page the user wants to see. That is usually much faster than running the SQL query for each page.
This works well if, like most users, your user reads only a few pages and then changes their query.
Re your comment:
By cache I mean something like Memcached or Redis. A high-speed, in-memory key/value store.
MySQL views don't store anything, they're more like a macro that runs a predefined query for you.
Oracle supports materialized views, so that might work better, but querying the view would have the overhead of interpreting an SQL query.
A simpler in-memory cache should be much faster.

How to optimize and Fast run SQL query

I have following SQL query that taking too much time to fetch data.
Customer.joins("LEFT OUTER JOIN renewals ON customers.id = renewals.customer_id").where("renewals.customer_id IS NULL && customers.status_id = 4").order("created_at DESC").select('first_name, last_name, customer_state, customers.created_at, customers.customer_state, customers.id, customers.status_id')
Above query takes 230976.6ms to execute.
I added indexing on firstname, lastname, customer_state and status_id.
How can I execute query within less then 3 sec. ?
Try this...
Everyone wants faster database queries, and both SQL developers and DBAs can turn to many time-tested methods to achieve that goal. Unfortunately, no single method is foolproof or ironclad. But even if there is no right answer to tuning every query, there are plenty of proven do's and don'ts to help light the way. While some are RDBMS-specific, most of these tips apply to any relational database.
Do use temp tables to improve cursor performance
I hope we all know by now that it’s best to stay away from cursors if at all possible. Cursors not only suffer from speed problems, which in itself can be an issue with many operations, but they can also cause your operation to block other operations for a lot longer than is necessary. This greatly decreases concurrency in your system.
However, you can’t always avoid using cursors, and when those times arise, you may be able to get away from cursor-induced performance issues by doing the cursor operations against a temp table instead. Take, for example, a cursor that goes through a table and updates a couple of columns based on some comparison results. Instead of doing the comparison against the live table, you may be able to put that data into a temp table and do the comparison against that instead. Then you have a single UPDATE statement against the live table that’s much smaller and holds locks only for a short time.
Sniping your data modifications like this can greatly increase concurrency. I’ll finish by saying you almost never need to use a cursor. There’s almost always a set-based solution; you need to learn to see it.
Don’t nest views
Views can be convenient, but you need to be careful when using them. While views can help to obscure large queries from users and to standardize data access, you can easily find yourself in a situation where you have views that call views that call views that call views. This is called nesting views, and it can cause severe performance issues, particularly in two ways. First, you will very likely have much more data coming back than you need. Second, the query optimizer will give up and return a bad query plan.
I once had a client that loved nesting views. The client had one view it used for almost everything because it had two important joins. The problem was that the view returned a column with 2MB documents in it. Some of the documents were even larger. The client was pushing at least an extra 2MB across the network for every single row in almost every single query it ran. Naturally, query performance was abysmal.
And none of the queries actually used that column! Of course, the column was buried seven views deep, so even finding it was difficult. When I removed the document column from the view, the time for the biggest query went from 2.5 hours to 10 minutes. When I finally unraveled the nested views, which had several unnecessary joins and columns, and wrote a plain query, the time for that same query dropped to subseconds.
Do use table-valued functions
RESOURCES
VIDEO/WEBCAST
Sponsored
Discover your Data Dilemma
WHITE PAPER
Best Practices when Designing a Digital Workplace
SEE ALL
Search Resources
Go
This is one of my favorite tricks of all time because it is truly one of those hidden secrets that only the experts know. When you use a scalar function in the SELECT list of a query, the function gets called for every single row in the result set. This can reduce the performance of large queries by a significant amount. However, you can greatly improve the performance by converting the scalar function to a table-valued function and using a CROSS APPLY in the query. This is a wonderful trick that can yield great improvements.
Want to know more about the APPLY operator? You'll find a full discussion in an excellent course on Microsoft Virtual Academy by Itzik Ben-Gan.
Do use partitioning to avoid large data moves
Not everyone will be able to take advantage of this tip, which relies on partitioning in SQL Server Enterprise, but for those of you who can, it’s a great trick. Most people don’t realize that all tables in SQL Server are partitioned. You can separate a table into multiple partitions if you like, but even simple tables are partitioned from the time they’re created; however, they’re created as single partitions. If you're running SQL Server Enterprise, you already have the advantages of partitioned tables at your disposal.
This means you can use partitioning features like SWITCH to archive large amounts of data from a warehousing load. Let’s look at a real example from a client I had last year. The client had the requirement to copy the data from the current day’s table into an archive table; in case the load failed, the company could quickly recover with the current day’s table. For various reasons, it couldn’t rename the tables back and forth every time, so the company inserted the data into an archive table every day before the load, then deleted the current day’s data from the live table.
This process worked fine in the beginning, but a year later, it was taking 1.5 hours to copy each table -- and several tables had to be copied every day. The problem was only going to get worse. The solution was to scrap the INSERT and DELETE process and use the SWITCH command. The SWITCH command allowed the company to avoid all of the writes because it assigned the pages to the archive table. It’s only a metadata change. The SWITCH took on average between two and three seconds to run. If the current load ever fails, you SWITCH the data back into the original table.
YOU MIGHT ALSO LIKE
Microsoft Dynamics AX ERP
Microsoft Dynamics AX: A new ERP is born, this time in the cloud
Joseph Sirosh
Why Microsoft’s data chief thinks current machine learning tools are like...
Urs Holzle Structure
Google's infrastructure czar predicts cloud business will outpace ads in 5...
This is a case where understanding that all tables are partitions slashed hours from a data load.
If you must use ORMs, use stored procedures
This is one of my regular diatribes. In short, don’t use ORMs (object-relational mappers). ORMs produce some of the worst code on the planet, and they’re responsible for almost every performance issue I get involved in. ORM code generators can’t possibly write SQL as well as a person who knows what they're doing. However, if you use an ORM, write your own stored procedures and have the ORM call the stored procedure instead of writing its own queries. Look, I know all the arguments, and I know that developers and managers love ORMs because they speed you to market. But the cost is incredibly high when you see what the queries do to your database.
Stored procedures have a number of advantages. For starters, you’re pushing much less data across the network. If you have a long query, then it could take three or four round trips across the network to get the entire query to the database server. That's not including the time it takes the server to put the query back together and run it, or considering that the query may run several -- or several hundred -- times a second.
Using a stored procedure will greatly reduce that traffic because the stored procedure call will always be much shorter. Also, stored procedures are easier to trace in Profiler or any other tool. A stored procedure is an actual object in your database. That means it's much easier to get performance statistics on a stored procedure than on an ad-hoc query and, in turn, find performance issues and draw out anomalies.
In addition, stored procedures parameterize more consistently. This means you’re more likely to reuse your execution plans and even deal with caching issues, which can be difficult to pin down with ad-hoc queries. Stored procedures also make it much easier to deal with edge cases and even add auditing or change-locking behavior. A stored procedure can handle many tasks that trouble ad-hoc queries. My wife unraveled a two-page query from Entity Framework a couple of years ago. It took 25 minutes to run. When she boiled it down to its essence, she rewrote that huge query as SELECT COUNT(*) from T1. No kidding.
OK, I kept it as short as I could. Those are the high-level points. I know many .Net coders think that business logic doesn’t belong in the database, but what can I say other than you’re outright wrong. By putting the business logic on the front end of the application, you have to bring all of the data across the wire merely to compare it. That’s not good performance. I had a client earlier this year that kept all of the logic out of the database and did everything on the front end. The company was shipping hundreds of thousands of rows of data to the front end, so it could apply the business logic and present the data it needed. It took 40 minutes to do that. I put a stored procedure on the back end and had it call from the front end; the page loaded in three seconds.
Of course, the truth is that sometimes the logic belongs on the front end and sometimes it belongs in the database. But ORMs always get me ranting.
Don’t do large ops on many tables in the same batch
This one seems obvious, but apparently it's not. I’ll use another live example because it will drive home the point much better. I had a system that suffered tons of blocking. Dozens of operations were at a standstill. As it turned out, a delete routine that ran several times a day was deleting data out of 14 tables in an explicit transaction. Handling all 14 tables in one transaction meant that the locks were held on every single table until all of the deletes were finished. The solution was to break up each table's deletes into separate transactions so that each delete transaction held locks on only one table. This freed up the other tables and reduced the blocking and allowed other operations to continue working. You always want to split up large transactions like this into separate smaller ones to prevent blocking.
Don't use triggers
This one is largely the same as the previous one, but it bears mentioning. Don’t use triggers unless it’s unavoidable -- and it’s almost always avoidable.
The problem with triggers: Whatever it is you want them to do will be done in the same transaction as the original operation. If you write a trigger to insert data into another table when you update a row in the Orders table, the lock will be held on both tables until the trigger is done. If you need to insert data into another table after the update, then put the update and the insert into a stored procedure and do them in separate transactions. If you need to roll back, you can do so easily without having to hold locks on both tables. As always, keep transactions as short as possible and don’t hold locks on more than one resource at a time if you can help it.
Don’t cluster on GUID
After all these years, I can't believe we’re still fighting this issue. But I still run into clustered GUIDs at least twice a year.
A GUID (globally unique identifier) is a 16-byte randomly generated number. Ordering your table’s data on this column will cause your table to fragment much faster than using a steadily increasing value like DATE or IDENTITY. I did a benchmark a few years ago where I inserted a bunch of data into one table with a clustered GUID and into another table with an IDENTITY column. The GUID table fragmented so severely that the performance degraded by several thousand percent in a mere 15 minutes. The IDENTITY table lost only a few percent off performance after five hours. This applies to more than GUIDs -- it goes toward any volatile column.
Don’t count all rows if you only need to see if data exists
It's a common situation. You need to see if data exists in a table or for a customer, and based on the results of that check, you’re going to perform some action. I can't tell you how often I've seen someone do a SELECT COUNT(*) FROM dbo.T1 to check for the existence of that data:
SET #CT = (SELECT COUNT(*) FROM dbo.T1);
If #CT > 0
BEGIN
END
It’s completely unnecessary. If you want to check for existence, then do this:
If EXISTS (SELECT 1 FROM dbo.T1)
BEGIN
END
Don’t count everything in the table. Just get back the first row you find. SQL Server is smart enough to use EXISTS properly, and the second block of code returns superfast. The larger the table, the bigger difference this will make. Do the smart thing now before your data gets too big. It’s never too early to tune your database.
In fact, I just ran this example on one of my production databases against a table with 270 million rows. The first query took 15 seconds, and included 456,197 logical reads, while the second one returned in less than one second and included only five logical reads. However, if you really do need a row count on the table, and it's really big, another technique is to pull it from the system table. SELECT rows from sysindexes will get you the row counts for all of the indexes. And because the clustered index represents the data itself, you can get the table rows by adding WHERE indid = 1. Then simply include the table name and you're golden. So the final query is SELECT rows from sysindexes where object_name(id) = 'T1' and indexid = 1. In my 270 million row table, this returned sub-second and had only six logical reads. Now that's performance.
Don’t do negative searches
Take the simple query SELECT * FROM Customers WHERE RegionID <> 3. You can’t use an index with this query because it’s a negative search that has to be compared row by row with a table scan. If you need to do something like this, you may find it performs much better if you rewrite the query to use the index. This query can easily be rewritten like this:
SELECT * FROM Customers WHERE RegionID < 3 UNION ALL SELECT * FROM Customers WHERE RegionID
This query will use an index, so if your data set is large it could greatly outperform the table scan version. Of course, nothing is ever that easy, right? It could also perform worse, so test this before you implement it. There are too many factors involved for me to tell you that it will work 100 percent of the time. Finally, I realize this query breaks the “no double dipping” tip from the last article, but that goes to show there are no hard and fast rules. Though we're double dipping here, we're doing it to avoid a costly table scan.
Ref:http://www.infoworld.com/article/2604472/database/10-more-dos-and-donts-for-faster-sql-queries.html
http://www.infoworld.com/article/2628420/database/database-7-performance-tips-for-faster-sql-queries.html

Is there / would be feasible a service providing random elements from a given SQL table?

ABSTRACT
Talking with some colleagues we came accross the "extract random row from a big database table" issue. It's a classic one and we know the naive approach (also on SO) is usually something like:
SELECT * FROM mytable ORDER BY RAND() LIMIT 1
THE PROBLEM
We also know a query like that is utterly inefficient and actually usable only with very few rows. There are some approaches that could be taken to attain better efficiency, like these ones still on SO, but they won't work with arbitrary primary keys and the randomness will be skewed as soon as you have holes in your numeric primary keys. An answer to the last cited question links to this article which has a good explanation and some bright solutions involving an additional "equal distribution" table that must be maintained whenever the "master data" table changes. But then again if you have frequent DELETEs on a big table you'll probably be screwed up by the constant updating of the added table. Also note that many solutions rely on COUNT(*) which is ridiculously fast on MyISAM but "just fast" on InnoDB (I don't know how it performs on other platforms but I suspect the InnoDB case could be representative of other transactional database systems).
In addition to that, even the best solutions I was able to find are fast but not Ludicrous Speed fast.
THE IDEA
A separate service could be responsible to generate, buffer and distribute random row ids or even entire random rows:
it could choose the best method to extract random row ids depending on how the original PKs are structured. An ordered list of keys could be maintained in ram by the service (shouldn't take too many bytes per row in addition to the actual size of the PK, it's probably ok up to 100~1000M rows with standard PCs and up to 1~10 billion rows with a beefy server)
once the keys are in memory you have an implicit "row number" for each key and no holes in it so it's just a matter of choosing a random number and directly fetch the corresponding key
a buffer of random keys ready to be consumed could be maintained to quickly respond to spikes in the incoming requests
consumers of the service will connect and request N random rows from the buffer
rows are returned as simple keys or the service could maintain a (pool of) db connection(s) to fetch entire rows
if the buffer is empty the request could block or return EOF-like
if data is added to the master table the service must be signaled to add the same data to its copy too, flush the buffer of random picks and go on from that
if data is deleted from the master table the service must be signaled to remove that data too from both the "all keys" list and "random picks" buffer
if data is updated in the master table the service must be signaled to update corresponding rows in the key list and in the random picks
WHY WE THINK IT'S COOL
does not touch disks other than the initial load of keys at startup or when signaled to do so
works with any kind of primary key, numerical or not
if you know you're going to update a large batch of data you can just signal it when you're done (i.e. not at every single insert/update/delete on the original data), it's basically like having a fine grained lock that only blocks requests for random rows
really fast on updates of any kind in the original data
offloads some work from the relational db to another, memory only process: helps scalability
responds really fast from its buffers without waiting for any querying, scanning, sorting
could easily be extended to similar use cases beyond the SQL one
WHY WE THINK IT COULD BE A STUPID IDEA
because we had the idea without help from any third party
because nobody (we heard of) has ever bothered to do something similar
because it adds complexity in the mix to keep it updated whenever original data changes
AND THE QUESTION IS...
Does anything similar already exists? If not, would it be feasible? If not, why?
The biggest risk with your "cache of eligible primary keys" concept is keeping the cache up to date, when the origin data is changing continually. It could be just as costly to keep the cache in sync as it is to run the random queries against the original data.
How do you expect to signal the cache that a value has been added/deleted/updated? If you do it with triggers, keep in mind that a trigger can fire even if the transaction that spawned it is rolled back. This is a general problem with notifying external systems from triggers.
If you notify the cache from the application after the change has been committed in the database, then you have to worry about other apps that make changes without being fitted with the signaling code. Or ad hoc queries. Or queries from apps or tools for which you can't change the code.
In general, the added complexity is probably not worth it. Most apps can tolerate some compromise and they don't need an absolutely random selection all the time.
For example, the inequality lookup may be acceptable for some needs, even with the known weakness that numbers following gaps are chosen more often.
Or you could pre-select a small number of random values (e.g. 30) and cache them. Let app requests choose from these. Every 60 seconds or so, refresh the cache with another set of randomly chosen values.
Or choose a random value evenly distributed between MIN(id) and MAX(id). Try a lookup by equality, not inequality. If the value corresponds to a gap in the primary key, just loop and try again with a different random value. You can terminate the loop if it's not successful after a few tries. Then try another method instead. On average, the improved simplicity and speed of an equality lookup may make up for the occasional retries.
It appears you are basically addressing a performance issue here. Most DB performance experts recommend you have as much RAM as your DB size, then disk is no longer a bottleneck - your DB lives in RAM and flushes to disk as required.
You're basically proposing a custom developed in-RAM CDC Hashing system.
You could just build this as a standard database only application and lock your mapping table in RAM, if your DB supports this.
I guess I am saying that you can address performance issues without developing custom applications, just use already existing performance tuning methods.

How can I select some amount of rows, like "get as many rows as possible in 5 seconds"?

The aim is: getting the highest number of rows and not getting more rows than rows loaded, after 5 seconds. The aim is not creating a timeout.
after months, I thought maybe this would work and it didn't:
declare #d1 datetime2(7); set #d1=getdate();
select c1,c2 from t1 where (datediff(ss,#d1,getdate())<5)
Although the trend in recent years for relational databases has moved more and more toward cost-based query optimization, there is no RDBMS I am aware of that inherently supports designating a maximum cost (in time or I/O) for a query.
The idea of "just let it time out and use the records collected so far" is a flawed solution. The flaw lies in the fact that a complex query may spend the first 5 seconds performing a hash on a subtree of the query plan, to generate data that will be used by a later part of the plan. So after 5 seconds, you may still have no records.
To get the most records possible in 5 seconds, you would need a query that had a known estimated execution plan, which could then be used to estimate the optimal number of records to request in order to make the query run for as close to 5 seconds as possible. In other words, knowing that the query optimizer estimates it can process 875 records per second, you could request 4,375 records. The query might run a bit longer than 5 seconds sometimes, but over time your average execution should fall close to 5 seconds.
So...how to make this happen?
In your particular situation, it's not feasible. The catch is "known estimated execution plan". To make this work reliably, you'd need a stored procedure with a known execution plan, not an ad-hoc query. Since you can't create stored procedures in your environment, that's a non-starter. For others who want to explore that solution, though, here's an academic paper by a team who implemented this concept in Oracle. I haven't read the full paper, but based on the abstract it sounds like their work could be translated to any RDBMS that has cost-based optimization (e.g. MS SQL, MySQL, etc.)
OK, So what can YOU do in your situation?
If you can't do it the "right" way, solve it with a hack.
My suggestion: keep your own "estimated cost" statistics.
Do some testing in advance and estimate how many rows you can typically get back in 4 seconds. Let's say that number is 18,000.
So you LIMIT your query to 18,000 rows. But you also track the execution time every time you run it and keep a moving average of, say, the last 50 executions. If that average is less than 4.5s, add 1% to the query size and reset the moving average. So now your app is requesting 18,180 rows every time. After 50 iterations, if the moving average is under 4.5s, add 1% again.
And if your moving average ever exceeds 4.75s, subtract 1%.
Over time, this method should converge to an optimized N-rows solution for your particular query/environment/etc. And should adjust (slowly but steadily) when conditions change (e.g. high-concurrency vs low-concurrency)
Just one -- scratch that, two -- more things...
As a DBA, I have to say...it should be exceedingly rare for any query to take more than 5 seconds. In particular, if it's a query that runs frequently and is used by the front end application, then it absolutely should not ever run for 5 seconds. If you really do have a user-facing query that can't complete in 5 seconds, that's a sign that the database design needs improvement.
Jonathan VM's Law Of The Greenbar Report I used to work for a company that still used a mainframe application that spit out reams of greenbar dot-matrix-printed reports every day. Most of these were ignored, and of the few that were used, most were never read beyond the first page. A report might have thousands of rows sorted by descending account age...and all that user needed was to see the 10 most aged. My law is this: The number of use cases that actually require seeing a vast number of rows is infinitesimally small. Think - really think - about the use case for your query, and whether having lots and lots of records is really what that user needs.
Your while loop idea won't solve the problem entirely. It is possible that the very first iteration through the loop could take longer than 5 seconds. Plus, it will likely result in retrieving far fewer rows in the allotted time than if you tried to do it with just a single query.
Personally, I wouldn't try to solve this exact problem. Instead, I would do some testing, and through trial and error identify a number of records that I am confident will load in under five seconds. Then, I would just place a LIMIT on the loading query.
Next, depending on the requirements I would either set a timeout on the DB call of five seconds or just live with the chance that some calls will exceed the time restriction.
Lastly, consider that on most modern hardware for most queries, you can return a very large number of records within five seconds. It's hard to imagine returning all of that data to the UI and still have it be usable, if that is your intention.
-Jason
I've never tried this, but if a script is running this query you could try running an unbuffered query (in php, this would be something like mysql_unbuffered_query())... you could then store these into an array while the query is running. You could then set the mysql query timeout to five minutes. When the query is killed, if you've set your while() loop to check for a timeout response it can then terminate the loop and you'll have an array with all of the records returned in 5 minutes. Again, I'm not sure this would work, but I'd be interested to see if it would accomplish what you're looking to do.
You could approach this problem like this, but I doubt that this logic is really what I'd recommend for real world use.
You have a 10s interval, you try one query, it gets you the row in 0.1s. That would imply you could get at least 99 similar queries still in the remaining 9.9s.
However, getting 99 queries at once should proove faster than getting them one-by-one (which your initial calculation would suggest). So you get the 99 queries and check the time again.
Let's say the operation performed 1.5 times as fast as the single query, because getting more queries at once is more efficient, leaving you with 100rows at a time of 7.5s. You calculate that by average you have so far gotten 100rows per 7.5s, calculate a new amount of possible queries for the rest of the time and query again, and so on. You would, however, need to set a threshold limit for this loop, let's say something like: Don't get any new queries any more after 9.9s.
This solution obviously is neither the most smooth nor something I'd really use, but maybe it serves to solve the OP's problem.
Also, jmacinnes already pointed out: "It is possible that the very first iteration through the loop could take longer than 10[5] seconds."
I'd certainly be interested myself, if someone can come up with a proper solution to this problem.
To get data from the table you should do two things:
execute a query (SELECT something FROM table)
fill the table or read data
You are asking about second one. I'm not that familiar with php, but I think it does not matter. We use fetching to get first records quickly and show them to the user, then fetch records as needed. In ADO.NET you could use IDataReader to get records one by one, in php I think you could use similar methods, for example - mysqli_fetch_row in mysqli extension or mysql_fetch_row in mysql extension. In this case you could stop reading data at any moment.