Is there an easy way to page large datasets using the Access database via straight SQL? Let's say my query would normally return 100 rows, but I want the query to page through the results so that it only retrieves (let's say) the first 10 rows. Not until I request the next 10 rows would it query for rows 11-20.
If you run a ranking query, you will get a column containing ascending numbers in your output. You can then run a query against this column using a BETWEEN...AND clause to perform your paging.
So, for example, if your pages each contain 10 records and you want the third page, you would do:
SELECT * FROM MyRankingQuery WHERE MyAscendingField BETWEEN 30 and 39
How to Rank Records Within a Query
Microsoft support KB208946
The Access Database Engine doesn’t handle this very well: the proprietary TOP N syntax returns ties and the N cannot be parameterized; the optimizer doesn't handle the equivalent subquery construct very well at all :(
But, to be fair, this is something SQL in general doesn't handle very well. This is one of the few scenarios where I would consider dynamic SQL (shudder). But first I would consider using an ADO classic recordset, which has properties for AbsolutePage, PageCount, and PageSize (which, incidentally, the DAO libraries lack).
You could also consider using the Access Database Engine's little-known LIMIT TO nn ROWS syntax. From the Access 2003 help:
You may want to use ANSI-92 SQL for
the following reasons... ...
Using the LIMIT TO nn ROWS clause to limit the number of rows returned by a query
Could come in handy?
... my tongue is firmly embedded in my cheek :) This syntax doesn't exist in the Access Database Engine and never has. Instead, it's yet another example of the appalling state of the Access documentation on the engine side of the house.
Is the product fit for purpose if the documentation has massive holes and content cannot be trusted? is Caveat emptor.
I'm not certain how ranking answers your question. Also, I'm having trouble imagining why you would need this -- this is usually something you do on a website in order to break down the data retrieved into small chunks. But a Jet/ACE database is not a very good candidate for a website back end, unless it's strictly read-only.
One other SQL solution would use nested TOP N, but usually requires on-the-fly procedural code to write the SQL.
It also has the problem with ties, in that unless you include a unique field in your ORDER BY, you can get 11 records with a TOP 10 should two records have a tie on the values in the ORDER BY clause.
I'm not suggesting this is a better solution, just a different one.
Related
The problem is I need to do pagination.I want to use order by and limit.But my colleague told me mysql will return records in the same order,and since this job doesn't care in which order the records are shown,so we don't need order by.
So I want to ask if what he said is correct? Of course assuming that no records are updated or inserted between the two queries.
You don't show your query here, so I'm going to assume that it's something like the following (where ID is the primary key of the table):
select *
from TABLE
where ID >= :x:
limit 100
If this is the case, then with MySQL you will probably get rows in the same order every time. This is because the only predicate in the query involves the primary key, which is a clustered index for MySQL, so is usually the most efficient way to retrieve.
However, probably may not be good enough for you, and if your actual query is any more complex than this one, probably no longer applies. Even though you may think that nothing changes between queries (ie, no rows inserted or deleted), so you'll get the same optimization plan, that is not true.
For one thing, the block cache will have changed between queries, which may cause the optimizer to choose a different query plan. Or maybe not. But I wouldn't take the word of anyone other than one of the MySQL maintainers that it won't.
Bottom line: use an order by on whatever column(s) you're using to paginate. And if you're paginating by the primary key, that might actually improve your performance.
The key point here is that database engines need to handle potentially large datasets and need to care (a lot!) about performance. MySQL is never going to waste any resource (CPU cycles, memory, whatever) doing an operation that doesn't serve any purpose. Sorting result sets that aren't required to be sorted is a pretty good example of this.
When issuing a given query MySQL will try hard to return the requested data as quick as possible. When you insert a bunch of rows and then run a simple SELECT * FROM my_table query you'll often see that rows come back in the same order than they were inserted. That makes sense because the obvious way to store the rows is to append them as inserted and the obvious way to read them back is from start to end. However, this simplistic scenario won't apply everywhere, every time:
Physical storage changes. You won't just be appending new rows at the end forever. You'll eventually update values, delete rows. At some point, freed disk space will be reused.
Most real-life queries aren't as simple as SELECT * FROM my_table. Query optimizer will try to leverage indices, which can have a different order. Or it may decide that the fastest way to gather the required information is to perform internal sorts (that's typical for GROUP BY queries).
You mention paging. Indeed, I can think of some ways to create a paginator that doesn't require sorted results. For instance, you can assign page numbers in advance and keep them in a hash map or dictionary: items within a page may appear in random locations but paging will be consistent. This is of course pretty suboptimal, it's hard to code and requieres constant updating as data mutates. ORDER BY is basically the easiest way. What you can't do is just base your paginator in the assumption that SQL data sets are ordered sets because they aren't; neither in theory nor in practice.
As an anecdote, I once used a major framework that implemented pagination using the ORDER BY and LIMIT clauses. (I won't say the same because it isn't relevant to the question... well, dammit, it was CakePHP/2). It worked fine when sorting by ID. But it also allowed users to sort by arbitrary columns, which were often not unique, and I once found an item that was being shown in two different pages because the framework was naively sorting by a single non-unique column and that row made its way into both ORDER BY type LIMIT 10 and ORDER BY type LIMIT 10, 10 because both sortings complied with the requested condition.
I'm not very knowledgeable about databases. I would want to retrieve, say the "newest" 10 rows with owner ID matching something, and then perhaps paginate to retrieve the next "newest" 10 rows with that owner, and so on. But say I'm adding more and more rows into a database table -- at some point, would such a query become unbearably slow, or are databases generally good enough that this won't be a worry?
I imagine it would be an issue because to get the "newest" 10 rows you'd have to order by date, which is O(n log n). With this assumption, I sought a possible solution from SQL Server SELECT LAST N Rows.
It pointed me to http://www.sqlservercurry.com/2009/02/retrieve-last-n-rows-based-on-condition.html where I found that there is a PARTITION BY option for a query. I imagine this means first selecting all the rows that match the owner ID, and THEN ordering them, which would be significantly faster, and fast enough to not worry about for most applications. Is this the correct understanding?
Otherwise, is there some better way to get the "newest" N rows ( seems to suggest it is)?
I'm developing the app in Django if anyone knows a convenient way, but otherwise Django also allows raw database queries.
Okay, if you are using django, then you don't have to worry about DB's complexity. ORM is here to resolve your worries.
Simple fact, Django uses lazy query. So, it will reduce your DB hits and improve system performance.
So, according to your initial part of question, you can simply run this query:
queryset = YourModel.objects.filter(**lookup_condition).order_by('id')
It will get a queryset with the objects which match the condition from database of that Model class. For details, check this: https://docs.djangoproject.com/en/1.9/ref/models/querysets/#django.db.models.query.QuerySet.filter
And to paginate over it, run like this:
first_ten_values = queryset[0:9]
second_ten_values = queryset[10:19]
...
I know SQL is able to return only a certain number of rows like so:
MySQL:
select ... order by num desc limit 10
Oracle SQL:
WHERE ROWNUM <= 10 and whatever_else
but I'm under the impression that those execute by finding all the entries that meet your "where" conditions and then only returning a subset of them.
What I want is to tell it, "Give me the first N entries you come across that meet my conditions and stop executing," so that my query will execute really fast if I only want an example of some data in the DB and not all of it.
Does anyone know how to do this in MySQL and/or Oracle SQL? Oracle SQL preferred but any help is appreciated.
Also, what is the correct term for this? The term "short circuiting" describes what I'm looking for, but I'm not sure if it is the official term in regards to databases.
A simple select ... where ... limit ... will stop processing once it's found the necessary items, this optimisation is built into mysql and other engines.
A common optimisation technique is to LIMIT 1 when you know there will be only one match, this prevents the database from doing a full scan.
However, when you include ... order by ... the engine has no choice but to iterate over all items to find the right elements. Even so there are optimisations over not limiting, in your example the database engine may only keep a list of 10 items and pop items out as it finds elements that should be ordered above it.
In my jsp application I have a search box that lets user to search for user names in the database. I send an ajax call on each keystroke and fetch 5 random names starting with the entered string.
I am using the below query:
select userid,name,pic from tbl_mst_users where name like 'queryStr%' order by rand() limit 5
But this is very slow as I have more than 2000 records in my table.
Is there any better approach which takes less time and let me achieve the same..? I need random values.
How slow is "very slow", in seconds?
The reason why your query could be slow is most likely that you didn't place an index on name. 2000 rows should be a piece of cake for MySQL to handle.
The other possible reason is that you have many columns in the SELECT clause. I assume in this case the MySQL engine first copies all this data to a temp table before sorting this large result set.
I advise the following, so that you work only with indexes, for as long as possible:
SELECT userid, name, pic
FROM tbl_mst_users
JOIN (
-- here, MySQL works on indexes only
SELECT userid
FROM tbl_mst_users
WHERE name LIKE 'queryStr%'
ORDER BY RAND() LIMIT 5
) AS sub USING(userid); -- join other columns only after picking the rows in the sub-query.
This method is a bit better, but still does not scale well. However, it should be sufficient for small tables (2000 rows is, indeed, small).
The link provided by #user1461434 is quite interesting. It describes a solution with almost constant performance. Only drawback is that it returns only one random row at a time.
does table has indexing on name?
if not apply it
2.MediaWiki uses an interesting trick (for Wikipedia's Special:Random feature): the table with the articles has an extra column with a random number (generated when the article is created). To get a random article, generate a random number and get the article with the next larger or smaller (don't recall which) value in the random number column. With an index, this can be very fast. (And MediaWiki is written in PHP and developed for MySQL.)
This approach can cause a problem if the resulting numbers are badly distributed; IIRC, this has been fixed on MediaWiki, so if you decide to do it this way you should take a look at the code to see how it's currently done (probably they periodically regenerate the random number column).
3.http://jan.kneschke.de/projects/mysql/order-by-rand/
I have only used SQL rarely until recently when I began using it daily. I notice that if no "order by" clause is used:
When selecting part of a table the rows returned appear to be in the same order as they appear if I select the whole table
The order of rows returned by a selecting from a join seemes to be determined by the left most member of a join.
Is this behaviour a standard thing one can count on in the most common databases (MySql, Oracle, PostgreSQL, Sqlite, Sql Server)? (I don't really even know whether one can truly count on it in sqlite). How strictly is it honored if so (e.g. if one uses "group by" would the individual groups each have that ordering)?
If no ORDER BY clause is included in the query, the returned order of rows is undefined.
Whilst some RDBMSes will return rows in specific orders in some situations even when an ORDER BY clause is omitted, such behaviour should never be relied upon.
Section 20.2 <direct select statement: multiple rows>, subsection "General Rules" of
the SQL-92 specification:
4) If an <order by clause> is not specified, then the ordering of
the rows of Q is implementation-dependent.
If you want order, include an ORDER BY. If you don't include an ORDER BY, you're telling SQL Server:
I don't care what order you return the rows, just return the rows
Since you don't care, SQL Server is going to decide how to return the rows what it deems will be the most efficient manner possible right now (or according to the last time the plan for this specific query was cached). Therefore you should not rely on the behavior you observe. It can change from one run of the query to the next, with data changes, statistics changes, index changes, service packs, cumulative updates, upgrades, etc. etc. etc.
For PostgreSQL, if you omit the ORDER BY clause you could run the exact same query 100 times while the database is not being modified, and get one run in the middle in a different order than the others. In fact, each run could be in a different order.
One reason this could happen is that if the plan chosen involves a sequential scan of a table's heap, and there is already a seqscan of that table's heap in process, your query will start it's scan at whatever point the other scan is already at, to reduce the need for disk access.
As other answers have pointed out, if you want the data in a certain order, specify that order. PostgreSQL will take the requested order into consideration in choosing a plan, and may use an index that provides data in that order, if that works out to be cheaper than getting the rows some other way and then sorting them.
GROUP BY provides no guarantee of order; PostgreSQL might sort the data to do the grouping, or it might use a hash table and return the rows in order of the number generated by the hashing algorithm (i.e., pretty random). And that might change from one run to the next.
It never ceased to amaze me when I was a DBA that this feature of SQL was so often thought of as quirky. Consider a simple program that runs against a text file and produces some output. If the program never changes, and the data never changes, you'd expect the output to never change.
As for this:
If no ORDER BY clause is included in the query, the returned order of rows is undefined.
Not strictly true - on every RDBMS I've ever worked on (Oracle, Informix, SQL Server, DB2 to name a few) a DISTINCT clause also has the same effect as an ORDER BY as finding unique values involves a sort by definition.
EDIT (6/2/14):
Create a simple table
For DISTINCT and ORDER BY, both the plan and the cost is the same since it is ostensibly the same operation to be performed
And not surprisingly, the effect is thus the same