I have a REST service which return rows from a database table depending on the current page and results per page.
When not filtering the results, it's pretty easy to do, I just do a SELECT WHERE id >= (page - 1) * perPage + 1 and LIMIT to perPage.
The problem is when trying to use pagination on filtered results, e.g. if I choose to filter only the rows WHERE type = someType.
In that case, the first match of the first page can start in id 7, and the last can be in id 5046. Then the first match of the second page can start at 7302 and end at 12430, and so on.
For the first page of filtered results, I'd be able to simply start from id 1 and LIMIT to perPage, but for the second page, etc, I need to know the index of the last matched row in the previous page, or even better - the first matched row in the current page, or some other indication.
How do I do it efficiently? I need to be able to do it on tables with millions of rows, so obviously fetching all the rows and taking it from there is not an option.
The idea is something like this:
SELECT ... FROM ... WHERE filterKey = filterValue AND id >= id_of_first_match_in_current_page
with id_of_first_match_in_current_page being the mystery.
You can't know what the first id on a given page is, because id numbers are not necessarily sequential. In other words, there could be gaps in the sequence, so rows on the fifth page of 100 rows doesn't necessarily start at id 500. It could start on id 527 for example, It's impossible to know.
Stated yet another way: id is a value, not a row number.
One possible solution if your client is advancing through pages in ascending order is that each REST request fetches data, notes the greatest id value on that page, then uses that in the next REST request so it queries id values that are larger.
SELECT ... FROM ... WHERE filterKey = filterValue
AND id > id_of_last_match_of_previous_page
But if your REST request can fetch any random page, this solution doesn't work. It depends on having fetched the prior page already.
Another solution is to use the LIMIT <x> OFFSET <y> syntax. This allows you to request any arbitrary page. LIMIT <y>, <x> works the same, but for some reason x and y are reversed in the two different syntax forms, so keep that in mind.
Using LIMIT...OFFSET isn't very efficient when you request a page that is many pages into the result. Say you request the 5,000th page. MySQL has to generate a result on the server-side of 5,000 pages, then discard 4,999 of them and return the last page in the result. Sorry, but that's how it works.
Re your comment:
You must understand that WHERE applies conditions on values in rows, but pages are defined by the position of rows. These are two different ways of determining rows!
If you have a column that is guaranteed to be a row-number, then you can use that value like a row position. You can even put an index on it, or use it as the primary key.
But primary key values may change, and may not be consecutive, for example if you update or delete rows, or rollback some transactions, and so on. Renumbering primary key values is a bad idea because other tables or external data may reference primary key values.
So you could add another column that is not the primary key, but only a row-number.
ALTER TABLE MyTable ADD COLUMN row_number BIGINT UNSIGNED, ADD KEY (row_number);
Then fill the values when you need to renumber the rows.
SET #row := 0;
UPDATE MyTable SET row_number = (#row := #row + 1) ORDER BY id;
You'd have to re-number the rows if you ever delete some, for example. It's not efficient to do this frequently, depending on the size of the table.
Also, new inserts cannot create correct row number values without locking the table. This is necessary to prevent race conditions.
If you have a guarantee that row_number is a sequence of consecutive values, then it's both a value and a row position, so you can use it for high-performance index lookups for any arbitrary page of rows.
SELECT * FROM MyTable WHERE row_number BETWEEN 401 AND 500;
At least until the next time the sequence of row numbers is put into doubt by a delete or by new inserts.
You're using the ID column for the wrong purpose. ID is the identifier of a record, not the sequence number of a record for any given set of results.
The LIMIT keyword extends to basic pagination. If you just wanted the first 10 records, you'd do something like:
LIMIT 10
To paginate, if you wanted the second 10 records, you'd do:
LIMIT 10,10
The 10 after that:
LIMIT 20,10
And so on.
The LIMIT clause is independent of the WHERE clause. Use WHERE to filter your results, use LIMIT to paginate them.
Related
I need to fetch data in batch wise. Example 1 to 1000, 1001 to 2000
Query: Select * from Employee limit 1, 1000
Select * from Employee limit 1001, 1000
Here no order by is used in this query. Will the second query returns duplicate data? or it will follow any sorting techniques?
This question was previously called a "duplicate" of The order of a SQL Select statement without Order By clause . That is inappropriate as a "duplicate" link because it refers to engines other than MySQL. However, the effect is "correct". That is, you must use ORDER BY; do not assume the table is in some order.
I brought this question back to life because of a more subtle part of the question, referring to a common cause of duplicates.
This
Select * from Employee limit 1001, 1000
has two problems:
LIMIT without an ORDER BY is asking for trouble (as discussed in the link)
You appear to be doing "pagination" and you mentioned "returns duplicate data". I bring this up because you can get dups even if you have an ORDER BY. To elaborate...
OFFSET is implemented by stepping over rows.
Between getting N rows and getting the next N rows, some rows could be INSERTed or DELETEd in the 'previous' rows. This messes up the OFFSET, leading to either "duplicate" or "missing" rows.
More discussion, plus an alternative to OFFSET: Pagination It involves "remembering where you left off".
Specific to InnoDB:
The data's BTree is ordered by the PRIMARY KEY. That is predictable, but
The query does not necessarily use the "table" to fetch the rows. It might use a 'covering' INDEX, whose BTree is sorted by a secondary key!
For grins... MyISAM:
The data is initially ordered by when the rows were inserted.
That order may change as Inserts and Deletes, and even Updates, occur.
And the query may use a covering index (Primary or secondary).
If we use traditional approach by LIMIT m, n, we may have these steps:
SELECT COUNT(*) FROM table WHERE condition to get total numbers and calculate how many pages;
SELECT columns FROM table WHERE condition LIMIT 0, 100 to show first page (assume 100 rows per page);
SELECT columns FROM table WHERE condition LIMIT 100, 100 when click 'next page';
...
This may be very expensive because WHERE condition may be a full table scan. And MySQL may do this slowly on every page if cache is turned off.
So I have another way to implement this paged query:
SELECT id FROM table WHERE condition to get all IDs that match my condition. If there are huge data in this table (e.g. 1,000,000+ rows), we can limit our result size (e.g. 10,000 at most, by LIMIT 10000). These IDs are sent to front-end (via dynamic pages or Ajax, JavaScript code or JSON data);
In front-end, JavaScript choose the first 100 IDs as the first page, then request these rows, so we have SELECT columns FROM table WHERE id IN (id_0, id_1, ..., id_99) in back-end;
We have SELECT columns FROM table WHERE id IN (id_100, id_101, ..., id_199) when click 'next page';
...
In this approach, we have only one WHERE condition and full data are queried by primary-key query.
This is implemented in my part-time project: (http://) www.chess-wizard.com/base/ (First page data is stored in JSP for SEO).
I share this idea to my team member but they don't agree with me :-(
Why LIMIT m,n must be a standard / unique way to implement paged query?
My website has more than 20.000.000 entries, entries have categories (FK) and tags (M2M). As for query even like SELECT id FROM table ORDER BY id LIMIT 1000000, 10 MySQL needs to scan 1000010 rows, but that is really unacceptably slow (and pks, indexes, joins etc etc don't help much here, still 1000010 rows). So I am trying to speed up pagination by storing row count and row number with triggers like this:
DELIMITER //
CREATE TRIGGER #trigger_name
AFTER INSERT
ON entry_table FOR EACH ROW
BEGIN
UPDATE category_table SET row_count = (#rc := row_count + 1)
WHERE id = NEW.category_id;
NEW.row_number_in_category = #rc;
END //
And then I can simply:
SELECT *
FROM entry_table
WHERE row_number_in_category > 10
ORDER BY row_number_in_category
LIMIT 10
(now only 10 rows scanned and therefore selects are blazing fast, although inserts are slower, but they are rare comparing to selects, so it is ok)
Is it a bad approach and are there any good alternatives?
Although I like the solution in the question. It may present some issues if data in the entry_table is changed - perhaps deleted or assigned to different categories over time.
It also limits the ways in which the data can be sorted, the method assumes that data is only sorted by the insert order. Covering multiple sort methods requires additional triggers and summary data.
One alternate way of paginating is to pass in offset of the field you are sorting/paginating by instead of an offset to the limit parameter.
Instead of this:
SELECT id FROM table ORDER BY id LIMIT 1000000, 10
Do this - assuming in this scenario that the last result viewed had an id of 1000000.
SELECT id FROM table WHERE id > 1000000 ORDER BY id LIMIT 0, 10
By tracking the offset of the pagination, this can be passed to subsequent queries for data and avoids the database sorting rows that are not ever going to be part of the end result.
If you really only wanted 10 rows out of 20million, you could go further and guess that the next 10 matching rows will occur in the next 1000 overall results. Perhaps with some logic to repeat the query with a larger allowance if this is not the case.
SELECT id FROM table WHERE id BETWEEN 1000000 AND 1001000 ORDER BY id LIMIT 0, 10
This should be significantly faster because the sort will probably be able to limit the result in a single pass.
Good Morning,
I have a table that contains couple million rows and I need to view the data ordered by the TimeStamp.
when I tried to do this
SELECT * FROM table ORDER BY date DESC offset 0 LIMIT 200
the MySQL will order all the data and then will response with the 200 rows and this is a performance issue. because its not wise to order everything each time I want to scroll the page !
do you have any idea on how we could improve the performance ?
Firstly you need to create an index based on the date field. This allows the rows to be retrieved in order without having to sort the entire table every time a request is made.
Secondly, paging based on index gets slower the deeper you delve into the result set. To illustrate:
ORDER BY indexedcolumn LIMIT 0, 200 is very fast because it only has to scan 200 rows of the index.
ORDER BY indexedcolumn LIMIT 200, 200 is relatively fast, but requires scanning 400 rows of the index.
ORDER BY indexedcolumn LIMIT 660000, 200 is very slow because it requires scanning 660,200 rows of the index.
Note: even so, this may still be significantly faster than not having an index at all.
You can fix this in a few different ways.
Implement value-based paging, so you're paging based on the value of the last result on the previous page. For example:
WHERE indexedcolumn>[lastval] ORDER BY indexedcolumn LIMIT 200 replacing [lastval] with the value of the last result of the current page. The index allows random access to a particular value, and proceeding forward or backwards from that value.
Only allow users to view the first X rows (eg. 1000). This is no good if the value they want is the 2529th value.
Think of some logical way of breaking up your large table, for example by the first letter, the year, etc so users never have to encounter the entire result set of millions of rows, instead they need to drill down into a specific subset first, which will be a smaller set and quicker to sort.
If you're combining a WHERE and an ORDER BY you'll need to reflect this in the design of your index to enable MySQL to continue to benefit from the index for sorting. For example if your query is:
SELECT * FROM mytable WHERE year='2012' ORDER BY date LIMIT 0, 200
Then your index will need to be on two columns (year, date) in that order.
If your query is:
SELECT * FROM mytable WHERE firstletter='P' ORDER BY date LIMIT 0, 200
Then your index will need to be on the two columns (firstletter, date) in that order.
The idea is that an index on multiple columns allows sorting by any column as long as you specified previous columns to be constants (single values) in a condition. So an index on A, B, C, D and E allows sorting by C if you specify A and B to be constants in a WHERE condition. A and B cannot be ranges.
This is going to be one of those questions but I need to ask it.
I have a large table which may or may not have one unique row. I therefore need a MySQL query that will just tell me TRUE or FALSE.
With my current knowledge, I see two options (pseudo code):
[id = primary key]
OPTION 1:
SELECT id FROM table WHERE x=1 LIMIT 1
... and then determine in PHP whether a result was returned.
OPTION 2:
SELECT COUNT(id) FROM table WHERE x=1
... and then just use the count.
Is either of these preferable for any reason, or is there perhaps an even better solution?
Thanks.
If the selection criterion is truly unique (i.e. yields at most one result), you are going to see massive performance improvement by having an index on the column (or columns) involved in that criterion.
create index my_unique_index on table(x)
If you want to enforce the uniqueness, that is not even an option, you must have
create unique index my_unique_index on table(x)
Having this index, querying on the unique criterion will perform very well, regardless of minor SQL tweaks like count(*), count(id), count(x), limit 1 and so on.
For clarity, I would write
select count(*) from table where x = ?
I would avoid LIMIT 1 for two other reasons:
It is non-standard SQL. I am not religious about that, use the MySQL-specific stuff where necessary (i.e. for paging data), but it is not necessary here.
If for some reason, you have more than one row of data, that is probably a serious bug in your application. With LIMIT 1, you are never going to see the problem. This is like counting dinosaurs in Jurassic Park with the assumption that the number can only possibly go down.
AFAIK, if you have an index on your ID column both queries will be more or less equal performance. The second query will need 1 less line of code in your program but that's not going to make any performance impact either.
Personally I typically do the first one of selecting the id from the row and limiting to 1 row. I like this better from a coding perspective. Instead of having to actually retrieve the data, I just check the number of rows returned.
If I were to compare speeds, I would say not doing a count in MySQL would be faster. I don't have any proof, but my guess would be that MySQL has to get all of the rows and then count how many there are. Altough...on second thought, it would have to do that in the first option as well so the code will know how many rows there are as well. But since you have COUNT(id) vs COUNT(*), I would say it might be slightly slower.
Intuitively, the first one could be faster since it can abort the table(or index) scan when finds the first value. But you should retrieve x not id, since if the engine it's using an index on x, it doesn't need to go to the block where the row actually is.
Another option could be:
select exists(select 1 from mytable where x = ?) from dual
Which already returns a boolean.
Typically, you use group by having clause do determine if there are duplicate rows in a table. If you have a table with id and a name. (Assuming id is the primary key, and you want to know if name is unique or repeated). You would use
select name, count(*) as total from mytable group by name having total > 1;
The above will return the number of names which are repeated and the number of times.
If you just want one query to get your answer as true or false, you can use a nested query, e.g.
select if(count(*) >= 1, True, False) from (select name, count(*) as total from mytable group by name having total > 1) a;
The above should return true, if your table has duplicate rows, otherwise false.