I have a table, lets call it mytable, which holds huge amount of data that I need to query based on some column values of types varchar and datetime (none of these columns have indexing on them and I cannot use primary key for this query).
I need to fetch the data with pagination, for which I am using variables varLimit and varOffset. Now what I have noticed after much experimentation is that though LIMIT varLimit optimizes a query when result count is high, it severely reduces performance when it is greater than the result count. If the query returns 0 rows, with LIMIT 20 applied it takes 30 more seconds than it does with the LIMIT removed!
Here's my query
SELECT `data`
FROM mytable
WHERE (conditions...)
ORDER BY `heure` desc LIMIT varLimit OFFSET varOffset;
To optimize this, I have to first re-calculate varLimit to set it to the minimum value between result count and itself (varLimit = 20 but if query returns 10 rows, it should set varLimit = 10. The final code becomes:
SELECT COUNT(*) INTO varCount
FROM mytable
WHERE (conditions...);
SELECT LEAST(varLimit, varCount - varOffset) INTO varLimit; -- Assume varOffset <= varCount
SELECT `data`
FROM mytable
WHERE (conditions...)
ORDER BY `heure` desc LIMIT varLimit OFFSET varOffset;
Is there any way to do it in a single query, or a better way to achieve the same?
Unfortunately you cannot use variables in LIMIT and OFFSET clauses. They must be constants, so you must do this limit/offset computation either in application code or by creating a MySQL "prepared statement" with string concatenation.
Related
I have a table (>500GB) from which I need to select 5000 random rows where table.condition = True and 5000 random rows where table.condition = False. My attempts until now used tablesample, but, unfortunately, any WHERE clause is only applied after the sample has been generated. So the only way I see this working is by doing the following:
Generate 2 empty temporary_tables -- temporary_table_true and temporary_table_false -- with the structure of the main table, so I can add rows iteratively.
create temp temporary_table_true as select
table.condition, table.b, table.c, ... table.z
from table LIMIT 0
create temp temporary_table_false as select
table.condition, table.b, table.c, ... table.z
from table LIMIT 0
Create a loop that only stops when the size of my temporary_tables are both 5000.
Inside that loop I generate a batch of 100 random samples from table, in each iteration. From those random rows I insert the ones with the table.condition = True in my temporary_table_true and the ones with the table.condition = False in my temporary_table_false.
Could you guys give me some help here?
Are there any better approaches?
If not, any idea on how I could code parts 2. and 3.?
Add a column to your table and populate it with random numbers.
ALTER TABLE `table` ADD COLUMN rando FLOAT DEFAULT NULL;
UPDATE `table` SET rando = RAND() WHERE rando IS NULL;
Then do
SELECT *
FROM `table`
WHERE rando > RAND() * 0.9
AND condition = 0
ORDER BY rando
LIMIT 5000
Do it again for condition = 1 and Bob's your uncle. It will pull rows in random order starting from a random row.
A couple of notes:
0.9 is there to improve the chances you'll actually get 5000 rows and not some lesser number.
You may have to add LIMIT 1000 to the UPDATE statement and run it a whole bunch of times to populate the complete rando column: trying to update all the rows in a big table can generate a huge transaction and swamp your server for a long time.
If you need to generate another random sample, run the UPDATE or UPDATEs again.
The textbook solution would be to run two queries, one for rows with true and one for rows with `false:
SELECT * FROM mytable WHERE `condition`=true ORDER BY RAND() LIMIT 5000;
SELECT * FROM mytable WHERE `condition`=false ORDER BY RAND() LIMIT 5000;
The WHERE clause applies first, to reduce the matching rows, then it sorts the subset of rows randomly and picks up to 5000 of them. The result is a random subset.
This solution has an advantage that it returns a pretty evenly distributed set of random rows, and it automatically handles cases like there being an unknown proportion of true to false in the table, and even handles if one of the condition values matches fewer than 5000 rows.
The disadvantage is that it's incredibly expensive to sort such a large set of rows, and an index does not help you sort by a nondeterministic expression like RAND().
You could do this with window functions if you need it to be a single SQL query, but it would still be very expensive.
SELECT t.*
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY `condition` ORDER BY RAND()) AS rownum
FROM mytable
) AS t
WHERE t.rownum <= 5000;
Another alternative that does not use a random sort operation would be to do a table-scan, and pick a random subset of rows. But you need to know roughly how many rows match each condition value, so that you can estimate the fraction of these that would make ~5000 rows. Say for example there are 1 million rows with the true value and 500k rows with the false value:
SELECT * FROM mytable WHERE `condition`=true AND RAND()*1000000 < 5000;
SELECT * FROM mytable WHERE `condition`=false AND RAND()*500000 < 5000;
This is not guaranteed to return exactly 5000 rows, because of the randomness. But probably pretty close. And a table-scan is still quite expensive.
The answer from O.Jones gives me another idea. If you can add a column, then you can add an index on that column.
ALTER TABLE `table`
ADD COLUMN rando FLOAT DEFAULT NULL,
ADD INDEX (`condition`, rando);
UPDATE `table` SET rando = RAND() WHERE rando IS NULL;
Then you can use indexed searches. Again, you need to know how many rows match each value to do this.
SELECT * FROM mytable
WHERE `condition`=true AND rando < 5000/1000000
ORDER BY `condition`, rando
LIMIT 5000;
SELECT * FROM mytable
WHERE `condition`=true AND rando < 5000/500000
ORDER BY `condition`, rando
LIMIT 5000;
The ORDER BY in this case should be a no-op if the index I added is used. The rows will be read in index order anyway, and MySQL's optimizer will not do any work to sort them.
This solution will be much faster, because it doesn't have to sort anything, and doesn't have to do a table-scan. MySQL has an optimization to bail out of a query once the LIMIT has been satisfied.
But the disadvantage is that it doesn't return a different random result when you run the SELECT again, or if different clients run the query. You would have to use UPDATE to re-randomize the whole table to get a different result. This might not be suitable depending on your needs.
Let's say I have a black box query that I don't really understand how it works, something along the lines of:
SELECT ... FROM ... JOIN ... = A (denoted as A)
Let's say A returns 500 rows.
I want to get the count of the number of rows (500 in this case), and then only return a limit of 50.
How can I wrote a query built around A that would return the number '500' and 50 rows of data?
You can use window functions (available in MySQL 8.0 only) and a row-limiting clause:
select a.*, count(*) over() total_rows
from ( < your query >) a
order by ??
limit 50
Note that I added an order by clause to the query. Although this is not technically required, it is a best practice: without an order by clause where the column (or set of columns) uniquely identifies each row, it is undefined which 50 rows the database will return, and the results may not be consistent over consecutive executions of the same query.
This is what SELECT SQL_CALC_FOUND_ROWS is intended to do.
SELECT SQL_CALC_FOUND_ROWS * FROM tbl_name WHERE id > 100 LIMIT 10;
SELECT FOUND_ROWS();
The first query returns the limited set of rows.
The second query calls FOUND_ROWS() which returns an integer number of how many rows matched the most recent query, the number of rows which would have been returned if that query had not used LIMIT.
See https://dev.mysql.com/doc/refman/8.0/en/information-functions.html#function_found-rows
However, keep in mind that using SQL_CALC_FOUND_ROWS incurs a significant performance cost. Benchmarks show that it's usually faster to just run two queries:
SELECT COUNT(*) FROM tbl_name WHERE id > 100; -- the count of matching rows
SELECT * FROM tbl_name WHERE id > 100 LIMIT 10; -- the limited result
See https://www.percona.com/blog/2007/08/28/to-sql_calc_found_rows-or-not-to-sql_calc_found_rows/
There are a few ways you can do this (assuming that I am understanding your question correctly). You can open run two queries (and point a cursor to each) and then open and return both cursors, or you can run a stored procedure in which the count query is ran first, the result is stored into a variable, then it is used in another query.
Let me know if you would like an example of either of these
Scenario in short: A table with more than 16 million records [2GB in size]. The higher LIMIT offset with SELECT, the slower the query becomes, when using ORDER BY *primary_key*
So
SELECT * FROM large ORDER BY `id` LIMIT 0, 30
takes far less than
SELECT * FROM large ORDER BY `id` LIMIT 10000, 30
That only orders 30 records and same eitherway. So it's not the overhead from ORDER BY.
Now when fetching the latest 30 rows it takes around 180 seconds. How can I optimize that simple query?
I had the exact same problem myself. Given the fact that you want to collect a large amount of this data and not a specific set of 30 you'll be probably running a loop and incrementing the offset by 30.
So what you can do instead is:
Hold the last id of a set of data(30) (e.g. lastId = 530)
Add the condition WHERE id > lastId limit 0,30
So you can always have a ZERO offset. You will be amazed by the performance improvement.
It's normal that higher offsets slow the query down, since the query needs to count off the first OFFSET + LIMIT records (and take only LIMIT of them). The higher is this value, the longer the query runs.
The query cannot go right to OFFSET because, first, the records can be of different length, and, second, there can be gaps from deleted records. It needs to check and count each record on its way.
Assuming that id is the primary key of a MyISAM table, or a unique non-primary key field on an InnoDB table, you can speed it up by using this trick:
SELECT t.*
FROM (
SELECT id
FROM mytable
ORDER BY
id
LIMIT 10000, 30
) q
JOIN mytable t
ON t.id = q.id
See this article:
MySQL ORDER BY / LIMIT performance: late row lookups
MySQL cannot go directly to the 10000th record (or the 80000th byte as your suggesting) because it cannot assume that it's packed/ordered like that (or that it has continuous values in 1 to 10000). Although it might be that way in actuality, MySQL cannot assume that there are no holes/gaps/deleted ids.
So, as bobs noted, MySQL will have to fetch 10000 rows (or traverse through 10000th entries of the index on id) before finding the 30 to return.
EDIT : To illustrate my point
Note that although
SELECT * FROM large ORDER BY id LIMIT 10000, 30
would be slow(er),
SELECT * FROM large WHERE id > 10000 ORDER BY id LIMIT 30
would be fast(er), and would return the same results provided that there are no missing ids (i.e. gaps).
I found an interesting example to optimize SELECT queries ORDER BY id LIMIT X,Y.
I have 35million of rows so it took like 2 minutes to find a range of rows.
Here is the trick :
select id, name, address, phone
FROM customers
WHERE id > 990
ORDER BY id LIMIT 1000;
Just put the WHERE with the last id you got increase a lot the performance. For me it was from 2minutes to 1 second :)
Other interesting tricks here : http://www.iheavy.com/2013/06/19/3-ways-to-optimize-for-paging-in-mysql/
It works too with strings
The time-consuming part of the two queries is retrieving the rows from the table. Logically speaking, in the LIMIT 0, 30 version, only 30 rows need to be retrieved. In the LIMIT 10000, 30 version, 10000 rows are evaluated and 30 rows are returned. There can be some optimization can be done my the data-reading process, but consider the following:
What if you had a WHERE clause in the queries? The engine must return all rows that qualify, and then sort the data, and finally get the 30 rows.
Also consider the case where rows are not processed in the ORDER BY sequence. All qualifying rows must be sorted to determine which rows to return.
For those who are interested in a comparison and figures :)
Experiment 1: The dataset contains about 100 million rows. Each row contains several BIGINT, TINYINT, as well as two TEXT fields (deliberately) containing about 1k chars.
Blue := SELECT * FROM post ORDER BY id LIMIT {offset}, 5
Orange := #Quassnoi's method. SELECT t.* FROM (SELECT id FROM post ORDER BY id LIMIT {offset}, 5) AS q JOIN post t ON t.id = q.id
Of course, the third method, ... WHERE id>xxx LIMIT 0,5, does not appear here since it should be constant time.
Experiment 2: Similar thing, except that one row only has 3 BIGINTs.
green := the blue before
red := the orange before
The bellow statement does not work but i cant seem to figure out why
select AVG(delay_in_seconds) from A_TABLE ORDER by created_at DESC GROUP BY row_type limit 1000;
I want to get the avg's of the most recent 1000 rows for each row_type. created_at is of type DATETIME and row_type is of type VARCHAR
If you only want the 1000 most recent rows, regardless of row_type, and then get the average of delay_in_seconds for each row_type, that's a fairly straightforward query. For example:
SELECT t.row_type
, AVG(t.delay_in_seconds)
FROM (
SELECT r.row_type
, r.delay_in_seconds
FROM A_table r
ORDER BY r.created_at DESC
LIMIT 1000
) t
GROUP BY t.row_type
I suspect, however, that this query does not satisfy the requirements that were specified. (I know it doesn't satisfy what I understood as the specification.)
If what we want is the average of the most recent 1000 rows for each row_type, that would also be fairly straightforward... if we were using a database that supported analytic functions.
Unfortunately, MySQL doesn't provide support for analytic functions. But it is possible to emulate one in MySQL, but the syntax is a bit involved, and it is dependent on behavior that is not guaranteed.
As an example:
SELECT s.row_type
, AVG(s.delay_in_seconds)
FROM (
SELECT #row_ := IF(#prev_row_type = t.row_type, #row_ + 1, 1) AS row_
, #prev_row_type := t.row_type AS row_type
, t.delay_in_seconds
FROM A_table t
CROSS
JOIN (SELECT #prev_row_type := NULL, #row_ := NULL) i
ORDER BY t.row_type DESC, t.created_at DESC
) s
WHERE s.row_ <= 1000
GROUP
BY s.row_type
NOTES:
The inline view query is going to be expensive for large sets. What that's effectively doing is assigning a row number to each row. The "order by" is sorting the rows in descending sequence by created_at, what we want is for the most recent row to be assigned a value of 1, the next most recent 2, etc. This numbering of rows will be repeated for each distinct value of row_type.
For performance, we'd want a suitable index with leading columns (row_type,created_at,delay_seconds) to avoid an expensive "Using filesort" operation. We need at least those first two columns for that, including the delay_seconds makes it a covering index (the query can be satisfied entirely from the index.)
The outer query then runs against the resultset returned from the view query (a "derived table"). The predicate in the WHERE filters out all rows that were assigned a row number greater than 1000, the rest is a straighforward GROUP BY and and AVG aggregate.
A LIMIT clause is entirely unnecessary. It may be possible to incorporate some additional predicates for some additional performance enhancement... like, what if we specified the most recent 1000 rows, but only that were create_at within the past 30 or 90 days?
(I'm not entirely sure this answers the question that OP was asking. What this answers is: Is there a query that can return the specified resultset, making use of AVG aggregate and GROUP BY, ORDER BY and LIMIT clauses.)
N.B. This query is dependent on a behavior of MySQL user-defined variables which is not guaranteed.
The query above shows one approach, but there is also another approach. It's possible to use a "join" operation (of A_table with A_table) to get a row number assigned (getting a COUNT of the number of rows that are "more recent" than each row. With large sets, however, that can produce a humongous intermediate result, if we aren't careful to limit it.
Write the ORDER BY at the last of the statement.
SELECT AVG(delay_in_seconds) from A_TABLE GROUP BY row_type ORDER by created_at DESC limit 1000;
read mysql dev site for details.
i need to get total amount of rows when using LIMIT with my query to avoid twice querying.
is it possible?
Use FOUND_ROWS():
For a SELECT with a LIMIT clause, the number of rows that would be returned were there no LIMIT clause
use the statement right after your SELECT query, which needs the CALC_FOUND_ROWS keyword. Example from the manual:
SELECT SQL_CALC_FOUND_ROWS * FROM tbl_name
WHERE id > 100 LIMIT 10;
Note that this puts additional strain on the database, because it has to find out the size of the full result set every time. Use SQL_CALC_FOUND_ROWS only when you need it.