Optimization of my Custom RAND() query - mysql

I use the following query to get a random row in MySql. And, I think it to be pretty faster than the ORDER BY RAND() as it just returns a row after a random count of rows, and doesn't require any ordering of rows.
SELECT COUNT(ID) FROM TABLE_NAME
!-- GENERATE A RANDOM NUMBER BETWEEN 0 and COUNT(ID)-1 --!
SELECT x FROM TABLE_NAME LIMIT RANDOM_NUMBER,1
But, I need to know if in any way I could optimize it more and is there a faster method.
I would also be grateful to know if I can combine the 2 queries as LIMIT doesn't support such sub-queries (As I know).
EDIT- The way my query works is not by randomly generating any ID. But instead it generates a random no. between 0 and total no. of rows. And, then I use that no. as offset to get a row next to that random count.

EDIT : My answer assumes MySql<5.5.6 where you cannot pass a variable to LIMIT and OFFSET. Otherwise, OP's method is the best.
The most reliable solution, imo, would be to rank your results to eliminate the gaps. My solution might not be optimal since I'm not used to MySQL, but the logic works (or worked in my SQLFiddle).
SET #total = 0;
SELECT #total := COUNT(1) FROM test;
SET #random=FLOOR(RAND()*#total)+1;
SET #rank=0;
SELECT * from
(SELECT #rank:=#rank+1 as rank, id, name
FROM test
order by id) derived_table
where rank = #random;
I'm not sure how this structure will old if you use it on a massive query, but as long as you're within a few hundreds of rows it should be instant.
Basically, you generate a random row number with (this is one of the place where there's most probably optimization to be made) :
SET #total = 0;
SELECT #total := COUNT(1) FROM test;
SET #random=FLOOR(RAND()*#total)+1;
Then, you rank all of your rows to eliminate gaps :
SELECT #rank:=#rank+1 as rank, id, name
FROM test
order by id
And, you select the randomly selected row :
SELECT * from
(ranked derived table) derived_table
where rank = #random;

I think the query you want is:
select x.*
from tablename x
where x.id >= random_number
order by x.id
limit 1;
This should use an index on x.id and should be quite fast. You can combine them as:
select x.*
from tablename x cross join
(select cast(max(id) * rand() as int) as random_number from tablename
) c
where x.id >= random_number
order by x.id
limit 1;
Note that you should use max(id) rather than count(), because there can be gaps in the ids. The subquery should also make use of an index on id.
EDIT:
I won't be defensive about the above solution. It returns a random id, but the id is not uniformly distributed.
My preferred method, in any case, is:
select x.*
from tablename x cross join
(select count(*) as cnt from x) cnt
where rand() < 100 / cnt
order by rand()
limit 1;
It is highly, highly unlikely that you will get no rows with the where condition (it is possible, but highly unlikely). The final order by rand() is only processing 100 rows, so it should go pretty fast.

There are 5 techniques in http://mysql.rjweb.org/doc.php/random . None of them have to look at the entire table.
Do you have an AUTO_INCREMENT? With or without gaps? And other questions need answering to know which technique in that link is even applicable.

Try caching the result of the first query and the use in the second query. Using both in the same query will be very heavy on the system.
As for the second query, try the following:
SELECT x FROM TABLE_NAME WHERE ID = RANDOM_NUMBER
The above query is much faster than yours (assuming ID is indexed)
Of course, the above query assumes that you are using sequential IDs (no gaps). If there are gaps, then you will need to create another sequential field (maybe call it ID2) and then execute the above query on that field.

Related

SQL Optimization on SELECT random id (with WHERE clause)

I'm currently working on a multi-thread program (in Java) that will need to select random rows in a database, in order to update them. This is working well but I started to encounter some performance issue regarding my SELECT request.
I tried multiple solutions before finding this website :
http://jan.kneschke.de/projects/mysql/order-by-rand/
I tried with the following solution :
SELECT * FROM Table
JOIN (SELECT FLOOR( COUNT(*) * RAND() ) AS Random FROM Table)
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL
LIMIT 1;
It selects only one row below the random id number generated. This is working pretty good (an average of less than 100ms per request on 150k rows). But after the process of my program, the FOREIGNKEY_ID will no longer be NULL (it will be updated with some value).
The problem is, my SELECT will "forget" some rows than have an id below the random generated id, and I won't be able to process them.
So I tried to adapt my request, doing this :
SELECT * FROM Table
JOIN (SELECT FLOOR(
(SELECT COUNT(id) FROM Table WHERE FOREIGNKEY_ID IS NULL) * RAND() )
AS Random FROM Table)
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL
LIMIT 1;
With that request, no more problems of skipping some rows, but performances are decreasing drastically (an average of 1s per request on 150k rows).
I could simply execute the fast one when I still have a lot of rows to process, and switch to the slow one when it remains just a few rows, but it will be a "dirty" fix in the code, and I would prefer an elegant SQL request that can do the work.
Thank you for your help, please let me know if I'm not clear or if you need more details.
For your method to work more generally, you want max(id) rather than count(*):
SELECT t.*
FROM Table t JOIN
(SELECT FLOOR(MAX(id) * RAND() ) AS Random FROM Table) r
ON t.ID > R.Random
WHERE t.FOREIGNKEY_ID IS NULL
ORDER BY t.ID
LIMIT 1;
The ORDER BY is usually added to be sure that the "next" id is returned. In theory, MySQL could always return the maximum id in the table.
The problem is gaps in ids. And, it is easy to create distributions where you never get a random number . . . say that the four ids are 1, 2, 3, 1000. Your method will never get 1000000. The above will almost always get it.
Perhaps the simplest solution to your problem is to run the first query multiple times until it gets a valid row. The next suggestion would be an index on (FOREIGNKEY_ID, ID), which the subquery can use. That might speed the query.
I tend to favor something more along these lines:
SELECT t.id
FROM Table t
WHERE t.FOREIGNKEY_ID IS NULL AND
RAND() < 1.0 / 1000
ORDER BY RAND()
LIMIT 1;
The purpose of the WHERE clause is to reduce the volume considerable, so the ORDER BY doesn't take much time.
Unfortunately, this will require scanning the table, so you probably won't get responses in the 100 ms range on a 150k table. You can reduce that to an index scan with an index on t(FOREIGNKEY_ID, ID).
EDIT:
If you want a reasonable chance of a uniform distribution and performance that does not increase as the table gets larger, here is another idea, which -- alas -- requires a trigger.
Add a new column to the table called random, which is initialized with rand(). Build an index onrandom`. Then run a query such as:
select t.*
from ((select t.*
from t
where random >= #random
order by random
limit 10
) union all
(select t.*
from t
where random < #random
order by random desc
limit 10
)
) t
order by rand();
limit 1;
The idea is that the subqueries can use the index to choose a set of 20 rows that are pretty arbitrary -- 10 before and after the chosen point. The rows are then sorted (some overhead, which you can control with the limit number). These are randomized and returned.
The idea is that if you choose random numbers, there will be arbitrary gaps and these would make the chosen numbers not quite uniform. However, by taking a larger sample around the value, then the probability of any one value being chosen should approach a uniform distribution. The uniformity would still have edge effects, but these should be minor on a large amount of data.
Your ID's are probably gonna contain gaps. Anything that works with COUNT(*) is not going to be able to find all the ID's.
A table with records with ID's 1,2,3,10,11,12,13 has only 7 records. Doing a random with COUNT(*) will often result in a miss as records 4,5 and 6 donot exist, and it will then pick the nearest ID which is 3. This is not only unbalanced (it will pick 3 far too often) but it will also never pick records 10-13.
To get a fair uniformly distrubuted random selection of records, I would suggest loading the ID's of the table first. Even for 150k rows, loading a set of integer id's will not consume a lot of memory (<1 MB):
SELECT id FROM table;
You can then use a function like Collections.shuffle to randomize the order of the ID's. To get the rest of the data, you can select records one at a time or for example 10 at a time:
SELECT * FROM table WHERE id = :id
Or:
SELECT * FROM table WHERE id IN (:id1, :id2, :id3)
This should be fast if the id column has an index, and it will give you a proper random distribution.
If prepared statement can be used, then this should work:
SELECT #skip := Floor(Rand() * Count(*)) FROM Table WHERE FOREIGNKEY_ID IS NULL;
PREPARE STMT FROM 'SELECT * FROM Table WHERE FOREIGNKEY_ID IS NULL LIMIT ?, 1';
EXECUTE STMT USING #skip;
LIMIT in SELECT statement can be used to skip rows

How to query row with lowest value, and also to know the value of the highest value?

Consider these two queries:
SELECT *, 'b' AS b FROM someTable ORDER BY a ASC LIMIT 1;
SELECT *, MAX(a) AS maxA FROM someTable ORDER BY a ASC LIMIT 1;
The former query returns the row with the lowest value of a, as expected. The latter query returns the first row stored on disk (usually the row with the lowest value for primary key). How can I work around this? My intention is to get the full row of the column with the lowest a value (if there is more than one I only need one, it does not matter which), and additionally I do need the value of the highest age. In a perfect world I would run two queries, but due to the way that objects are serialised in this application I cannot do that without refactoring a lot of code that isn't mine. I actually don't mind if the MySQL engine itself must query twice, the important bit is that the output be returned in a single row. I cannot write a stored procedure for this query, unfortunately. And yes, the * operator is important, I cannot list the needed fields. And there are too many row to return them all!
Note that this question is superficially similar to a previous question, however the question asked there was ill-formed and ambiguous, therefore all the answers addressed the issue that was not my intention (however useful, I did learn much and I'm happy that it turned out that way). This question asks the intended question more clearly and so should attract different answers.
Why not just run this:
SELECT MIN(a) as minA, MAX(a) AS maxA FROM someTable
Unfortunately, MySQL doesn't know window functions. So if you really want to select * along with min/max values, I guess you'll have to resort to a JOIN:
SELECT * FROM
(
SELECT * FROM someTable ORDER BY a ASC LIMIT 1
) t1
CROSS JOIN
(
SELECT MIN(a) as minA, MAX(a) AS maxA FROM someTable
) t2
Or to a subselect, as given in Imre L's answer
use a subquery in select part:
SELECT *, 'b' AS b,
(SELECT MAX(a) FROM someTable) AS maxA
FROM someTable ORDER BY a ASC LIMIT 1;

Quickly select random ID from mysql table with millions of non-sequential records

I've looked around and there doesnt seem to be any easy way to do this. It almost looks like it's easier just to grab a subset of records and do all the randomizing in code (perl). The methods I've seen online seem like theyre geared more to at most hundreds of thousands, but certainly not millions.
The table I'm working with has 6 million records (and growing), the IDs are auto incremented, but not always stored in the table (non-gapless).
I've tried to do the LIMIT 1 query that's been recommended, but the query takes forever to run -- is there a quick way to do this, given that there are gaps in the record? I can't just take the max and randomize over the range.
Update:
One idea I had maybe was to grab the max, randomize a limit based on the max, and then grab a range of 10 records from random_limit_1 to random_limit_2 and then taking the first record found in that range.
Or if I know the max, is there a way i can just pick say the 5th record of the table, without having to know which ID it is. Then just grabbing the id of that record.
Update:
This query is somewhat faster-ish. Still not fast enough =/
SELECT t.id FROM table t JOIN (SELECT(FLOOR(max(id) * rand())) as maxid FROM table) as tt on t.id >= tt.maxid LIMIT 1
SELECT * FROM TABLE ORDER BY RAND() LIMIT 1;
Ok, this is slow. If you'll search for ORDER BY RAND() MYSQL, you will find alot of results saying that this is very slow and this is the case. I did a little research and I found this alternative MySQL rand() is slow on large datasets
I hope this is better
Yeah, idea seems good:
select min(ID), max(ID) from table into #min, #max;
set #range = #max - #min;
set #mr = #min + ((#range / 1000) * (rand() * 1000));
select ID from table
where ID >= #mr and ID <= #mr + 1000
order by rand()
limit 1
-- into #result
;
May change 1000 to 10000 or whatever as needed to scale...
EDIT: you could also try this:
select ID from table
where (ID % 1000) = floor(rand() * 1000)
order by rand()
limit 1
;
Splits it along different lines...
EDIT 2:
See: What is the best way to pick a random row from a table in MySQL?
This is probably the fastest way:
select #row := floor(count(*) * rand()) from some_tbl;
select some_ID from some_tbl limit #row, 1;
unfortunately, variables can't be used in limit clause so you'd have to use a dynamic query, either writing the query string in code, or using PREPARE and EXECUTE. Also, limit n, 1 still requires scanning n items into the table, so it's only about twice as fast as the second method listed above on average. (Though it is probably more uniform and guarantees a matching row will always be found)
SELECT ID
FROM YourTable
ORDER BY RAND() LIMIT 1;

How to optimize a MySQL query so that a selected value in a WHERE clause is only computed once?

I need to randomly select, in an efficient way, 10 rows from my table.
I found out that the following works nicely (after the query, I just select 10 random elements in PHP from the 10 to 30 I get from the query):
SELECT * FROM product WHERE RAND() <= (SELECT 20 / COUNT(*) FROM product)
However, the subquery, though relatively cheap, is computed for every row in the table. How can I prevent that? With a variable? A join?
Thanks!
A variable would do it. Something like this:
SELECT #myvar := (SELECT 20 / COUNT(*) FROM product);
SELECT * FROM product WHERE RAND() <= #myvar;
Or, from the MySql math functions doc:
You cannot use a column with RAND()
values in an ORDER BY clause, because
ORDER BY would evaluate the column
multiple times. However, you can
retrieve rows in random order like
this:
mysql> SELECT * FROM tbl_name ORDER BY
> RAND();
ORDER BY RAND() combined with LIMIT is
useful for selecting a random sample
from a set of rows:
mysql> SELECT * FROM table1, table2
> WHERE a=b AND c<d -> ORDER BY RAND()
> LIMIT 1000;
RAND() is not meant to be a perfect
random generator. It is a fast way to
generate random numbers on demand that
is portable between platforms for the
same MySQL version.
Its a highly mysql specific trick but by wrapping it in another subquery MySQL will make it a constant table and compute it only once.
SELECT * FROM product WHERE RAND() &lt= (
select * from ( SELECT 20 / COUNT(*) FROM product ) as const_table
)
SELECT * FROM product ORDER BY RAND() LIMIT 10
Don't use order by rand(). This will result in a table scan. If you have much data at all in your table this will not be efficient at all. First determine how many rows are in the table:
select count(*) from table might work for you, though you should probably cache this value for some time since it can be slow for large datasets.
explain select * from table will give you the db statistics for the table (how many rows the statistics thinks are in the table) This is much faster, however it is less accurate and less accurate still for InnoDB.
once you have the number of rows, you should write some code like:
pseudo code:
String SQL = "SELECT * FROM product WHERE id IN (";
for (int i=0;i<numResults;i++) {
SQL += (int)(Math.rand() * tableRows) + ", ";
}
// trim off last ","
SQL.trim(",");
SQL += ")";
this will give you fast lookup on PK and avoid the table scan.

Selecting last row WITHOUT any kind of key

I need to get the last (newest) row in a table (using MySQL's natural order - i.e. what I get without any kind of ORDER BY clause), however there is no key I can ORDER BY on!
The only 'key' in the table is an indexed MD5 field, so I can't really ORDER BY on that. There's no timestamp, autoincrement value, or any other field that I could easily ORDER on either. This is why I'm left with only the natural sort order as my indicator of 'newest'.
And, unfortunately, changing the table structure to add a proper auto_increment is out of the question. :(
Anyone have any ideas on how this can be done w/ plain SQL, or am I SOL?
If it's MyISAM you can do it in two queries
SELECT COUNT(*) FROM yourTable;
SELECT * FROM yourTable LIMIT useTheCountHere - 1,1;
This is unreliable however because
It assumes rows are only added to this table and never deleted.
It assumes no other writes are performed to this table in the meantime (you can lock the table)
MyISAM tables can be reordered using ALTER TABLE, so taht the insert order is no longer preserved.
It's not reliable at all in InnoDB, since this engine can reorder the table at will.
Can I ask why you need to do this?
In oracle, possibly the same for MySQL too but the optimiser will choose the quickest record / order to return you results. So there is potential if your data was static to run the same query twice and get a different answer.
You can assign row numbers using the ROW_NUMBER function and then sort by this value using the ORDER BY clause.
SELECT *,
ROW_NUMBER() OVER() AS rn
FROM table
ORDER BY rn DESC
LIMIT 1;
Basically, you can't do that.
Normally I'd suggest adding a surrogate primary key with auto-incrememt and ORDER BY that:
SELECT *
FROM yourtable
ORDER BY id DESC
LIMIT 1
But in your question you write...
changing the table structure to add a proper auto_increment is out of the question.
So another less pleasant option I can think of is using a simulated ROW_NUMBER using variables:
SELECT * FROM
(
SELECT T1.*, #rownum := #rownum + 1 AS rn
FROM yourtable T1, (SELECT #rownum := 0) T2
) T3
ORDER BY rn DESC
LIMIT 1
Please note that this has serious performance implications: it requires a full scan and the results are not guaranteed to be returned in any particular order in the subquery - you might get them in sort order, but then again you might not - when you dont' specify the order the server is free to choose any order it likes. Now it probably will choose the order they are stored on disk in order to do as little work as possible, but relying on this is unwise.
Without an order by clause you have no guarantee of the order in which you will get your result. The SQL engine is free to choose any order.
But if for some reason you still want to rely on this order, then the following will indeed return the last record from the result (MySql only):
select *
from (select *,
#rn := #rn + 1 rn
from mytable,
(select #rn := 0) init
) numbered
where rn = #rn
In the sub query the records are retrieved without order by, and are given a sequential number. The outer query then selects only the one that got the last attributed number.
We can use the having for that kind of problem-
SELECT MAX(id) as last_id,column1,column2 FROM table HAVING id=last_id;