I'm currently working on a multi-thread program (in Java) that will need to select random rows in a database, in order to update them. This is working well but I started to encounter some performance issue regarding my SELECT request.
I tried multiple solutions before finding this website :
http://jan.kneschke.de/projects/mysql/order-by-rand/
I tried with the following solution :
SELECT * FROM Table
JOIN (SELECT FLOOR( COUNT(*) * RAND() ) AS Random FROM Table)
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL
LIMIT 1;
It selects only one row below the random id number generated. This is working pretty good (an average of less than 100ms per request on 150k rows). But after the process of my program, the FOREIGNKEY_ID will no longer be NULL (it will be updated with some value).
The problem is, my SELECT will "forget" some rows than have an id below the random generated id, and I won't be able to process them.
So I tried to adapt my request, doing this :
SELECT * FROM Table
JOIN (SELECT FLOOR(
(SELECT COUNT(id) FROM Table WHERE FOREIGNKEY_ID IS NULL) * RAND() )
AS Random FROM Table)
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL
LIMIT 1;
With that request, no more problems of skipping some rows, but performances are decreasing drastically (an average of 1s per request on 150k rows).
I could simply execute the fast one when I still have a lot of rows to process, and switch to the slow one when it remains just a few rows, but it will be a "dirty" fix in the code, and I would prefer an elegant SQL request that can do the work.
Thank you for your help, please let me know if I'm not clear or if you need more details.
For your method to work more generally, you want max(id) rather than count(*):
SELECT t.*
FROM Table t JOIN
(SELECT FLOOR(MAX(id) * RAND() ) AS Random FROM Table) r
ON t.ID > R.Random
WHERE t.FOREIGNKEY_ID IS NULL
ORDER BY t.ID
LIMIT 1;
The ORDER BY is usually added to be sure that the "next" id is returned. In theory, MySQL could always return the maximum id in the table.
The problem is gaps in ids. And, it is easy to create distributions where you never get a random number . . . say that the four ids are 1, 2, 3, 1000. Your method will never get 1000000. The above will almost always get it.
Perhaps the simplest solution to your problem is to run the first query multiple times until it gets a valid row. The next suggestion would be an index on (FOREIGNKEY_ID, ID), which the subquery can use. That might speed the query.
I tend to favor something more along these lines:
SELECT t.id
FROM Table t
WHERE t.FOREIGNKEY_ID IS NULL AND
RAND() < 1.0 / 1000
ORDER BY RAND()
LIMIT 1;
The purpose of the WHERE clause is to reduce the volume considerable, so the ORDER BY doesn't take much time.
Unfortunately, this will require scanning the table, so you probably won't get responses in the 100 ms range on a 150k table. You can reduce that to an index scan with an index on t(FOREIGNKEY_ID, ID).
EDIT:
If you want a reasonable chance of a uniform distribution and performance that does not increase as the table gets larger, here is another idea, which -- alas -- requires a trigger.
Add a new column to the table called random, which is initialized with rand(). Build an index onrandom`. Then run a query such as:
select t.*
from ((select t.*
from t
where random >= #random
order by random
limit 10
) union all
(select t.*
from t
where random < #random
order by random desc
limit 10
)
) t
order by rand();
limit 1;
The idea is that the subqueries can use the index to choose a set of 20 rows that are pretty arbitrary -- 10 before and after the chosen point. The rows are then sorted (some overhead, which you can control with the limit number). These are randomized and returned.
The idea is that if you choose random numbers, there will be arbitrary gaps and these would make the chosen numbers not quite uniform. However, by taking a larger sample around the value, then the probability of any one value being chosen should approach a uniform distribution. The uniformity would still have edge effects, but these should be minor on a large amount of data.
Your ID's are probably gonna contain gaps. Anything that works with COUNT(*) is not going to be able to find all the ID's.
A table with records with ID's 1,2,3,10,11,12,13 has only 7 records. Doing a random with COUNT(*) will often result in a miss as records 4,5 and 6 donot exist, and it will then pick the nearest ID which is 3. This is not only unbalanced (it will pick 3 far too often) but it will also never pick records 10-13.
To get a fair uniformly distrubuted random selection of records, I would suggest loading the ID's of the table first. Even for 150k rows, loading a set of integer id's will not consume a lot of memory (<1 MB):
SELECT id FROM table;
You can then use a function like Collections.shuffle to randomize the order of the ID's. To get the rest of the data, you can select records one at a time or for example 10 at a time:
SELECT * FROM table WHERE id = :id
Or:
SELECT * FROM table WHERE id IN (:id1, :id2, :id3)
This should be fast if the id column has an index, and it will give you a proper random distribution.
If prepared statement can be used, then this should work:
SELECT #skip := Floor(Rand() * Count(*)) FROM Table WHERE FOREIGNKEY_ID IS NULL;
PREPARE STMT FROM 'SELECT * FROM Table WHERE FOREIGNKEY_ID IS NULL LIMIT ?, 1';
EXECUTE STMT USING #skip;
LIMIT in SELECT statement can be used to skip rows
Related
I want to return rows order by random from a table with large number of rows to be scanned
Tried:
1) select * from table order by rand() limit 1
2) select * from table where id in (select id from table order by rand() limit 1)
2 is faster than 1 but still too slow on table with large rows
Update:
Query is used in real time app. Insert, select and update are roughly 10/sec. So caching will not be the ideal solution. Rows required for this specific case is 1. But looking for a general solution as well where query is fast and number of rows required>1
Fastest way is using prepared statement in mysql and limit
select #offset:=floor(rand()*total_rows_in_table);
PREPARE STMT FROM 'select id from table limit ?,1';
EXECUTE STMT USING #offset;
total_rows_in_table= total rows in table.
It is much faster as compared to above two.
Limitation: Fetching more than 1 rows is not truly random.
Generate a random set of IDs before doing the query (you can also get MAX(id) very quickly if you need it). Then do the query as id IN (your, list). This will use the index to look only at the IDs you requested, so it will be very fast.
Limitation: if some of your randomly chosen IDs don't exist, the query will return less results, so you'll need to do these operations in a loop until you have enough results.
If you can run two querys in the same "call" you can do something like this, sadly, this asumes there are no deleted records in your database... if they where some query's would not return anything.
I tested with some local records and the fastest i could do was this... that said i tested it on a table with no deleted rows.
SET #randy = CAST(rand()*(SELECT MAX(id) FROM yourtable) as UNSIGNED);
SELECT *
FROM yourtable
WHERE id = #randy;
Another solution that came from modifying a little the answer to this question, and from your own solution:
Using variables as OFFSET in SELECT statments inside mysql's stored functions
SET #randy = CAST(rand()*(SELECT MAX(id) FROM yourtable) as UNSIGNED);
SET #q1 = CONCAT('SELECT * FROM yourtable LIMIT 1 OFFSET ', #randy);
PREPARE stmt1 FROM #q1;
EXECUTE stmt1;
I imagine a table with, say, a million entries. You want to pick a row randomly, so you generate one random number per row, i.e. a million random numbers, and then seek the row with the minimum generated number. There are two tasks involved:
generating all those numbers
finding the minimum number
and then accessing the record of course.
If you wanted more than one row, the DBMS could sort all records and then return n records, but hopefully it would rather apply some part-sort operation where it only detects the n minimum numbers. Quite some task anyway.
There is no thorough way to circumvent this, I guess. If you want random access, this is the way to go.
If you would be ready to live with a less random result, however, I'd suggest to make ID buckets. Imagine ID buckets 000000-0999999, 100000-1999999, ... Then randomly choose one bucket and of this pick your random rows. Well, admittedly, this doesn't look very random and you would either get only old or only new records with such buckets; but it illustrates the technique.
Instead of creating the buckets by value, you'd create them with a modulo function. id % 1000 would give you 1000 buckets. The first with IDs xxx000, the second with IDs xxx001. This would solve the new/old records thing and get the buckets balanced. As IDs are a mere technical thing, it doesn't matter at all that the drawn IDs look so similar. And even if that bothers you, then don't make 1000 buckets, but say 997.
Now create a computed column:
alter table mytable add column bucket int generated always as (id % 997) stored;
Add an index:
create index idx on mytable(bucket);
And query the data:
select *
from mytable
where bucket = floor(rand() * 998)
order by rand()
limit 10;
Only about 0.1% of the table gets into the sorting here. So this should be rather fast. But I suppose that only pays with a very large table and a high number of buckets.
Disadvantages of the technique:
It can happen that you don't get as many rows as you want and you'd have to query again then.
You must choose the modulo number wisely. If there are just two thousand records in the table, you wouldn't make 1000 buckets of course, but maybe 100 and never demand more than, say, ten rows at a time.
If the table grows and grows, a once chosen number may no longer be optimal and you might want to alter it.
Rextester link: http://rextester.com/VDPIU7354
UPDATE: It just dawned on me that the buckets would be really random, if the generated column would not be based on a modulo on the ID, but on a RANDvalue instead:
alter table mytable add column bucket int generated always as (floor(rand() * 1000)) stored;
but MySQL throws an error "Expression of generated column 'bucket' contains a disallowed function". This doesn't seem to make sense, as a non-deterministic function should be okay with the STORED option, but at least in version 5.7.12 this doesn't work. Maybe in some later version?
Scenario in short: A table with more than 16 million records [2GB in size]. The higher LIMIT offset with SELECT, the slower the query becomes, when using ORDER BY *primary_key*
So
SELECT * FROM large ORDER BY `id` LIMIT 0, 30
takes far less than
SELECT * FROM large ORDER BY `id` LIMIT 10000, 30
That only orders 30 records and same eitherway. So it's not the overhead from ORDER BY.
Now when fetching the latest 30 rows it takes around 180 seconds. How can I optimize that simple query?
I had the exact same problem myself. Given the fact that you want to collect a large amount of this data and not a specific set of 30 you'll be probably running a loop and incrementing the offset by 30.
So what you can do instead is:
Hold the last id of a set of data(30) (e.g. lastId = 530)
Add the condition WHERE id > lastId limit 0,30
So you can always have a ZERO offset. You will be amazed by the performance improvement.
It's normal that higher offsets slow the query down, since the query needs to count off the first OFFSET + LIMIT records (and take only LIMIT of them). The higher is this value, the longer the query runs.
The query cannot go right to OFFSET because, first, the records can be of different length, and, second, there can be gaps from deleted records. It needs to check and count each record on its way.
Assuming that id is the primary key of a MyISAM table, or a unique non-primary key field on an InnoDB table, you can speed it up by using this trick:
SELECT t.*
FROM (
SELECT id
FROM mytable
ORDER BY
id
LIMIT 10000, 30
) q
JOIN mytable t
ON t.id = q.id
See this article:
MySQL ORDER BY / LIMIT performance: late row lookups
MySQL cannot go directly to the 10000th record (or the 80000th byte as your suggesting) because it cannot assume that it's packed/ordered like that (or that it has continuous values in 1 to 10000). Although it might be that way in actuality, MySQL cannot assume that there are no holes/gaps/deleted ids.
So, as bobs noted, MySQL will have to fetch 10000 rows (or traverse through 10000th entries of the index on id) before finding the 30 to return.
EDIT : To illustrate my point
Note that although
SELECT * FROM large ORDER BY id LIMIT 10000, 30
would be slow(er),
SELECT * FROM large WHERE id > 10000 ORDER BY id LIMIT 30
would be fast(er), and would return the same results provided that there are no missing ids (i.e. gaps).
I found an interesting example to optimize SELECT queries ORDER BY id LIMIT X,Y.
I have 35million of rows so it took like 2 minutes to find a range of rows.
Here is the trick :
select id, name, address, phone
FROM customers
WHERE id > 990
ORDER BY id LIMIT 1000;
Just put the WHERE with the last id you got increase a lot the performance. For me it was from 2minutes to 1 second :)
Other interesting tricks here : http://www.iheavy.com/2013/06/19/3-ways-to-optimize-for-paging-in-mysql/
It works too with strings
The time-consuming part of the two queries is retrieving the rows from the table. Logically speaking, in the LIMIT 0, 30 version, only 30 rows need to be retrieved. In the LIMIT 10000, 30 version, 10000 rows are evaluated and 30 rows are returned. There can be some optimization can be done my the data-reading process, but consider the following:
What if you had a WHERE clause in the queries? The engine must return all rows that qualify, and then sort the data, and finally get the 30 rows.
Also consider the case where rows are not processed in the ORDER BY sequence. All qualifying rows must be sorted to determine which rows to return.
For those who are interested in a comparison and figures :)
Experiment 1: The dataset contains about 100 million rows. Each row contains several BIGINT, TINYINT, as well as two TEXT fields (deliberately) containing about 1k chars.
Blue := SELECT * FROM post ORDER BY id LIMIT {offset}, 5
Orange := #Quassnoi's method. SELECT t.* FROM (SELECT id FROM post ORDER BY id LIMIT {offset}, 5) AS q JOIN post t ON t.id = q.id
Of course, the third method, ... WHERE id>xxx LIMIT 0,5, does not appear here since it should be constant time.
Experiment 2: Similar thing, except that one row only has 3 BIGINTs.
green := the blue before
red := the orange before
I have been looking on the web on how to select a random row on big tables, I have found various results, but then I analyzed my data and figured out that the best way for me to go is to count the rows and select a random one of those with LIMIT
While testing I start to wonder why this works:
SET #t = CEIL(RAND()*(SELECT MAX(id) FROM logo));
SELECT id
FROM logo
WHERE
current_status_id=29 AND
logo_type_id=4 AND
active='y' AND
id>=#t
ORDER BY id
LIMIT 1;
and gives random results, but this always returns the same 4 or 5 results ?
SELECT id
FROM logo
WHERE
current_status_id=29 AND
logo_type_id=4 AND
active='y' AND
id>=CEIL(RAND()*(SELECT MAX(id) FROM logo))
ORDER BY id
LIMIT 1;
the table has MANY fields (almost 100) and quite a few indexes. over 14 Million records and counting. When I select a random it is almost NEVER that I have to select it from the table, I always have to select depending on various fields values (all indexed).
Could it be a bug of my MySQL server version (5.6.13-log Source distribution)?
One possibility is that this statement in the documentation:
RAND() in a WHERE clause is re-evaluated every time the WHERE is executed.
is simply not always true. It is true when you do:
where rand() < 0.01
to get an approximate 1% sample of the rows. Perhaps the MySQL optimizer says something like "Oh, I'll evaluate the subquery to get one value back. And, just to be more efficient, I'll multiply that row by rand() before defining the constant."
If I had to guess, that would be the case.
Another possibility is that the data is arranged so the values you are looking for has one row with a large id. Or, it could be that there are lots of rows with small ids at the very beginning, and then a very large gap.
Your method of getting a random row, by the way is not guaranteed to return a result when you are doing filtering. I don't know if that is important to you.
EDIT:
Check to see if this version works as you expect:
SELECT id
FROM logo cross join
(SELECT MAX(id) as maxid FROM logo) c
WHERE current_status_id = 29 AND
logo_type_id = 4 AND
active = 'y' AND
id >= RAND() * maxid
ORDER BY id
LIMIT 1;
If so, the problem is that the max id is being calculated and then there is an extra step of multiplying it by rand() as execution of the query begins.
Currently I am using:
SELECT *
FROM
table AS t1
JOIN (
SELECT (RAND() * (SELECT MAX(id) FROM table where column_x is null)) AS id
) AS t2
WHERE
t1.id >= t2.id
and column_x is null
ORDER BY t1.id ASC
LIMIT 1
This is normally extremely fast however when I include the highlighted column_x being Y (null) condition, it gets slow.
What would be the fastest random querying solution where the records' column X is null?
ID is PK, column X is int(4). Table contains about a million records and over 1 GB in total size doubling itself every 24 hours currently.
column_x is indexed.
Column ID may not be consecutive.
The DB engine used in this case is InnoDB.
Thank you.
Getting a genuinely random record can be slow. There's not really much getting around this fact; if you want it to be truly random, then the query has to load all the relevant data in order to know which records it has to choose from.
Fortunately however, there are quicker ways of doing it. They're not properly random, but if you're happy to trade a bit of pure randomness for speed, then they should be good enough for most purposes.
With that in mind, the fastest way to get a "random" record is to add an extra column to your DB, which is populated with a random value. Perhaps a salted MD5 hash of the primary key? Whatever. Add appropriate indexes on this column, and then simply add the column to your ORDER BY clause in the query, and you'll get your records back in a random order.
To get a single random record, simply specify LIMIT 1 and add a WHERE random_field > $random_value where random value would be a value in the range of your new field (say an MD5 hash of a random number, for example).
Of course the down side here is that although your records will be in a random order, they'll be stuck in the same random order. I did say it was trading perfection for query speed. You can get around this by updating them periodically with fresh values, but I guess that could be a problem for you if you need to keep it fresh.
The other down-side is that adding an extra column might be too much to ask if you have storage constraints and your DB is already massive in size, or if you have a strict DBA to get past before you can add columns. But again, you have to trade off something; if you want the query speed, you need this extra column.
Anyway, I hope that helped.
I don't think you need a join, nor an order by, nor a limit 1 (providing the ids are unique).
SELECT *
FROM myTable
WHERE column_x IS NULL
AND id = ROUND(RAND() * (SELECT MAX(Id) FROM myTable), 0)
Have you ran explain on the query? What was the output?
Why not store or cache the value of : SELECT MAX(id) FROM table where column_x is null and use that as a variable. your query would then become:
$rand = rand(0, $storedOrCachedMaxId);
SELECT *
FROM
table AS t1
WHERE
t1.id >= $rand
and column_x is null
ORDER BY t1.id ASC
LIMIT 1
A simpler query will likely be easier on the db.
Know that if your data contains sizable holes - you aren't going to get consistently random results with these kind of queries.
I'm new to MySQL syntax, but digging a little further I think a dynamic query might work. We select the Nth row, where the Nth is random:
SELECT #r := CAST(COUNT(1)*RAND() AS UNSIGNED) FROM table WHERE column_x is null;
PREPARE stmt FROM
'SELECT *
FROM table
WHERE column_x is null
LIMIT 1 OFFSET ?';
EXECUTE stmt USING #r;
I've looked around and there doesnt seem to be any easy way to do this. It almost looks like it's easier just to grab a subset of records and do all the randomizing in code (perl). The methods I've seen online seem like theyre geared more to at most hundreds of thousands, but certainly not millions.
The table I'm working with has 6 million records (and growing), the IDs are auto incremented, but not always stored in the table (non-gapless).
I've tried to do the LIMIT 1 query that's been recommended, but the query takes forever to run -- is there a quick way to do this, given that there are gaps in the record? I can't just take the max and randomize over the range.
Update:
One idea I had maybe was to grab the max, randomize a limit based on the max, and then grab a range of 10 records from random_limit_1 to random_limit_2 and then taking the first record found in that range.
Or if I know the max, is there a way i can just pick say the 5th record of the table, without having to know which ID it is. Then just grabbing the id of that record.
Update:
This query is somewhat faster-ish. Still not fast enough =/
SELECT t.id FROM table t JOIN (SELECT(FLOOR(max(id) * rand())) as maxid FROM table) as tt on t.id >= tt.maxid LIMIT 1
SELECT * FROM TABLE ORDER BY RAND() LIMIT 1;
Ok, this is slow. If you'll search for ORDER BY RAND() MYSQL, you will find alot of results saying that this is very slow and this is the case. I did a little research and I found this alternative MySQL rand() is slow on large datasets
I hope this is better
Yeah, idea seems good:
select min(ID), max(ID) from table into #min, #max;
set #range = #max - #min;
set #mr = #min + ((#range / 1000) * (rand() * 1000));
select ID from table
where ID >= #mr and ID <= #mr + 1000
order by rand()
limit 1
-- into #result
;
May change 1000 to 10000 or whatever as needed to scale...
EDIT: you could also try this:
select ID from table
where (ID % 1000) = floor(rand() * 1000)
order by rand()
limit 1
;
Splits it along different lines...
EDIT 2:
See: What is the best way to pick a random row from a table in MySQL?
This is probably the fastest way:
select #row := floor(count(*) * rand()) from some_tbl;
select some_ID from some_tbl limit #row, 1;
unfortunately, variables can't be used in limit clause so you'd have to use a dynamic query, either writing the query string in code, or using PREPARE and EXECUTE. Also, limit n, 1 still requires scanning n items into the table, so it's only about twice as fast as the second method listed above on average. (Though it is probably more uniform and guarantees a matching row will always be found)
SELECT ID
FROM YourTable
ORDER BY RAND() LIMIT 1;