Fastest random selection WHERE column X is Y (NULL)

Fastest random selection WHERE column X is Y (NULL) - mysql

Currently I am using:
SELECT *
FROM
table AS t1
JOIN (
SELECT (RAND() * (SELECT MAX(id) FROM table where column_x is null)) AS id
) AS t2
WHERE
t1.id >= t2.id
and column_x is null
ORDER BY t1.id ASC
LIMIT 1
This is normally extremely fast however when I include the highlighted column_x being Y (null) condition, it gets slow.
What would be the fastest random querying solution where the records' column X is null?
ID is PK, column X is int(4). Table contains about a million records and over 1 GB in total size doubling itself every 24 hours currently.
column_x is indexed.
Column ID may not be consecutive.
The DB engine used in this case is InnoDB.
Thank you.

Getting a genuinely random record can be slow. There's not really much getting around this fact; if you want it to be truly random, then the query has to load all the relevant data in order to know which records it has to choose from.
Fortunately however, there are quicker ways of doing it. They're not properly random, but if you're happy to trade a bit of pure randomness for speed, then they should be good enough for most purposes.
With that in mind, the fastest way to get a "random" record is to add an extra column to your DB, which is populated with a random value. Perhaps a salted MD5 hash of the primary key? Whatever. Add appropriate indexes on this column, and then simply add the column to your ORDER BY clause in the query, and you'll get your records back in a random order.
To get a single random record, simply specify LIMIT 1 and add a WHERE random_field > $random_value where random value would be a value in the range of your new field (say an MD5 hash of a random number, for example).
Of course the down side here is that although your records will be in a random order, they'll be stuck in the same random order. I did say it was trading perfection for query speed. You can get around this by updating them periodically with fresh values, but I guess that could be a problem for you if you need to keep it fresh.
The other down-side is that adding an extra column might be too much to ask if you have storage constraints and your DB is already massive in size, or if you have a strict DBA to get past before you can add columns. But again, you have to trade off something; if you want the query speed, you need this extra column.
Anyway, I hope that helped.

I don't think you need a join, nor an order by, nor a limit 1 (providing the ids are unique).
SELECT *
FROM myTable
WHERE column_x IS NULL
AND id = ROUND(RAND() * (SELECT MAX(Id) FROM myTable), 0)

Have you ran explain on the query? What was the output?
Why not store or cache the value of : SELECT MAX(id) FROM table where column_x is null and use that as a variable. your query would then become:
$rand = rand(0, $storedOrCachedMaxId);
SELECT *
FROM
table AS t1
WHERE
t1.id >= $rand
and column_x is null
ORDER BY t1.id ASC
LIMIT 1
A simpler query will likely be easier on the db.
Know that if your data contains sizable holes - you aren't going to get consistently random results with these kind of queries.

I'm new to MySQL syntax, but digging a little further I think a dynamic query might work. We select the Nth row, where the Nth is random:
SELECT #r := CAST(COUNT(1)*RAND() AS UNSIGNED) FROM table WHERE column_x is null;
PREPARE stmt FROM
'SELECT *
FROM table
WHERE column_x is null
LIMIT 1 OFFSET ?';
EXECUTE stmt USING #r;

Related

Generate random sample from huge table, with conditions

I have a table (>500GB) from which I need to select 5000 random rows where table.condition = True and 5000 random rows where table.condition = False. My attempts until now used tablesample, but, unfortunately, any WHERE clause is only applied after the sample has been generated. So the only way I see this working is by doing the following:
Generate 2 empty temporary_tables -- temporary_table_true and temporary_table_false -- with the structure of the main table, so I can add rows iteratively.
create temp temporary_table_true as select
table.condition, table.b, table.c, ... table.z
from table LIMIT 0
create temp temporary_table_false as select
table.condition, table.b, table.c, ... table.z
from table LIMIT 0
Create a loop that only stops when the size of my temporary_tables are both 5000.
Inside that loop I generate a batch of 100 random samples from table, in each iteration. From those random rows I insert the ones with the table.condition = True in my temporary_table_true and the ones with the table.condition = False in my temporary_table_false.
Could you guys give me some help here?
Are there any better approaches?
If not, any idea on how I could code parts 2. and 3.?

Add a column to your table and populate it with random numbers.
ALTER TABLE `table` ADD COLUMN rando FLOAT DEFAULT NULL;
UPDATE `table` SET rando = RAND() WHERE rando IS NULL;
Then do
SELECT *
FROM `table`
WHERE rando > RAND() * 0.9
AND condition = 0
ORDER BY rando
LIMIT 5000
Do it again for condition = 1 and Bob's your uncle. It will pull rows in random order starting from a random row.
A couple of notes:
0.9 is there to improve the chances you'll actually get 5000 rows and not some lesser number.
You may have to add LIMIT 1000 to the UPDATE statement and run it a whole bunch of times to populate the complete rando column: trying to update all the rows in a big table can generate a huge transaction and swamp your server for a long time.
If you need to generate another random sample, run the UPDATE or UPDATEs again.

The textbook solution would be to run two queries, one for rows with true and one for rows with `false:
SELECT * FROM mytable WHERE `condition`=true ORDER BY RAND() LIMIT 5000;
SELECT * FROM mytable WHERE `condition`=false ORDER BY RAND() LIMIT 5000;
The WHERE clause applies first, to reduce the matching rows, then it sorts the subset of rows randomly and picks up to 5000 of them. The result is a random subset.
This solution has an advantage that it returns a pretty evenly distributed set of random rows, and it automatically handles cases like there being an unknown proportion of true to false in the table, and even handles if one of the condition values matches fewer than 5000 rows.
The disadvantage is that it's incredibly expensive to sort such a large set of rows, and an index does not help you sort by a nondeterministic expression like RAND().
You could do this with window functions if you need it to be a single SQL query, but it would still be very expensive.
SELECT t.*
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY `condition` ORDER BY RAND()) AS rownum
FROM mytable
) AS t
WHERE t.rownum <= 5000;
Another alternative that does not use a random sort operation would be to do a table-scan, and pick a random subset of rows. But you need to know roughly how many rows match each condition value, so that you can estimate the fraction of these that would make ~5000 rows. Say for example there are 1 million rows with the true value and 500k rows with the false value:
SELECT * FROM mytable WHERE `condition`=true AND RAND()*1000000 < 5000;
SELECT * FROM mytable WHERE `condition`=false AND RAND()*500000 < 5000;
This is not guaranteed to return exactly 5000 rows, because of the randomness. But probably pretty close. And a table-scan is still quite expensive.
The answer from O.Jones gives me another idea. If you can add a column, then you can add an index on that column.
ALTER TABLE `table`
ADD COLUMN rando FLOAT DEFAULT NULL,
ADD INDEX (`condition`, rando);
UPDATE `table` SET rando = RAND() WHERE rando IS NULL;
Then you can use indexed searches. Again, you need to know how many rows match each value to do this.
SELECT * FROM mytable
WHERE `condition`=true AND rando < 5000/1000000
ORDER BY `condition`, rando
LIMIT 5000;
SELECT * FROM mytable
WHERE `condition`=true AND rando < 5000/500000
ORDER BY `condition`, rando
LIMIT 5000;
The ORDER BY in this case should be a no-op if the index I added is used. The rows will be read in index order anyway, and MySQL's optimizer will not do any work to sort them.
This solution will be much faster, because it doesn't have to sort anything, and doesn't have to do a table-scan. MySQL has an optimization to bail out of a query once the LIMIT has been satisfied.
But the disadvantage is that it doesn't return a different random result when you run the SELECT again, or if different clients run the query. You would have to use UPDATE to re-randomize the whole table to get a different result. This might not be suitable depending on your needs.

SQL Optimization on SELECT random id (with WHERE clause)

I'm currently working on a multi-thread program (in Java) that will need to select random rows in a database, in order to update them. This is working well but I started to encounter some performance issue regarding my SELECT request.
I tried multiple solutions before finding this website :
http://jan.kneschke.de/projects/mysql/order-by-rand/
I tried with the following solution :
SELECT * FROM Table
JOIN (SELECT FLOOR( COUNT(*) * RAND() ) AS Random FROM Table)
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL
LIMIT 1;
It selects only one row below the random id number generated. This is working pretty good (an average of less than 100ms per request on 150k rows). But after the process of my program, the FOREIGNKEY_ID will no longer be NULL (it will be updated with some value).
The problem is, my SELECT will "forget" some rows than have an id below the random generated id, and I won't be able to process them.
So I tried to adapt my request, doing this :
SELECT * FROM Table
JOIN (SELECT FLOOR(
(SELECT COUNT(id) FROM Table WHERE FOREIGNKEY_ID IS NULL) * RAND() )
AS Random FROM Table)
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL
LIMIT 1;
With that request, no more problems of skipping some rows, but performances are decreasing drastically (an average of 1s per request on 150k rows).
I could simply execute the fast one when I still have a lot of rows to process, and switch to the slow one when it remains just a few rows, but it will be a "dirty" fix in the code, and I would prefer an elegant SQL request that can do the work.
Thank you for your help, please let me know if I'm not clear or if you need more details.

For your method to work more generally, you want max(id) rather than count(*):
SELECT t.*
FROM Table t JOIN
(SELECT FLOOR(MAX(id) * RAND() ) AS Random FROM Table) r
ON t.ID > R.Random
WHERE t.FOREIGNKEY_ID IS NULL
ORDER BY t.ID
LIMIT 1;
The ORDER BY is usually added to be sure that the "next" id is returned. In theory, MySQL could always return the maximum id in the table.
The problem is gaps in ids. And, it is easy to create distributions where you never get a random number . . . say that the four ids are 1, 2, 3, 1000. Your method will never get 1000000. The above will almost always get it.
Perhaps the simplest solution to your problem is to run the first query multiple times until it gets a valid row. The next suggestion would be an index on (FOREIGNKEY_ID, ID), which the subquery can use. That might speed the query.
I tend to favor something more along these lines:
SELECT t.id
FROM Table t
WHERE t.FOREIGNKEY_ID IS NULL AND
RAND() < 1.0 / 1000
ORDER BY RAND()
LIMIT 1;
The purpose of the WHERE clause is to reduce the volume considerable, so the ORDER BY doesn't take much time.
Unfortunately, this will require scanning the table, so you probably won't get responses in the 100 ms range on a 150k table. You can reduce that to an index scan with an index on t(FOREIGNKEY_ID, ID).
EDIT:
If you want a reasonable chance of a uniform distribution and performance that does not increase as the table gets larger, here is another idea, which -- alas -- requires a trigger.
Add a new column to the table called random, which is initialized with rand(). Build an index onrandom`. Then run a query such as:
select t.*
from ((select t.*
from t
where random >= #random
order by random
limit 10
) union all
(select t.*
from t
where random < #random
order by random desc
limit 10
)
) t
order by rand();
limit 1;
The idea is that the subqueries can use the index to choose a set of 20 rows that are pretty arbitrary -- 10 before and after the chosen point. The rows are then sorted (some overhead, which you can control with the limit number). These are randomized and returned.
The idea is that if you choose random numbers, there will be arbitrary gaps and these would make the chosen numbers not quite uniform. However, by taking a larger sample around the value, then the probability of any one value being chosen should approach a uniform distribution. The uniformity would still have edge effects, but these should be minor on a large amount of data.

Your ID's are probably gonna contain gaps. Anything that works with COUNT(*) is not going to be able to find all the ID's.
A table with records with ID's 1,2,3,10,11,12,13 has only 7 records. Doing a random with COUNT(*) will often result in a miss as records 4,5 and 6 donot exist, and it will then pick the nearest ID which is 3. This is not only unbalanced (it will pick 3 far too often) but it will also never pick records 10-13.
To get a fair uniformly distrubuted random selection of records, I would suggest loading the ID's of the table first. Even for 150k rows, loading a set of integer id's will not consume a lot of memory (<1 MB):
SELECT id FROM table;
You can then use a function like Collections.shuffle to randomize the order of the ID's. To get the rest of the data, you can select records one at a time or for example 10 at a time:
SELECT * FROM table WHERE id = :id
Or:
SELECT * FROM table WHERE id IN (:id1, :id2, :id3)
This should be fast if the id column has an index, and it will give you a proper random distribution.

If prepared statement can be used, then this should work:
SELECT #skip := Floor(Rand() * Count(*)) FROM Table WHERE FOREIGNKEY_ID IS NULL;
PREPARE STMT FROM 'SELECT * FROM Table WHERE FOREIGNKEY_ID IS NULL LIMIT ?, 1';
EXECUTE STMT USING #skip;
LIMIT in SELECT statement can be used to skip rows

MySQL - RANDOMLY choose a row in a 14Millions rows table - testing does not make sense

I have been looking on the web on how to select a random row on big tables, I have found various results, but then I analyzed my data and figured out that the best way for me to go is to count the rows and select a random one of those with LIMIT
While testing I start to wonder why this works:
SET #t = CEIL(RAND()*(SELECT MAX(id) FROM logo));
SELECT id
FROM logo
WHERE
current_status_id=29 AND
logo_type_id=4 AND
active='y' AND
id>=#t
ORDER BY id
LIMIT 1;
and gives random results, but this always returns the same 4 or 5 results ?
SELECT id
FROM logo
WHERE
current_status_id=29 AND
logo_type_id=4 AND
active='y' AND
id>=CEIL(RAND()*(SELECT MAX(id) FROM logo))
ORDER BY id
LIMIT 1;
the table has MANY fields (almost 100) and quite a few indexes. over 14 Million records and counting. When I select a random it is almost NEVER that I have to select it from the table, I always have to select depending on various fields values (all indexed).
Could it be a bug of my MySQL server version (5.6.13-log Source distribution)?

One possibility is that this statement in the documentation:
RAND() in a WHERE clause is re-evaluated every time the WHERE is executed.
is simply not always true. It is true when you do:
where rand() < 0.01
to get an approximate 1% sample of the rows. Perhaps the MySQL optimizer says something like "Oh, I'll evaluate the subquery to get one value back. And, just to be more efficient, I'll multiply that row by rand() before defining the constant."
If I had to guess, that would be the case.
Another possibility is that the data is arranged so the values you are looking for has one row with a large id. Or, it could be that there are lots of rows with small ids at the very beginning, and then a very large gap.
Your method of getting a random row, by the way is not guaranteed to return a result when you are doing filtering. I don't know if that is important to you.
EDIT:
Check to see if this version works as you expect:
SELECT id
FROM logo cross join
(SELECT MAX(id) as maxid FROM logo) c
WHERE current_status_id = 29 AND
logo_type_id = 4 AND
active = 'y' AND
id >= RAND() * maxid
ORDER BY id
LIMIT 1;
If so, the problem is that the max id is being calculated and then there is an extra step of multiplying it by rand() as execution of the query begins.

MySQL Optimize UNION query

I'm trying to optimize a query.
My question seems to be similar to MySQL, Union ALL and LIMIT and the answer might be the same (I'm afraid). However in my case there's a stricter limit (1) as well as an index on a datetime column.
So here we go:
For simplicity, let's have just one table with three: columns:
md5 (varchar)
value (varchar).
lastupdated (datetime)
There's an index on (md5, updated) so selecting on a md5 key, ordering by updated and limiting to 1 will be optimized.
The search shall return a maximum of one record matching one of 10 md5 keys. The keys have a priority. So if there's a record with prio 1 it will be preferred over any record with prio 2, 3 etc.
Currently UNION ALL is used:
select * from
(
(
select 0 prio, value
from mytable
where md5 = '7b76e7c87e1e697d08300fd9058ed1db'
order by lastupdated desc
limit 1
)
union all
(
select 1 prio, value
from mytable
where md5 = 'eb36cd1c563ffedc6adaf8b74c259723'
order by lastupdated desc
limit 1
)
) x
order by prio
limit 1;
It works, but the UNION seems to execute all 10 queries if 10 keys are provided.
However, from a business perspective, it would be ok to run the selects sequentially and stop after the first match.
Is that possible though plain SQL?
Or would the only option be a stored procedure?

There's a much better way to do this that doesn't need UNION. You really want the groupwise max for each key, with a custom ordering.
Groupwise Max
Order by FIELD()

There's no way the optimizer for UNION ALL can figure out what you're up to.
I don't know if you can do this, but suppose you had a md5prio table with the list of hash codes you know you're looking for. For example.
prio md5
0 '7b76e7c87e1e697d08300fd9058ed1db'
1 'eb36cd1c563ffedc6adaf8b74c259723'
etc
in it.
Then your query could be:
select mytable.*
from mytable
join md5prio on mytable.md5 = md5prio.md5
order by md5prio.prio, mytable.lastupdated desc
limit 1
This might save the repeated queries. You'll definitely need your index on mytable.md5. I am not sure whether your compound index on lastupdated will help; you'll need to try it.

In your case, the most efficient solution may be to build an index on (md5, lastupdated). This index should be used to resolve each subquery very efficiently (looking up the values in the index and then looking up one data page).
Unfortunately, the groupwise max referenced by Gavin will produce multiple rows when there are duplicate lastupdated values (admittedly, perhaps not a concern in your case).
There is, actually, a MySQL way to get this answer, using group_concat and substring_index:
select p.prio,
substring_index(group_concat(mt.value order by mt.lastupdated desc), ',', 1)
from mytable mt join
(select 0 as prio, '7b76e7c87e1e697d08300fd9058ed1db' as md5 union all
select 1 as prio, 'eb36cd1c563ffedc6adaf8b74c259723' as md5 union all
. . .
) p
on mt.md5 = p.md5

How can I speed up a MySQL query with a large offset in the LIMIT clause?

I'm getting performance problems when LIMITing a mysql SELECT with a large offset:
SELECT * FROM table LIMIT m, n;
If the offset m is, say, larger than 1,000,000, the operation is very slow.
I do have to use limit m, n; I can't use something like id > 1,000,000 limit n.
How can I optimize this statement for better performance?

Perhaps you could create an indexing table which provides a sequential key relating to the key in your target table. Then you can join this indexing table to your target table and use a where clause to more efficiently get the rows you want.
#create table to store sequences
CREATE TABLE seq (
seq_no int not null auto_increment,
id int not null,
primary key(seq_no),
unique(id)
);
#create the sequence
TRUNCATE seq;
INSERT INTO seq (id) SELECT id FROM mytable ORDER BY id;
#now get 1000 rows from offset 1000000
SELECT mytable.*
FROM mytable
INNER JOIN seq USING(id)
WHERE seq.seq_no BETWEEN 1000000 AND 1000999;

If records are large, the slowness may be coming from loading the data. If the id column is indexed, then just selecting it will be much faster. You can then do a second query with an IN clause for the appropriate ids (or could formulate a WHERE clause using the min and max ids from the first query.)
slow:
SELECT * FROM table ORDER BY id DESC LIMIT 10 OFFSET 50000
fast:
SELECT id FROM table ORDER BY id DESC LIMIT 10 OFFSET 50000
SELECT * FROM table WHERE id IN (1,2,3...10)

There's a blog post somewhere on the internet on how you should best make the selection of the rows to show should be as compact as possible, thus: just the ids; and producing the complete results should in turn fetch all the data you want for only the rows you selected.
Thus, the SQL might be something like (untested, I'm not sure it actually will do any good):
select A.* from table A
inner join (select id from table order by whatever limit m, n) B
on A.id = B.id
order by A.whatever
If your SQL engine is too primitive to allow this kind of SQL statements, or it doesn't improve anything, against hope, it might be worthwhile to break this single statement into multiple statements and capture the ids into a data structure.
Update: I found the blog post I was talking about: it was Jeff Atwood's "All Abstractions Are Failed Abstractions" on Coding Horror.

I don't think there's any need to create a separate index if your table already has one. If so, then you can order by this primary key and then use values of the key to step through:
SELECT * FROM myBigTable WHERE id > :OFFSET ORDER BY id ASC;
Another optimisation would be not to use SELECT * but just the ID so that it can simply read the index and doesn't have to then locate all the data (reduce IO overhead). If you need some of the other columns then perhaps you could add these to the index so that they are read with the primary key (which will most likely be held in memory and therefore not require a disc lookup) - although this will not be appropriate for all cases so you will have to have a play.

Paul Dixon's answer is indeed a solution to the problem, but you'll have to maintain the sequence table and ensure that there is no row gaps.
If that's feasible, a better solution would be to simply ensure that the original table has no row gaps, and starts from id 1. Then grab the rows using the id for pagination.
SELECT * FROM table A WHERE id >= 1 AND id <= 1000;
SELECT * FROM table A WHERE id >= 1001 AND id <= 2000;
and so on...

I have run into this problem recently. The problem was two parts to fix. First I had to use an inner select in my FROM clause that did my limiting and offsetting for me on the primary key only:
$subQuery = DB::raw("( SELECT id FROM titles WHERE id BETWEEN {$startId} AND {$endId} ORDER BY title ) as t");
Then I could use that as the from part of my query:
'titles.id',
'title_eisbns_concat.eisbns_concat',
'titles.pub_symbol',
'titles.title',
'titles.subtitle',
'titles.contributor1',
'titles.publisher',
'titles.epub_date',
'titles.ebook_price',
'publisher_licenses.id as pub_license_id',
'license_types.shortname',
$coversQuery
)
->from($subQuery)
->leftJoin('titles', 't.id', '=', 'titles.id')
->leftJoin('organizations', 'organizations.symbol', '=', 'titles.pub_symbol')
->leftJoin('title_eisbns_concat', 'titles.id', '=', 'title_eisbns_concat.title_id')
->leftJoin('publisher_licenses', 'publisher_licenses.org_id', '=', 'organizations.id')
->leftJoin('license_types', 'license_types.id', '=', 'publisher_licenses.license_type_id')
The first time I created this query I had used the OFFSET and LIMIT in MySql. This worked fine until I got past page 100 then the offset started getting unbearably slow. Changing that to BETWEEN in my inner query sped it up for any page. I'm not sure why MySql hasn't sped up OFFSET but between seems to reel it back in.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008