Getting random data from a MySQL database but not repeating data - mysql

I have a list of random products(1000's) each with an ID and I am bringing them up randomly. I would like that the items are not repeated. Currently what I am doing is the following:
select * from products where product_id <> previous_product_id order by rand() limit 1;
I am ensuring that the same product will not appear directly after. A repeat product usually appears a lot sooner then I would like (I believe I am right in saying this is the birthday problem). I have no idea what is the most effective way to get random data in a non-repeating fashion. I have thought of a way that I assume is highly inefficent:
I would assign the user an id (e.g. foo) and then when they have seen an item add it to a string that would be product_id_1 AND product_id_2 AND product_id_3 AND product_id_n. I would store this data with timestamp(explained further on).
+--------------------------------------------------------------------------------------------+
|user_id |timestamp | product_seen_string |
|--------------------------------------------------------------------------------------------|
|foo |01-01-14 12:00:00 |product_id_1 AND product_id_2 AND product_id_3 AND product_id_n |
+--------------------------------------------------------------------------------------------+
With this product_seen_string I would keep adding to the seen products (I would also update the timestamp) and then in the query I would do a first query based on the user_id to obtain this string and then add that returned string to the main query that obtains the random products like so:
select * from products where product_id <> product_id_1 AND product_id_2 AND product_id_3 AND product_id_n order by rand() limit 1;
I would also write into that if no products were returned then the data would be deleted so that the cycle can start again. As well as having a cron job that every ten minutes would run to see if the timestamp is older then an hour I would delete it.
The scripting language I am using is PHP

Selecting random rows is always tricky, and there are no perfect solutions that don't involve some compromise. Either compromise performance, or compromise even random distribution, or compromise on the chance of selecting duplicates, etc.
As #Giacomo1968 mentions in their answer, any solution with ORDER BY RAND() does not scale well. As the number of rows in your table gets larger, the cost of sorting the whole table in a filesort gets worse and worse. Giacomo1968 is correct that the query cannot be cached when the sort order is random. But I don't care about that so much because I usually disable the query cache anyway (it has its own scalability problems).
Here's a solution to pre-randomize the rows in the table, by creating a rownum column and assigning unique consecutive values:
ALTER TABLE products ADD COLUMN rownum INT UNSIGNED, ADD KEY (rownum);
SET #rownum := 0;
UPDATE products SET rownum = (#rownum:=#rownum+1) ORDER BY RAND();
Now you can get a random row by an index lookup, without sorting:
SELECT * FROM products WHERE rownum = 1;
Or you can get the next random row:
SELECT * FROM products WHERE rownum = 2;
Or you can get 10 random rows at a time, or any other number you want, with no duplicates:
SELECT * FROM products WHERE rownum BETWEEN 11 and 20;
You can re-randomize anytime you want:
SET #rownum := 0;
UPDATE products SET rownum = (#rownum:=#rownum+1) ORDER BY RAND();
It's still costly to do the random sorting, but now you don't have to do it on every SELECT query. You can do it on a schedule, hopefully at off-peak times.

You might consider a pagination-like solution. Rather than ordering by RAND() (not good form a performance standpoint anyway), why not simply use LIMIT clause and randomize the offset.
For example:
SELECT product_id FROM products ORDER BY product_id LIMIT X, 1
Where X is the offset you want to use. You could easily keep track of the offsets used in the application and randomize amongst the available remaining values.
PHP code to call this might look like this:
if(!isset($_SESSION['available_offsets'] || count($_SESSION['available_offsets']) === 0) {
$record_count = ...; // total number of records likely derived from query against table in question
// this creates array with numerical keys matching available offsets
// we don't care about the values
$available_offsets = array_fill(0, $record_count, '');
} else {
$available_offsets = $_SESSION['available_offsets'];
}
// pick offset from available offsets
$offset = array_rand($available_offsets);
// unset this as available offset
unset($available_offsets[$offset]);
// set remaining offsets to session for next page load
$_SESSION['available_offsets'] = $available_offsets;
$query = 'SELECT product_id FROM products ORDER BY product_id LIMIT ' . $offset . ', 1';
// make your query now

You could try adding a seed to the RAND to avoid repeats
select * from products where product_id <> previous_product_id order by rand(7) limit 1;
From http://www.techonthenet.com/mysql/functions/rand.php :
The syntax for the RAND function in MySQL is:
RAND( [seed] )
Parameters or Arguments
seed
Optional. If specified, it will produce a repeatable sequence of random numbers each time that seed value is provided.

First, unclear what scripting language you will use to piece this together, so will answer conceptually. I also added uppercase to your MySQL for readability. So first let’s look at this query:
SELECT * FROM products WHERE product_id <> previous_product_id ORDER BY RAND() LIMIT 1;
Superficially does what you want it to do, but if your database has thousands of items, RAND() is not recommended. It defeats all MySQL caching & is a resource hog. More details here, specially the area that reads:
A query cannot be cached if it contains any of the functions shown in
the following table.
That’s just not good.
But that said you might be able to improve it by just returning the product_id:
SELECT product_id FROM products WHERE product_id <> previous_product_id ORDER BY RAND() LIMIT 1;
And then you would have to do another query for the actual product data based on the product_id, but that would be much less taxing on the server than grabbing a whole set of randomized data.
But still, RAND() inherently will bog down your system due to lack of caching. And the problem will only get worse as your database grows.
Instead I would recommend having some kind of file-based solution. That grabs a random list of items like this:
SELECT product_id FROM products WHERE product_id <> previous_product_id ORDER BY RAND() LIMIT 0,100;
You will strictly be grabbing product ids and then saving them to—let’s say—a JSON file. The logic being is as random as you want this to be, you have to be realistic. So grabbing a slice of 100 items at a time will give a user a nice selection of items.
Then you would load that file in as an array and perhaps even randomize the array & grab one item off of the top; a quick way to assure that the item won’t be chosen again. I you wish you could shorten the array with each access by re-saving the JSON file. And then when it gets down to—let’s say—less than 10 items, fetch a new batch & start again.
But in general you RAND() is a neat trick that is useful for small datasets. Anything past that should be re-factored into code where you realistically leverage what you want users to see versus what can realistically be done in a scalable manner.
EDIT: Just another thing I noticed in your code on keeping track of product IDs. If you want to go down that route—instead of saving items to a JSON file—that would be fine as well. But what you are describing is not efficient:
+--------------------------------------------------------------------------------------------+
|user_id |timestamp | product_seen_string |
|--------------------------------------------------------------------------------------------|
|foo |01-01-14 12:00:00 |product_id_1 AND product_id_2 AND product_id_3 AND product_id_n |
+--------------------------------------------------------------------------------------------+
A better structure would be to simply store the IDs in a MySQL table like this; let’s call it products_seen:
+----------------------------------------------------+
|user_id |timestamp | product_seen_id |
|----------------------------------------------------|
|foo |01-01-14 12:00:00 |product_id_1 |
|foo |01-01-14 12:00:00 |product_id_2 |
|foo |01-01-14 12:00:00 |product_id_3 |
+----------------------------------------------------+
And then have a query like this
SELECT * FROM products, products_seen WHERE user_id = products_seen.user_id product_id <> product_seen_id ORDER BY RAND() LIMIT 1;
Basically, you would be cross referencing products_seen with products and making sure the products already seen by user_id are not shown again.
This is all pseudo-code, so please adjust to fit the concept.

Related

SQL Optimization on SELECT random id (with WHERE clause)

I'm currently working on a multi-thread program (in Java) that will need to select random rows in a database, in order to update them. This is working well but I started to encounter some performance issue regarding my SELECT request.
I tried multiple solutions before finding this website :
http://jan.kneschke.de/projects/mysql/order-by-rand/
I tried with the following solution :
SELECT * FROM Table
JOIN (SELECT FLOOR( COUNT(*) * RAND() ) AS Random FROM Table)
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL
LIMIT 1;
It selects only one row below the random id number generated. This is working pretty good (an average of less than 100ms per request on 150k rows). But after the process of my program, the FOREIGNKEY_ID will no longer be NULL (it will be updated with some value).
The problem is, my SELECT will "forget" some rows than have an id below the random generated id, and I won't be able to process them.
So I tried to adapt my request, doing this :
SELECT * FROM Table
JOIN (SELECT FLOOR(
(SELECT COUNT(id) FROM Table WHERE FOREIGNKEY_ID IS NULL) * RAND() )
AS Random FROM Table)
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL
LIMIT 1;
With that request, no more problems of skipping some rows, but performances are decreasing drastically (an average of 1s per request on 150k rows).
I could simply execute the fast one when I still have a lot of rows to process, and switch to the slow one when it remains just a few rows, but it will be a "dirty" fix in the code, and I would prefer an elegant SQL request that can do the work.
Thank you for your help, please let me know if I'm not clear or if you need more details.
For your method to work more generally, you want max(id) rather than count(*):
SELECT t.*
FROM Table t JOIN
(SELECT FLOOR(MAX(id) * RAND() ) AS Random FROM Table) r
ON t.ID > R.Random
WHERE t.FOREIGNKEY_ID IS NULL
ORDER BY t.ID
LIMIT 1;
The ORDER BY is usually added to be sure that the "next" id is returned. In theory, MySQL could always return the maximum id in the table.
The problem is gaps in ids. And, it is easy to create distributions where you never get a random number . . . say that the four ids are 1, 2, 3, 1000. Your method will never get 1000000. The above will almost always get it.
Perhaps the simplest solution to your problem is to run the first query multiple times until it gets a valid row. The next suggestion would be an index on (FOREIGNKEY_ID, ID), which the subquery can use. That might speed the query.
I tend to favor something more along these lines:
SELECT t.id
FROM Table t
WHERE t.FOREIGNKEY_ID IS NULL AND
RAND() < 1.0 / 1000
ORDER BY RAND()
LIMIT 1;
The purpose of the WHERE clause is to reduce the volume considerable, so the ORDER BY doesn't take much time.
Unfortunately, this will require scanning the table, so you probably won't get responses in the 100 ms range on a 150k table. You can reduce that to an index scan with an index on t(FOREIGNKEY_ID, ID).
EDIT:
If you want a reasonable chance of a uniform distribution and performance that does not increase as the table gets larger, here is another idea, which -- alas -- requires a trigger.
Add a new column to the table called random, which is initialized with rand(). Build an index onrandom`. Then run a query such as:
select t.*
from ((select t.*
from t
where random >= #random
order by random
limit 10
) union all
(select t.*
from t
where random < #random
order by random desc
limit 10
)
) t
order by rand();
limit 1;
The idea is that the subqueries can use the index to choose a set of 20 rows that are pretty arbitrary -- 10 before and after the chosen point. The rows are then sorted (some overhead, which you can control with the limit number). These are randomized and returned.
The idea is that if you choose random numbers, there will be arbitrary gaps and these would make the chosen numbers not quite uniform. However, by taking a larger sample around the value, then the probability of any one value being chosen should approach a uniform distribution. The uniformity would still have edge effects, but these should be minor on a large amount of data.
Your ID's are probably gonna contain gaps. Anything that works with COUNT(*) is not going to be able to find all the ID's.
A table with records with ID's 1,2,3,10,11,12,13 has only 7 records. Doing a random with COUNT(*) will often result in a miss as records 4,5 and 6 donot exist, and it will then pick the nearest ID which is 3. This is not only unbalanced (it will pick 3 far too often) but it will also never pick records 10-13.
To get a fair uniformly distrubuted random selection of records, I would suggest loading the ID's of the table first. Even for 150k rows, loading a set of integer id's will not consume a lot of memory (<1 MB):
SELECT id FROM table;
You can then use a function like Collections.shuffle to randomize the order of the ID's. To get the rest of the data, you can select records one at a time or for example 10 at a time:
SELECT * FROM table WHERE id = :id
Or:
SELECT * FROM table WHERE id IN (:id1, :id2, :id3)
This should be fast if the id column has an index, and it will give you a proper random distribution.
If prepared statement can be used, then this should work:
SELECT #skip := Floor(Rand() * Count(*)) FROM Table WHERE FOREIGNKEY_ID IS NULL;
PREPARE STMT FROM 'SELECT * FROM Table WHERE FOREIGNKEY_ID IS NULL LIMIT ?, 1';
EXECUTE STMT USING #skip;
LIMIT in SELECT statement can be used to skip rows

Is it a bad idea to store row count and number of row to speed up pagination?

My website has more than 20.000.000 entries, entries have categories (FK) and tags (M2M). As for query even like SELECT id FROM table ORDER BY id LIMIT 1000000, 10 MySQL needs to scan 1000010 rows, but that is really unacceptably slow (and pks, indexes, joins etc etc don't help much here, still 1000010 rows). So I am trying to speed up pagination by storing row count and row number with triggers like this:
DELIMITER //
CREATE TRIGGER #trigger_name
AFTER INSERT
ON entry_table FOR EACH ROW
BEGIN
UPDATE category_table SET row_count = (#rc := row_count + 1)
WHERE id = NEW.category_id;
NEW.row_number_in_category = #rc;
END //
And then I can simply:
SELECT *
FROM entry_table
WHERE row_number_in_category > 10
ORDER BY row_number_in_category
LIMIT 10
(now only 10 rows scanned and therefore selects are blazing fast, although inserts are slower, but they are rare comparing to selects, so it is ok)
Is it a bad approach and are there any good alternatives?
Although I like the solution in the question. It may present some issues if data in the entry_table is changed - perhaps deleted or assigned to different categories over time.
It also limits the ways in which the data can be sorted, the method assumes that data is only sorted by the insert order. Covering multiple sort methods requires additional triggers and summary data.
One alternate way of paginating is to pass in offset of the field you are sorting/paginating by instead of an offset to the limit parameter.
Instead of this:
SELECT id FROM table ORDER BY id LIMIT 1000000, 10
Do this - assuming in this scenario that the last result viewed had an id of 1000000.
SELECT id FROM table WHERE id > 1000000 ORDER BY id LIMIT 0, 10
By tracking the offset of the pagination, this can be passed to subsequent queries for data and avoids the database sorting rows that are not ever going to be part of the end result.
If you really only wanted 10 rows out of 20million, you could go further and guess that the next 10 matching rows will occur in the next 1000 overall results. Perhaps with some logic to repeat the query with a larger allowance if this is not the case.
SELECT id FROM table WHERE id BETWEEN 1000000 AND 1001000 ORDER BY id LIMIT 0, 10
This should be significantly faster because the sort will probably be able to limit the result in a single pass.

Fast rand() order search

We have a table of 50k items and we display it at a search page with a random sort and 10 items per page. We need to apply some filters.
RAND() with or without a seed is very slow. Note that items have three categories. The first category should be displayed first with random order, and then the second category, also with random order.
generating a random number between 0 and max_id s not working because of pages and the previously mentioned constraints
randomizing the records with php makes items always display at the same page
Is there a better solution to speed up this random search?
here are few tip hopes it works
Put indexes on your main field on which you are filtering
reduce number of column in your select query (only use needed columns)
recheck your Joins
recheck your conditions
recheck your group/having/order By clause
Tip: Don't seed your RAND() call unless you're trying to test with a reproducible sequence of items.
This is tricky to do nearly perfectly without a lot of programming. In the meantime here are a couple of things to do.
First, try this. Instead of doing SELECT * FROM t ORDER BY RAND() LIMIT 10 use the following kind of subquery:
SELECT * FROM t
WHERE id IN (
SELECT id FROM t WHERE category = 1 ORDER BY RAND() LIMIT 10
UNION ALL
SELECT id FROM t WHERE category = 2 ORDER BY RAND() LIMIT 10
)
ORDER BY RAND()
This should save some time on the ORDER BY RAND() LIMIT 10 operation because it only has to shuffle the id values, not the whole record. But it's not an algorithmic change, just a volume-of-data change: it still has to shuffle the whole list of id values. So it's a quick patch, not a real fix.
Second, if you can write a PHP function that will generate a text string with, let's say, 100 random numbers between 1 and max_id, you could try this to get your first category.
SELECT * FROM t WHERE id IN
( SELECT DISTINCT id FROM t
WHERE category = 1 AND id IN (num, num, num, ..., num, num)
LIMIT 10 )
ORDER BY RAND()
This will give you ten, or fewer, randomly chosen records in the named category, pretty cheaply. Notice that you must provide many more than ten random numbers in your (num, num, num, num) list because not all the num values will be valid for rows with category = 1.
If you need more than one category, just use a similar query in a UNION to get the other category.
Both these approaches' performance will be improved by a compound index on (category, id).
Notice there's an extra ORDER BY RAND() at the end of each of those approaches' queries. That's because the lists of id values generated by the subqueries are likely to be in a non-random order.

MySQL - RANDOMLY choose a row in a 14Millions rows table - testing does not make sense

I have been looking on the web on how to select a random row on big tables, I have found various results, but then I analyzed my data and figured out that the best way for me to go is to count the rows and select a random one of those with LIMIT
While testing I start to wonder why this works:
SET #t = CEIL(RAND()*(SELECT MAX(id) FROM logo));
SELECT id
FROM logo
WHERE
current_status_id=29 AND
logo_type_id=4 AND
active='y' AND
id>=#t
ORDER BY id
LIMIT 1;
and gives random results, but this always returns the same 4 or 5 results ?
SELECT id
FROM logo
WHERE
current_status_id=29 AND
logo_type_id=4 AND
active='y' AND
id>=CEIL(RAND()*(SELECT MAX(id) FROM logo))
ORDER BY id
LIMIT 1;
the table has MANY fields (almost 100) and quite a few indexes. over 14 Million records and counting. When I select a random it is almost NEVER that I have to select it from the table, I always have to select depending on various fields values (all indexed).
Could it be a bug of my MySQL server version (5.6.13-log Source distribution)?
One possibility is that this statement in the documentation:
RAND() in a WHERE clause is re-evaluated every time the WHERE is executed.
is simply not always true. It is true when you do:
where rand() < 0.01
to get an approximate 1% sample of the rows. Perhaps the MySQL optimizer says something like "Oh, I'll evaluate the subquery to get one value back. And, just to be more efficient, I'll multiply that row by rand() before defining the constant."
If I had to guess, that would be the case.
Another possibility is that the data is arranged so the values you are looking for has one row with a large id. Or, it could be that there are lots of rows with small ids at the very beginning, and then a very large gap.
Your method of getting a random row, by the way is not guaranteed to return a result when you are doing filtering. I don't know if that is important to you.
EDIT:
Check to see if this version works as you expect:
SELECT id
FROM logo cross join
(SELECT MAX(id) as maxid FROM logo) c
WHERE current_status_id = 29 AND
logo_type_id = 4 AND
active = 'y' AND
id >= RAND() * maxid
ORDER BY id
LIMIT 1;
If so, the problem is that the max id is being calculated and then there is an extra step of multiplying it by rand() as execution of the query begins.

randomizing large dataset

I am trying to find a way to get a random selection from a large dataset.
We expect the set to grow to ~500K records, so it is important to find a way that keeps performing well while the set grows.
I tried a technique from: http://forums.mysql.com/read.php?24,163940,262235#msg-262235 But it's not exactly random and it doesn't play well with a LIMIT clause, you don't always get the number of records that you want.
So I thought, since the PK is auto_increment, I just generate a list of random id's and use an IN clause to select the rows I want. The problem with that approach is that sometimes I need a random set of data with records having a spefic status, a status that is found in at most 5% of the total set. To make that work I would first need to find out what ID's I can use that have that specific status, so that's not going to work either.
I am using mysql 5.1.46, MyISAM storage engine.
It might be important to know that the query to select the random rows is going to be run very often and the table it is selecting from is appended to frequently.
Any help would be greatly appreciated!
You could solve this with some denormalization:
Build a secondary table that contains the same pkeys and statuses as your data table
Add and populate a status group column which will be a kind of sub-pkey that you auto number yourself (1-based autoincrement relative to a single status)
Pkey Status StatusPkey
1 A 1
2 A 2
3 B 1
4 B 2
5 C 1
... C ...
n C m (where m = # of C statuses)
When you don't need to filter you can generate rand #s on the pkey as you mentioned above. When you do need to filter then generate rands against the StatusPkeys of the particular status you're interested in.
There are several ways to build this table. You could have a procedure that you run on an interval or you could do it live. The latter would be a performance hit though since the calculating the StatusPkey could get expensive.
Check out this article by Jan Kneschke... It does a great job at explaining the pros and cons of different approaches to this problem...
You can do this efficiently, but you have to do it in two queries.
First get a random offset scaled by the number of rows that match your 5% conditions:
SELECT ROUND(RAND() * (SELECT COUNT(*) FROM MyTable WHERE ...conditions...))
This returns an integer. Next, use the integer as an offset in a LIMIT expression:
SELECT * FROM MyTable WHERE ...conditions... LIMIT 1 OFFSET ?
Not every problem must be solved in a single SQL query.