Mysql "where rand()" performance - mysql

I'm trying to pick random articles from my database, where high rating articles have a higher chance of getting picked
SELECT * FROM articles WHERE RAND()>0.9 ORDER BY rating DESC LIMIT 3
My question is:
Will it random the whole table, or just until it finds 3 articles that random a number higher then 0.9

If you have INDEX(rating), that query will probably fetch 3 or 4 (3/(1-0.1)) rows before finding the 3.
But that does not give you "high rating articles have a higher chance of getting picked" at all. It merely gives you a random 90% of the highest ranking rows.
This might give you what you want, but with a full table scan:
SELECT *
FROM articles
ORDER BY rating * RAND() DESC
LIMIT 3;

Related

Is there a way to be less random than ORDER BY RAND to increase speed?

I run this query 5 times, 5 seconds apart, on a table of 500,000 rows:
SELECT * FROM `apps` WHERE dev_name = '' ORDER BY RAND() LIMIT 10;
I'd like to get 50 rows that have a 90-95% chance of being unique. The query takes 10 seconds right now. I'd rather get it lower and have a smaller chance of being random.
Try
AND RAND() >= 0.90
(or 0.95 if you like) in your WHERE clause.
You may want to study Fetching Random Rows from a Table, as highlighted in here in a related question. Another great article can be found here.

MYSQL SELECT random on large table ORDER BY SCORE [duplicate]

This question already has answers here:
Optimizing my mysql statement! - RAND() TOO SLOW
(6 answers)
Closed 8 years ago.
I have a large mysql table with about 25000 rows. There are 5 table fields: ID, NAME, SCORE, AGE,SEX
I need to select random 5 MALES order BY SCORE DESC
For instance, if there are 100 men that score 60 each and another 100 that score 45 each, the script should return random 5 from the first 200 men from the list of 25000
ORDER BY RAND()
is very slow
The real issue is that the 5 men should be a random selection within the first 200 records. Thanks for the help
so to get something like this I would use a subquery.. that way you are only putting the RAND() on the outer query which will be much less taxing.
From what I understood from your question you want 200 males from the table with the highest score... so that would be something like this:
SELECT *
FROM table_name
WHERE age = 'male'
ORDER BY score DESC
LIMIT 200
now to randomize 5 results it would be something like this.
SELECT id, score, name, age, sex
FROM
( SELECT *
FROM table_name
WHERE age = 'male'
ORDER BY score DESC
LIMIT 200
) t -- could also be written `AS t` or anything else you would call it
ORDER BY RAND()
LIMIT 5
I dont think that sorting by random can be "optimised out" in any way as sorting is N*log(N) operation. Sorting is avoided by query analyzer by using indexes.
The ORDER BY RAND() operation actually re-queries each row of your table, assigns a random number ID and then delivers the results. This takes a large amount of processing time for table of more than 500 rows. And since your table is containing approx 25000 rows then it will definitely take a good amount of time.

Optimizing slow ORDER BY RAND() query

I have a query that is using ORDER BY RAND() but it takes too long and it's getting worse as data is growing.
The query joins two tables and it returns 5 random products and a random image of each product
Table 1 - Products
product_id - pk auto-inc
name
description
Data
1 - product 1 - description
2 - product 2 - description
Table 2 - ProductImages
image_id - pk auto-inc
product_id - fk index
filename
Data
1 - 1 - product 1 image
2 - 1 - product 1 image
3 - 1 - product 1 image
4 - 2 - product 2 image
...
I've read this and this but cannot find a way to optimize the query so i'm asking for help.
Thanks in advance.
ORDER BY RAND() is slow because the DBMS has to read all rows, sort them all, just to keep only a few rows. So the performance of this query heavily depends on the number of rows in the table, and decreases as the number of rows increase.
There is no way to optimize that.
There are alternatives, however:
You can implement "get 5 random rows" by doing 6 queries:
get number of rows in table (you can cache this one)
do 5 queries with OFFSET <random offset from 0 to $number_of_rows-1> LIMIT 1 (i.e. read and return only one row from some random offset)
For example: SELECT * FROM Products OFFSET 42 LIMIT 1 (note: without joining, for now)
Such queries are very fast and run in a time virtually independent from the table size.
This should be much faster than ORDER BY RAND().
Now, to get a random Image for each random Product:
SELECT *
FROM (
SELECT *
FROM Products
OFFSET 42 LIMIT 1
) p
JOIN ProductImages pi
ON pi.product_id = p.id
ORDER BY RAND()
LIMIT 1
The inner query is still fast, and the outer is only sorting few rows (assuming there are few images per product), and so can still use order by rand().

MySQL Rating With Weight

I want to create a rating with weight depending on number of votes.
So, 1 voting with 5 can't be better than 4 votings with 4.
I found this math form:
bayesian = ( (avg_num_votes * avg_rating) + (this_num_votes * this_rating) ) / (avg_num_votes + this_num_votes)
How can I make a MySQL SELECT to get the ID of the best rating image.
I got a table for IMAGE, and a table for VOTING
VOTING:
id
imageID
totalVotes
avgVote
I think I got to do this with SELECT in SELECT, but how?
A first step is to calculate avg_num_votes and avg_rating:
SELECT
SUM(totalVotes)/COUNT(*) AS avg_num_votes,
SUM(avgVote)/COUNT(*) AS avg_rating
FROM voting;
If you can live with a small error, it might be good enough to calculate that once in a while.
Now using your formula and the values above, you can run the weighing query. As a small optimization I precalculate avg_num_votes * avg_rating and call it avg_summand
SELECT
voting.*, -- or whatever fields you need
($avg_summand+totalVotes*avgVote)/($avg_num_votes+totalVotes) AS bayesian
FROM voting
ORDER BY bayesian DESC
LIMIT 1;
Edit
You could run this as a join:
SELECT
voting.*, -- or whatever fields you need
(avg_num_votes*avg_rating+totalVotes*avgVote)/(avg_num_votes+totalVotes) AS bayesian
FROM voting,
(
SELECT
SUM(totalVotes)/COUNT(*) AS avg_num_votes,
SUM(avgVote)/COUNT(*) AS avg_rating
FROM voting AS iv
) AS avg
ORDER BY bayesian DESC
LIMIT 1;
But this will calculate sum and average on every single query - call it a performance bomb.

Select a Portion of Vast Data Over Time with MySQL

I have hundreds of thousands of price points spanning 40 years plus. I would like to construct a query that will only return 3000 total data points, with the last 500 being the most recent data points, and the other 2500 being just a sample of the rest of the data, evenly distributed.
Is it possible to do this in one query? How would I select just a sample of the large amount of data? This is a small example of what I mean for getting just a sample of the other 2500 data points:
1
2
3
4
5
6
7
8
9
10
And I want to return something like this:
1
5
10
Here's the query for the last 500:
SELECT * FROM price ORDER BY time_for DESC LIMIT 500
I'm not sure how to go about getting the sample data from the other data points.
Try this:
(SELECT * FROM price ORDER BY time_for DESC LIMIT 500)
UNION ALL
(SELECT * FROM price WHERE time_for < (SELECT time_for FROM price ORDER BY time_for LIMIT 500, 1) ORDER BY rand() LIMIT 2500)
ORDER BY time_for
Note: It's probably going to be slow. How big is this table?
It might be faster to only get the primary ID from all these rows, then join it to the original in a secondary query once it's narrowed down. This is because ORDER BY rand() LIMIT has to sort the entire table. If the table is large this can take a LONG time, and a lot of disk space. Retrieving only the ID reduces the necessary disk space.
The previous answer is good, but you did specify that you want the results to be evenly distributed so I'll add this possibility too. By iterating a counter over the rows you can use a MOD operator to sample an even distribution. I don't have a MYSQL install right now to test this so apologies if the syntax isn't 100% spot on. But it should be close enough and may give you some ideas.
( SELECT p1.*
FROM price p1
ORDER BY p1.time_for DESC
LIMIT 500 )
UNION ALL
( SELECT #i := #i + 1 AS row_num,
p2.*
FROM price p2,
(SELECT #i: = 0)
WHERE row_num > 500
AND (row_num % 500) = 0
ORDER BY time_for DESC )
The first query gives the 500 latest rows. The second query gives every 500th row after that, thus returning an even distribution from the rest of the data. Obviously you can tune this parameter to achieve the desired sample spacing. Or base it on the total number of rows in the table to calculate the necessary spacing to give exactly 2500 records.