I have a query that is using ORDER BY RAND() but it takes too long and it's getting worse as data is growing.
The query joins two tables and it returns 5 random products and a random image of each product
Table 1 - Products
product_id - pk auto-inc
name
description
Data
1 - product 1 - description
2 - product 2 - description
Table 2 - ProductImages
image_id - pk auto-inc
product_id - fk index
filename
Data
1 - 1 - product 1 image
2 - 1 - product 1 image
3 - 1 - product 1 image
4 - 2 - product 2 image
...
I've read this and this but cannot find a way to optimize the query so i'm asking for help.
Thanks in advance.
ORDER BY RAND() is slow because the DBMS has to read all rows, sort them all, just to keep only a few rows. So the performance of this query heavily depends on the number of rows in the table, and decreases as the number of rows increase.
There is no way to optimize that.
There are alternatives, however:
You can implement "get 5 random rows" by doing 6 queries:
get number of rows in table (you can cache this one)
do 5 queries with OFFSET <random offset from 0 to $number_of_rows-1> LIMIT 1 (i.e. read and return only one row from some random offset)
For example: SELECT * FROM Products OFFSET 42 LIMIT 1 (note: without joining, for now)
Such queries are very fast and run in a time virtually independent from the table size.
This should be much faster than ORDER BY RAND().
Now, to get a random Image for each random Product:
SELECT *
FROM (
SELECT *
FROM Products
OFFSET 42 LIMIT 1
) p
JOIN ProductImages pi
ON pi.product_id = p.id
ORDER BY RAND()
LIMIT 1
The inner query is still fast, and the outer is only sorting few rows (assuming there are few images per product), and so can still use order by rand().
Related
I have the following sample MYSQL table:
id | count_likes
-----------
1 | 30
2 | 95
3 | 60
4 | 60
5 | 22
I want to order the table by column count_likes descending and display 5 rows at a time (this is a sample table so assume thousands of rows).
To achieve this I run the following command:
SELECT * FROM table ORDER BY count_likes DESC, id DESC LIMIT 5
I want to give the option for users to load more rows like loading facebook comments for example (5 rows at a time).
To achieve this I run the following command:
SELECT * FROM table WHERE id NOT IN(values already loaded)
ORDER BY count_likes DESC, id DESC LIMIT 5
This could work well for few pages but I think it's not recommended to have like hundred values in the WHERE NOT IN clause.
If I make the command like this:
SELECT * FROM table WHERE count_likes < 'the last displayed count number'
I could miss some rows which have the same count like the last loaded row.
If I make the command like this:
SELECT * FROM table WHERE count_likes <= 'the last displayed count number'
I could get duplicate values that are already loaded.
If I make the command like this:
SELECT * FROM table ORDER BY count_likes DESC LIMIT offset,5
I may get disorganized or duplicate rows as the count_likes for any row may increase or decrease while other users are manipulating the same page.
What is the best way to load more rows in my case above?
The most accurate one would be the WHERE NOT IN but I don't know if it causes performance issues on large number of members like hundred or even thousand.
I have a table with a 'billnumber' column (INT 11)
billnumber value may changed every 3 or 4 records so if I index a 1M record I will have a 250K index
What I want is anyway to index every 1000 billnumber together
- from 1 to 1000
- from 1001 to 2000
Is there actually a problem?
INDEX(billnumber) is non-unique, this will return 3 or 4 rows without difficulty:
SELECT ...
WHERE bill_number = 1234
If you want to select 1..1000, simply do
SELECT ...
WHERE bill_number BETWEEN 1 AND 1000;
Both are efficient.
Keep in mind, a data base table has no order. To get the rows ordered, you must use ORDER BY.
Meanwhile, INDEXes try to make WHERE and ORDER BY efficient.
I'm trying to pick random articles from my database, where high rating articles have a higher chance of getting picked
SELECT * FROM articles WHERE RAND()>0.9 ORDER BY rating DESC LIMIT 3
My question is:
Will it random the whole table, or just until it finds 3 articles that random a number higher then 0.9
If you have INDEX(rating), that query will probably fetch 3 or 4 (3/(1-0.1)) rows before finding the 3.
But that does not give you "high rating articles have a higher chance of getting picked" at all. It merely gives you a random 90% of the highest ranking rows.
This might give you what you want, but with a full table scan:
SELECT *
FROM articles
ORDER BY rating * RAND() DESC
LIMIT 3;
I have this table,
person_id int(10) pk
points int(6) index
other columns not very important
I have this random function which is very fast on a table with 10M rows:
SELECT person_id
FROM persons AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(person_id)
FROM persons)) AS id)
AS r2
WHERE r1.person_id >= r2.id
ORDER BY r1.person_id ASC
LIMIT 1
This is all great but now I wish to show only people with points > 0. Example table:
PERSON_ID POINTS
1 4
2 6
3 0
4 3
When I append AND points > 0 to the where clause, person_id 3 can't be selected, so a gap is created and when the random select person_id 3, person_id 4 will be selected. This gives person 4 a bigger chance to be chosen. Any one got suggestions how I can adjust the query to make it work with the where clause and give all rows same % of chance to be selected.
Info table: The table is uniform, no gaps in person_id's. About 90% will have 0 points. I want to make the query for where points = 0 and points > 0.
Before someone will say, use rand(): this is not solution for tables with more than a few 100k rows.
Bonus question: will it be possible to select x random rows in 1 query, so I do not have to call this query a few times when i want more random rows?
Important note: performance is key, with 10M+ rows the query may not take much longer than the current query, which takes 0.0005 seconds, I prefer to stay under 0.05 second.
Last note: If you think the query will never be this fast with above requirements, but another solution is possible (like fetching 100 rows and showing x random which has more than 0 points), please tell :)
Really appreciate your help and all help is welcome :)
You could generate in-line gap-free id's for the records that you really want to work with, and generate then the random selector using the total number of records available.
Try with this (props to the chosen answer here for the row_number generator):
SELECT r1.*
FROM
(SELECT person_id,
#curRow := #curRow + 1 AS row_number
FROM persons as p,
(SELECT #curRow := 0) r0
WHERE points>0) r1
, (SELECT COUNT(1) * RAND() id
FROM persons
WHERE points>0) r2
WHERE r1.person_id>=r2.id
ORDER BY r1.person_id ASC
LIMIT 1;
You can mess with it in this sqlfiddle.
I have hundreds of thousands of price points spanning 40 years plus. I would like to construct a query that will only return 3000 total data points, with the last 500 being the most recent data points, and the other 2500 being just a sample of the rest of the data, evenly distributed.
Is it possible to do this in one query? How would I select just a sample of the large amount of data? This is a small example of what I mean for getting just a sample of the other 2500 data points:
1
2
3
4
5
6
7
8
9
10
And I want to return something like this:
1
5
10
Here's the query for the last 500:
SELECT * FROM price ORDER BY time_for DESC LIMIT 500
I'm not sure how to go about getting the sample data from the other data points.
Try this:
(SELECT * FROM price ORDER BY time_for DESC LIMIT 500)
UNION ALL
(SELECT * FROM price WHERE time_for < (SELECT time_for FROM price ORDER BY time_for LIMIT 500, 1) ORDER BY rand() LIMIT 2500)
ORDER BY time_for
Note: It's probably going to be slow. How big is this table?
It might be faster to only get the primary ID from all these rows, then join it to the original in a secondary query once it's narrowed down. This is because ORDER BY rand() LIMIT has to sort the entire table. If the table is large this can take a LONG time, and a lot of disk space. Retrieving only the ID reduces the necessary disk space.
The previous answer is good, but you did specify that you want the results to be evenly distributed so I'll add this possibility too. By iterating a counter over the rows you can use a MOD operator to sample an even distribution. I don't have a MYSQL install right now to test this so apologies if the syntax isn't 100% spot on. But it should be close enough and may give you some ideas.
( SELECT p1.*
FROM price p1
ORDER BY p1.time_for DESC
LIMIT 500 )
UNION ALL
( SELECT #i := #i + 1 AS row_num,
p2.*
FROM price p2,
(SELECT #i: = 0)
WHERE row_num > 500
AND (row_num % 500) = 0
ORDER BY time_for DESC )
The first query gives the 500 latest rows. The second query gives every 500th row after that, thus returning an even distribution from the rest of the data. Obviously you can tune this parameter to achieve the desired sample spacing. Or base it on the total number of rows in the table to calculate the necessary spacing to give exactly 2500 records.