I am trying to extract a random article who has a picture from a database.
SELECT FLOOR(MAX(id) * RAND()) FROM `table` WHERE `picture` IS NOT NULL
My table is 33 MB big and has 1,006,394 articles but just 816 with pictures.
My problem is this query takes 0.4640 sek
I need this to be much much more faster.
Any idea is welcome.
P.S.
1. of course I have a index on id.
2. there is no index on the picture field. should I add one?
3. the product name is unique, also the product number, but thats out of question.
RESULT OF TESTING SESSION.
#cHao's Solution is faster when I use it to select one of the random entries with a picture.(les then 0.1 sec.
But its slower if I try to do the opposite, to select a random article without picture. 2..3 sec.
#Kickstart's Solution is a bit slower when trying to find a entry with picture, but is almost same speed when trying to find a entry without picture. average 0,149 sec.
#bob-kruithof's Solution don't work for me.
when trying to find a entry with picture, it selects a entry without picture.
and #ganesh-bora, yes you are right, in my case the speed difference is about 5..15 times.
I want to thank you all for your help, and I decided for #Kickstart.
You need to get a range of values with matching records and then find a matching record within that range.
Something like this:-
SELECT r1.id
FROM `table` AS r1
INNER JOIN (
SELECT RAND( ) * ( MAX( id ) - MIN( id ) ) + MIN( id ) AS id
FROM `table`
WHERE `picture` IS NOT NULL
) AS r2
ON r1.id >= r2.id
WHERE `picture` IS NOT NULL
ORDER BY r1.id ASC
LIMIT 1
However for any hope of efficiency you need an index on the field it is checking (ie, picture in your example)
Just an explanation of how this works.
The sub select finds a random id from the table which is between the min and max ids for records for a picture. This random id may or may not be for a picture.
The resulting id from this sub select is joined back against the main table, but using >= and with a WHERE clause specifying that the record is a picture record. Hence it joins against all picture records where the id is greater than or equal to the random id. The highest random id will be the one for the picture record with the highest id, so it will always find a record (if there are any picture records). The ORDER BY / LIMIT is then used to bring back that single id.
Note that there is an obvious flaw to this, but most of the time it will be irrelevant. The record retrieved may not be entirely random. The picture with the lowest id is unlikely to be returned (will only be returned if the RAND() returns exactly 0), but if this is important this is easy enough to fix by rounding the resulting random id. The other flaw is that if the ids are not vaguely equally distributed in the full range of ids then some will be returned more often than others. For example, take the situation where the first 1000 ids were pictures, then no more until the last (33 millionth) record. The random id could be any of those 33 million, but unless it is less than or equal to 1000 then it will be the 33 millionth record that will be returned.
You might try attaching a random number to each row, then sorting by that. The row with the lowest number will be at the top.
SELECT `table`.`id`, RAND() as `order`
FROM `table`
WHERE `picture` IS NOT NULL
ORDER BY `order`
LIMIT 1;
This is of course slower than just magicking up an ID with RAND(), but (1) it'll always give you a valid ID (as long as there's a record with a non-null picture field in the table, anyway), and (2) the WTF ratio is pretty low; most people can tell what's going on here. :) Its performance rivals Kickstart's solution with a decently indexed table, when the number of items to select from is relatively small (around 1%). Definitely don't try to select from a whole huge table like this; limit it first with a WHERE clause on some indexed field(s).
Performancewise, if you have a long-running app (ie: not PHP; i'm talking about Java, .net, etc where the app is alive even between requests), you might try to keep a list of all the IDs of items with pictures, select a random ID from that list, and load the article. You could do that in PHP too, if you wanted. It might not work as well when you have to query all the IDs each time, but it could be very useful if you can cache the list of IDs in APC or something.
for performance you can first add index on picture column so 814 records get sorted out at the top while executing the query and then you can fire your query.
How has someone else solved the problem?
I would suggest looking at the this article about different possible ways of selecting random rows in mysql.
Modified example from the article
SELECT name
FROM random JOIN
( SELECT CEIL( RAND() * (
SELECT MAX( id ) FROM random WHERE picture IS NOT NULL
) ) AS id ) AS r2 USING ( id );
This might work in your case.
Efficiency
As user Kickstart mentioned: Do you have an index on the column picture? This might help getting you the results a bit faster.
Are your tables optimized?
Related
Imagine I would like to develop something similar to Tinder. I have a database with roughly 170k rows (=persons) and I would like to present them on my website. After the user's response, the next person is shown etc.
Once a person has been shown, this is marked in the column 'seen' with a 1. The order in which the persons are shown should be random and only persons that have not been seen yet should be shown.
At the moment, I have this solution. However, this is rather slow and takes too much time for a smooth experience. What would be a more efficient approach to this problem? What is the gold standard for such problems?
SELECT * FROM data WHERE (seen = 0) ORDER BY RAND() LIMIT 1
Add a non-clustered index on the 'seen' column and PK column which will improve querying on the same.
If the primary id is sequential and u know the limits of the records, you can get a random number between max value and min value and query like
SELECT *
FROM data
WHERE seen = 0 AND id >= random_id
LIMIT 1
Say you have a table with n rows, what is the most efficient way to get the first row ever recorded on that table without sorting?
This is guaranteed to work, but becomes slower as the number of records increases:
SELECT * FROM posts ORDER BY created_at DESC LIMIT 1;
UPDATE:
This is even better in case there are multiple records with the same created_at value, but still needs sorting:
SELECT * FROM posts ORDER BY id ASC LIMIT 1;
Imagine a ledger book with 1 million pages and 1 billion lines of records, to get the first ever record, you'd simply turn to the first page and get the one on the top most, right? Regardless of the size of the ledger, you should get the first ever record with the same efficiency. I was hoping I could do the same in MySQL without doing any kind of sorting or ordering. For research purposes. I mean, why not? Why can't MySQL? Is it impossible by design?
This is possible in typical array structures in programming:
array = [1,2,3,4,5]
The first element is in array[0], the second in array[1] and so on. There is no sorting necessary. The last element is array[array_count(array)-1].
I can offer the following two queries to find the most recent record:
SELECT * FROM posts ORDER BY created_at DESC LIMIT 1
and
SELECT *
FROM posts
WHERE created_at = (SELECT MAX(created_at) FROM posts
Both queries would suffer from performance degredation as the table gets larger, because the sorting operation needed to find the most recent created date would take more time.
But in both cases, adding the following index should improve the performance of the query:
ALTER TABLE posts ADD INDEX created_idx (created_at)
MySQL can use an index both in the ORDER BY clause and when finding the max. See the documentation for more information.
I'm currently working on a multi-thread program (in Java) that will need to select random rows in a database, in order to update them. This is working well but I started to encounter some performance issue regarding my SELECT request.
I tried multiple solutions before finding this website :
http://jan.kneschke.de/projects/mysql/order-by-rand/
I tried with the following solution :
SELECT * FROM Table
JOIN (SELECT FLOOR( COUNT(*) * RAND() ) AS Random FROM Table)
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL
LIMIT 1;
It selects only one row below the random id number generated. This is working pretty good (an average of less than 100ms per request on 150k rows). But after the process of my program, the FOREIGNKEY_ID will no longer be NULL (it will be updated with some value).
The problem is, my SELECT will "forget" some rows than have an id below the random generated id, and I won't be able to process them.
So I tried to adapt my request, doing this :
SELECT * FROM Table
JOIN (SELECT FLOOR(
(SELECT COUNT(id) FROM Table WHERE FOREIGNKEY_ID IS NULL) * RAND() )
AS Random FROM Table)
AS R ON Table.ID > R.Random
WHERE Table.FOREIGNKEY_ID IS NULL
LIMIT 1;
With that request, no more problems of skipping some rows, but performances are decreasing drastically (an average of 1s per request on 150k rows).
I could simply execute the fast one when I still have a lot of rows to process, and switch to the slow one when it remains just a few rows, but it will be a "dirty" fix in the code, and I would prefer an elegant SQL request that can do the work.
Thank you for your help, please let me know if I'm not clear or if you need more details.
For your method to work more generally, you want max(id) rather than count(*):
SELECT t.*
FROM Table t JOIN
(SELECT FLOOR(MAX(id) * RAND() ) AS Random FROM Table) r
ON t.ID > R.Random
WHERE t.FOREIGNKEY_ID IS NULL
ORDER BY t.ID
LIMIT 1;
The ORDER BY is usually added to be sure that the "next" id is returned. In theory, MySQL could always return the maximum id in the table.
The problem is gaps in ids. And, it is easy to create distributions where you never get a random number . . . say that the four ids are 1, 2, 3, 1000. Your method will never get 1000000. The above will almost always get it.
Perhaps the simplest solution to your problem is to run the first query multiple times until it gets a valid row. The next suggestion would be an index on (FOREIGNKEY_ID, ID), which the subquery can use. That might speed the query.
I tend to favor something more along these lines:
SELECT t.id
FROM Table t
WHERE t.FOREIGNKEY_ID IS NULL AND
RAND() < 1.0 / 1000
ORDER BY RAND()
LIMIT 1;
The purpose of the WHERE clause is to reduce the volume considerable, so the ORDER BY doesn't take much time.
Unfortunately, this will require scanning the table, so you probably won't get responses in the 100 ms range on a 150k table. You can reduce that to an index scan with an index on t(FOREIGNKEY_ID, ID).
EDIT:
If you want a reasonable chance of a uniform distribution and performance that does not increase as the table gets larger, here is another idea, which -- alas -- requires a trigger.
Add a new column to the table called random, which is initialized with rand(). Build an index onrandom`. Then run a query such as:
select t.*
from ((select t.*
from t
where random >= #random
order by random
limit 10
) union all
(select t.*
from t
where random < #random
order by random desc
limit 10
)
) t
order by rand();
limit 1;
The idea is that the subqueries can use the index to choose a set of 20 rows that are pretty arbitrary -- 10 before and after the chosen point. The rows are then sorted (some overhead, which you can control with the limit number). These are randomized and returned.
The idea is that if you choose random numbers, there will be arbitrary gaps and these would make the chosen numbers not quite uniform. However, by taking a larger sample around the value, then the probability of any one value being chosen should approach a uniform distribution. The uniformity would still have edge effects, but these should be minor on a large amount of data.
Your ID's are probably gonna contain gaps. Anything that works with COUNT(*) is not going to be able to find all the ID's.
A table with records with ID's 1,2,3,10,11,12,13 has only 7 records. Doing a random with COUNT(*) will often result in a miss as records 4,5 and 6 donot exist, and it will then pick the nearest ID which is 3. This is not only unbalanced (it will pick 3 far too often) but it will also never pick records 10-13.
To get a fair uniformly distrubuted random selection of records, I would suggest loading the ID's of the table first. Even for 150k rows, loading a set of integer id's will not consume a lot of memory (<1 MB):
SELECT id FROM table;
You can then use a function like Collections.shuffle to randomize the order of the ID's. To get the rest of the data, you can select records one at a time or for example 10 at a time:
SELECT * FROM table WHERE id = :id
Or:
SELECT * FROM table WHERE id IN (:id1, :id2, :id3)
This should be fast if the id column has an index, and it will give you a proper random distribution.
If prepared statement can be used, then this should work:
SELECT #skip := Floor(Rand() * Count(*)) FROM Table WHERE FOREIGNKEY_ID IS NULL;
PREPARE STMT FROM 'SELECT * FROM Table WHERE FOREIGNKEY_ID IS NULL LIMIT ?, 1';
EXECUTE STMT USING #skip;
LIMIT in SELECT statement can be used to skip rows
SELECT DISTINCT `Stock`.`ProductNumber`,`Stock`.`Description`,`TComponent_Status`.`component`, `TComponent_Status`.`certificate`,`TComponent_Status`.`status`,`TComponent_Status`.`date_created`
FROM Stock , TBOM , TComponent_Status
WHERE `TBOM`.`Component` = `TComponent_Status`.`component`
AND `Stock`.`ProductNumber` = `TBOM`.`Product`
Basically table TBOM HAS :
24,588,820 rows
The query is ridiculously slow, i'm not too sure what i can do to make it better. I have indexed all the other tables in the query but TBOM has a few duplicates in the columns so i can't even run that command. I'm a little baffled.
To start, index the following fields:
TBOM.Component
TBOM.Product
TComponent_Status.component
Stock.ProductNumber
Not all of the above indexes may be necessary (e.g., the last two), but it is a good start.
Also, remove the DISTINCT if you don't absolutely need it.
The only thing I can really think of is having an index on your Stock table on
(ProductNumber, Description)
This can help in two ways. Since you are only using those two fields in the query, the engine wont be required to go to the full data row of each stock record since both parts are in the index, it can use that. Additionally, you are doing DISTINCT, so having the index available to help optimize the DISTINCTness, should also help.
Now, the other issue for time. Since you are doing a distinct from stock to product to product status, you are asking for all 24 million TBOM items (assume bill of materials), and each BOM component could have multiple status created, you are getting every BOM for EVERY component changed.
If what you are really looking for is something like the most recent change of any component item, you might want to do it in reverse... Something like...
SELECT DISTINCT
Stock.ProductNumber,
Stock.Description,
JustThese.component,
JustThese.certificate,
JustThese.`status`,
JustThese.date_created
FROM
( select DISTINCT
TCS.Component,
TCS.Certificate,
TCS.`staus`,
TCS.date_created
from
TComponent_Status TCS
where
TCS.date_created >= 'some date you want to limit based upon' ) as JustThese
JOIN TBOM
on JustThese.Component = TBOM.Component
JOIN Stock
on TBOM.Product = Stock.Product
If this is a case, I would ensure an index on the component status table, something like
( date_created, component, certificate, status, date_created ) as the index. This way, the WHERE clause would be optimized, and distinct would be too since pieces already part of the index.
But, how you currently have it, if you have 10 TBOM entries for a single "component", and that component has 100 changes, you now have 10 * 100 or 1,000 entries in your result set. Take this and span 24 million, and its definitely not going to look good.
I have two questions here but i am asking them at once as i think they are inter-related.
I am working with a complex query (Multiple joins + sub queries) and the table is pretty huge as well (around 2,00,000 records in this table).
A part of this query (a LEFT JOIN) is required to find a record which has a second lowest value in a cetain column among all the records associated with the primary key of the first table. For now I have isolated this part and thinking on the lines of -
SELECT id FROM tbl ORDER BY `myvalue` ASC LIMIT 1,1;
But there is a case where, if there is only 1 record in the table, it must return that record instead of NULL. So my first question is how do write a query for this ?
Secondly, considering the size of the table and the time its already taking to run even after creating indexes, I understand that adding any more complexity to it in order to achieve the above part might affect the querying time dramatically.
I cannot decompose joins because I need to get some of the columns for the ORDER BY clause (the application has an option to sort the result by these columns, the above column "myvalue" being one of them)
What would be the way(s) to approach this problem ?
Thanks
Something like this might work
COALESCE(
(SELECT id FROM tbl ORDER BY `myvalue` ASC LIMIT 1,1),
(SELECT id FROM tbl ORDER BY `myvalue` ASC LIMIT 0,1))
It selects the first non null value from the list provided.
As for the complexity of the query, post the whole thing so we can take a look at it.