How do I select a random row from the database based on the probability chance assigned to each row.
Example:
Make Chance Value
ALFA ROMEO 0.0024 20000
AUDI 0.0338 35000
BMW 0.0376 40000
CHEVROLET 0.0087 15000
CITROEN 0.016 15000
........
How do I select random make name and its value based on the probability it has to be chosen.
Would a combination of rand() and ORDER BY work? If so what is the best way to do this?
You can do this by using rand() and then using a cumulative sum. Assuming they add up to 100%:
select t.*
from (select t.*, (#cumep := #cumep + chance) as cumep
from t cross join
(select #cumep := 0, #r := rand()) params
) t
where #r between cumep - chance and cumep
limit 1;
Notes:
rand() is called once in a subquery to initialize a variable. Multiple calls to rand() are not desirable.
There is a remote chance that the random number will be exactly on the boundary between two values. The limit 1 arbitrarily chooses 1.
This could be made more efficient by stopping the subquery when cumep > #r.
The values do not have to be in any particular order.
This can be modified to handle chances where the sum is not equal to 1, but that would be another question.
Related
The SQL query is :
Select ProductName from Products;
The above query returns 5000 rows.
How can the result of 5000 rows be divided into two result sets of 2500 rows each,.i.e., one result set from 1 to 2500 and the other from 2501 to 5000?
Note:
Here ProductName is the primary Key.No ProductID column is present in the table.
It can be done either in the back end or in the front end.
An approach that works for mySQL (based on this answer https://stackoverflow.com/a/4741301/14015737):
Upper half
SELECT *
FROM (
SELECT test.*, #counter := #counter +1 counter
FROM (select #counter:=0) initvar, test
ORDER BY num
) X
WHERE counter <= round(50/100 * #counter);
ORDER BY num;
Lower half
Invert the sort order and remove the rounding
SELECT *
FROM (
SELECT test.*, #counter := #counter +1 counter
FROM (select #counter:=0) initvar, test
ORDER BY num DESC
) X
WHERE counter <= (50/100 * #counter);
ORDER BY num;
In case of an uneven number of records, the middle record is added to the upper half in this example. If you want it the other way around, move the round() to the other statement. If you don't want it at all, remove round().
Dbfiddle example: https://dbfiddle.uk/?rdbms=mysql_5.7&fiddle=fb70eae0f7f1434a24099b5bb19f0878
If you know the numbers that you want, just use limit:
select ProductName
from Products
order by id
And then either:
limit 2500
limit 2500 offset 2499
If you simply want the results split into half, then you can use:
select t.*
from (select t.*,
ntile(2) over (order by <primary key>) as tile
from t
) t
where tile = 1; -- or 2 for the other half
The easiest and probably fastest approach is to use the table's primary key if you are fine with getting the rows in its order.
Run
select productname, id from products order by id;
and fetch 2500 rows. Then with the last ID, say ID 3456, run
select productname, id from products where id > 3456 order by id;
and fetch 2500 rows again. Etc.
UPDATE: Seeing I got a downvote for this, I'll better explain :-)
The query returns 5000 rows now and the OP doesn't want that many rows, so they want to cut this in halves. But the query may well return 10000 rows next year. Will the OP suddenly be fine with getting 5000 rows at once? This doesn't seem likely. It is more likely that there is an amount of rows that shall not be surpassed. This is why I cut the amount into slices of 2500.
The other approach to number all rows and return the first n rows has a severe drawback: All rows must be read again. Even if it is decided to cut the result in chunks of 100 each, everytime all rows must be read, sorted, numbered, fetched from. Reading all rows from a table and sorting all these rows is a lot of work for a DBMS.
I've got a table:
player_id|player_name|play_with_id|play_with_name|
I made this table for a game.
Everyone who wants to play can sign up to it.
When they sign up the table stores player_id and player_name
When the period while they can sign up expires I want to assign every player_name to a play_with_name randomly.
So for example.. my structure would like this when they in sign up period:
player_id|player_name|play_with_id|play_with_name|
1 someone1
2 someone2
3 someone3
4 someone4
5 someone5
And this when the period expires:
player_id|player_name|play_with_id|play_with_name|
1 someone1 2 someone2
2 someone2 1 someone1
3 someone3 4 someone4
4 someone4 3 someone3
5 someone5 - -
I can't test this since I don't have a MySQL database handy and SQLFiddle seems to take forever to run anything, but this hopefully gets you there or at least close:
SET #row_num = 0;
SET #last_player_id = 0;
UPDATE P
SET
play_with_id =
CASE
WHEN P.player_id = SQ.player_id THEN SQ.last_player_id
ELSE player_id
END
FROM
Players P
LEFT OUTER JOIN
(
SELECT
#row_num := #row_num + 1 row_num,
#last_player_id last_player_id,
#last_player_id := player_id player_id
FROM
Players
WHERE
MOD(#row_num, 2) = 0
ORDER BY
RAND()
) SQ ON SQ.player_id = P.player_id OR SQ.last_player_id = P.player_id
The code (hopefully) sorts the players randomly then it pairs them based on that order. Every other player in the randomly sorted result is paired with the person right before them.
In MS SQL Server RAND() would only be evaluated once here and wouldn't end up affecting the ORDER BY, but I think that MySQL handles RAND() differently and generates a new value for each row in the result set.
I'm not sure why some client code isn't doing this as opposed to having this operation be done at the database level, but I suppose if you get the strategy for retrieving a randomized row set based on your DB from here, you could then write a stored procedure with a cursor or iterator to loop through the result set of something like:
select player_id, player_name from players order by RAND()
and then loop through the all the table rows to update the play_with_id and play_with_name, where the previously selected player_id <> play_with_id.
I have a SQL table with periodic measurements. I'd like to be able to return some summary method (say SUM) over the value column, for an arbitrary number of rows at a time. So if I had
id | reading
1 10
5 14
7 10
11 12
13 18
14 16
I could sum over 2 rows at a time, getting (24, 22, 34), or I could sum 3 rows at a time and get (34, 46), if that makes sense. Note that the ID might not be contiguous -- I just want to operate by row count, in sort order.
In the real world, the identifier is a timestamp, but I figure that (maybe after applying a unix_timestamp() call) anything that works for the simple case above should be applicable. If it matters, I'm trying to gracefully scale the number of results returned for a plot query -- maybe there's a smarter way to do this? I'd like the solution to be general, and not impose a particular storage mechanism/schema on the data.
You may resequense query result and then group it
SET #seq = 0;
SELECT SUM(data), ts FROM (
SELECT #seq := #seq + 1 AS seq, data, ts FROM table ORDER BY ts LIMIT 50
) AS tmp GROUP BY floor(tmp.seq / 3);
Let's say I have a list of values, like this:
id value
----------
A 53
B 23
C 12
D 72
E 21
F 16
..
I need the top 10 percent of this list - I tried:
SELECT id, value
FROM list
ORDER BY value DESC
LIMIT COUNT(*) / 10
But this doesn't work. The problem is that I don't know the amount of records before I do the query. Any idea's?
Best answer I found:
SELECT*
FROM (
SELECT list.*, #counter := #counter +1 AS counter
FROM (select #counter:=0) AS initvar, list
ORDER BY value DESC
) AS X
where counter <= (10/100 * #counter);
ORDER BY value DESC
Change the 10 to get a different percentage.
In case you are doing this for an out of order, or random situation - I've started using the following style:
SELECT id, value FROM list HAVING RAND() > 0.9
If you need it to be random but controllable you can use a seed (example with PHP):
SELECT id, value FROM list HAVING RAND($seed) > 0.9
Lastly - if this is a sort of thing that you need full control over you can actually add a column that holds a random value whenever a row is inserted, and then query using that
SELECT id, value FROM list HAVING `rand_column` BETWEEN 0.8 AND 0.9
Since this does not require sorting, or ORDER BY - it is O(n) rather than O(n lg n)
You can also try with that:
SET #amount =(SELECT COUNT(*) FROM page) /10;
PREPARE STMT FROM 'SELECT * FROM page LIMIT ?';
EXECUTE STMT USING #amount;
This is MySQL bug described in here: http://bugs.mysql.com/bug.php?id=19795
Hope it'll help.
I realize this is VERY old, but it still pops up as the top result when you google SQL limit by percent so I'll try to save you some time. This is pretty simple to do these days. The following would give the OP the results they need:
SELECT TOP 10 PERCENT
id,
value
FROM list
ORDER BY value DESC
To get a quick and dirty random 10 percent of your table, the following would suffice:
SELECT TOP 10 PERCENT
id,
value
FROM list
ORDER BY NEWID()
I have an alternative which hasn't been mentionned in the other answers: if you access from any language where you have full access to the MySQL API (i.e. not the MySQL CLI), you can launch the query, ask how many rows there will be and then break the loop if it is time.
E.g. in Python:
...
maxnum = cursor.execute(query)
for num, row in enumerate(query)
if num > .1 * maxnum: # Here I break the loop if I got 10% of the rows.
break
do_stuff...
This works only with mysql_store_result(), not with mysql_use_result(), as the latter requires that you always accept all needed rows.
OTOH, the traffic for my solution might be too high - all rows have to be transferred.
For the last two days, I have been asking questions on rank queries in Mysql. So far, I have working queries for
query all the rows from a table and order by their rank.
query ONLY one row with its rank
Here is a link for my question from last night
How to get a row rank?
As you might notice, btilly's query is pretty fast.
Here is a query for getting ONLY one row with its rank that I made based on btilly's query.
set #points = -1;
set #num = 0;
select * from (
SELECT id
, points
, #num := if(#points = points, #num, #num + 1) as point_rank
, #points := points as dummy
FROM points
ORDER BY points desc, id asc
) as test where test.id = 3
the above query is using subquery..so..I am worrying about the performance.
are there any other faster queries that I can use?
Table points
id points
1 50
2 50
3 40
4 30
5 30
6 20
Don't get into a panic about subqueries. Subqueries aren't always slow - only in some situations. The problem with your query is that it requires a full scan.
Here's an alternative that should be faster:
SELECT COUNT(DISTINCT points) + 1
FROM points
WHERE points > (SELECT points FROM points WHERE id = 3)
Add an index on id (I'm guessing that you probably you want a primary key here) and another index on points to make this query perform efficiently.