Select random row from MySQL (with probability) - mysql

I have a MySQL table that has a row called cur_odds which is a percent number with the percent probability that that row will get selected. How do I make a query that will actually select the rows in approximately that frequency when you run through 100 queries for example?
I tried the following, but a row that has a probability of 0.35 ends up getting selected around 60-70% of the time.
SELECT * FROM table ORDER BY RAND()*cur_odds DESC
All the values of cur_odds in the table add up to 1 exactly.

If cur_odds is changed rarely you could implement the following algorithm:
1) Create another column prob_sum, for which
prob_sum[0] := cur_odds[0]
for 1 <= i <= row_count - 1:
prob_sum[i] := prob_sum[i - 1] + cur_odds[i]
2) Generate a random number from 0 to 1:
rnd := rand(0,1)
3) Find the first row for which prob_sum > rnd (if you create a BTREE index on the prob_sum, the query should work much faster):
CREATE INDEX prob_sum_ind ON <table> (prob_sum);
SET #rnd := RAND();
SELECT MIN(prob_sum) FROM <table> WHERE prob_sum > #rnd;

Given your above SQL statement, whatever numbers you have in cur_odds are not the probabilities that each row is selected, but is instead just an arbitrary weighting (relative to the "weights" of all the other rows) which could instead be best interpreted as a relative tendency to float towards the top of the sorted table. The actual value in each row is meaningless (e.g. you could have 4 rows with values of 0.35, 0.5, 0.75 and 0.99, or you could have values of 35, 50, 75 and 99, and the results would be the same).
Update: Here's what's going on with your query. You have one row with a cur_odds value of 0.35. For the sake of illustration, I'm going to assume that the other 9 rows all have the same value (0.072). Also for the sake of illustration, let's assume RAND() returns a value from 0.0 to 1.0 (it may actually).
Every time you run this SELECT statement, each row is assigned a sorting value by multiplying its cur_odds value by a RAND() value from 0.0 to 1.0. This means that the row with a 0.35 will have a sorting value between 0.0 and 0.35.
Every other row (with a value of 0.072) will have sorting values ranging between 0.0 and 0.072. This means that there is an approximately 80% chance that your one row will have a sorting value greater than 0.072, which would mean that there is no possible chance that any other row could be sorted higher. This is why your row with the cur_odds value of 0.35 is coming up first more often than you expect.
I incorrectly described the cur_odds value as a relative change weighting. It actually functions as a maximum relative weighting, which would then involve some complex math to determine the actual relative probabilities involved.
I'm not sure what you need can be done with straight T-SQL. I've implemented a weighted probability picker many times (I was even going to ask a question about best methods for this this morning, ironically) but always in code.

Related

SQL generate a random positive or negative value

I am looking for a way to change a value randomly from positive to negative. (I am creating a distortion on a lat/long location, so I would like to offset a given location with +/- some degrees)
I already created the following query which give me a number between -1 and +1, the idea is to multiply my distortion with this number to get a random negative or positive number.
SELECT round(-1+3*RAND(),0);
The only problem is, this also generates the value 0.0 which can't be multiplied. How do I get -1 or +1 only?
TIA
ABBOV
Maybe:
start by rounding, to give 0 or 1
then multiply to give 0 or 2
then subtract, to give -1 or 1
i.e.:
SELECT ROUND(RAND()) * 2 - 1;

How can I make my select statement deterministically match only 1/n of my dataset?

I'm processing data from a MySQL table where each row has a UUID associated with it. EDIT: the "UUID" is in fact an MD5 hash (VARCHAR) of the job text.
My select query looks something like:
SELECT * FROM jobs ORDER BY priority DESC LIMIT 1
I am only running one worker node right now, but would like to scale it out to several nodes without altering my schema.
The issue is that the jobs take some time, and scaling out beyond one right now would introduce a race condition where several nodes are working on the same job before it completes and the row is updated.
Is there an elegant way to effectively "shard" the data on the client-side, by specifying some modifier config value per worker node? My first thought was to use the MOD function like this:
SELECT * FROM jobs WHERE UUID MOD 2 = 0 ORDER BY priority DESC LIMIT 1
and SELECT * FROM jobs WHERE UUID MOD 2 = 1 ORDER BY priority DESC LIMIT 1
In this case I would have two workers configured as "0" and "1". But this isn't giving me an even distribution (not sure why) and feels clunky. Is there a better way?
The problem is you're storing the ID as a hex string like acbd18db4cc2f85cedef654fccc4a4d8. MySQL will not convert the hex for you. Instead, if it starts with a letter you get 0. If it starts with a number, you get the starting numbers.
select '123abc' + 0 = 123
select 'abc123' + 0 = 0
6 out of 16 will start with a letter so they will all be 0 and 0 mod anything is 0. The remaining 10 of 16 will be some number so will be distributed properly, 5 of 16 will be 0, 5 of 16 will be 1. 6/16 + 5/16 = 69% will be 0 which is very close to your observed 72%.
To do this right we need to convert the 128 hex string into a 64 bit unsigned integer.
Slice off 64 bits with either left(uuid, 16) or right(uuid, 16).
Convert the hex (base 16) into decimal (base 10) using conv.
cast the result to an unsigned bigint. If we skip this step MySQL appears to use a float which loses accurracy.
select cast(conv(right(uuid, 16), 16, 10) as unsigned) mod 2
Beautiful.
That will only use 64 bits of the 128 bit checksum, but for this purpose that should be fine.
Note this technique works with an MD5 checksum because it is pseudorandom. It will not work with the default MySQL uuid() function which is a UUID version 1. UUIDv1 is a timestamp + a fixed ID and will always mod the same.
UUIDv4, which is a random number, will work.
Convert the hex string to decimal before modding:
where CONV(substring(uuid, 1, 8), 16, 10) mod 2 = 1
A reasonable hashing function should distribute evenly enough for this purpose.
Use substring to convert only a small part so the conv doesn't overflow decimal range and maybe behave badly. Any subset of bits should also be well distributed.

Get value between from to dataset columns ssrs

I have a data set like that:
Data Set Contents
From To Comment
----+---+--------
0 50 Bad
50 70 Good
70 100 Excellent
If I have a value of 75, I need to get Excellent by searching the Dataset.
I know about the lookup function but it is not what I want. How can I do that?
The values should be in percentage.
Note : the value (75) is Average of a column (Calculated) it
calculate student grade from max and student mark Version SQL Server
2016
Note 2 : the dataset is from database not static values
Thank You
Assuming you only ever have a fixed number of 'grades' then this will work. However, I would strongly recommend doing this type of work on the server where possible.
Here we go...
I created two datasets
dsGradeRange with the following sql to recreate your example (more or less)
DECLARE #t TABLE (low int, high int, comment varchar(20))
INSERT INTO #t VALUES
(0,49,'Bad'),
(50,69,'Good'),
(70,100, 'Excellent')
SELECT * FROM #t
dsRandomNumbers This just creates 30 random numbers between 0 and 100
SELECT *
FROM (SELECT top 30 ABS(CHECKSUM(NEWID()) % 100) as myNumber FROM sys.objects) x
ORDER BY myNumber
I added a table to the report to show the grades (just for reference).
I then added a table to show the dsRandomNumbers
Finally I set the expression of the 2nd column to the following expression.
=SWITCH
(
Fields!myNumber.Value < LOOKUP("Bad", Fields!comment.Value, Fields!high.Value, "dsGradeRange"), "Bad",
Fields!myNumber.Value < LOOKUP("Good", Fields!comment.Value, Fields!high.Value, "dsGradeRange"), "Good",
True, "Excellent"
)
This gives the following results
As you can see we only need to compare to the high value of each case, the first match will return the correct comment.
Right click on your dataset and add a calculated field. Go to Field Properties > Fields > Add and add the following expression, which descripes your scenario:
=IIF(Fields!Number.Value < 50, "Bad", "Good")

How to create query with simple formula?

Hey is there any way to create query with simple formula ?
I have a table data with two columns value_one and value_two both are decimal values. I want to select this rows where difference between value_one and value_two is grater then 5. How can i do this?
Can i do something like this ?
SELECT * FROM data WHERE (MAX(value_one, value_two) - MIN(value_one, value_two)) > 5
Example values
value_one, value_two
1,6
9,3
2,3
3,2
so analogical difs are: 5, 6, 1, 1 so the selected row would be only first and second.
Consider an example where smaller number is subtracted with a bigger number:
2 - 5 = -3
So, the result is a difference of two numbers with a negation sign.
Now, consider the reverse scenario, when bigger number is subtracted with the smaller number:
5 - 2 = 3
Pretty simple right.
Basically, the difference of two number remains same, if you just ignore the sign. This is in other words called absolute value of a number.
Now, the question arises how to find the absolute value in MySQL?
Answer to this is the built-in method of MySQL i.e. abs() function which returns an absolute value of a number.
ABS(X):
Returns the absolute value of X.
mysql> SELECT ABS(2);
-> 2
mysql> SELECT ABS(-32);
-> 32
Therefore, without worrying about finding min and max number, we can directly focus on the difference of two numbers and then, retrieving the absolute value of the result. Finally, check if it is greater than 5.
So, the final query becomes:
SELECT *
FROM data
WHERE abs(value_one - value_two) > 5;
You can also do complex operations once the absolute value is calculated like adding or dividing with the third value. Check the code below:
SELECT *
FROM
data
WHERE
(abs(value_one - value_two) / value_three) + value_four > 5;
You can also add multiple conditions using logical operators like AND, OR, NOT to do so. Click here for logical operators.
SELECT *
FROM
data
WHERE
((abs(value_one - value_two) / value_three) + value_four > 5)
AND (value_five != 0);
Here is the link with various functions available in MySQL:
https://dev.mysql.com/doc/refman/5.0/en/mathematical-functions.html
No, you would just use a simple where clause:
select *
from data
where abs(value_one - value_two) > 5;

SAS : Eliminate duplicates if a condition is satisfied

I want to eliminate duplicates from a database, based on an identifier, an order and a condition.
More precisely, I have data with several observations. I have sometimes a condition that makes me want to keep that observation anyway (let fix it condition=1), but then also keep the observation with the same identifier even if this condition does not hold (condition=0).
But if I have for one identifier several observations where condition=0 then I want to elminate duplicates, with criterion being having the greatest order.
Without the condition I can do that
proc sort data=have;
by identifier descending order;
run;
proc sort nudopkey data=have;
by identifier;
run;
But how to incorporate my condition in this ?
Edit 1 : add a database example :
data Test;
input identifier $ order condition;
datalines;
1023 1 0
1023 2 0
1064 2 0
1064 1 0
1098 1 0
1098 1 1
;
Then I want to keep
1023 2 0
1064 2 0
1098 1 0
1098 1 1
Edit 2 : tried to precise my conditions
I presume you want to eliminate duplicates only when the condition for all records for an identifier is set to 0. In that case you want to keep the record with the maximum order and eliminate all other records with the same identifier.
Proc sql;
create table want as
select *
from test
group by identifier
having max (condition) ne 0
or order eq max (order)
;
Quit;
This will keep all rows for an Identifier where the maximum condition = 1,
or in the case of those where maximum condition = 0, select the row with the maximum order.
Is that what you want?
Some of this depends on how you define 'condition'. Is your condition easily verifiable on every record for that identifier? Then you can do something like this.
Evaluate the condition.
For records where it is true (you want to remove the duplicate), set flag=0. For records where it is not true, increment the condition flag by one.
If the condition is true for all records in that ID, all will have the same value (flag=0) and nodupkey on by identifier flag; will remove extras. If the condition is false for all records, those will not be removed. If it's true for some and false for some, and you want to remove only some of the records with that identifier (only the duplicates where it is true), then you have to make sure that either it's sorted to have all of the condition=true records at top, or have a separate flag counter that determines what value the flag will be (since it sometimes will go to 0 in the middle, so 0 0 0 1 2 3 0 4 5 6 is what you want, not 0 0 0 1 2 3 0 0 1 2 ).
Perhaps easier to see is to do it within a datastep. After sorting by identifier descending order:
data want;
set have;
by identifier descending order;
if (condition=true) and not (first.identifier) then delete;
run;
This will, again, work if either condition=true is always at the top, or if it's always consistent within one ID group. If it's inconsistent and mixed, then you need to keep track of whether you've kept one where it was true (assuming you want to), or it might delete all records where it is true; use a separate variable to keep track of how many you've kept. first.identifier will be 1/TRUE for the first record for that identifier only, not taking into account the condition. You could also create the flag, then sort by identifier flag descending order; and guarantee the condition=true are at the top (either by making flag=0 for true, or sorting by descending flag.)