MySQL random record with a bias

MySQL random record with a bias - mysql

I would like to select a random record from a table but with a bias toward higher values in a particular field -- I don't want any record to have a 0% chance of getting selected, just less likely to get selected.
From this article, I know that random selects can be slow and you can speed them up:
http://wanderr.com/jay/order-by-slow/2008/01/30/
But what about when you are dealing with a few tables with joins and a where statement, and want to use one of the fields as a way to bias the randomness (the higher this field's value, the more likely to get selected)? For example:
SELECT a.id, a.date, a.userid, b.points FROM table_a AS a INNER JOIN table_b AS b ON (a.userid = b.userid) WHERE DATE_SUB(CURDATE(), INTERVAL 60 DAY) <= a.date
How could I turn the above into an efficient but not truly random query that would be biased toward higher values of b.points?

my 2 cents, biased can be carried out like this:
Assuming the Score is between 0, 100.
You randomly choose 5 records that is >75, 3 recording >50, 2 record >25, 1 record >0
Now if you random again from this 11 records, it is biased toward higher score.
To put them in sql, called your joined table "abc"
Select * from (
select * from abc where b.points > 75 order by rand() limit 5
cross join
select * from abc where b.points > 50 and b.points <75 order by rand() limit 3
cross join
select * from abc where b.points > 25 and b.points <50 order by rand() limit 2
cross join
select * from abc where b.points > 0 and b.points <25 order by rand() limit 1
) as result
order by rand() limit 3
Performance wise, I'll have a look at your link and update this anwser.

Related

cumulative sum for each user and sorting

I'm trying to get the cumulative sum for each user.
related tables(just example):
[user]
id
nickname
A
AA
B
BB
[pointTable] user_id -> [user]id
id
user_id
point
piA
A
10
piB
B
8
[pointHistoryTable] point_id -> [point]id
id
point_id
gain
use
phi1
piA
25
0
phi2
piB
10
0
phi3
piA
0
10
phi4
piB
0
9
phi5
piB
7
0
(For gain-use column, only one of them has a value.)
The result I want:
nickname
current
cGainSum
cUseSum
AA
10
25
10
BB
8
17
9
The query I used(mysql v5.7):
#1
SELECT
user.nickname AS nickname,
pointTable.point AS current,
sub.cGainSum AS cGainSum,
sub.cUseSum AS cUseSum
FROM
(SELECT
point_id, SUM(gain) AS cGainSum, SUM(`use`) AS cUseSum
FROM
pointHistoryTable
GROUP BY point_id) sub
INNER JOIN
pointTable ON pointTable.id = sub.point_id
INNER JOIN
user ON user.id = pointTable.user_id
ORDER BY cGainSum DESC
LIMIT 20 OFFSET 0;
#2
SELECT
user.nickname AS nickname,
pointTable.id AS pointId,
pointTable.point AS current,
(SELECT
IFNULL(SUM(gain), 0)
FROM
pointHistoryTable
WHERE
point_id = pointId AND gain > 0) AS cGainSum,
(SELECT
IFNULL(SUM(`use`), 0)
FROM
pointHistoryTable
WHERE
point_id = pointId AND `use` > 0) AS cUseSum
FROM
pointTable
INNER JOIN
user ON user.id = pointTable.user_id
ORDER BY cGainSum DESC
LIMIT 20 OFFSET 0;
Both work. But sorting takes a long time. (20,000 users)
When sorting with current, #1 takes about 25s and #2 takes about 300ms.
However, when sorting by cumulative sum(cGainSum or cUseSum), #1 takes about 25s again and #2 takes about 50s.
So #1 always causes a slow query, and #2 causes a slow query when sorting by cumulative sum.
Any other suggestions?
++
I'm using this query in node api. The data is sorted by the request query. The request query can be current, cGainSum, or cUseSum.
like this...
SELECT (...) ORDER BY ${query} DESC LIMIT 20 OFFSET 0;
The offset uses the pagination related request query.
(included in the details)

I imagine that without all those sub queries it would run faster with just simple joins and aggregate function if I misunderstood something you are welcome to correct me, this is anyways what I came up with
SELECT user.nickname,pointTable.point,SUM(pointHistoryTable.gain), SUM(pointHistoryTable.use) FROM user
LEFT JOIN pointTable
ON user.id=pointTable.user_id
LEFT JOIN pointHistoryTable
ON pointHistoryTable.point_id=pointTable.id
GROUP BY user.id
ORDER BY ${query} DESC LIMIT 20 OFFSET 0;
EDIT:
Besides improving the query, like not creating to complicated subqueries etc. A very easy way to improve performance is the use of indexes. I would for starters create indexes for all the id's in all the tables. You simple do this by using CREATE INDEX indexname ON tablename (column1,column2, etc) in your case for pointHistory the query would look something like this
CREATE INDEX pointHistoryIndex ON pointHistoryTable ('id','point_id')

Get subtraction done through mysql

I have below mentioned table called myData1
ID Value
1 150
2 120
3 100
I could get the last two values using below query:
SELECT value from myData1 order by ID desc limit 2;
I need to get the subtraction total of these two values (in this result set the result should be 100-120==> -20
Appreciate if someone can help to get this result

Approach 1
Use Correlated Subquery to get the first and second last value as two separate columns.
You can then use the result-set as Derived Table, to compute the difference.
Try (DB Fiddle DEMO #1):
SELECT dt.last_value - dt.second_last_value
FROM
(
SELECT
t1.Value as last_value,
(SELECT t2.Value
FROM myData1 AS t2
WHERE t2.ID < t1.ID
ORDER BY t2.ID DESC LIMIT 1) AS second_last_value
FROM myData1 AS t1
ORDER BY t1.ID DESC LIMIT 1
) AS dt
Approach 2
Break into two different Select Queries; Use Limit with Offset. For last item, use a factor of 1. For second last, use the factor of -1.
Combine these results using UNION ALL into a Derived Table.
Eventually, sum the values using respective factors.
You can do the following (DB Fiddle DEMO #2):
SELECT SUM(dt.factor * dt.Value)
FROM
(
(SELECT Value, 1 AS factor
FROM myData1
ORDER BY ID DESC LIMIT 0,1)
UNION ALL
(SELECT Value, -1 AS factor
FROM myData1
ORDER BY ID DESC LIMIT 1,1)
) AS dt

LIMIT by random number between 1 and 10

Essentially, I want to return X number of records from the last 21 days, with an upper limit of 10 records.
How do I add a random LIMIT to a query in MySQL?
Here's my query, with X for the random number 1-10.
SELECT releases.id, COUNT(charts_extended.release_id) as cnt FROM releases
INNER JOIN charts_extended
ON charts_extended.release_id=releases.id
WHERE DATEDIFF(NOW(), releases.date) < 21
GROUP BY releases.id
ORDER BY RAND()
LIMIT 0, X
I tried using RAND() * 10 + 1, but it gives a syntax error.
Is there any way to do this using pure SQL; ie without using an application language to "build" the query as a string and have the application language fill in X programmatically?

Eureka...
In pseudo code:
execute a query to select 10 random rows
select from that assigning a row number 0-9 using a user defined variable to calculate that
cross join with a single hit on rand() to create a number 0-9 and select all rows with row number less than or equal to that number
Here's the essence of the solution (you can adapt your query to work with it:
select * from (
select *, (#row := coalesce(#row + 1, 0)) row from (
// your query here, except simply LIMIT 10
select * from mytable
order by rand()
limit 10
) x
) y
cross join (select rand() * 10 rand) z
where row <= rand
See SQLFiddle. Run it a few times and you'll see you get 1-10 random rows.
If you don't want to see the row number, you can change the outer select * to select only the specific columns from the inner query that you want in your result.

Your query is correct but you need to update limit clause.
$query = "SELECT releases.id, COUNT(charts_extended.release_id) as cnt FROM releases
INNER JOIN charts_extended
ON charts_extended.release_id=releases.id
WHERE DATEDIFF(NOW(), releases.date) < 21
GROUP BY releases.id
ORDER BY RAND()
LIMIT 0,".rand(1,10);
and then execute this query.

MySQL get two random rows (complex), performance

I need to get 2 random rows but not just with rand() because it's very bad for performance by 10k+ rows so I got this code from another question here:
SELECT b.*
FROM bilder b CROSS JOIN
(SELECT COUNT(*) as cnt FROM bilder) v
WHERE rand() <= 5 / cnt
ORDER BY rand()
LIMIT 2
So I get 2 random rows from the table bilder and the performance is much better now. But I need to specify it a bit more. I need the rows only where the field geschlecht got the value female so i tryed:
SELECT b.*
FROM bilder b CROSS JOIN
(SELECT COUNT(*) as cnt FROM bilder) v
WHERE rand() <= 5 / cnt AND geschlecht = 'female'
ORDER BY rand()
LIMIT 2
But now I sometimes get only one row and sometimes none. How can I do this right?

Suppose you have 100 rows in bilder but only 10 of those rows have geschlecht='female'.
The first query tests its random selection 100 times, and each time it has a 5/100 chance of selecting the row. The odds of picking no rows is therefore 0.95100 (for an explanation why see the Birthday Problem), in other words only 0.5% chance of picking no rows.
The second query tests its random selection only 10 times, and each time it still has 5/100 chance of selecting the row. The odds of picking no rows is 0.9510 which is 59.87% chance!
It would be better if you apply the condition to the count subquery as well:
SELECT b.*
FROM bilder b CROSS JOIN
(SELECT COUNT(*) as cnt FROM bilder WHERE geschlect = 'female') v
WHERE rand() <= 5 / cnt AND geschlecht = 'female'
ORDER BY rand()
LIMIT 2
Now cnt is only 10, and the random chance is therefore 5/10 chance of selecting the row. So the odds of picking no rows is 0.5010, or less than 0.1%.

Mysql: Order by max N values from subquery

I'm about to throw in the towel with this.
Preface: I want to make this work with any N, but for the sake of simplicity, I'll set N to be 3.
I've got a query (MySQL, specifically) that needs to pull in data from a table and sort based on top 3 values from that table and after that fallback to other sort criteria.
So basically I've got something like this:
SELECT tbl.id
FROM
tbl1 AS maintable
LEFT JOIN
tbl2 AS othertable
ON
maintable.id = othertable.id
ORDER BY
othertable.timestamp DESC,
maintable.timestamp DESC
Which is all basic textbook stuff. But the issue is I need the first ORDER BY clause to only get the three biggest values in othertable.timestamp and then fallback on maintable.timestamp.
Also, doing a LIMIT 3 subquery to othertable and join it is a no go as this needs to work with an arbitrary number of WHERE conditions applied to maintable.
I was almost able to make it work with a user variable based approach like this, but it fails since it doesn't take into account ordering, so it'll take the FIRST three othertable values it finds:
ORDER BY
(
IF(othertable.timestamp IS NULL, 0,
IF(
(#rank:=#rank+1) > 3, null, othertable.timestamp
)
)
) DESC
(with a #rank:=0 preceding the statement)
So... any tips on this? I'm losing my mind with the problem. Another parameter I have for this is that since I'm only altering an existing (vastly complicated) query, I can't do a wrapping outer query. Also, as noted, I'm on MySQL so any solutions using the ROW_NUMBER function are unfortunately out of reach.
Thanks to all in advance.
EDIT. Here's some sample data with timestamps dumbed down to simpler integers to illustrate what I need:
maintable
id timestamp
1 100
2 200
3 300
4 400
5 500
6 600
othertable
id timestamp
4 250
5 350
3 550
1 700
=>
1
3
5
6
4
2
And if for whatever reason we add WHERE NOT maintable.id = 5 to the query, here's what we should get:
1
3
4
6
2
...because now 4 is among the top 3 values in othertable referring to this set.
So as you see, the row with id 4 from othertable is not included in the ordering as it's the fourth in descending order of timestamp values, thus it falls back into getting ordered by the basic timestamp.
The real world need for this is this: I've got content in "maintable" and "othertable" is basically a marker for featured content with a timestamp of "featured date". I've got a view where I'm supposed to float the last 3 featured items to the top and the rest of the list just be a reverse chronologic list.

Maybe something like this.
SELECT
id
FROM
(SELECT
tbl.id,
CASE WHEN othertable.timestamp IS NULL THEN
0
ELSE
#i := #i + 1
END AS num,
othertable.timestamp as othertimestamp,
maintable.timestamp as maintimestamp
FROM
tbl1 AS maintable
CROSS JOIN (select #i := 0) i
LEFT JOIN tbl2 AS othertable
ON maintable.id = othertable.id
ORDER BY
othertable.timestamp DESC) t
ORDER BY
CASE WHEN num > 0 AND num <= 3 THEN
othertimestamp
ELSE
maintimestamp
END DESC

Modified answer:
select ilv.* from
(select sq.*, #i:=#i+1 rn from
(select #i := 0) i
CROSS JOIN
(select m.*, o.id o_id, o.timestamp o_t
from maintable m
left join othertable o
on m.id = o.id
where 1=1
order by o.timestamp desc) sq
) ilv
order by case when o_t is not null and rn <=3 then rn else 4 end,
timestamp desc
SQLFiddle here.
Amend where 1=1 condition inside subquery sq to match required complex selection conditions, and add appropriate limit criteria after the final order by for paging requirements.

Can you use a union query as below?
(SELECT id,timestamp,1 AS isFeatured FROM tbl2 ORDER BY timestamp DESC LIMIT 3)
UNION ALL
(SELECT id,timestamp,2 AS isFeatured FROM tbl1 WHERE NOT id in (SELECT id from tbl2 ORDER BY timestamp DESC LIMIT 3))
ORDER BY isFeatured,timestamp DESC
This might be somewhat redundant, but it is semantically closer to the question you are asking. This would also allow you to parameterize the number of featured results you want to return.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008