How to sample rows in MySQL using RAND(seed)? - mysql

I need to fetch a repeatable random set of rows from a table using MySQL. I implemented this using the MySQL RAND function using the bigint primary key of the row as the seed. Interestingly this produces numbers that don't look random at all. Can anyone tell me whats going on here and how to get it to work properly?
select id from foo where rand(id) < 0.05 order by id desc limit 100
In one example out of 600 rows not a single one was returned. I change the select to include "id, rand(id)" and get rid of the rand clause in the where this is what I got:
| 163345 | 0.315191733944408 |
| 163343 | 0.814825518815616 |
| 163337 | 0.313726862253367 |
| 163334 | 0.563177533972242 |
| 163333 | 0.312994424545201 |
| 163329 | 0.312261986837035 |
| 163327 | 0.811895771708242 |
| 163322 | 0.560980224573035 |
| 163321 | 0.310797115145994 |
| 163319 | 0.810430896291911 |
| 163318 | 0.560247786864869 |
| 163317 | 0.310064677437828 |
Look how many 0.31xxx lines there are. Not at all random.
PS: I know this is slow but in my app the where clause limits the number of rows to a few 1000.

Use the same seed for all the rows to do that, like:
select id from foo where rand(42) < 0.05 order by id desc limit 100
See the rand() docs for why it works that way. Change the seed if you want another set of values.

Multiply the decimal number returned by id:
select id from foo where rand() * id < 5 order by id desc limit 100

Related

mysql Rand() function cause unexpected multirow results

When I try to get random row from table by id using RAND() function I get unexpected unstable results. The following query (where id column is primary key) returns 1, 2 or even more rows:
I tried next variant as well which produces same result:
SELECT id, word FROM words WHERE id = FLOOR(RAND() * 1000)
I found another solution for my task:
SELECT id, word FROM words ORDER BY RAND() LIMIT 1
But I want to know why MySQL behavior is so unexpected with using so elementary functionality. It scares me.
I experimented in different IDE with the same results.
The behavior is not unexpected. The RAND() function is evaluated per-row:
SELECT RAND() FROM sometable LIMIT 10
+----------------------+
| RAND() |
+----------------------+
| 0.7383128467372738 |
| 0.6141578719151746 |
| 0.8558508500976961 |
| 0.4367806654766022 |
| 0.6163508078235674 |
| 0.7714120734216757 |
| 0.0080079743713214 |
| 0.7258036823252251 |
| 0.6049945192458057 |
| 0.8475615799869984 |
+----------------------+
Keeping this in mind, this query:
SELECT * FROM words WHERE id = FLOOR(RAND() * 1000)
means that every row with id between 0 and 999 has 1/1000 probability of being SELECTed!

Sorting varchar numerically when also sorting by other fields

I need to sort a query's results by a varchar that may contain numbers, but also sort by other fields beforehand.
Say my table looks like:
+------------------------+-----------------+
| AnotherField | VarCharWithNums |
+------------------------+-----------------|
| Same Data in Every Row | Numbers 5-10 |
| Same Data in Every Row | Numbers 10-13 |
| Same Data in Every Row | Numbers 13-15 |
+------------------------+-----------------|
This query:
SELECT VarCharWithNums, AnotherField
FROM MyTable
ORDER BY CAST(VarCharWithNums AS UNSIGNED) ASC
Gives me this:
+------------------------+-----------------+
| AnotherField | VarCharWithNums |
+------------------------+-----------------|
| Same Data in Every Row | Numbers 5-10 |
| Same Data in Every Row | Numbers 10-13 |
| Same Data in Every Row | Numbers 13-15 |
+------------------------+-----------------|
This query:
SELECT VarCharWithNums, AnotherField
FROM MyTable
ORDER BY AnotherField ASC, CAST(VarCharWithNums AS UNSIGNED) ASC
Gives me this:
+------------------------+-----------------+
| AnotherField | VarCharWithNums |
+------------------------+-----------------|
| Same Data in Every Row | Numbers 10-13 |
| Same Data in Every Row | Numbers 5-10 |
| Same Data in Every Row | Numbers 13-15 |
+------------------------+-----------------|
It doesn't matter what priority I give the fields in the ORDER BY clause, it never sorts VarCharWithNums correctly when I order it alongside other fields.
You mentioned it in your last paragraph, but truly the only error in what you've shown is that, to get the sort you described, CAST(VarCharWithNums AS UNSIGNED) ASC needs to be the first thing in the ORDER BY clause. It should work, and it works when I create a contrived example on my machine.

Only return an ordered subset of the rows from a joined table

Given a structure like this in a MySQL database
#data_table
(id) | user_id | time | (...)
#relations_table
(id) | user_id | user_coach_id | (...)
we can select all data_table rows belonging to a certain user_coach_id (let's say 1) with
SELECT rel.`user_coach_id`, dat.*
FROM `relations_table` rel
LEFT JOIN `data_table` dat ON rel.`uid` = dat.`uid`
WHERE rel.`user_coach_id` = 1
ORDER BY val.`time` DESC
returning something like
| user_coach_id | id | user_id | time | data1 | data2 | ...
| 1 | 9 | 4 | 15 | foo | bar | ...
| 1 | 7 | 3 | 12 | oof | rab | ...
| 1 | 6 | 4 | 11 | ofo | abr | ...
| 1 | 4 | 4 | 5 | foo | bra | ...
(And so on. Of course time are not integers in reality but to keep it simple.)
But now I would like to query (ideally) only up to an arbitrary number of rows from data_table per distinct user_id but still have those ordered (i.e. newest first). Is that even possible?
I know I can use GROUP BY user_id to only return 1 row per user, but then the ordering doesn't work and it seems kind of unpredictable which row will be in the result. I guess it's doable with a subquery, but I haven't figured it out yet.
To limit the number of rows in each GROUP is complicated. It is probably best done with an #variable to count, plus an outer query to throw out the rows beyond the limit.
My blog on Groupwise Max gives some hints of how to do such.

Efficient MySQL Table Setup and Queries

Suppose I have the following database setup (a simplified version from what I actually have):
Table: news_posting (500,000+ entries)
| --------------------------------------------------------------|
| posting_id | name | is_active | released_date | token |
| 1 | posting_1 | 1 | 2013-01-10 | 123 |
| 2 | posting_2 | 1 | 2013-01-11 | 124 |
| 3 | posting_3 | 0 | 2013-01-12 | 125 |
| --------------------------------------------------------------|
PRIMARY posting_id
INDEX sorting ON (is_active, released_date, token)
Table: news_category (500 entries)
| ------------------------------|
| category_id | name |
| 1 | category_1 |
| 2 | category_2 |
| 3 | category_3 |
| ------------------------------|
PRIMARY category_id
Table: news_cat_match (1,000,000+ entries)
| ------------------------------|
| category_id | posting_id |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 2 | 2 |
| 3 | 2 |
| 1 | 3 |
| 2 | 3 |
| ------------------------------|
UNIQUE idx (category_id, posting_id)
My task is as follows. I must get a list of 50 latest news postings (at some offset) that are active, that are before today's date, and that are in one of the 20 or so categories that are specified in the request. Before I choose the 50 news postings to return, I must sort the appropriate news postings by token in descending order. My query is currently similar to the following:
SELECT DISTINCT posting_id
FROM news_posting np
INNER JOIN news_cat_match ncm ON (ncm.posting_id = np.posting_id AND ncm.category_id IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20))
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
ORDER BY np.token DESC LIMIT 50
With just one specified category_id the query does not involve a filesort and is reasonably fast because it does not have to process removal of duplicate results. However, calling EXPLAIN on the above query that has multiple category_id's returns a table that says that there is filesort to be done. And, the query is extremely slow on my data set.
Is there any way to optimize the table setup and/or the query?
I was able to get the above query to run even faster than with a single-value category list version by rewriting it as follows:
SELECT posting_id
FROM news_posting np
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
AND EXISTS (
SELECT ncm.posting_id
FROM news_cat_match ncm
WHERE ncm.posting_id = np.posting_id
AND ncm.category_id IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
LIMIT 1
)
ORDER BY np.token DESC LIMIT 50
This now takes under a second on my data set.
The sad part is that this is even faster than if there is just one category_id specified. That's because the subset of news items is bigger than with just one category_id, so it finds the results more quickly.
Now my next question is whether this can be optimized for cases when a category has only few news that are spread in time?
The following is still pretty slow on my development machine. Although it's fast enough on the production server, I would like to optimize this if possible.
SELECT DISTINCT posting_id
FROM news_posting np
INNER JOIN news_cat_match ncm ON (ncm.posting_id = np.posting_id AND ncm.category_id = 1)
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
ORDER BY np.token DESC LIMIT 50
Does anyone have any further suggestions?

Sum a generated list in MySQL, in a single query

I need to sum a result that I'm getting from an existing query. And the it has to extend the current query and remain a single query
(by this I mean NOT - DO 1; DO 2; DO3;)
My current query is:
SELECT SUM((count)/(SELECT COUNT(*) FROM mobile_site_statistics WHERE campaign_id='1201' AND start_time BETWEEN CURDATE()-1 AND CURDATE())*100) AS percentage FROM mobile_site_statistics WHERE device NOT LIKE '%Pingdom%' AND campaign_id='1201' AND start_time BETWEEN (CURDATE()-1) AND CURDATE() GROUP BY device ORDER BY 1 DESC LIMIT 10;
This returns:
+------------+
| percentage |
+------------+
| 47.3813 |
| 19.7940 |
| 5.6672 |
| 5.0801 |
| 3.9603 |
| 3.8500 |
| 3.1294 |
| 2.9924 |
| 2.9398 |
| 2.7136 |
+------------+
What I need is the total of that table (total percent used by the top 10 devices)(that's all) but it has to be a single query (Has to include the initial query)(Has to be a single query due to another program that's using the query)
Is this possible? every way I have tried so far has failed. We tried temporary tables, but that turned into multiple queries.
Just do a
SELECT SUM(percentage) AS total FROM (<YOUR_QUERY>) a
and replace the sub-query <YOUR_QUERY> with your initial query