MySQL equally distributed random rows with WHERE clause - mysql

I have this table,
person_id int(10) pk
points int(6) index
other columns not very important
I have this random function which is very fast on a table with 10M rows:
SELECT person_id
FROM persons AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(person_id)
FROM persons)) AS id)
AS r2
WHERE r1.person_id >= r2.id
ORDER BY r1.person_id ASC
LIMIT 1
This is all great but now I wish to show only people with points > 0. Example table:
PERSON_ID POINTS
1 4
2 6
3 0
4 3
When I append AND points > 0 to the where clause, person_id 3 can't be selected, so a gap is created and when the random select person_id 3, person_id 4 will be selected. This gives person 4 a bigger chance to be chosen. Any one got suggestions how I can adjust the query to make it work with the where clause and give all rows same % of chance to be selected.
Info table: The table is uniform, no gaps in person_id's. About 90% will have 0 points. I want to make the query for where points = 0 and points > 0.
Before someone will say, use rand(): this is not solution for tables with more than a few 100k rows.
Bonus question: will it be possible to select x random rows in 1 query, so I do not have to call this query a few times when i want more random rows?
Important note: performance is key, with 10M+ rows the query may not take much longer than the current query, which takes 0.0005 seconds, I prefer to stay under 0.05 second.
Last note: If you think the query will never be this fast with above requirements, but another solution is possible (like fetching 100 rows and showing x random which has more than 0 points), please tell :)
Really appreciate your help and all help is welcome :)

You could generate in-line gap-free id's for the records that you really want to work with, and generate then the random selector using the total number of records available.
Try with this (props to the chosen answer here for the row_number generator):
SELECT r1.*
FROM
(SELECT person_id,
#curRow := #curRow + 1 AS row_number
FROM persons as p,
(SELECT #curRow := 0) r0
WHERE points>0) r1
, (SELECT COUNT(1) * RAND() id
FROM persons
WHERE points>0) r2
WHERE r1.person_id>=r2.id
ORDER BY r1.person_id ASC
LIMIT 1;
You can mess with it in this sqlfiddle.

Related

Get Top n data from a set of grouped data

I'm on MySQL 5.7 and have a dataset as the table, bundles.
bundle_id
single_amount
item_count
1
100
1
2
20
3
3
15
2
SQL Fiddle: http://sqlfiddle.com/#!9/823c91/1/0
e.g. The table means that for bundle1, a customer bought one item that is $100 and another customer bought 3 items that the individual one is $20.
I want to get the top n data by individual item, not by a bundle. The straightforward idea is to flatten the data as the table below.
single_amount
item_count
100
1
20
3
20
3
20
3
15
2
15
2
It's easy to do it on service but considering the potential size of the dataset, is there a way to do it on MySQL side?
The simplest method is to create one table having one number column and fill it with 1 to maximum numbers that is available in the column item_count.
Let's say, you have created table as numbers_table (number_col)
Then you can use that numbers table as follows:
select single_amount, item_count
from your_table t
join numbers_table n on t.item_count >= n.number_col
MySQL 8+ supports recursive CTEs (and actually CTEs in general). But earlier versions do not. To solve this, you need a numbers table of some sort. MySQL doesn't have one built-in, but you can generate one.
Based on your description of the problem, the bundles table is big enough, so you can use that:
SELECT n.n, b.*
FROM bundles b JOIN
(SELECT (#rn := #rn + 1) as n
FROM bundles b CROSS JOIN
(SELECT #rn := 0) params
LIMIT 100
) n
ON n.n <= b.item_count;
The LIMIT 100 is just for performance. You obviously need as many rows at the maximum item_count.
Here is a SQL Fiddle.

get distinct user rows from table while counting a different row

I am trying to count rows from a SQL Table named "reports", sorted by "working" being 0 / 1 / 2.
Currently I have this SQL query which works okay to give me three rows, each with a counter of how many there are that had "working" as either 0 / 1 / 2.
SELECT `working`, COUNT(`working`) AS `total` FROM `reports`
WHERE `appid` = 379720
GROUP BY `working`
ORDER BY `report_id` DESC LIMIT 30
So it currently (correctly as per the SQL) gives me something like:
Working
Total
0
12
1
34
2
18
What I want to do though, is have only one row per user counted, which I can't quite wrap my head around. I can't use a distinct select on an "author_id" field as that ends up included and I can't group by it since I need it grouped by the working int.
To be clear: I want the same results display, but only count one per unique "author_id" from each row.
Any pointers?
You seem to want count(distinct):
SELECT `working`, COUNT(DISTINCT author_id) AS `total`
FROM `reports`
WHERE `appid` = 379720
GROUP BY `working`
ORDER BY `report_id` DESC

SQL Query - selecting higher priority rows more often

I am trying to do SQL code in mysqli query to select rows with higher priority more often. I have a DB where all posts are sorted by priority, but I want it select like this (10 - the highest priority):
**Priority**
10
3
10
9
7
10
9
1
10
How can I do this? I have tried that to solve by more ways but no result. Thank you.
If you want to sample your data with preference to higher priorities, you could do something like this:
SELECT *
FROM (
SELECT OrderDetailID
,mod(OrderDetailID, 10) + 1 AS priority
,rand() * 10 AS rand_priority
FROM OrderDetails
) A
WHERE rand_priority < priority
ORDER BY OrderDetailID
This query runs in MySQL Tryit from W3Schools.
mod(OrderDetailID, 10) + 1 simulates a 1-10 priority - your table just has this value in it already
rand() * 10 gives you a random number between 0 and 10
Then by filtering to only ones where the random number is less than the priority, you get a result set where the higher priorities are more likely.
You may use rank function if your MySQL version supports it. It will order your data by priority in descending order and ranks each row. If the two rows have same priority then both rows will have same ranking. Then you can filter out the first rank data which will give you highest priority rows always.
Select * FROM
(
SELECT
col1,
col2,
priority,
RANK() OVER w AS 'rank'
FROM MyTable
WINDOW w AS (ORDER BY priority)
) MyQuery
Where rank = 1
Note : Syntax might be incorrect, please feel to edit the query.
This post might help you for ranking if your MySql version doesn't support Rank.

MYSQL SELECT random on large table ORDER BY SCORE [duplicate]

This question already has answers here:
Optimizing my mysql statement! - RAND() TOO SLOW
(6 answers)
Closed 8 years ago.
I have a large mysql table with about 25000 rows. There are 5 table fields: ID, NAME, SCORE, AGE,SEX
I need to select random 5 MALES order BY SCORE DESC
For instance, if there are 100 men that score 60 each and another 100 that score 45 each, the script should return random 5 from the first 200 men from the list of 25000
ORDER BY RAND()
is very slow
The real issue is that the 5 men should be a random selection within the first 200 records. Thanks for the help
so to get something like this I would use a subquery.. that way you are only putting the RAND() on the outer query which will be much less taxing.
From what I understood from your question you want 200 males from the table with the highest score... so that would be something like this:
SELECT *
FROM table_name
WHERE age = 'male'
ORDER BY score DESC
LIMIT 200
now to randomize 5 results it would be something like this.
SELECT id, score, name, age, sex
FROM
( SELECT *
FROM table_name
WHERE age = 'male'
ORDER BY score DESC
LIMIT 200
) t -- could also be written `AS t` or anything else you would call it
ORDER BY RAND()
LIMIT 5
I dont think that sorting by random can be "optimised out" in any way as sorting is N*log(N) operation. Sorting is avoided by query analyzer by using indexes.
The ORDER BY RAND() operation actually re-queries each row of your table, assigns a random number ID and then delivers the results. This takes a large amount of processing time for table of more than 500 rows. And since your table is containing approx 25000 rows then it will definitely take a good amount of time.

Select a Portion of Vast Data Over Time with MySQL

I have hundreds of thousands of price points spanning 40 years plus. I would like to construct a query that will only return 3000 total data points, with the last 500 being the most recent data points, and the other 2500 being just a sample of the rest of the data, evenly distributed.
Is it possible to do this in one query? How would I select just a sample of the large amount of data? This is a small example of what I mean for getting just a sample of the other 2500 data points:
1
2
3
4
5
6
7
8
9
10
And I want to return something like this:
1
5
10
Here's the query for the last 500:
SELECT * FROM price ORDER BY time_for DESC LIMIT 500
I'm not sure how to go about getting the sample data from the other data points.
Try this:
(SELECT * FROM price ORDER BY time_for DESC LIMIT 500)
UNION ALL
(SELECT * FROM price WHERE time_for < (SELECT time_for FROM price ORDER BY time_for LIMIT 500, 1) ORDER BY rand() LIMIT 2500)
ORDER BY time_for
Note: It's probably going to be slow. How big is this table?
It might be faster to only get the primary ID from all these rows, then join it to the original in a secondary query once it's narrowed down. This is because ORDER BY rand() LIMIT has to sort the entire table. If the table is large this can take a LONG time, and a lot of disk space. Retrieving only the ID reduces the necessary disk space.
The previous answer is good, but you did specify that you want the results to be evenly distributed so I'll add this possibility too. By iterating a counter over the rows you can use a MOD operator to sample an even distribution. I don't have a MYSQL install right now to test this so apologies if the syntax isn't 100% spot on. But it should be close enough and may give you some ideas.
( SELECT p1.*
FROM price p1
ORDER BY p1.time_for DESC
LIMIT 500 )
UNION ALL
( SELECT #i := #i + 1 AS row_num,
p2.*
FROM price p2,
(SELECT #i: = 0)
WHERE row_num > 500
AND (row_num % 500) = 0
ORDER BY time_for DESC )
The first query gives the 500 latest rows. The second query gives every 500th row after that, thus returning an even distribution from the rest of the data. Obviously you can tune this parameter to achieve the desired sample spacing. Or base it on the total number of rows in the table to calculate the necessary spacing to give exactly 2500 records.