Randomly select rows from a table based on weight and probability - mysql

I'm using MySQL. I have a table which looks like that:
id: primary key
name: varchar
weight: int (this can be either 1,2 or 3)
What I want to do is randomly select one row until I get a list of 400 selected rows from a table similar to that below that has 500 rows, but taking into account the weight.
For example, if I have 3 rows:
id, name, weight
1, "some content", 2
2, "other content", 1
3, "something", 3
When creating the list, rows that have a weight of 2 appear 30% of times in the list, rows that have a weight of 1 appear 20% of times in the list and rows with weight of 3 appear 50% of times in the list.
Duplicates are permitted but not back to back.
Is there a way to do that?
If you don't understand something please feel free to ask.
Thanks in advance.

I still havent solve the repetition part. But this will give you a start
SQL Fiddle Demo
most inner select assign a random number
middle select use variables to assign a row_number to each row partition by Weight
last select filter to match the ratio. In this case generate a list of size 50.
the original data has an evenly distribution of ~30 for each category. So size 60 will be the limit to achive 50% Weight = 3
.
SELECT `ID`,`Name`,`Weight`, RowNumber
FROM (
SELECT *,
#row_num := IF(#prev_value = `Weight`,
#row_num + 1,
IF(#prev_value:=`Weight`,
1,
1)
) AS RowNumber
FROM (
SELECT `ID`,`Name`,`Weight`, rand() as rng
FROM `myTable`
ORDER BY `Weight`, rng
) X
CROSS JOIN (SELECT #row_num := 1, #prev_value := 0) y
) T
WHERE ( Weight = 3 and RowNumber <= 50 * 0.5 )
OR ( Weight = 2 and RowNumber <= 50 * 0.3 )
OR ( Weight = 1 and RowNumber <= 50 * 0.2 )
ORDER BY Weight, RowNumber

I suggest you make a temporary table where all records with 1's are repeated 2 times, alle records with 2 are repeated 3 times, and all records with 3's are repeated 5 times. Then make random selections in the temporary table among all the records. This should statistically end up with a distribution very near your target, if the total is large enough (e.g. 400).

In my other answer I solve how assign an ID to each weight. Here I will show you how create a list to handle the duplicates.
I use tables to show the whole process, also you can do select on the demo to validate each result. But with some work can be combine in a single query but wont be easy to read.
SQL FIDDLE DEMO
First we need to create a table to store which row will participate in your list
CREATE TABLE `incr` (
`weight` mediumint,
`row` mediumint
);
Using store procedure we fill the table.
CREATE PROCEDURE dowhile(IN Size INT)
BEGIN
DECLARE v1 INT DEFAULT Size * 0.5;
WHILE v1 >= 0 DO
IF v1 <= (Size - 1) * 0.5 THEN
INSERT incr VALUES (3, v1);
END IF;
IF v1 <= (Size - 1) * 0.3 THEN
INSERT incr VALUES (2, v1);
END IF;
IF v1 <= (Size - 1) * 0.2 THEN
INSERT incr VALUES (1, v1);
END IF;
SET v1 = v1 - 1;
END WHILE;
END//
CALL dowhile(300); -- Indicate List Size
Now create a new table to know the size of each weight category in our sample.
CREATE TABLE maxWeight
SELECT `Weight`, COUNT(*) as mw
FROM `myTable`
GROUP BY `Weight`;
Using % operator we can repeat the rows to fill the required size
CREATE TABLE rowList
SELECT i.weight,
CASE WHEN i.row >= w.mw then i.row % w.mw
ELSE i.row
END newrow
FROM incr i
JOIN maxWeight w
ON i.weight = w.weight;
As you can see here even when my list is only 100 the final result is 300
SELECT weight, count(*)
FROM rowList
GROUP BY weight;
| weight | count(*) |
|--------|----------|
| 1 | 60 |
| 2 | 90 |
| 3 | 150 |
Now join both tables together
CREATE TABLE finalResult
SELECT `ID`,`Name`, T.`Weight`, RowNumber
FROM (
SELECT *,
#row_num := IF(#prev_value = `Weight`,
#row_num + 1,
IF(#prev_value:=`Weight`,
0,
0)
) AS RowNumber
FROM (
SELECT `ID`,`Name`,`Weight`, rand() as rng
FROM `myTable`
ORDER BY `Weight`, rng
) X
CROSS JOIN (SELECT #row_num := 0, #prev_value := 0) y
) T
JOIN rowList
ON T.`RowNumber` = rowList.`newrow`
AND T.`Weight` = rowList.`weight`;
The final result has the desire ratio using repeat the names
SELECT `Weight`, COUNT(*) total, COUNT(DISTINCT `Name`) d_name
FROM finalResult
GROUP BY `Weight`;
| Weight | total | d_name |
|--------|-------|--------|
| 1 | 60 | 36 |
| 2 | 90 | 32 |
| 3 | 150 | 30 |
Even when original table has 37 weight = 1, the tool I use to generate random values duplicate one Name, so d_name = 36

Related

mysql how to select count of rows group by sun_calendar_date and div to periodic by every x day

This question as I think needs a function, but every solution is acceptable.
I have a table like below :
sun_calendar_date is integer and its easy for me to convert it to string,
answerset:
id sun_calendar_date data
-------------------------------------------
1 13980120 something
2 13980122 something
3 13980129 something
4 13980130 something
5 13980131 something(end of month)
6 13980201 something
7 13980202 something
8 13980103 something
9 13980103 something
I want to select count of rows group by sun_calendar_date and div to periodic by every x day
for example
for example for period 5 days I had the code below but not working for next month and empty days:
SELECT COUNT(answerset.id) as val,sun_calendar_date FROM answerset
WHERE id group by SUBSTRING(sun_calendar_date,7,2) div 5;
I need this:
val sun_calendar_date
-------------------------------------
2 13980120 20-24=> 2 rows
1 13980129 25-29=> 1 rows
5 13980130 30-03=> 5 rows (next month)
You can use the below to solve your problem:
DELIMITER ;
DROP TABLE IF EXISTS answerset;
CREATE TABLE answerset
(
id INTEGER,
sun_calendar_date DATE,
data VARCHAR(100)
);
INSERT INTO answerset VALUES (1,'13980120','something'),
(2,'13980122','something'),
(3,'13980129','something'),
(4,'13980130','something'),
(5,'13980131','something(end of month)'),
(6,'13980201','something'),
(7,'13980202','something'),
(8,'13980203','something'),
(9,'13980203','something');
-- We need a variable as we need a place to start. You could also set this to whatever date you want
-- if you need to avoid using a variable.
DECLARE #minDate DATE;
SELECT MIN(sun_calendar_date) INTO #minDate FROM answerset;
-- Here we use modulo ((%) returns the remainder of a division) and FLOOR which removes decimal places (you could also
-- convert to INT too). This gives us the number of days after the minimum date grouped into 5s. You could
-- also replace 5 with a variable if you need to change the size of your groups.
SELECT DATE_ADD(sun_calendar_date, INTERVAL -FLOOR((DATEDIFF(sun_calendar_date, #minDate))) % 5 DAY) AS PeriodStart,
MIN(sun_calendar_date) AS Period,
COUNT(DISTINCT sun_calendar_date) AS Val
FROM answerset
GROUP BY DATE_ADD(sun_calendar_date, INTERVAL -FLOOR((DATEDIFF(sun_calendar_date, #minDate))) % 5 DAY)
ORDER BY sun_calendar_date;
You need an auxiliary calendar table. I tried to get this table from the information_schema.columns table.
select
min(a.sun_calendar_date) qnt,
count(a.sun_calendar_date) sun_calendar_date
from (
select
#seq beg,
#seq := adddate(#seq, 5) fin
from (
select
max(sun_calendar_date) x,
#seq := adddate(min(sun_calendar_date),
-(day(min(sun_calendar_date)) % 5))
from answerset
) init
cross join information_schema.columns c1
cross join information_schema.columns c2
where #seq <= init.x
) calendar
join answerset a
on a.sun_calendar_date >= calendar.beg and
a.sun_calendar_date < calendar.fin
group by calendar.beg;
Output:
| qnt | sun_calendar_date |
+------------+-------------------+
| 1398-01-20 | 2 |
| 1398-01-29 | 1 |
| 1398-01-30 | 5 |
Test it online with SQL Fiddle.
MySQL 8.0 with recursive CTEs:
with recursive
init as (
select
adddate(min(sun_calendar_date),
-(day(min(sun_calendar_date)) % 5)) beg,
max(sun_calendar_date) x
from answerset
),
calendar(beg, fin, x) as (
select beg, adddate(beg, 5), x from init
union all
select fin, adddate(fin, 5), x from calendar where fin <= x
)
select
min(a.sun_calendar_date) qnt,
count(a.sun_calendar_date) sun_calendar_date
from answerset a
join calendar c
on a.sun_calendar_date >= c.beg and a.sun_calendar_date < c.fin
group by c.beg;
Test it online with db<>fiddle.

Find two closest elements from one table to other element from another table

I have two tables:
DROP TABLE IF EXISTS `left_table`;
CREATE TABLE `left_table` (
`l_id` INT(11) NOT NULL AUTO_INCREMENT,
`l_curr_time` INT(11) NOT NULL,
PRIMARY KEY(l_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
DROP TABLE IF EXISTS `right_table`;
CREATE TABLE `right_table` (
`r_id` INT(11) NOT NULL AUTO_INCREMENT,
`r_curr_time` INT(11) NOT NULL,
PRIMARY KEY(r_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO left_table(l_curr_time) VALUES
(3),(4),(6),(10),(13);
INSERT INTO right_table(r_curr_time) VALUES
(1),(5),(7),(8),(11),(12);
I want to map (if exists) two closest r_curr_time from right_table to each l_curr_time from left_table such that r_curr_time must be greater or equal to l_curr_time.
The expected result for given values should be:
+------+-------------+-------------+
| l_id | l_curr_time | r_curr_time |
+------+-------------+-------------+
| 1 | 3 | 5 |
| 1 | 3 | 7 |
| 2 | 4 | 5 |
| 2 | 4 | 7 |
| 3 | 6 | 7 |
| 3 | 6 | 8 |
| 4 | 10 | 11 |
| 4 | 10 | 12 |
+------+-------------+-------------+
I have following solution which works for one closest value. But I do not like it very much because it silently rely on fact that GROUP BY will remain the first occurrence from group:
SELECT l_id, l_curr_time, r_curr_time, time_diff FROM
(
SELECT *, ABS(r_curr_time - l_curr_time) AS time_diff
FROM left_table
JOIN right_table ON 1=1
WHERE r_curr_time >= l_curr_time
ORDER BY l_id ASC, time_diff ASC
) t
GROUP BY l_id;
The output is following:
+------+-------------+-------------+-----------+
| l_id | l_curr_time | r_curr_time | time_diff |
+------+-------------+-------------+-----------+
| 1 | 3 | 5 | 2 |
| 2 | 4 | 5 | 1 |
| 3 | 6 | 7 | 1 |
| 4 | 10 | 11 | 1 |
+------+-------------+-------------+-----------+
4 rows in set (0.00 sec)
As you can see I am doing JOIN ON 1=1 is this OK also for large data (e.g. if both left_table and right_table has 10000 rows then Cartesian product will be 10^8 long)? Despite this lack I thing JOIN ON 1=1 is the only possible solution because first I need to create all possible combinations from existing tables and then pick up the ones which satisfies the condition, but if I'm wrong please correct me. Thanks.
This question is not trivial. In SQL Server or postgrsql it would be very easy because of the row_number() over x statement. This is not present in mysql. In mysql you have to deal with variables and chained select statements.
To solve this problem you have to combine multiple concepts. I will try to explain them one after the other to came to a solution that fits your question.
Lets start easy: How to build a table that contains the information of left_table and right_table?
Use a join. In this particular problem a left join and as the join condition we set that l_curr_time has to be smaller than r_curr_time. To make the rest easier we order this table by l_curr_time and r_curr_time. The statement is like the following:
SELECT l_id, l_curr_time, r_curr_time
FROM left_table l
LEFT JOIN right_table r ON l.l_curr_time<r.r_curr_time
ORDER BY l.l_curr_time, r.r_curr_time;
Now we have a table that is ordered and contains the information we want... but too many of them ;) Because the table is ordered it would be amazing if mysql could select only the two first occurent rows for each value in l_curr_time. This is not possible. We have to do it by ourselfs
mid part: How to number rows?
Use a variable! If you want to number a table you can use a mysql variable. There are two things to do: First of all we have to declare and define the variable. Second we have to increment this variable. Let's say we have a table with names and we want to know the position of all names when we order them by name:
SELECT name, #num:=#num+1 /* increment */
FROM table t, (SELECT #num:=0) as c
ORDER BY name ASC;
Hard part: How to number subset of rows depending of the value of one field?
Use variables to count (take a look above) and a variable for state pattern. We use the same principe like above but now we take a variable and save the value of the field we want depend on. If the value changes we reset the counter variable to zero. Again: This second variable have to be declared and defined. New Part: resetting a different variable depending on the content of the state variable:
SELECT
l_id,
l_curr_time,
r_curr_time,
#num := IF( /* (re)set num (the counter)... */
#l_curr_time = l_curr_time,
#num:= #num + 1, /* increment if the variable equals the actual l_curr_time field value */
1 /* reset to 1 if the values are not equal */
) as row_num,
#l_curr_time:=l_curr_time as lct /* state variable that holds the l_curr_time value */
FROM ( /* table from Step 1 of the explanation */
SELECT l_id, l_curr_time, r_curr_time
FROM left_table l
LEFT JOIN right_table r ON l.l_curr_time<r.r_curr_time
ORDER BY l.l_curr_time, r.r_curr_time
) as joinedTable
Now we have a table that holds all combinations we want (but too many) and all rows are numbered depending on the value of the l_curr_time field. In other words: Each subset is numbered from 1 to the amount of matching r_curr_time values that are greather or equal than l_curr_time.
Again the easy part: select all the values we want and depending on the row number
This part is easy. because the table we created in 3. is ordered and numbered we can filter by the number (it has to be smaller or equal to 2). Furthermore we select only the columns we're interessted in:
SELECT l_id, l_curr_time, r_curr_time, row_num
FROM ( /* table from step 3. */
SELECT
l_id,
l_curr_time,
r_curr_time,
#num := IF(
#l_curr_time = l_curr_time,
#num:= #num + 1,
1
) as row_num,
#l_curr_time:=l_curr_time as lct
FROM (
SELECT l_id, l_curr_time, r_curr_time
FROM left_table l
LEFT JOIN right_table r ON l.l_curr_time<r.r_curr_time
ORDER BY l.l_curr_time, r.r_curr_time
) as joinedTable
) as numberedJoinedTable,(
SELECT #l_curr_time:='',#num:=0 /* define the state variable and the number variable */
) as counterTable
HAVING row_num<=2; /* the number has to be smaller or equal to 2 */
That's it. This statement returns exactly what you want. You can see this statement in action in this sqlfiddle.
JoshuaK has the right idea. I just think it could be expressed a little more succinctly...
How about:
SELECT n.l_id
, n.l_curr_time
, n.r_curr_time
FROM
( SELECT a.*
, CASE WHEN #prev = l_id THEN #i:=#i+1 ELSE #i:=1 END i
, #prev := l_id prev
FROM
( SELECT l.*
, r.r_curr_time
FROM left_table l
JOIN right_table r
ON r.r_curr_time >= l.l_curr_time
) a
JOIN
( SELECT #prev := null,#i:=0) vars
ORDER
BY l_id,r_curr_time
) n
WHERE i<=2;

Enumerate records sequentially, grouped and by date, in MySQL

This seems like such a simple question and I terrified that I might be bashed with the duplicate question hammer, but here's what I have:
ID Date
1 1/11/01
1 3/3/03
1 2/22/02
2 1/11/01
2 2/22/02
All I need to do is enumerate the records, based on the date, and grouped by ID! As such:
ID Date Num
1 1/11/01 1
1 3/3/03 3
1 2/22/02 2
2 1/11/01 1
2 2/22/02 2
This is very similar to this question, but it's not working for me. This would be great but it's not MySQL.
I've tried to use group by but it doesn't work, as in
SELECT ta.*, count(*) as Num
FROM temp_a ta
GROUP BY `ID` ORDER BY `ID`;
which clearly doesn't run since the GROUP BY always results to one value.
Any advice greatly appreciated.
Let's assume the table to be as follows:
CREATE TABLE q43381823(id INT, dt DATE);
INSERT INTO q43381823 VALUES
(1, '2001-01-11'),
(1, '2003-03-03'),
(1, '2002-02-22'),
(2, '2001-01-11'),
(2, '2002-02-22');
Then, one of the ways in which the query to get the desired output could be written is:
SELECT q.*,
CASE WHEN (
IF(#id != q.id, #rank := 0, #rank := #rank + 1)
) >=1 THEN #rank
ELSE #rank := 1
END as rank,
#id := q.id AS buffer_id
FROM q43381823 q
CROSS JOIN (
SELECT #rank:= 0,
#id := (SELECT q2.id FROM q43381823 AS q2 ORDER BY q2.id LIMIT 1)
) x
ORDER BY q.id, q.dt
Output:
id | dt | rank | buffer_id
-------------------------------------------------
1 | 2001-01-11 | 1 | 1
1 | 2002-02-22 | 2 | 1
1 | 2003-03-03 | 3 | 1
2 | 2001-01-11 | 1 | 2
2 | 2002-02-22 | 2 | 2
You may please ignore the buffer_id column from the output - it's irrelevant to the result, but required for the resetting of rank.
SQL Fiddle Demo
Explanation:
#id variable keeps track of every id in the row, based on the sorted order of the output. In the initial iteration, we set it to id of the first record that may be obtained in the final result. See sub-query SELECT q2.id FROM q43381823 AS q2 ORDER BY q2.id LIMIT 1
#rank is set to 0 initially and is by default incremented for every subsequent row in the result set. However, when the id changes, we reset it back to 1. Please see the CASE - WHEN - ELSE construct in the query for this.
The final output is sorted first by id and then by dt. This ensures that #rank is set incrementally for every subsequent dt field within the same id, but gets reset to 1 whenever a new id group begins to show up in the result set.

Return the k rows that appear the most

Lets say I got this table
photo_id user_id tag
0 0 Car
0 0 Bridge
0 0 Sky
20 1 Car
20 1 Bridge
2 2 Bridge
2 2 Cat
1 3 Cat
I need to return the k tags that appear the most, WITHOUT USING LIMIT.
tie breaker for tags that appear the same number of times will be the lexicographically order (smallest will have the highest score).
I will need for each tag the number of tags he appeared as well.
for example, for the table above with k=2 the output should be:
Tag #
Bridge 3
Car 2
and for k=4:
Tag #
Bridge 3
Car 2
Cat 2
Sky 1
Try this :
SELECT t1.tag, COUNT(*) as mycount FROM table t1
GROUP BY t1.tag
ORDER BY mycount DESC
LIMIT 2;
Replace the limit ammount for your k var.
Inserting data into table:
INSERT INTO new_table VALUES
(0,0,'Car'),
(0,0,'Bridge'),
(0,0,'Sky'),
(20,1,'Car'),
(20,1,'Bridge'),
(0,0,'bottle');
To query:
SELECT tag, COUNT(1) FROM new_table
GROUP BY tag HAVING COUNT(1) = (
SELECT MIN(c) FROM
(
SELECT COUNT(1) AS c FROM new_table GROUP BY tag
) AS temp
)
Output:
+--------+----------+
| tag | count(1) |
+--------+----------+
| bottle | 1 |
| Sky | 1 |
+--------+----------+
Note : Get smallest count tag
Although this is homework and we are not supposed to answer such questions (not till you've proved that attempted to solve it and not getting desired result), I got a little curious about not using LIMIT in this question, so I am posting here.
The idea is to rank the result and then select only rows whose rank are less than or equal to value k (as in your case). The rank column is like adding a S.No. (serial number) column to your result and selecting till desired number.
DDL statements:
CREATE TABLE new_table(
photo_id INTEGER,
user_id INTEGER,
tag VARCHAR(10)
);
INSERT INTO new_table VALUES
(0, 0, 'Car'),
(0, 0, 'Bridge'),
(0, 0, 'Sky'),
(20, 1, 'Car'),
(20, 1, 'Bridge'),
(2, 2, 'Bridge'),
(2, 2, 'Cat'),
(1, 3, 'Cat');
Query:
SELECT
tag, tag_count,
#k := #k + 1 AS k
FROM (
SELECT
tag,
COUNT(*) AS tag_count
FROM new_table
GROUP BY tag
ORDER BY tag_count DESC
) AS temp, (SELECT #k := 0) AS k
WHERE #k < 2;
Check this SQLFiddle.

Select rows until a total amount is met in a column (mysql)

I have seen this issue in SF, but me being a noob I just can't get my fried brain around them. So please forgive me if this feels like repetition.
My Sample Table
--------------------------
ID | Supplier | QTY
--------------------------
1 1 2
2 1 2
3 2 5
4 3 2
5 1 3
6 2 4
I need to get the rows "UNTIL" the cumulative total for "QTY" is equal or greater than 5 in descending order for a particular supplier id.
In this example, for supplier 1, it will be rows with the ids of 5 and 2.
Id - unique primary key
Supplier - foreign key, there is another table for supplier info.
Qty - double
It ain't pretty, but I think this does it and maybe it can be the basis of something less cumbersome. Note that I use a "fake" INNER JOIN just to get some variable initialized for the first time--it serves no other role.
SELECT ID,
supplier,
qty,
cumulative_qty
FROM
(
SELECT
ID,
supplier,
qty,
-- next line keeps a running total quantity by supplier id
#cumulative_quantity := if (#sup <> supplier, qty, #cumulative_quantity + qty) as cumulative_qty,
-- next is 0 for running total < 5 by supplier, 1 the first time >= 5, and ++ after
#reached_five := if (#cumulative_quantity < 5, 0, if (#sup <> supplier, 1, #reached_five + 1)) as reached_five,
-- next takes note of changes in supplier being processed
#sup := if(#sup <> supplier, supplier, #sup) as sup
FROM
(
--this subquery is key for getting things in supplier order, by descending id
SELECT *
FROM `sample_table`
ORDER BY supplier, ID DESC
) reverse_order_by_id
INNER JOIN
(
-- initialize the variables used to their first ever values
SELECT #cumulative_quantity := 0, #sup := 0, #reached_five := 0
) only_here_to_initialize_variables
) t_alias
where reached_five <= 1 -- only get things up through the time we first get to 5 or above.
How about this? Using two variables.
SQLFIDDLE DEMO
Query:
set #tot:=0;
set #sup:=0;
select x.id, x.supplier, x.ctot
from (
select id, supplier, qty,
#tot:= (case when #sup = supplier then
#tot + qty else qty end) as ctot,
#sup:=supplier
from demo
order by supplier asc, id desc) x
where x.ctot >=5
;
| ID | SUPPLIER | CTOT |
------------------------
| 2 | 1 | 5 |
| 1 | 1 | 7 |
| 3 | 2 | 5 |
Standard SQL has no concept of 'what row number am I up to', so this can only be implemented using something called a cursor. Writing code with cursors is something like writing code with for loops in other languages.
An example of how to use cursors is here:
http://dev.mysql.com/doc/refman/5.0/en/cursors.html
Here is a rough demo about cursor, may be it's helpful.
CREATE TABLE #t
(
ID INT IDENTITY,
Supplier INT,
QTY INT
);
TRUNCATE TABLE #t;
INSERT INTO #t (Supplier, QTY)
VALUES (1, 2),
(1, 2),
(2, 5),
(3, 2),
(1, 3);
DECLARE #sum AS INT;
DECLARE #qty AS INT;
DECLARE #totalRows AS INT;
DECLARE curSelectQTY CURSOR
FOR SELECT QTY
FROM #t
ORDER BY QTY DESC;
OPEN curSelectQTY;
SET #sum = 0;
SET #totalRows = 0;
FETCH NEXT FROM curSelectQTY INTO #qty;
WHILE ##FETCH_STATUS = 0
BEGIN
SET #sum = #sum + #qty;
SET #totalRows = #totalRows + 1;
IF #sum >= 5
BREAK;
END
SELECT TOP (#totalRows) *
FROM #t
ORDER BY QTY DESC;
CLOSE curSelectQTY;
DEALLOCATE curSelectQTY;
SELECT x.*
FROM supplier_stock x
JOIN supplier_stock y
ON y.supplier = x.supplier
AND y.id >= x.id
GROUP
BY supplier
, id
HAVING SUM(y.qty) <=5;