SQL - Select one record randomly per person - mysql

I currently have one table containing details of 100 people's first transactions with each product in the transaction as a separate record. Some people only bought 1 product in their first transaction while others may have bought up to 4 different products in their first transaction.
I'd like to create a table containing 1 distinct record per person where if they had purchased multiple different products in their first transaction, I'm randomly selecting one of those products. How would I go about doing this in mySQL?

Suppose your table is like this:
CREATE TABLE MyTable (
person_id INT,
product_id INT,
transaction_date DATE
);
It might have other columns too, but I'm not going to guess at them.
I would do this by "shuffling" the rows for each person. Sort the table by person first, then randomize the rows for the given person. You can do this by using ORDER BY person_id, RAND().
Then pick the first row for each person. This is a way to do this, using MySQL session variables to track the row number and start the row numbering from 1 with each distinct person_id:
SELECT person_id, product_id, transaction_date
FROM (
SELECT IF(#p=person_id, #r:=#r+1, 1) AS row_num
#p:=person_id AS person_id, product_id, transaction_date
FROM (SELECT #p:=0, #r:=0) AS _init
CROSS JOIN MyTable
ORDER BY person_id, RAND()
) AS T
WHERE row_num = 1;
When MySQL 8.0 comes out (probably in 2018), it should implement windowing functions, which will make this a little more standard:
SELECT person_id, product_id, transaction_date
FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY RAND()) AS row_num,
person_id, product_id, transaction_date
FROM MyTable
) AS T
WHERE row_num = 1;
P.S.: By the way, when you ask questions about SQL, you should run SHOW CREATE TABLE MyTable and include the result in your question. It helps people not have to guess about your table, columns, indexes, data types, etc. They may be able to write an example that is closer to what you need.

Related

How to use a column in select statement which is not in aggregate function nor in group by clause? [duplicate]

This question already has answers here:
Retrieving the last record in each group - MySQL
(33 answers)
Closed 10 months ago.
Above is the table and on the basis of which I have to answer the below question in my past interview.
Q. The most recent order value for each customer?
Answer which I have given in interview:
select customerID, ordervalue, max(orderdate)
from office
group by customerID;
I know since we are not using ordervalue in aggregate and nor in group by so this query will throw an error in SQL but I want to know how to answer this question.
Many times in my past interviewers asked a question where I need to use a column in select statement which is not in aggregate function or nor in group by. So I want know in general what is a workaround for it with an example so that I can resolve these type of questions or how to answer these questions.
The work around depends on what is being asked. For the requirements you have above, I think it makes sense to create (customerid, MAX(orderdate)) pairs.
SELECT customerid, MAX(orderdate)
FROM office
GROUP BY customerid;
Then you can use them to match the row you need from the table.
SELECT customerid, ordervalue, orderdate
FROM office
WHERE (customerid, orderdate) IN
(SELECT customerid, MAX(orderdate)
FROM office
GROUP BY customerid);
Note, this assumes there is only one order per customer per day. If there were more than one, you would see the most recent order(s) per customer. You could add also a GROUP BY on the outer query if needed.
SELECT customerid, MAX(ordervalue), orderdate
FROM office AS tt
WHERE (customerid, orderdate) IN
(SELECT customerid, MAX(orderdate)
FROM office
GROUP BY customerid)
GROUP BY customerid, orderdate;
If the non-aggregate column you need in the SELECT is functionally dependent on the column in the GROUP BY, you can add a subquery in the SELECT.
We can extend your example by adding a name column, where the name of different customers could be the same. If you wanted name instead of ordervalue, just match the customerid of the outer query to get name.
SELECT customerid,
(SELECT name FROM office WHERE customerid=o.customerid LIMIT 1) AS name,
MAX(orderdate)
FROM office AS o
GROUP BY customerid;
You are approaching the task as follows: Aggregate all rows to get one result line per customer, showing the maximum order date and its order value. The problem with this: you'd need an aggregate function to get the value for the maximum order date. The only DBMS I know of featuring such a function is Oracle with KEEP FIRST/LAST.
So look at the task from a different angle. Don't think aggregation-wise where you could count and add up values for a group and get the minimum or maximum value over all the group's rows, because after all you just want to pick single rows. (That is, pick the top 1 row per customer.) In order to pick rows, you'll use a WHERE clause.
One option has been shown by Steve in his answer:
select *
from office
where (customerid, orderdate) in
(
select customerid, max(orderdate)
from office
group by customerid
);
This is a good, straight-forward approach. (Some DBMS, though, don't feature tuples with IN clauses.)
Another way to get the "best" row for a customer would be to pick those rows for which not exists a better row:
select *
from office
where not exists
(
select null
from office better
where better.customerid = office.customerid
and better.orderdate > office.orderdate
);
And then there is the option to use a window function (aka analytic function) in order to get those rows. One example is to get the maximum dates along with the rows' data:
select customerid, ordervalue, orderdate
from
(
select
customerid, ordervalue, orderdate,
max(orderdate) over (partition by customerid) as max_orderdate
from office
)
where orderdate = max_orderdate;
And with ROW_NUMBER, RANK, and DENSE_RANK there are window functions to assign numbers to your rows in the order you want. You number them such that the best rows get number 1 and pick them. The big advantage here: you can apply any order, deal with ties and not only get the top 1, but the top n rows.
select customerid, ordervalue, orderdate
from
(
select
customerid, ordervalue, orderdate,
row_number() over (partition by customerid order by orderdate desc) as rn
from office
)
where rn = 1;

Selecting Data from Normalized Tables

I'm stuck on trying to write this query, I think my brain is just a little fried tonight. I have this table that stores whenever a person executes an action (Clocking In, Clocking Out, Going on Lunch, Returning from Lunch) and I need to return a list of all the primary ID's for the people who's last action is not clock_out - but the problem is it needs to be a somewhat fast query.
Table Structure:
ID | person_id | status | datetime | shift_type
ID = Primary Key for this table
person_id = The ID I want to return if their status does not equal clock_out
status = clock_in, lunch_start, lunch_end, break_start, break_end, clock_out
datetime = The time the record was added
shift_type = Not Important
The way I was executing this query before was finding people who are still clocked in during a specific time period, however I need this query to locate at any point. The queries I am trying are taking the thousands and thousands of records and making it way too slow.
I need to return a list of all the primary ID's for the people whose last action is not clock_out.
One option uses window functions, available in MySQL 8.0:
select id
from (
select t.*, row_number() over(partition by person_id order by datetime desc) rn
from mytable t
) t
where rn = 1 and status <> 'clock_out'
In earlier versions, one option uses a correlated subquery:
select id
from mytable
where
datetime = (select max(t1.datetime) from mytable t1 where t1.personid = t.person_id)
and status <> 'clock_out'
After looking through it further, this was my solution -
SELECT * FROM (
SELECT `status`,`person_id` FROM `timeclock` ORDER BY `datetime` DESC
) AS tmp_table GROUP BY `person_id`
This works because it is grouping all of the same person ID's together, and then ordering them by the datetime and selecting the most recent.

How can I get a column value from another table outside of UNION ALL?

In my SQL I am getting transactions relating to a user and a business. However, I also need to get the name of the business. It is found in column business_name under table Businesses. In my example SQL, I would want to get the business name for business_id=1. My current code works aside from not getting the business name.
(SELECT TRUNCATE(code_reward_amount, 2) AS amount, UNIX_TIMESTAMP(code_redeemed_date) AS date, 0 AS action_number
FROM CodesRedeemed
WHERE code_redeemed_by_user_id=191 AND code_business_id=1)
UNION ALL
(SELECT TRUNCATE(action_amount, 2) AS amount, UNIX_TIMESTAMP(action_date) AS date, action_number
FROM BusinessAccountActions
WHERE action_user_id=191 AND action_business_id=1)
ORDER BY date DESC
LIMIT 100
In my second code attempt, it does get the business name, however, it is not efficient to do the select in every row since the business name would be the same for each row. How can I do it once and apply it to each row? Perhaps somewhere outside of the UNION ALL? Here is my working code, however, I would like to optimize it so it doesn't SELECT from Businesses for the business_name in every single row (since the business_name is guaranteed to be the same for all rows since they share the same business_id).
(SELECT TRUNCATE(code_reward_amount, 2) AS amount, UNIX_TIMESTAMP(code_redeemed_date) AS date, 0 AS action_number, (SELECT business_name FROM Businesses WHERE business_id=1) AS business_name
FROM CodesRedeemed
WHERE code_redeemed_by_user_id=191 AND code_business_id=1)
UNION ALL
(SELECT TRUNCATE(action_amount, 2) AS amount, UNIX_TIMESTAMP(action_date) AS date, action_number, (SELECT business_name FROM Businesses WHERE business_id=1) AS business_name
FROM BusinessAccountActions
WHERE action_user_id=191 AND action_business_id=1)
ORDER BY date DESC
LIMIT 100
business_id would change depending on the business. I am just testing it for business_id 1 right now. How would I optimize (mainly not checking for business_name in every single row)? Thank you.
Use a JOIN.
SELECT u.amount, u.date, b.business_name, u.action_number
FROM (
(SELECT TRUNCATE(code_reward_amount, 2) AS amount, UNIX_TIMESTAMP(code_redeemed_date) AS date, 0 AS action_number
FROM CodesRedeemed
WHERE code_redeemed_by_user_id=191 AND code_business_id=1)
UNION ALL
(SELECT TRUNCATE(action_amount, 2) AS amount, UNIX_TIMESTAMP(action_date) AS date, action_number
FROM BusinessAccountActions
WHERE action_user_id=191 AND action_business_id=1)
ORDER BY date DESC
LIMIT 100) AS u
CROSS JOIN Businesses AS b
WHERE b.business_id = 1
Using a JOIN as Bamar suggested is a perfectly acceptable way to do it, and is how I would most likely do it.
However, you could use a user defined variable and replace that additional select with that.
SELECT business_name FROM Businesses WHERE business_id=1 LIMIT 1 INTO #bname;
(SELECT TRUNCATE(code_reward_amount, 2) AS amount, UNIX_TIMESTAMP(code_redeemed_date) AS date, 0 AS action_number, (SELECT business_name FROM Businesses WHERE business_id=1) AS business_name
FROM CodesRedeemed
WHERE code_redeemed_by_user_id=191 AND code_business_id=1)
UNION ALL
(SELECT TRUNCATE(action_amount, 2) AS amount, UNIX_TIMESTAMP(action_date) AS date, action_number, #bname AS business_name
FROM BusinessAccountActions
WHERE action_user_id=191 AND action_business_id=1)
ORDER BY date DESC
LIMIT 100

MySQL query for weighted voting - how to calculate with values assigned to different columns

I have a voting application that writes values to a mysql db table. It is a preference/weighted voting system so people choose a first option, second option, and third option. These all go into separate fields in the table. I'm looking for a way to write a query that will assign numerical values to the responses (3 for a first response, 2 for a second, 1 for a first) and then display the value with the summed score. I've been able to do this for total number of votes
select count(name) as votes,name
from (select 1st_option as name from votes
union all
select 2nd_option from votes
union all
select 3rd_option from votes) as tbl
group by name
having count(name) > 0
order by 1 desc;
but haven't quite figured out how to assign values to response in each column and then pull them together. Any help is much appreciated. Thanks!
You could do something like this:
select sum(score) as votes,name
from (select 1st_option as name, 3 as score from votes
union all
select 2nd_option as name, 2 as score from votes
union all
select 3rd_option as name, 1 as score from votes) as tbl
group by name;

Determine total amount of top result returned

I would like to determine two things from a single query:
Most prevalent column in a table
The amount of times such column was located upon querying the table
Example Table:
user_id some_field
1 data
2 data
1 data
The above would return user_id # 1 as being the most prevalent in the table, and it would return (2) for the total amount of times that it was located in the table.
I have done my research and I came across two types of queries.
GROUP BY user_id ORDER BY COUNT(*) DESC
SUM
The problem is that I can't figure out how to use these two queries in conjunction with one another. For example, consider the following query which successfully returns the most prevalent column.
$top_user = "SELECT user_id FROM table_name GROUP BY user_id ORDER BY COUNT(*) DESC";
The above query returns "1" based on the example table shown above. Now, I would like to be able to return "2" for the total amount of times the user_id (1) was found in the table.
Is this by any chance possible?
Thanks,
Evan
You can include count(*) in the SELECT list:
SELECT user_id, count(*) as totaltimes from table_name
GROUP BY user_id ORDER BY count(*) DESC;
If you want only the first one:
SELECT user_id, count(*) as totaltimes from table_name
GROUP BY user_id ORDER BY count(*) DESC LIMIT 1;