Joining table to itself with multiple join criteria logic - mysql

I'm trying to understand the logic behind the syntax below. Based on the following question, table and syntax:
Write a query that'll identify returning active users. A returning active user is a user that has made a second purchase within 7 days of any other of their purchases. Output a list of user_ids of these returning active users.
Column + Data Type:
id: int | user_id: int | item: varchar |created_at: datetime | revenue: int
SELECT DISTINCT(a1.user_id)
FROM amazon_transactions a1
JOIN amazon_transactions a2 ON a1.user_id=a2.user_id
AND a1.id <> a2.id
AND a2.created_at::date-a1.created_at::date BETWEEN 0 AND 7
ORDER BY a1.user_id
Why does the table need to be joined to itself in this case?
How does 'AND a1.id <> a2.id' portion of syntax contribute to the join?

You are looking for users that have 2 records on that table whose date distance is lower (or equal) than 7 days
To accomplish this, you treat the table as if it were 2 different (but equal tables) because you have to match a row on the first table with a row on the second table
Of course you don't want to match a row with itself, so
AND a1.id <> a2.id
accomplishes that

The table needs to be joined with itself because, you just have one table, and you want to find out returning users (by comparing the duration between transaction dates for the same user).
AND a1.id <> a2.id portion of the syntax removes the same transactions, i.e. prevents the transactions with the same id to be included in the joined table.

There are two scenarios I can think of based on the id column values. Are id column values generated based on timely sequence ? If so, to answer your first question ,we can but don't have to use join syntax. Here is how to achieve your goal using a correlated subquery , with sample data created.
create table amazon_transactions(id int , user_id int , item varchar(20),created_at datetime , revenue int);
insert amazon_transactions (id,user_id,created_at) values
(1,1,'2020-01-05 15:33:22'),
(2,2,'2020-01-05 16:33:22'),
(3,1,'2020-01-08 18:33:22'),
(4,1,'2020-01-22 17:33:22'),
(5,2,'2020-02-05 15:33:22'),
(6,2,'2020-03-05 15:33:22');
select * from amazon_transactions;
-- sample set:
| id | user_id | item | created_at | revenue |
+------+---------+------+---------------------+---------+
| 1 | 1 | NULL | 2020-01-05 15:33:22 | NULL |
| 2 | 2 | NULL | 2020-01-05 16:33:22 | NULL |
| 3 | 1 | NULL | 2020-01-08 18:33:22 | NULL |
| 4 | 1 | NULL | 2020-01-22 17:33:22 | NULL |
| 5 | 2 | NULL | 2020-02-05 15:33:22 | NULL |
| 6 | 2 | NULL | 2020-03-05 15:33:22 | NULL |
-- Here is the answer using a correlated subquery:
select distinct user_id
from amazon_transactions t
where datediff(
(select created_at from amazon_transactions where user_id=t.user_id and id-t.id>=1 limit 1 ),
created_at
)<=7
;
-- result:
| user_id |
+---------+
| 1 |
However,what if the id values are NOT transaction time based? Then the id values are not at all helpful in our requirement. In this case, a JOIN is more capable than a correlated subquery and we need to arrange the order based on transaction time for each user in order to make the necessary join condition. And to answer your second question, the AND a1.id <> a2.id portion of syntax contribute by excluding two of the same transaction making a pair. However, to my understanding the matching scope is too high to be effective. We only care if CONSECUTIVE transactions have a within-7-day gap, but the AND a1.id <> a2.id overdoes the job. For instance, we want to check the gap between transaction1 and transaction2,transaction2 and transaction3, NOT transaction1 and transaction3
Note: by using the user variable row_id trick, we can produce the row id which is used to match consecutive transactions for each user, thus eliminating the wasteful job of random transaction check.
select distinct t1.user_id
from
(select user_id,created_at,#row_id:=#row_id+1 as row_id
from amazon_transactions ,(select #row_id:=0) t
order by user_id,created_at)t1
join
(select user_id,created_at,#row_num:=#row_num+1 as row_num
from amazon_transactions ,(select #row_num:=0) t
order by user_id,created_at)t2
on t1.user_id=t2.user_id and t2.row_num-t1.row_id=1 and datediff(t2.created_at,t1.created_at)<=7
;
-- result
| user_id |
+---------+
| 1 |

Related

Query membership of group on date with open-ended memberships

Given the following table structure for tracking membership of given groups:
+----+----------+----------------+--------------+
| id | group_id | in_group_begin | in_group_end |
+----+----------+----------------+--------------+
| 1 | 10 | 2019-01-01 | 2019-02-01 |
| 1 | 11 | 2019-02-02 | 2019-03-01 |
| 1 | 12 | 2019-03-01 | NULL |
| 2 | 10 | 2019-01-01 | NULL |
+----+----------+----------------+--------------+
(Where in_group_end being NULL signifies this is their current group)
How would I form a query that would tell me, for example, what group_id each member was associated with on a given date?
... in_group_end IS NULL will give me their current group, not necessarily the group they were in
... in_group_end IS NULL OR in_group_end >= '{$date_str}' could give me multiple options
Ideally I would like something I can use in a joined query, e.g. with a table storing a persons name, address, etc. from which I expect only one row back.
Would some kind of IF stmt in the JOIN do it? or GROUP in a sub-query?
Consider the following logic, which would find all matches for 2019-01-15:
SELECT group_id
FROM yourTable
WHERE
'2019-01-15' >= in_group_begin AND
('2019-01-15' <= in_group_end OR in_group_end IS NULL);
The WHERE clause considers an input date a match if it lies in between the start and end dates or it is greater than the start date and there is no end date. Also, the WHERE clause as written can make use of an index.
Let's say you want to search for a date 2019-01-03.
SELECT
id,
group_id
FROM membership
WHERE '2019-01-03' BETWEEN in_group_begin AND IFNULL(in_group_end, CURRENT_DATE);
If you have another users table which stores details of users and id of that table is used in membership table using id field. You can do following query.
SELECT
u.id,
u.name,
u.address,
m.group_id
FROM users u
INNER JOIN membership m ON u.id = m.id
WHERE '2019-01-03' BETWEEN in_group_begin AND IFNULL(in_group_end, CURRENT_DATE);
Assuming there is a table users from which you want the user's details returned, join it to your table tablename like this:
select u.*, t.group_id
from users u inner join (
select
id, group_id, in_group_begin,
coalesce(in_group_end, current_date) in_group_end
from tablename
) t on t.id = u.id and #date between t.in_group_begin and t.in_group_end
Replace #date with the date you search for.

Order by a column in another table

I have two MySQL tables:
web_forums_threads
| tid | title |
|=================|
| 1 | News Post |
web_forums_posts
| pid | tid | content | date_created | date_modified |
|===========================================================|
| 1 | 1 | Today,.. | unix timestamp | null or timestamp |
| 2 | 1 | I agree! | unix timestamp | null or timestamp |
I want to SELECT * from web_forums_threads, and order by the most recent date_created value from web_forums_posts with the correct corresponding TID.
I feel as though the results I've found from Google may be incorrect for my case, because multiple rows can exist with the threads' TID.
The example that I tried (with no success):
SELECT * FROM web_forums_threads WHERE fid = :postfid ORDER BY (SELECT date_created FROM web_forums_posts WHERE web_forum_posts.tid = web_forums_threads.tid) DESC;
The syntax might be wrong but the concept is there. I don't want to add another column to the threads table because I'd just be storing info twice (the first post acts as the threads content).
You have to make the join between the two tables
SELECT * FROM web_forums_threads AS WFT, web_forums_posts AS WFP
WHERE WFT.tid=WFP.tid
ORDER BY WFP.date_created
It would be something like that
You can use rather JOIN.
SELECT T.tid, T.title
FROM web_forums_threads T
JOIN web_forums_posts P ON T.tid = P.tid
WHERE fid = :postfid
ORDER BY P.date_created;

How to merge column data using the last updated value in MySQL?

Somewhat confusing so its easier if I put down example and expected output to begin.
I have a table that could look like this: (Unit1 - Unit2 columns could span up to 30 columns in the same general format)
| ID | Name | Unit1_left | Unit2_left |
| 1 | Tom | 50 | NULL |
| 2 | Tom | NULL | 1 |
| 3 | Tom | 45 | NULL |
| 4 | Dan | NULL | NULL |
What I am trying to select is a table like this:
| Name | Unit1_left | Unit2_left |
| Tom | 45 | 1 |
| Dan | NULL | NULL |
What that is doing is grouping by name and attempting to find the last values in the 2 other columns if they exist (if not then it returns NULL).
I have looked at various other questions and they all say to use Max() however this will not work since it selects the highest value (incorrect). I have seen that in MsSQL there is a Last() function which looks vaguely like what I want it to do but its not implemented in MySQL and isn't exactly what I need anyway.
What I am trying to ask is, does anyone know of a possible method of selecting the data like this or if I will have to use a separate programming language to do this?
This will produce the result set you've described
SELECT dname.name,
l1value.unit1_left,
l2value.unit2_left
FROM (SELECT DISTINCT `name`
FROM table1) `DName`
LEFT JOIN (SELECT `name`,
Max(id) id
FROM table1
WHERE unit1_left IS NOT NULL
GROUP BY `name`) l1
ON dname.`name` = l1.`name`
LEFT JOIN table1 l1value
ON l1.id = l1value.id
LEFT JOIN (SELECT `name`,
Max(id) id
FROM table1
WHERE unit2_left IS NOT NULL
GROUP BY `name`) l2
ON dname.`name` = l2.`name`
LEFT JOIN table1 l2value
ON l2.id = l2value.id ;
DEMO
I did it by creating 2 inline views to the highest id for non-null values for both unit1_left and unit2_left (l1 and l2). Then joined it back to original table to get the values (l1value and l2value). We then join that back to a third inline view (dname) that creates the distinct names.
It's quite messy and it might make more sense just to keep your data in a more sensible manner.
You can use subqueries in you select statement. Using SqlFidlle I came up with this.
select o.name,
(select o2.Unit1_left
from original as o2
where o.name = o2.name
and o2.Unit1_left is not null
order by o2.id desc
LIMIT 1) as Unit1_left,
(select o3.Unit2_left
from original as o3
where o.name = o3.name
and o3.Unit2_left is not null
order by o3.id desc
LIMIT 1) as Unit2_left
from original as o
group by o.name
order by id;

MySQL conditionally populate column 3 based on DISTINCT involving 2 other columns in one table

Had a good read through similar topics but I can't quite a) find one to match my scenario, or b) understand others enough to fit / tailor / tweek to my situation.
I have a table, the important fields being;
+------+------+--------+--------+
| ID | Name | Price |Status |
+------+------+--------+--------+
| 1 | Fred | 4.50 | |
| 2 | Fred | 4.50 | |
| 3 | Fred | 5.00 | |
| 4 | John | 7.20 | |
| 5 | John | 7.20 | |
| 6 | John | 7.20 | |
| 7 | Max | 2.38 | |
| 8 | Max | 2.38 | |
| 9 | Sam | 21.00 | |
+------+------+--------+--------+
ID is an auto-incrementing value as records get added throughout the day.
NAME is a Primary Key field, which can repeat 1 to 3 times in the whole table.
Each NAME will have a PRICE value, which may or may not be the same per NAME.
There is also a STATUS field that need to be populated based on the following, which is actually the part I am stuck on.
Status = 'Y' if each DISTINCT name has only one price attached to it.
Status = 'N' if each DISTINCT name has multiple prices attached to it.
Using the table above, ID's 1, 2 and 3 should be 'N', whilst 4, 5, 6, 7, 8 and 9 should be 'Y'.
I think this may well involve some form of combination of JOINs, GROUPs, and DISTINCTs but I am at a loss on how to put that into the right order for SQL.
In order to get the count of distinct Price values per name, we must use a GROUP BY on the Name field, but since you also want to display all names ungrouped but with an additional Status field, we must first create a subselect in the FROM clause which groups by the name and determines whether the name has multiple price values or not.
When we GROUP BY Name in the subselect, COUNT(DISTINCT price) will count the number of distinct price values for each particular name. Without the DISTINCT keyword, it would simply count the number of rows where price is not null.
In conjunction with that, we use a CASE expression to insert N into the Status column if there is more than one distinct Price value for the particular name, otherwise, it will insert Y.
The subselect only returns one row per Name, so to get all names ungrouped, we join that subselect to the main table on the condition that the subselect's Name = the main table's Name:
SELECT
b.ID,
b.Name,
b.Price,
a.Status
FROM
(
SELECT Name, CASE WHEN COUNT(DISTINCT Price) > 1 THEN 'N' ELSE 'Y' END AS Status
FROM tbl
GROUP BY Name
) a
INNER JOIN
tbl b ON a.Name = b.Name
Edit: In order to facilitate an update, you can incorporate this query using JOINs in the UPDATE like so:
UPDATE
tbl a
INNER JOIN
(
SELECT Name, CASE WHEN COUNT(DISTINCT Price) > 1 THEN 'N' ELSE 'Y' END AS Status
FROM tbl
GROUP BY Name
) b ON a.Name = b.Name
SET
a.Status = b.Status
Assuming you have an unfilled Status column in your table.
If you want to update the status column, you could do:
UPDATE mytable s
SET status = (
SELECT IF(COUNT(DISTINCT price)=1, 'Y', 'N') c
FROM (
SELECT *
FROM mytable
) s1
WHERE s1.name = s.name
GROUP BY name
);
Technically, it should not be necessary to have this:
FROM (
SELECT *
FROM mytable
) s1
but there is a mysql limitation that prevents you to select from the table you're updating. By wrapping it in parenthesis, we force mysql to create a temporary table and then it suddenly is possible.

How to `SELECT` and manufacture missing rows from previous values?

I have the following (simplified) result from SELECT * FROM table ORDER BY tick,refid:
tick refid value
----------------
1 1 11
1 2 22
1 3 33
2 1 1111
2 3 3333
3 3 333333
Note the "missing" rows for refid 1 (tick 3) and refid 2 (ticks 2 and 3)
If possible, how can I make a query to add these missing rows using the most recent prior value for that refid? "Most recent" means the value for the row with the same refid as the missing row and largest tick such that the tick is less than the tick for the missing row. e.g.
tick refid value
----------------
1 1 11
1 2 22
1 3 33
2 1 1111
2 2 22
2 3 3333
3 1 1111
3 2 22
3 3 333333
Additional conditions:
All refids will have values at tick=1.
There may be many 'missing' ticks for a refid in sequence, (as above for refid 2).
There are many refids and it's not known which will have sparse data where.
There will be many ticks beyond 3, but all sequential. In the correct result, each refid will have a result for each tick.
Missing rows are not known in advance - this will be run on multiple databases, all with the same structure, and different "missing" rows.
I'm using MySQL and cannot change db just now. Feel free to post answer in another dialect, to help discussion, but I'll select an answer in MySQL dialect over others.
Yes, I know this can be done in the code, which I've implemented. I'm just curious if it can be done with SQL.
What value should be returned when a given tick-refid combination does not exist? In this solution, I simply returned the lowest value for that given refid.
Revision
I've updated the logic to determine what value to use in the case of a null. It should be noted that I'm assuming that ticks+refid is unique in the table.
Select Ticks.tick
, Refs.refid
, Case
When Table.value Is Null
Then (
Select T2.value
From Table As T2
Where T2.refid = Refs.refId
And T2.tick = (
Select Max(T1.tick)
From Table As T1
Where T1.tick < Ticks.tick
And T1.refid = T2.refid
)
)
Else Table.value
End As value
From (
Select Distinct refid
From Table
) As Refs
Cross Join (
Select Distinct tick
From Table
) As Ticks
Left Join Table
On Table.tick = Ticks.tick
And Table.refid = Refs.refid
If you know in advance what your 'tick' and 'refid' values are,
Make a helper table that contains all possible tick and refid values.
Then left join from the helper table on tick and refid to your data table.
If you don't know exactly what your 'tick' and 'refid' values are, you maybe could still use this method, but instead of a static helper table, it would have to be dynamically generated.
The following has too many sub-selects for my taste, but it generates the desired result in MySQL, as long as every tick and every refid occurs separately at least once in the table.
Start with a query that generates every pair of tick and refid. The following uses the table to generate the pairs, so if any tick never appears in the underlying table, it will also be missing from the generated pairs. The same holds true for refids, though the restriction that "All refids will have values at tick=1" should ensure the latter never happens.
SELECT tick, refid FROM
(SELECT refid FROM chadwick WHERE tick=1) AS r
JOIN
(SELECT DISTINCT tick FROM chadwick) AS t
Using this, generate every missing tick, refid pair, along with the largest tick that exists in the table by equijoining on refid and θ≥-joining on tick. Group by the generated tick, refid since only one row for each pair is desired. The key to filtering out existing tick, refid pairs is the HAVING clause. Strictly speaking, you can leave out the HAVING; the resulting query will return existing rows with their existing values.
SELECT tr.tick, tr.refid, MAX(c.tick) AS ctick
FROM
(SELECT tick, refid FROM
(SELECT refid FROM chadwick WHERE tick=1) AS r
JOIN
(SELECT DISTINCT tick FROM chadwick) AS t
) AS tr
JOIN chadwick AS c ON tr.tick >= c.tick AND tr.refid=c.refid
GROUP BY tr.tick, tr.refid
HAVING tr.tick > MAX(c.tick)
One final select from the above as a sub-select, joined to the original table to get the value for the given ctick, returns the new rows for the table.
INSERT INTO chadwick
SELECT missing.tick, missing.refid, c.value
FROM (SELECT tr.tick, tr.refid, MAX(c.tick) AS ctick
FROM
(SELECT tick, refid FROM
(SELECT refid FROM chadwick WHERE tick=1) AS r
JOIN
(SELECT DISTINCT tick FROM chadwick) AS t
) AS tr
JOIN chadwick AS c ON tr.tick >= c.tick AND tr.refid=c.refid
GROUP BY tr.tick, tr.refid
) AS missing
JOIN chadwick AS c ON missing.ctick = c.tick AND missing.refid=c.refid
;
Performance on the sample table, along with (tick, refid) and (refid, tick) indices:
+----+-------------+------------+-------+-------------------+----------+---------+----------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+-------------------+----------+---------+----------+------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 3 | |
| 1 | PRIMARY | c | ALL | tick_ref,ref_tick | NULL | NULL | NULL | 6 | Using where; Using join buffer |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 9 | Using temporary; Using filesort |
| 2 | DERIVED | c | ref | tick_ref,ref_tick | ref_tick | 5 | tr.refid | 1 | Using where; Using index |
| 3 | DERIVED | <derived4> | ALL | NULL | NULL | NULL | NULL | 3 | |
| 3 | DERIVED | <derived5> | ALL | NULL | NULL | NULL | NULL | 3 | Using join buffer |
| 5 | DERIVED | chadwick | index | NULL | tick_ref | 10 | NULL | 6 | Using index |
| 4 | DERIVED | chadwick | ref | tick_ref | tick_ref | 5 | | 2 | Using where; Using index |
+----+-------------+------------+-------+-------------------+----------+---------+----------+------+---------------------------------+
As I said, too many sub-selects. A temporary table may help matters.
To check for missing ticks:
SELECT clo.tick+1 AS missing_tick
FROM chadwick AS chi
RIGHT JOIN chadwick AS clo ON chi.tick = clo.tick+1
WHERE chi.tick IS NULL;
This will return at least one row with tick equal to 1 + the largest tick in the table. Thus, the largest value in this result can be ignored.
In order to have the list of pairs (tick, refid) to insert get a whole list:
SELECT a.tick, b.refid
FROM ( SELECT DISTINCT tick FROM t) a
CROSS JOIN ( SELECT DISTINCT refid FROM t) b
Now substract from that query the existing ones:
SELECT a.tick tick, b.refid refid
FROM ( SELECT DISTINCT tick FROM t) a
CROSS JOIN ( SELECT DISTINCT refid FROM t) b
MINUS
SELECT DISTINCT tick, refid FROM t
Now you can join with t to obtain the final query (note that I use inner join + left join to obtain previous result but you could adapt):
INSERT INTO t(tick, refid, value)
SELECT c.tick, c.refid, t1.value
FROM ( SELECT a.tick tick, b.refid refid
FROM ( SELECT DISTINCT tick FROM t) a
CROSS JOIN ( SELECT DISTINCT refid FROM t) b
MINUS
SELECT DISTINCT tick, refid FROM t
) c
INNER JOIN t t1 ON t1.refid = c.refid and t1.tick < c.tick
LEFT JOIN t t2 ON t2.refid = c.refid AND t1.tick < t2.tick AND t2.tick < c.tick
WHERE t2.tick IS NULL