How to `SELECT` and manufacture missing rows from previous values? - mysql

I have the following (simplified) result from SELECT * FROM table ORDER BY tick,refid:
tick refid value
----------------
1 1 11
1 2 22
1 3 33
2 1 1111
2 3 3333
3 3 333333
Note the "missing" rows for refid 1 (tick 3) and refid 2 (ticks 2 and 3)
If possible, how can I make a query to add these missing rows using the most recent prior value for that refid? "Most recent" means the value for the row with the same refid as the missing row and largest tick such that the tick is less than the tick for the missing row. e.g.
tick refid value
----------------
1 1 11
1 2 22
1 3 33
2 1 1111
2 2 22
2 3 3333
3 1 1111
3 2 22
3 3 333333
Additional conditions:
All refids will have values at tick=1.
There may be many 'missing' ticks for a refid in sequence, (as above for refid 2).
There are many refids and it's not known which will have sparse data where.
There will be many ticks beyond 3, but all sequential. In the correct result, each refid will have a result for each tick.
Missing rows are not known in advance - this will be run on multiple databases, all with the same structure, and different "missing" rows.
I'm using MySQL and cannot change db just now. Feel free to post answer in another dialect, to help discussion, but I'll select an answer in MySQL dialect over others.
Yes, I know this can be done in the code, which I've implemented. I'm just curious if it can be done with SQL.

What value should be returned when a given tick-refid combination does not exist? In this solution, I simply returned the lowest value for that given refid.
Revision
I've updated the logic to determine what value to use in the case of a null. It should be noted that I'm assuming that ticks+refid is unique in the table.
Select Ticks.tick
, Refs.refid
, Case
When Table.value Is Null
Then (
Select T2.value
From Table As T2
Where T2.refid = Refs.refId
And T2.tick = (
Select Max(T1.tick)
From Table As T1
Where T1.tick < Ticks.tick
And T1.refid = T2.refid
)
)
Else Table.value
End As value
From (
Select Distinct refid
From Table
) As Refs
Cross Join (
Select Distinct tick
From Table
) As Ticks
Left Join Table
On Table.tick = Ticks.tick
And Table.refid = Refs.refid

If you know in advance what your 'tick' and 'refid' values are,
Make a helper table that contains all possible tick and refid values.
Then left join from the helper table on tick and refid to your data table.
If you don't know exactly what your 'tick' and 'refid' values are, you maybe could still use this method, but instead of a static helper table, it would have to be dynamically generated.

The following has too many sub-selects for my taste, but it generates the desired result in MySQL, as long as every tick and every refid occurs separately at least once in the table.
Start with a query that generates every pair of tick and refid. The following uses the table to generate the pairs, so if any tick never appears in the underlying table, it will also be missing from the generated pairs. The same holds true for refids, though the restriction that "All refids will have values at tick=1" should ensure the latter never happens.
SELECT tick, refid FROM
(SELECT refid FROM chadwick WHERE tick=1) AS r
JOIN
(SELECT DISTINCT tick FROM chadwick) AS t
Using this, generate every missing tick, refid pair, along with the largest tick that exists in the table by equijoining on refid and θ≥-joining on tick. Group by the generated tick, refid since only one row for each pair is desired. The key to filtering out existing tick, refid pairs is the HAVING clause. Strictly speaking, you can leave out the HAVING; the resulting query will return existing rows with their existing values.
SELECT tr.tick, tr.refid, MAX(c.tick) AS ctick
FROM
(SELECT tick, refid FROM
(SELECT refid FROM chadwick WHERE tick=1) AS r
JOIN
(SELECT DISTINCT tick FROM chadwick) AS t
) AS tr
JOIN chadwick AS c ON tr.tick >= c.tick AND tr.refid=c.refid
GROUP BY tr.tick, tr.refid
HAVING tr.tick > MAX(c.tick)
One final select from the above as a sub-select, joined to the original table to get the value for the given ctick, returns the new rows for the table.
INSERT INTO chadwick
SELECT missing.tick, missing.refid, c.value
FROM (SELECT tr.tick, tr.refid, MAX(c.tick) AS ctick
FROM
(SELECT tick, refid FROM
(SELECT refid FROM chadwick WHERE tick=1) AS r
JOIN
(SELECT DISTINCT tick FROM chadwick) AS t
) AS tr
JOIN chadwick AS c ON tr.tick >= c.tick AND tr.refid=c.refid
GROUP BY tr.tick, tr.refid
) AS missing
JOIN chadwick AS c ON missing.ctick = c.tick AND missing.refid=c.refid
;
Performance on the sample table, along with (tick, refid) and (refid, tick) indices:
+----+-------------+------------+-------+-------------------+----------+---------+----------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+-------------------+----------+---------+----------+------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 3 | |
| 1 | PRIMARY | c | ALL | tick_ref,ref_tick | NULL | NULL | NULL | 6 | Using where; Using join buffer |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 9 | Using temporary; Using filesort |
| 2 | DERIVED | c | ref | tick_ref,ref_tick | ref_tick | 5 | tr.refid | 1 | Using where; Using index |
| 3 | DERIVED | <derived4> | ALL | NULL | NULL | NULL | NULL | 3 | |
| 3 | DERIVED | <derived5> | ALL | NULL | NULL | NULL | NULL | 3 | Using join buffer |
| 5 | DERIVED | chadwick | index | NULL | tick_ref | 10 | NULL | 6 | Using index |
| 4 | DERIVED | chadwick | ref | tick_ref | tick_ref | 5 | | 2 | Using where; Using index |
+----+-------------+------------+-------+-------------------+----------+---------+----------+------+---------------------------------+
As I said, too many sub-selects. A temporary table may help matters.
To check for missing ticks:
SELECT clo.tick+1 AS missing_tick
FROM chadwick AS chi
RIGHT JOIN chadwick AS clo ON chi.tick = clo.tick+1
WHERE chi.tick IS NULL;
This will return at least one row with tick equal to 1 + the largest tick in the table. Thus, the largest value in this result can be ignored.

In order to have the list of pairs (tick, refid) to insert get a whole list:
SELECT a.tick, b.refid
FROM ( SELECT DISTINCT tick FROM t) a
CROSS JOIN ( SELECT DISTINCT refid FROM t) b
Now substract from that query the existing ones:
SELECT a.tick tick, b.refid refid
FROM ( SELECT DISTINCT tick FROM t) a
CROSS JOIN ( SELECT DISTINCT refid FROM t) b
MINUS
SELECT DISTINCT tick, refid FROM t
Now you can join with t to obtain the final query (note that I use inner join + left join to obtain previous result but you could adapt):
INSERT INTO t(tick, refid, value)
SELECT c.tick, c.refid, t1.value
FROM ( SELECT a.tick tick, b.refid refid
FROM ( SELECT DISTINCT tick FROM t) a
CROSS JOIN ( SELECT DISTINCT refid FROM t) b
MINUS
SELECT DISTINCT tick, refid FROM t
) c
INNER JOIN t t1 ON t1.refid = c.refid and t1.tick < c.tick
LEFT JOIN t t2 ON t2.refid = c.refid AND t1.tick < t2.tick AND t2.tick < c.tick
WHERE t2.tick IS NULL

Related

Joining table to itself with multiple join criteria logic

I'm trying to understand the logic behind the syntax below. Based on the following question, table and syntax:
Write a query that'll identify returning active users. A returning active user is a user that has made a second purchase within 7 days of any other of their purchases. Output a list of user_ids of these returning active users.
Column + Data Type:
id: int | user_id: int | item: varchar |created_at: datetime | revenue: int
SELECT DISTINCT(a1.user_id)
FROM amazon_transactions a1
JOIN amazon_transactions a2 ON a1.user_id=a2.user_id
AND a1.id <> a2.id
AND a2.created_at::date-a1.created_at::date BETWEEN 0 AND 7
ORDER BY a1.user_id
Why does the table need to be joined to itself in this case?
How does 'AND a1.id <> a2.id' portion of syntax contribute to the join?
You are looking for users that have 2 records on that table whose date distance is lower (or equal) than 7 days
To accomplish this, you treat the table as if it were 2 different (but equal tables) because you have to match a row on the first table with a row on the second table
Of course you don't want to match a row with itself, so
AND a1.id <> a2.id
accomplishes that
The table needs to be joined with itself because, you just have one table, and you want to find out returning users (by comparing the duration between transaction dates for the same user).
AND a1.id <> a2.id portion of the syntax removes the same transactions, i.e. prevents the transactions with the same id to be included in the joined table.
There are two scenarios I can think of based on the id column values. Are id column values generated based on timely sequence ? If so, to answer your first question ,we can but don't have to use join syntax. Here is how to achieve your goal using a correlated subquery , with sample data created.
create table amazon_transactions(id int , user_id int , item varchar(20),created_at datetime , revenue int);
insert amazon_transactions (id,user_id,created_at) values
(1,1,'2020-01-05 15:33:22'),
(2,2,'2020-01-05 16:33:22'),
(3,1,'2020-01-08 18:33:22'),
(4,1,'2020-01-22 17:33:22'),
(5,2,'2020-02-05 15:33:22'),
(6,2,'2020-03-05 15:33:22');
select * from amazon_transactions;
-- sample set:
| id | user_id | item | created_at | revenue |
+------+---------+------+---------------------+---------+
| 1 | 1 | NULL | 2020-01-05 15:33:22 | NULL |
| 2 | 2 | NULL | 2020-01-05 16:33:22 | NULL |
| 3 | 1 | NULL | 2020-01-08 18:33:22 | NULL |
| 4 | 1 | NULL | 2020-01-22 17:33:22 | NULL |
| 5 | 2 | NULL | 2020-02-05 15:33:22 | NULL |
| 6 | 2 | NULL | 2020-03-05 15:33:22 | NULL |
-- Here is the answer using a correlated subquery:
select distinct user_id
from amazon_transactions t
where datediff(
(select created_at from amazon_transactions where user_id=t.user_id and id-t.id>=1 limit 1 ),
created_at
)<=7
;
-- result:
| user_id |
+---------+
| 1 |
However,what if the id values are NOT transaction time based? Then the id values are not at all helpful in our requirement. In this case, a JOIN is more capable than a correlated subquery and we need to arrange the order based on transaction time for each user in order to make the necessary join condition. And to answer your second question, the AND a1.id <> a2.id portion of syntax contribute by excluding two of the same transaction making a pair. However, to my understanding the matching scope is too high to be effective. We only care if CONSECUTIVE transactions have a within-7-day gap, but the AND a1.id <> a2.id overdoes the job. For instance, we want to check the gap between transaction1 and transaction2,transaction2 and transaction3, NOT transaction1 and transaction3
Note: by using the user variable row_id trick, we can produce the row id which is used to match consecutive transactions for each user, thus eliminating the wasteful job of random transaction check.
select distinct t1.user_id
from
(select user_id,created_at,#row_id:=#row_id+1 as row_id
from amazon_transactions ,(select #row_id:=0) t
order by user_id,created_at)t1
join
(select user_id,created_at,#row_num:=#row_num+1 as row_num
from amazon_transactions ,(select #row_num:=0) t
order by user_id,created_at)t2
on t1.user_id=t2.user_id and t2.row_num-t1.row_id=1 and datediff(t2.created_at,t1.created_at)<=7
;
-- result
| user_id |
+---------+
| 1 |

Does MySQL automatically use the coalesce function during a join between tables?

During a table join, when does MySQL use this function?
The single result column that replaces two common columns is defined
using the coalesce operation. That is, for two t1.a and t2.a the
resulting single join column a is defined as a = COALESCE(t1.a, t2.a),
where:
COALESCE(x, y) = (CASE WHEN x IS NOT NULL THEN x ELSE y END)
https://dev.mysql.com/doc/refman/8.0/en/join.html
I know what the function does, but I want to know when it is used during the join operation. This just makes no sense to me! Can someone show me an example?
That is in reference to redundant column elimination during natural join and join with using. Describing how the columns are excluded from display.
The order of operation is described above the section you referenced.
First, coalesced common columns of the two joined tables, in the order in which they occur in the first table
Second, columns unique to the first table, in order in which they occur in that table
Third, columns unique to the second table, in order in which they occur in that table
Example
t1
| a | b | c |
| 1 | 1 | 1 |
t2
| a | b | d |
| 1 | 1 | 1 |
The join with using
SELECT * FROM t1 JOIN t2 USING (b);
Would result in, t1.b being coalesced (due to USING), followed by the columns unique to the first table, followed by those in the second table.
| b | a | c | a | d |
| 1 | 1 | 1 | 1 | 1 |
Whereas a natural join
SELECT * FROM t1 NATURAL JOIN t2;
Would result in, the t1 columns (or rather common columns from both tables) being coalesced, followed by the unique columns of the first table, followed by those in the second table.
| a | b | c | d |
| 1 | 1 | 1 | 1 |

remove duplicate rows based on one column value

I have the below table and now I need to delete the rows which are having duplicate "refIDs" but have atleast one row with that ref, i.e i need to remove row 4 and 5. please help me on this
+----+-------+--------+--+
| ID | refID | data | |
+----+-------+--------+--+
| 1 | 1023 | aaaaaa | |
| 2 | 1024 | bbbbbb | |
| 3 | 1025 | cccccc | |
| 4 | 1023 | ffffff | |
| 5 | 1023 | gggggg | |
| 6 | 1022 | rrrrrr | |
+----+-------+--------+--+
This is similar to Gordon Linoff's query, but without the subquery:
DELETE t1 FROM table t1
JOIN table t2
ON t2.refID = t1.refID
AND t2.ID < t1.ID
This uses an inner join to only delete rows where there is another row with the same refID but lower ID.
The benefit of avoiding a subquery is being able to utilize an index for the search. This query should perform well with a multi-column index on refID + ID.
I would do:
delete from t where
ID not in (select min(ID) from table t group by refID having count(*) > 1)
and refID in (select refID from table t group by refID having count(*) > 1)
criteria is refId is among the duplicates and ID is different from the min(id) from the duplicates. It would work better if refId is indexed
otherwise and provided you can issue multiple times the following query until it does not delete anything
delete from t
where
ID in (select max(ID) from table t group by refID having count(*) > 1)
Some another variant, in some cases a bit faster than Marcus and NJ73 answers:
DELETE ourTable
FROM ourTable JOIN
(SELECT ID,targetField
FROM ourTable
GROUP BY targetField HAVING COUNT(*) > 1) t2
ON ourTable.targetField = t2.targetField AND ourTable.ID != t2.ID;
Hope that will help someone. On big tables Marcus answer stalls.
In MySQL, you can do this with a join in delete:
delete t
from table t left join
(select min(id) as id
from table t
group by refId
) tokeep
on t.id = tokeep.id
where tokeep.id is null;
For each RefId, the subquery calculates the minimum of the id column (presumed to be unique over the whole table). It uses a left join for the match, so anything that doesn't match has a NULL value for tokeep.id. These are the ones that are deleted.

How to merge column data using the last updated value in MySQL?

Somewhat confusing so its easier if I put down example and expected output to begin.
I have a table that could look like this: (Unit1 - Unit2 columns could span up to 30 columns in the same general format)
| ID | Name | Unit1_left | Unit2_left |
| 1 | Tom | 50 | NULL |
| 2 | Tom | NULL | 1 |
| 3 | Tom | 45 | NULL |
| 4 | Dan | NULL | NULL |
What I am trying to select is a table like this:
| Name | Unit1_left | Unit2_left |
| Tom | 45 | 1 |
| Dan | NULL | NULL |
What that is doing is grouping by name and attempting to find the last values in the 2 other columns if they exist (if not then it returns NULL).
I have looked at various other questions and they all say to use Max() however this will not work since it selects the highest value (incorrect). I have seen that in MsSQL there is a Last() function which looks vaguely like what I want it to do but its not implemented in MySQL and isn't exactly what I need anyway.
What I am trying to ask is, does anyone know of a possible method of selecting the data like this or if I will have to use a separate programming language to do this?
This will produce the result set you've described
SELECT dname.name,
l1value.unit1_left,
l2value.unit2_left
FROM (SELECT DISTINCT `name`
FROM table1) `DName`
LEFT JOIN (SELECT `name`,
Max(id) id
FROM table1
WHERE unit1_left IS NOT NULL
GROUP BY `name`) l1
ON dname.`name` = l1.`name`
LEFT JOIN table1 l1value
ON l1.id = l1value.id
LEFT JOIN (SELECT `name`,
Max(id) id
FROM table1
WHERE unit2_left IS NOT NULL
GROUP BY `name`) l2
ON dname.`name` = l2.`name`
LEFT JOIN table1 l2value
ON l2.id = l2value.id ;
DEMO
I did it by creating 2 inline views to the highest id for non-null values for both unit1_left and unit2_left (l1 and l2). Then joined it back to original table to get the values (l1value and l2value). We then join that back to a third inline view (dname) that creates the distinct names.
It's quite messy and it might make more sense just to keep your data in a more sensible manner.
You can use subqueries in you select statement. Using SqlFidlle I came up with this.
select o.name,
(select o2.Unit1_left
from original as o2
where o.name = o2.name
and o2.Unit1_left is not null
order by o2.id desc
LIMIT 1) as Unit1_left,
(select o3.Unit2_left
from original as o3
where o.name = o3.name
and o3.Unit2_left is not null
order by o3.id desc
LIMIT 1) as Unit2_left
from original as o
group by o.name
order by id;

MySQL conditionally populate column 3 based on DISTINCT involving 2 other columns in one table

Had a good read through similar topics but I can't quite a) find one to match my scenario, or b) understand others enough to fit / tailor / tweek to my situation.
I have a table, the important fields being;
+------+------+--------+--------+
| ID | Name | Price |Status |
+------+------+--------+--------+
| 1 | Fred | 4.50 | |
| 2 | Fred | 4.50 | |
| 3 | Fred | 5.00 | |
| 4 | John | 7.20 | |
| 5 | John | 7.20 | |
| 6 | John | 7.20 | |
| 7 | Max | 2.38 | |
| 8 | Max | 2.38 | |
| 9 | Sam | 21.00 | |
+------+------+--------+--------+
ID is an auto-incrementing value as records get added throughout the day.
NAME is a Primary Key field, which can repeat 1 to 3 times in the whole table.
Each NAME will have a PRICE value, which may or may not be the same per NAME.
There is also a STATUS field that need to be populated based on the following, which is actually the part I am stuck on.
Status = 'Y' if each DISTINCT name has only one price attached to it.
Status = 'N' if each DISTINCT name has multiple prices attached to it.
Using the table above, ID's 1, 2 and 3 should be 'N', whilst 4, 5, 6, 7, 8 and 9 should be 'Y'.
I think this may well involve some form of combination of JOINs, GROUPs, and DISTINCTs but I am at a loss on how to put that into the right order for SQL.
In order to get the count of distinct Price values per name, we must use a GROUP BY on the Name field, but since you also want to display all names ungrouped but with an additional Status field, we must first create a subselect in the FROM clause which groups by the name and determines whether the name has multiple price values or not.
When we GROUP BY Name in the subselect, COUNT(DISTINCT price) will count the number of distinct price values for each particular name. Without the DISTINCT keyword, it would simply count the number of rows where price is not null.
In conjunction with that, we use a CASE expression to insert N into the Status column if there is more than one distinct Price value for the particular name, otherwise, it will insert Y.
The subselect only returns one row per Name, so to get all names ungrouped, we join that subselect to the main table on the condition that the subselect's Name = the main table's Name:
SELECT
b.ID,
b.Name,
b.Price,
a.Status
FROM
(
SELECT Name, CASE WHEN COUNT(DISTINCT Price) > 1 THEN 'N' ELSE 'Y' END AS Status
FROM tbl
GROUP BY Name
) a
INNER JOIN
tbl b ON a.Name = b.Name
Edit: In order to facilitate an update, you can incorporate this query using JOINs in the UPDATE like so:
UPDATE
tbl a
INNER JOIN
(
SELECT Name, CASE WHEN COUNT(DISTINCT Price) > 1 THEN 'N' ELSE 'Y' END AS Status
FROM tbl
GROUP BY Name
) b ON a.Name = b.Name
SET
a.Status = b.Status
Assuming you have an unfilled Status column in your table.
If you want to update the status column, you could do:
UPDATE mytable s
SET status = (
SELECT IF(COUNT(DISTINCT price)=1, 'Y', 'N') c
FROM (
SELECT *
FROM mytable
) s1
WHERE s1.name = s.name
GROUP BY name
);
Technically, it should not be necessary to have this:
FROM (
SELECT *
FROM mytable
) s1
but there is a mysql limitation that prevents you to select from the table you're updating. By wrapping it in parenthesis, we force mysql to create a temporary table and then it suddenly is possible.