I am trying to implement a logic in Redshift Spectrum where my original table looks like below:
Records in the student table:
1 || student1 || Boston || 2019-01-01
2 || student2 || New York || 2019-02-01
3 || student3 || Chicago || 2019-03-01
1 || student1 || Dallas || 2019-03-01
Records in the incremental table studentinc looks like below:
1 || student1 || SFO || 2019-04-01
4 || student4 || Detroit || 2019-04-01
By joining both student and studentinc tables, I am trying to get the latest set of records which should look like below:
2 || student2 || New York || 2019-02-01
3 || student3 || Chicago || 2019-03-01
1 || student1 || SFO || 2019-04-01
4 || student4 || Detroit || 2019-04-01
I have got this solution by doing UNION of both student and studentinc, then querying the result of union based on max(modified_ts). However, this solution isn't good for huge tables, is there a better solution which works by joining both the tables?
1. Using Spark-SQL you can achieve this by using not in and union
scala> var df1 = Seq((1 ,"student1","Boston " , "2019-01-01" ),(2 ,"student2","New York" , "2019-02-01"),(3 ,"student3","Chicago " , "2019-03-01" ),(1 ,"student1","Dallas " , "2019-03-01")).toDF("id","name","country","_date")
register as temp table
scala> df1.registerTempTable("temp1")
scala> sql("select * from temp1") .show
+---+--------+--------+----------+
| id| name| country| _date|
+---+--------+--------+----------+
| 1|student1|Boston |2019-01-01|
| 2|student2|New York|2019-02-01|
| 3|student3|Chicago |2019-03-01|
| 1|student1|Dallas |2019-03-01|
+---+--------+--------+----------+
2nd DataFrame
scala> var df3 = Seq((1 , "student1", "SFO", "2019-04-01"),(4 , "student4", "Detroit", "2019-04-01")).toDF("id","name","country","_date")
scala> df3.show
+---+--------+-------+----------+
| id| name|country| _date|
+---+--------+-------+----------+
| 1|student1| SFO|2019-04-01|
| 4|student4|Detroit|2019-04-01|
+---+--------+-------+----------+
performing not in with union clause
scala> sql("select * from (select * from temp1 where id not in (select id from temp2 ) )tt") .union(df3).show
+---+--------+--------+----------+
| id| name| country| _date|
+---+--------+--------+----------+
| 2|student2|New York|2019-02-01|
| 3|student3|Chicago |2019-03-01|
| 1|student1| SFO|2019-04-01|
| 4|student4| Detroit|2019-04-01|
+---+--------+--------+----------+
2nd using Spark Dataframe this is faster than IN query becoz IN performs a row-wise operation.
scala> df1.join(df3,Seq("id"),"left_anti").union (df3).show
+---+--------+--------+----------+
| id| name| country| _date|
+---+--------+--------+----------+
| 2|student2|New York|2019-02-01|
| 3|student3|Chicago |2019-03-01|
| 1|student1| SFO|2019-04-01|
| 4|student4| Detroit|2019-04-01|
+---+--------+--------+----------+
Hope it helps you. let me know if you have any query related to the same
I would recommend window functions:
select s.*
from (select s.*,
row_number() over (partition by studentid order by date desc) as seqnum
from ((select s.* from student
) union all
(select i.* from incremental
from incremental
)
) s
) s
where seqnum = 1;
Note: The union all requires that the columns be exactly the same and in the same order. You may need to list out the columns if they are not the same.
Related
How can I find a value that has been mentioned several times in a row.
ID |1_Jan|3_Jan|4_Jan|4_Jan|
12 | 2 | 3 | 2 | 4 |
31 | 3 | 4 | 3 | 1 |
25 | 1 | 1 | 1 | 1 |
26 | 3 | 3 | 3 | 3 |
In the case of this table, I want to get out ID 25 and 26 because here the values 1 and 3 have been used 3 or more times in a record.
I was also wondering how can I for example only get out ID 25 even if 26 also has 3 or more?
You can select the rows with all equal column values, and then order by common column value:
with cte as (select t.id, t.1_Jan r, (t.1_Jan = t.2_Jan) and (t.2_Jan = t.3_Jan) and (t.3_Jan = t.4_Jan) val from test_table t)
select c.id from cte c where c.val = 1 order by c.r limit 1;
Output:
id
25
See demo.
This answers the original version of the question.
One way is to unpivot and aggregate:
select id, val, count(*)
from ((select id, 1_jan as val from t) union all
(select id, 2_jan as val from t) union all
(select id, 3_jan as val from t) union all
(select id, 4_jan as val from t)
) t
group by id, val
having count(*) >= 3;
Please take a look at the question described here: MySQL ONLY IN() equivalent clause , regarding Relational Division in MySQL.
My database structure is very similar to the one described, but in the "Chocolate Boys Table", I have an additional ID field - let's call it milk ID.
Chocolates Boys Table
+----+---------+-----------------------+
| id | chocolate_id | milk id | boy_id |
+----+--------------+---------+--------+
| 1 | 1000 | 2000 | 10007 |
| 2 | 1003 | 2001 | 10007 |
| 3 | 1006 | 2005 | 10007 |
| 4 | 1000 | 2001 | 10009 |
| 5 | 1001 | 2000 | 10009 |
| 6 | 1005 | 2008 | 10009 |
+----+--------------+---------+--------|
The objective is to run a query that retrieves the boy ID that contains the exact chocolate and milk IDs that I pass in. Here are some examples of my expected results:
Example #1:
Chocolate IDs Passed In (in order) - 1000,1003,1006.
Milk IDs Passed In (in order) - 2000,2001,2005.
Expected Result: Query returns boy ID of 10007.
Example #2:
Chocolate IDs Passed In (in order) - 1000,1003.
Milk IDs Passed In (in order) - 2000,2001.
Expected Result: Empty result set.
Example #3:
Chocolate IDs Passed In (in order) - 1003,1000,1006.
Milk IDs Passed In (in order) - 2000,2001,2005.
Expected Result: Empty result set - The passed in IDs are included in boy ID 10007, but the order is wrong. The values of Chocolate ID and Milk ID don't match up if examined on a row by row basis.
I am attempting to use a slightly modified version of John Woo's solution in order to incorporate the added ID field:
SELECT boy_id
FROM boys_chocolates a
WHERE chocolate_id IN (1003,1000,1006) AND milk_id IN (2000,2001,2005) AND
EXISTS
(
SELECT 1
FROM boys_chocolates b
WHERE a.boy_ID = b.boy_ID
GROUP BY boy_id
HAVING COUNT(DISTINCT chocolate_id) = 3
)
GROUP BY boy_id
HAVING COUNT(*) = 3
The problem that I'm having is that the IN function does not enforce order, as seen in example #3. I would like the above query to return an empty result set. What needs to be changed in order to address this problem? Thank you!
Try this approach:
SELECT a.boy_id
FROM
(SELECT id, boy_id FROM boys_chocolates WHERE chocolate_id = 1000) a
JOIN
(
(SELECT id, boy_id FROM boys_chocolates WHERE chocolate_id = 1003) b,
(SELECT id, boy_id FROM boys_chocolates WHERE chocolate_id = 1006) c,
(SELECT id, boy_id FROM boys_chocolates WHERE milk_id = 2000) d,
(SELECT id, boy_id FROM boys_chocolates WHERE milk_id = 2001) e,
(SELECT id, boy_id FROM boys_chocolates WHERE milk_id = 2005) f
)
ON a.boy_id = b.boy_id AND a.boy_id = c.boy_id AND a.boy_id = d.boy_id
AND a.boy_id = e.boy_id AND a.boy_id = f.boy_id AND b.id > a.id
AND c.id > b.id AND e.id > d.id AND f.id > e.id;
Replace 1000 1003 1006 with your first chocolate_id, second chocolate_id, third chocolate_id respectively. Also replace 2000 2001 2005 with your first milk_id, second milk_id, third milk_id.
I have 2 different tables in my database.
For eg; In table 1, named table1 , it has the following data:
||===============================||
|| ID | DATE ||
===================================
|| 1 | 2nd Jan ||
===================================
|| 2 | 4th Apr ||
===================================
And lets say in table 2, named table2, it has the following data:
||===============================||
|| ID | NAME ||
===================================
|| 1 | John ||
===================================
|| 2 | Pam ||
===================================
Now, both these table's (ID) is NOT THE SAME.
What I want to display is:
||===============================||===============================||
|| ID | NAME || ID | DATE ||
====================================================================
|| 1 | John || NULL | NULL ||
====================================================================
|| 2 | Pam || NULL | NULL ||
====================================================================
|| NULL | NULL || 1 | 2nd Jan ||
====================================================================
|| NULL | NULL || 2 | 4th Apr ||
====================================================================
So what I tried these mySQL statements:
select a.id, a.date, b.id, b.name from table1 a, table2 b
But this doesn't give me the correct display, it combines the result.
I also tried left join, it also combines the results.
What am I doing wrong? Please help me.
Thanks for reading.
select a.id, a.date, NULL id, NULL name from table1 a
UNION ALL
select NULL id, NULL date, b.id, b.name from table2 b
Just try above code.
Hope this will helps.
You can accomplish that with a 'fake' outer join:
select a.id, a.date, b.id, b.name
from table1 a
full outer join
table2 b
on 1 = 0;
I have two tables
one is business_details
+--------------+--------------++--------------+
| msisdn | total_counts || id |
+--------------+--------------++--------------+
| 919999999999 | 0 || 2323232 |
| 918888888888 | 0 || 2323231 |
| 917777777777 | 15 || 2323233 |
+--------------+--------------++--------------+
and another table is: users_details
+--------------++--------------+
| msisdn || id |
+--------------++--------------+
| 919999999999 || 2323232 |
| 918888888888 || 2323231 |
| 917777777777 || 2323233 |
+--------------++--------------+
I want to update table 'business_details' and set total_counts = total_counts + (the counts from users_details where business_details.msisdn = users_details.msisdn and business_details.id
= users_details.id)
Can anyone help to increment to the counts of one table by matching two conditions from another table?
The basic answer to your question is to use join in an update statement.
I am going to guess that total counts in the user table refers to the number of rows. This requires aggregation before the join:
update business_details bd join
(select ud.msisdn, count(*) as cnt
from user_details ud
group by ud.msisdn
) ud
on bd.msisdn = ud.msisdn
set bd.total_counts = bd.total_counts + ud.cnt;
I have a table which looks like this:
+-----------------------
| id | first_name
+-----------------------
| AC0089 | John |
| AC0015 | Dan |
| AC0017 | Marry |
| AC0003 | Andy |
| AC0001 | Trent |
| AC0006 | Smith |
+-----------------------
I need a query to split the id in the range of 3 and also display the starting id of that range i.e.
+------------+----------+--------
| startrange | endrange | id
+------------+----------+--------
| 1 | 3 | AC0089
| 4 | 6 | AC0003
+------------+----------+--------
I am pretty new to SQL and trying the below query but I dont think I am near to the correct solution at all ! Here is the query:
select startrange, endrange, id from table inner join (select 1 startRange, 3 endrange union all select 4 startRange, 6 endRange) r group by r.startRange, r.endRange;
It is giving the same id every-time and I am not able to come up with any other solution. How Can I get the required output?
Try this
SET #ct := 0;
select startrange,(startrange + 2) as endrange, seq_no from
(select (c.st - (select count(*) from <table_name>)) as startrange, c.* from
(select (#ct := #ct + 1) as st, b.* from <table_name> as b) c
having startrange mod 3 = 1) as cc;
sorry for formating.
I'm not completely sure what your trying to do but if you're trying to convert a table of ID's into ranges use a case when.
CASE WHEN startrange in(1,2,3) THEN 1
ELSE NULL
END as startrange,
CASE WHEN endrange in(1,2,3) THEN 3
ELSE NULL
END as endrange,
CASE WHEN ID in(1,2,3) THEN id
WHEN ID in(4,5,6) THEN id
ELSE id
END AS ID