Spark scala joining with subquery with limit - mysql

I need to join two tables on fake_id but table 2 contains more than one matching records for fake_id so I need to match with record where table2.end_time >= table1.event_time and table2.start_time <= table1.event_time
If there are more than one record in table 2 matching this condition, I need to only consider latest by updated_time
Here is what I tried.
spark.sql("select t1.fake_id, t1.attribute_1,t1.event_time,t22.end_time from table1 t1 left outer join (
select fake_id, end_time from table2 t2 where t2.fake_id=t1.fake_id and t2.end_time >= t1.event_time and t2.start_time <= t1.event_time order by t2.updated_time desc limit 1)
as t22 on t1.fake_id=t22.fake_id")
For above statement spark throwing me error for unknown column t1.fake_id
Table.1 -
---------------------------------------------------------------------------
fake_id attribute_1 event_time
---------------------------------------------------------------------------
1 attr_val_11 2020-08-01 05:00:00
2 attr_val_12 2020-08-01 15:00:00
3 attr_val_31 2020-08-03 07:00:00
4 attr_val_41 2020-08-01 05:00:00
Table.2 -
---------------------------------------------------------------------------
fake_id start_time end_time updated_time
---------------------------------------------------------------------------
1 2020-08-01 02:00:00 2020-08-01 08:00:00 2020-08-01 00:00:00
2 2020-08-01 04:00:00 2020-08-01 23:00:00 2020-08-01 00:00:00
3 2020-08-03 02:00:00 2020-08-03 08:00:00 2020-08-03 08:00:00
3 2020-08-03 05:00:00 2020-08-03 10:00:00 2020-08-03 12:00:00
3 2020-08-04 05:00:00 2020-08-04 10:00:00 2020-08-04 12:00:00
4 2020-08-01 08:00:00 2020-08-01 18:00:00 2020-08-01 18:00:00
4 2020-08-01 02:00:00 2020-08-01 05:00:00 2020-08-01 22:00:00
Result :
----------------------------------------------------------------------------------------------
fake_id attribute_1 event_time start_time end_time
----------------------------------------------------------------------------------------------
1 attr_val_11 2020-08-01 05:00:00 2020-08-01 02:00:00 2020-08-01 08:00:00
2 attr_val_12 2020-08-01 15:00:00 2020-08-01 04:00:00 2020-08-01 23:00:00
3 attr_val_31 2020-08-03 07:00:00 2020-08-03 05:00:00 2020-08-03 10:00:00
4 attr_val_41 2020-08-01 05:00:00 2020-08-01 02:00:00 2020-08-01 05:00:00

Use the between and get the row_number, sort and take the maximum update time.
spark.sql('''
select
fake_id,
attribute_1,
event_time,
start_time,
end_time
from (
select
t1.fake_id,
t1.attribute_1,
t1.event_time,
t2.start_time,
t2.end_time,
row_number() OVER (PARTITION BY t1.fake_id, t1.attribute_1 ORDER BY t2.updated_time DESC) as rank
from
table1 t1
left join
table2 t2
on
t1.fake_id = t2.fake_id and
t1.event_time between t2.start_time and t2.end_time) t
where
rank = 1
order by
fake_id
''').show()
+-------+-----------+-------------------+-------------------+-------------------+
|fake_id|attribute_1| event_time| start_time| end_time|
+-------+-----------+-------------------+-------------------+-------------------+
| 1|attr_val_11|2020-08-01 05:00:00|2020-08-01 02:00:00|2020-08-01 08:00:00|
| 2|attr_val_12|2020-08-01 15:00:00|2020-08-01 04:00:00|2020-08-01 23:00:00|
| 3|attr_val_31|2020-08-03 07:00:00|2020-08-03 05:00:00|2020-08-03 10:00:00|
| 4|attr_val_41|2020-08-01 05:00:00|2020-08-01 02:00:00|2020-08-01 05:00:00|
+-------+-----------+-------------------+-------------------+-------------------+

Related

Group By 3 columns (JobId, StartTime, EndTime) for continuous days in MySQL

I want to group by the JobId, StartTime & EndTime only for continuous days. If a specific row doesn't form part of a range it should be discarded. The Id's should also pivot into a column per grouping.
Id
Date
StartTime
EndTime
JobId
1
2021-08-23
08:30:00
19:00:00
1
2
2021-08-24
08:30:00
19:00:00
1
3
2021-08-24
12:30:00
14:30:00
2
4
2021-08-24
15:30:00
19:00:00
1
5
2021-08-25
08:30:00
19:00:00
1
6
2021-08-25
12:30:00
14:30:00
2
7
2021-08-25
15:45:00
19:00:00
1
8
2021-08-26
08:30:00
09:30:00
1
9
2021-08-26
15:30:00
19:00:00
1
10
2021-08-26
10:30:00
11:00:00
1
11
2021-08-26
12:00:00
14:30:00
1
12
2021-08-27
08:30:00
09:30:00
1
13
2021-08-27
11:00:00
11:15:00
1
14
2021-08-27
11:30:00
14:30:00
1
15
2021-08-28
08:30:00
09:30:00
1
Using the above sample data you can see 3 groupings that can form such a continuous range.
Range 1 consists of Id's, 1,2 & 5 - 2021-08-23 to 2021-08-25, 08:30:00 to 19:00:00
Range 2 consists of Id's 3 & 6 - 2021-08-24 to 2021-08-25, 12:30:00 to 14:30:00
Range 3 consists of Id's 8, 12 & 15 - 2021-08-26 to 2021-08-28, 08:30:00 to 09:30:00
The end result should be:
JobId
StartDate
EndDate
StartTime
EndTime
Ids
1
2021-08-23
2021-08-25
08:30:00
19:00:00
1,2,5
2
2021-08-24
2021-08-25
12:30:00
14:30:00
3,6
1
2021-08-26
2021-08-28
08:30:00
09:30:00
8,12,15
MySQL 8.0.23
Assuming that JobId, `Date`, StartTime, EndTime is unique you may use:
SELECT JobId,
MIN(`Date`) StartDate,
MAX(`Date`) EndDate,
StartTime,
EndTime,
GROUP_CONCAT(Id) Ids
FROM test
GROUP BY JobId,
StartTime,
EndTime
HAVING COUNT(*) > 1
AND DATEDIFF(EndDate, StartDate) = COUNT(*) - 1
ORDER BY StartDate, StartTime
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=fce8590f72ac1d50cd9e89add3ed01e7

ranking based on multiple tables

select a.no, a.Dtime,count(b.Dtime)+1 as Rank
from table1 a left
join table1 b on a.Dtime>b.Dtime and a.no=b.no
group by a.no,a.Dtime
order by a.no, a.Dtime
table1 Input:
NO Dtime
1 08:10:00
1 09:10:00
1 09:40:00
1 10:10:00
2 09:30:00
2 10:15:00
3 09:00:00
Output:
NO Dtime Rank
1 08:10:00 1
1 09:10:00 2
1 09:40:00 3
1 10:10:00 4
2 09:30:00 1
2 10:15:00 2
3 09:00:00 1
But I am looking for Output in mysql where table2 Rank links to table1 and table2 Dtime i.e. table1.Dtime>table2.time
table2 Input
NO Dtime
1 08:30:00
1 09:15:00
1 09:50:00
2 08:30:00
2 09:45:00
3 09:50:00
Output:
NO table1.Dtime Rank table2.Dtime
1 08:10:00 0 00:00:00
1 09:10:00 1 08:30:00
1 09:40:00 2 09:15:00
1 10:10:00 3 09:50:00
2 09:30:00 1 08:30:00
2 10:15:00 2 09:45:00
3 09:00:00 0 00:00:00
You can use the same approach with your initial query. Just left join to table2. To get the Dtime from table2 you can use a correlated subquery:
select a.no, a.Dtime,
count(b.Dtime) as Rank,
coalesce((select c.Dtime
from table2 as c
where c.no = a.no and a.Dtime > c.Dtime
order by c.Dtime desc limit 1), '00:00:00') as t2Dtime
from table1 a
left join table2 b on a.Dtime > b.Dtime and a.no = b.no
group by a.no,a.Dtime
order by a.no, a.Dtime
Demo here

How to Group the duplicate items in MySQL separately

I have a request table..
user_id no:of_mach time_start req_time
11 3 2012-12-12 09:00:00 2012-12-11 09:00:00
12 4 2012-12-14 08:00:00 2012-12-14 06:00:00
13 4 2012-12-12 09:00:00 2012-12-12 02:00:00
14 2 2013-12-12 07:00:00 2012-12-12 03:00:00
15 2 2012 12-14 08:00:00 2012-12-14 05:00:00
From the above table, I need to get the req_time of the users who has requested for the same time_start.
The duplicate time_start are
2012-12-12 09:00:00 by user_id 11,13.
2012-12-14 08:00:00 by user_id 12,15.
Now, each of theirs request time is different..
I want a query so that it will get me the result as:-
req_time of user requested for the time_start 2012-12-12 09:00:00 are:-
2012-12-11 09:00:00
2012-12-12 02:00:00
req_time of user requested for the time_start 2012-12-14 08:00:00 are:-
2012-12-14 06:00:00
2012-12-14 05:00:00
I have used a query:-
SELECT req_time FROM user_req WHERE user_id IN (SELECT o.user_id FROM user_req o INNER JOIN ( SELECT starttime, COUNT( * ) AS dupeCount FROM user_req GROUP BY starttime HAVING COUNT( * ) >1)oc ON o.starttime = oc.starttime) ORDER BY req_time ASC;
And this prints all the req_time together for all the duplicate time_start values..
The output will be :-
2012-12-11 09:00:00
2012-12-12 02:00:00
2012-12-14 06:00:00
2012-12-14 05:00:00
Can I have a query that help me to group this req_time based on each duplicate time_start which I have explained above.
Then I can call it in java and use it for my program..
Please help me..
Try this:
select * from user_req where time_start in
(select time_start
from user_req
group by time_start
having count(time_start) > 1)
order by time_start, req_time
This will return records from the table with multiple counts of same time_start, ordered by the start_time and req_time. You can choose to show only those 2 columns if you want by replacing the select * with appropriate column names.

Double ORDER BY sort with UNION statement

(
SELECT *
FROM (
SELECT d
FROM myTable
WHERE id = "4h"
AND d < "2011-12-08 12:00:00"
ORDER BY d DESC
LIMIT 10
)tmp
ORDER BY d ASC
)
UNION (
SELECT d
FROM myTable
WHERE id = "4h"
AND d >= "2011-12-08 12:00:00"
ORDER BY d ASC
LIMIT 10
)
I'm trying to get the 10 results before and after a particular ID by using two SELECT statements and a UNION. The first SELECT uses ORDER BY DESC to get the 10 preceding and then I attempt to envelope that in a second ORDER BY ASC to get all the results in ASC order but for some reason it does not work.
Here is what I get currently for a result:
d
2011-12-08 08:00:00
2011-12-08 04:00:00
2011-12-08 00:00:00
2011-12-07 20:00:00
2011-12-07 16:00:00
2011-12-07 12:00:00
2011-12-07 08:00:00
2011-12-07 04:00:00
2011-12-07 00:00:00
2011-12-06 20:00:00 <- These top 10 results should ASC!
2011-12-08 12:00:00
2011-12-08 16:00:00
2011-12-08 20:00:00
2011-12-09 00:00:00
2011-12-09 04:00:00
2011-12-09 08:00:00
2011-12-09 12:00:00
2011-12-09 16:00:00
2011-12-09 20:00:00
2011-12-11 20:00:00
And here is what I want:
d
2011-12-06 20:00:00
2011-12-07 00:00:00
2011-12-07 04:00:00
2011-12-07 08:00:00
2011-12-07 12:00:00
2011-12-07 16:00:00
2011-12-07 20:00:00
2011-12-08 00:00:00
2011-12-08 04:00:00
2011-12-08 08:00:00
2011-12-08 12:00:00
2011-12-08 16:00:00
2011-12-08 20:00:00
2011-12-09 00:00:00
2011-12-09 04:00:00
2011-12-09 08:00:00
2011-12-09 12:00:00
2011-12-09 16:00:00
2011-12-09 20:00:00
2011-12-11 20:00:00
(
SELECT d
FROM myTable
WHERE id = '4h' AND d < '2011-12-08 12:00:00'
ORDER BY d DESC
LIMIT 10
) UNION ALL (
SELECT d
FROM myTable
WHERE id = '4h' AND d >= '2011-12-08 12:00:00'
ORDER BY d ASC
LIMIT 10
)
ORDER BY d ASC

MySQL Count Numbers Are Off

I am not sure why my numbers are drastically off from each other.
A query with no max id:
SELECT id, DATE_FORMAT(t_stamp, '%Y-%m-%d %H:00:00') as date, COUNT(*) as count
FROM test_ips
WHERE id > 0
AND viewip != ""
GROUP BY HOUR(t_stamp)
ORDER BY t_stamp ASC;
I get:
1 2012-07-18 19:00:00 1313
106 2012-07-18 20:00:00 1567
107 2012-07-19 09:00:00 847
225 2012-07-19 10:00:00 5095
421 2012-07-19 11:00:00 205
423 2012-07-19 12:00:00 900
461 2012-07-19 13:00:00 619
490 2012-07-20 15:00:00 729
575 2012-07-20 16:00:00 1682
1060 2012-07-20 17:00:00 2063
2260 2012-07-20 18:00:00 1417
5859 2012-07-20 21:00:00 1303
7060 2012-07-20 22:00:00 1340
8280 2012-07-20 23:00:00 1211
9149 2012-07-21 00:00:00 1675
10418 2012-07-21 01:00:00 721
11127 2012-07-21 02:00:00 825
But if I add a max id:
AND id <= 8279
I get:
1 2012-07-18 19:00:00 1313
106 2012-07-18 20:00:00 1201
107 2012-07-19 09:00:00 118
225 2012-07-19 10:00:00 196
421 2012-07-19 11:00:00 2
423 2012-07-19 12:00:00 38
461 2012-07-19 13:00:00 20
490 2012-07-20 15:00:00 85
575 2012-07-20 16:00:00 483
1060 2012-07-20 17:00:00 1200
2260 2012-07-20 18:00:00 1200
5859 2012-07-20 21:00:00 1201
7060 2012-07-20 22:00:00 1220
The numbers are WAY off from each other. Something is goofy.
EDIT: Here is my table structure:
id t_stamp bID viewip unique
1 2012-07-18 19:22:20 5 192.168.1.1 1
2 2012-07-18 19:22:21 1 192.168.1.1 1
3 2012-07-18 19:22:22 5 192.168.1.1 0
4 2012-07-18 19:22:22 3 192.168.1.1 1
You are not grouping by ID and I think you intend to.
Try:
SELECT id, DATE_FORMAT(t_stamp, '%Y-%m-%d %H:00:00') as date, COUNT(*) as count
FROM test_ips
WHERE id > 0
AND viewip != ""
GROUP BY id, DATE_FORMAT(t_stamp, '%Y-%m-%d %H:00:00')
ORDER BY t_stamp;
Your query is not consistent.
In your select statement you are displaying the full date.
But you are grouping your data by the hour. So your count statement is taking the count of all the data for each hour of the day.
As an example take your first result:
1 2012-07-18 19:00:00 1313
The count of 1313 contains the records for all of your dates (7/18, 7/19, 7/20, 7/21, 7/22, etc) that have an hour of 19:00.
But the way you have your query setup, it looks like it should be the count of all records for 2012-07-18 19:00:00.
So when you add AND id <= 8279" The dates of 7/21 and some of 7/20 or no longer being counted so your count values are now lower.
I'm guessing you are meaning to group by the date and hour and not just the hour.