Efficiently finding the next event in SQL - with all columns - mysql

I found solutions that find the next event date, but not ones that would include all the data from the event. With cheating, I can get it done, but that only works in mysql and fails in vertica.
Here is the problem I am trying to solve:
I want to show all events a with data from the first event X that follows a and is not of type a. So here is the cut and paste example so you can play with it to see what actually works:
CREATE TABLE events (user_id int ,created_at int, event varchar(20));
INSERT INTO events values (0,0, 'a');
INSERT INTO events values (0,1, 'b');
INSERT INTO events values (0,2, 'c');
INSERT INTO events values (0,3, 'a');
INSERT INTO events values (0,4, 'c');
INSERT INTO events values (0,5, 'b');
INSERT INTO events values (0,6, 'a');
INSERT INTO events values (0,7, 'a');
INSERT INTO events values (0,8, 'd');
SELECT * FROM events;
+---------+------------+-------+
| user_id | created_at | event |
+---------+------------+-------+
| 0 | 0 | a |
| 0 | 1 | b |
| 0 | 2 | c |
| 0 | 3 | a |
| 0 | 4 | c |
| 0 | 5 | b |
| 0 | 6 | a |
| 0 | 7 | a |
| 0 | 8 | d |
+---------+------------+-------+
9 rows in set (0.00 sec)
Here is the result I know how to get in both, but I cannot seem to be able to get the event info in it as well:
SELECT user_id, MAX(purchased) AS purchased, spent
FROM (
SELECT
e1.user_id AS user_id, e1.created_at AS purchased,
MIN(e2.created_at) AS spent
FROM events e1, events e2
WHERE
e1.user_id = e2.user_id AND e1.created_at <= e2.created_at AND
e1.Event = 'a' AND e2.Event != 'a'
GROUP BY e1.user_id, e1.created_at
) e3 GROUP BY user_id, spent;
user_id | purchased | spent
---------+-----------+-------
0 | 0 | 1
0 | 3 | 4
0 | 7 | 8
Now if I want the event type in there as well, it does not work with the query above,
because you either have to use the event field in the group-by (not what we want) or with an aggregate (not what we want either). Funny enough in mysql it works, but I consider that cheating and since I have to use vertica for this, it won't help me:
SELECT user_id, MAX(purchased) as purchased, spent, event FROM (
SELECT
e1.User_ID AS user_id,
e1.created_at AS purchased,
MIN(e2.created_at) AS spent,
e2.event AS event
FROM events e1, events e2
WHERE
e1.user_id = e2.user_id AND e1.created_at <= e2.created_at AND
e1.Event = 'a' AND e2.Event != 'a'
GROUP BY
e1.user_id,e1.created_at
) e3 GROUP BY user_id, spent;
+---------+-----------+-------+-------+
| user_id | purchased | spent | event |
+---------+-----------+-------+-------+
| 0 | 0 | 1 | b |
| 0 | 3 | 4 | c |
| 0 | 7 | 8 | d |
+---------+-----------+-------+-------+
3 rows in set (0.00 sec)
For vertica the same query throws an error:
ERROR 2640: Column "e2.event" must appear in the GROUP BY clause or be used in an aggregate function
Whats an elegant solution to get the two events paired up with all their columns and without cheating so that I can get the same result as shown above when executing in vertica or some other database that does not allow the cheat? In the sample data I have exactly one additional column I want, which is the event type, but in the real world it would be two or three columns.
Please try it out with the sample data posted before answering :)

Ok, while I'm not 100% sure I understand what you're trying to do, see if this won't work:
SELECT e3.user_id, MAX(e3.purchased) AS purchased, e3.spent, e.event
FROM (
SELECT
e1.user_id AS user_id, e1.created_at AS purchased,
MIN(e2.created_at) AS spent
FROM events e1, events e2
WHERE
e1.user_id = e2.user_id AND e1.created_at <= e2.created_at AND
e1.Event = 'a' AND e2.Event != 'a'
GROUP BY e1.user_id, e1.created_at
) e3
JOIN events e on e3.user_id = e.user_id and e3.spent = e.created_at
GROUP BY e3.user_id, e3.spent, e.event
Essentially I'm just joining on the event table again assuming user_id and created_at are your primary key.
And here is the SQL Fiddle.

Try this...
With Cte As
(
Select Row_Number() Over (Partition By [user_id] Order By [created_at]) As row_num,
[user_id],
[created_at],
[event]
From [events]
)
Select c1.[user_id],
c1.[created_at] As purchased,
c2.[created_at] As spent,
c2.[event]
From Cte c1
Left Join Cte c2
On c1.row_num = c2.row_num - 1
Where c1.event = 'a'
And c2.event <> 'a'

I usually do the "next" calculations using correlated subqueries, and then join back to the original table. In this case, I am assuming that , uniquely defines a row.
Here is the query:
SELECT user_id, MAX(purchased) as purchased, spent, event
FROM (
SELECT e.User_ID, e.created_at AS purchased,
MIN(enext.created_at) AS spent,
min(enext.event) AS event
FROM (select e.*,
(select MIN(e2.created_at)
from event e2
where e2.user_id = e.user_id and e2.created_at > e.created_at and e2.event <> 'a'
) nextcreatedat
from events e
where e.event = 'a'
) e left outer join
events enext
on e.user_id = enext.user_id and
e.nextcreatedat = enext.create_at
GROUP BY e.user_id, e.created_at
) e3
GROUP BY user_id, spent;
The aggregation GROUP BY e.user_id, e.created_at is not necessary, but I've left it in to remain similar to the original query.
Because Vertica supports cumulative sums, there is a way to do this more efficiently, but it would not work in MySQL.

Related

How to get entries which are grouped and satisfy restriction within the group?

In the table REPORT there are following 3 columns:
RID - report id, which is not unique
TYPE - can be either new or cancel, report is identified by RID, so one report can have multiple cancellations and multiple new entries
TIMESTAMP - is just a timestamp of when entry arrived
Example below:
RID | Type | TIMESTAMP |
-----+---------+-------------------+
4 | New | 2019-10-27 10:35 |
4 | Cancel | 2019-10-27 09:35 |
3 | Cancel | 2019-10-27 07:35 |
2 | New | 2019-10-27 07:35 |
1 | Cancel | 2019-10-27 09:35 |
1 | Cancel | 2019-10-27 08:35 |
1 | New | 2019-10-27 07:35 |
I'd like to get all reports which at some point were created and then canceled, so that the latest state is canceled.
It is possible to have cancellations of non-existed reports, or entries with first cancellations and then new entries, all of those should be excluded.
My attempt so far was to use nested query to get all cancellations, which have corresponding new entry, but do not know how to consider their timestamps into consideration, to exclude entries which have sequence cancel->new
SELECT
RID
FROM
REPORT
WHERE
TYPE = 'Cancel'
AND RID IN (
SELECT
RID
FROM
REPORT
WHERE
TYPE = 'New'
);
My expectation from the example is to get RID 1, I'm interested in only RIDs.
Using: MySQL 5.7.27 InnoDB
With EXISTS:
select distinct r.rid
from report r
where r.type = 'Cancel'
and exists (
select 1 from report
where rid = r.rid and type='New' and timestamp < r.timestamp
)
See the demo.
Or:
select rid
from report
group by rid
having
min(case when type = 'New' then timestamp end) <
min(case when type = 'Cancel' then timestamp end)
See the demo.
Results:
| rid |
| --- |
| 1 |
I would get the latest type using a correlated subquery. Then check for "cancel":
select t.*
from t
where t.timestamp = (select max(t2.timestamp)
from t t2
where t2.rid = t.rid
) and
t.type = 'Cancel';
If you just want the rid and date, then another fun way uses aggregation:
select rid, max(timestamp)
from t
group by rid
having max(timestamp) = max(case when type = 'Cancel' then timestamp end);
The logic here is to get timestamps where the maximum date is also the maximum cancel date.
Out of my head, probably with few typos, and probably not the best, but should be easy to understand...
SELECT n.RID FROM (
SELECT RID, TYPE, MIN(DATETIME) AS FirstAdd
FROM REPORT
WHERE TYPE = 'New'
GROUP BY RID, TYPE) n INNER JOIN (
SELECT RID, TYPE, MAX(DATETIME) AS LastCancel
FROM REPORT
WHERE TYPE = 'Cancel'
GROUP BY RID, TYPE) c ON n.RID = c.RID
AND n.FirstAdd < c.LastCancel

Converting IN to EXISTS for counting users who have taken multiple kinds of actions

I have a table full of users, timestamps, and different types of actions. Let's call them type A, B, and C:
| User ID | Date | ActionType |
--------------------------------------
| 1 | 10/2/14 | A |
| 2 | 10/12/14 | A |
| 3 | 11/1/14 | B |
| 1 | 11/15/14 | B |
| 2 | 12/2/14 | C |
I'm trying to get counts of the number of users who have taken combinations of different action types within a time period -- for example, the number of users who have done both action A and action B between October and December.
This code works (for one combination of actions at a time), but takes forever to run:
SELECT
COUNT(DISTINCT `cm0`.`User ID`) AS `Users`
FROM `mytable` AS `cm0`
WHERE
(`cm0`.`User ID` IN (SELECT `cm1`.`User ID` FROM `mytable` AS `cm1` WHERE
(`cm1`.`ActionType` = 'A' AND (`cm1`.`Date` BETWEEN dateA AND
dateB)))
AND (`cm0`.`ActionType` = 'B')
AND (`cm0`.`Date` BETWEEN dateA AND dateB))
I researched ways to do this using common table expressions, and then realized I couldn't do those in mySQL. Now I'm trying to figure out how to optimize with EXISTS instead of IN, but I'm having trouble fitting examples into what I need. Any help would be much appreciated!
Try this:
SELECT COUNT(DISTINCT cm0.User_ID) AS Users
FROM mytable AS cm0
WHERE cm0.ActionType IN ('A', 'B') AND cm0.Date BETWEEN dateA AND dateB
GROUP BY cm0.User_ID
HAVING COUNT(DISTINCT cm0.ActionType) = 2;
The above query will return the number of users who have done both action A and action B between October and December, but if you still want to use EXISTS then check below query:
SELECT COUNT(DISTINCT cm0.User_ID) AS Users
FROM mytable AS cm0
WHERE EXISTS (SELECT 1 FROM mytable AS cm1
WHERE cm0.User_ID = cm1.User_ID AND cm1.ActionType = 'A' AND cm1.Date BETWEEN dateA AND dateB
) AND cm0.ActionType = 'B' AND cm0.Date BETWEEN dateA AND dateB

MySQL - Count Values occurring between other values

I'd like to count how many occurrences of a value happen before a specific value
Below is my starting table
+-----------------+--------------+------------+
| Id | Activity | Time |
+-----------------+--------------+------------+
| 1 | Click | 1392263852 |
| 2 | Error | 1392263853 |
| 3 | Finish | 1392263862 |
| 4 | Click | 1392263883 |
| 5 | Click | 1392263888 |
| 6 | Finish | 1392263952 |
+-----------------+--------------+------------+
I'd like to count how many clicks happen before a finish happens.
I've got a very roundabout way of doing it where I write a function to find the last
finished activity and query the clicks between the finishes.
Also repeat this for Error.
What I'd like to achieve is the below table
+-----------------+--------------+------------+--------------+------------+
| Id | Activity | Time | Clicks | Error |
+-----------------+--------------+------------+--------------+------------+
| 3 | Finish | 1392263862 | 1 | 1 |
| 6 | Finish | 1392263952 | 2 | 0 |
+-----------------+--------------+------------+--------------+------------+
This table is very long so I'm looking for an efficient solution.
If anyone has any ideas.
Thanks heaps!
This is a complicated problem. Here is an approach to solving it. The groups between the "finish" records need to be identified as being the same, by assigning a group identifier to them. This identifier can be calculated by counting the number of "finish" records with a larger id.
Once this is assigned, your results can be calculated using an aggregation.
The group identifier can be calculated using a correlated subquery:
select max(id) as id, 'Finish' as Activity, max(time) as Time,
sum(Activity = 'Clicks') as Clicks, sum(activity = 'Error') as Error
from (select s.*,
(select sum(s2.activity = 'Finish')
from starting s2
where s2.id >= s.id
) as FinishCount
from starting s
) s
group by FinishCount;
A version that leverages user(session) variables
SELECT MAX(id) id,
MAX(activity) activity,
MAX(time) time,
SUM(activity = 'Click') clicks,
SUM(activity = 'Error') error
FROM
(
SELECT t.*, #g := IF(activity <> 'Finish' AND #a = 'Finish', #g + 1, #g) g, #a := activity
FROM table1 t CROSS JOIN (SELECT #g := 0, #a := NULL) i
ORDER BY time
) q
GROUP BY g
Output:
| ID | ACTIVITY | TIME | CLICKS | ERROR |
|----|----------|------------|--------|-------|
| 3 | Finish | 1392263862 | 1 | 1 |
| 6 | Finish | 1392263952 | 2 | 0 |
Here is SQLFiddle demo
Try:
select x.id
, x.activity
, x.time
, sum(case when y.activity = 'Click' then 1 else 0 end) as clicks
, sum(case when y.activity = 'Error' then 1 else 0 end) as errors
from tbl x, tbl y
where x.activity = 'Finish'
and y.time < x.time
and (y.time > (select max(z.time) from tbl z where z.activity = 'Finish' and z.time < x.time)
or x.time = (select min(z.time) from tbl z where z.activity = 'Finish'))
group by x.id
, x.activity
, x.time
order by x.id
Here's another method of using variables, which is somewhat different to #peterm's:
SELECT
Id,
Activity,
Time,
Clicks,
Errors
FROM (
SELECT
t.*,
#clicks := #clicks + (activity = 'Click') AS Clicks,
#errors := #errors + (activity = 'Error') AS Errors,
#clicks := #clicks * (activity <> 'Finish'),
#errors := #errors * (activity <> 'Finish')
FROM
`starting` t
CROSS JOIN
(SELECT #clicks := 0, #errors := 0) i
ORDER BY
time
) AS s
WHERE Activity = 'Finish'
;
What's similar to Peter's query is that this one uses a subquery that's returning all the rows, setting some variables along the way and returning the variables' values as columns. That may be common to most methods that use variables, though, and that's where the similarity between these two queries ends.
The difference is in how the accumulated results are calculated. Here all the accumulation is done in the subquery, and the main query merely filters the derived dataset on Activity = 'Finish' to return the final result set. In contrast, the other query uses grouping and aggregation at the outer level to get the accumulated results, which may make it slower than mine in comparison.
At the same time Peter's suggestion is more easily scalable in terms of coding. If you happen to have to extend the number of activities to account for, his query would only need expansion in the form of adding one SUM(activity = '...') AS ... per new activity to the outer SELECT, whereas in my query you would need to add a variable and several expressions, as well as a column in the outer SELECT, per every new activity, which would bloat the resulting code much more quickly.

How to avoid two results of the same row in join

I have three tables actually on virturt mart table one is orders, another is item & one is order_user_info
to get the user first name i need to join order_user_info table
but when i join it shows the result info double, below i have mentioned the query & result please guide how can avoid double result
*FOR JOIN FIRST NAME I AM USING BELOW MENTIONED QUERY *
LEFT JOIN `urbanite_virtuemart_order_userinfos` as Uinfo ON Uinfo.virtuemart_order_id=i.virtuemart_order_id
*COMPLETE QUERY *
SELECT SQL_CALC_FOUND_ROWS o.created_on AS intervals, CAST( i.`created_on` AS DATE ) AS created_on, Uinfo.`first_name`, o.`order_number`, SUM(DISTINCT i.product_item_price * product_quantity) as order_subtotal_netto, SUM(DISTINCT i.product_subtotal_with_tax) as order_subtotal_brutto, COUNT(DISTINCT i.virtuemart_order_id) as count_order_id, SUM(i.product_quantity) as product_quantity FROM `urbanite_virtuemart_order_items` as i
LEFT JOIN `urbanite_virtuemart_orders` as o ON o.virtuemart_order_id=i.virtuemart_order_id
LEFT JOIN `urbanite_virtuemart_order_userinfos` as Uinfo ON Uinfo.virtuemart_order_id=i.virtuemart_order_id AND Uinfo.created_on = i.created_on AND Uinfo.virtuemart_user_id = o.virtuemart_user_id
WHERE (`i`.`order_status` = "S") AND i.virtuemart_vendor_id = "63" AND DATE( o.created_on ) BETWEEN "2013-06-01 05:00:00" AND "2013-06-30 05:00:00"
GROUP BY intervals
ORDER BY created_on DESC LIMIT 0, 400
result i am getting with out join like below
intervals | Created_on | order_no | order_subtotalnetto | order_subtotalbruto | count_order_id | product_quantity
2013-06-12 09:47:16 |2013-06-12 | 43940624 | 200.00000 | 200.00000 | 1 | 2
result i am getting with join for firstname like below
intervals | Created_on | order_no | f_name | order_subtotalnetto | order_subtotalbruto | count_order_id | product_quantity
2013-06-12 09:47:16 |2013-06-12 | Fatin Bokhari | 43940624 | 200.00000 | 200.00000 | 1 | 4
see in with out join for first name it show product_quantity = 2 but when i join it shows the value double, i tried distinct but cant go this way as it show the product quantity = 1 every time
Kindly need rescue!
oh actually the rows comes twice in a urbanite_virtuemart_order_userinfos table so i used where clause & it works
WHERE (`i`.`order_status` = "S") AND i.virtuemart_vendor_id = "63" AND DATE( o.created_on ) BETWEEN "2013-06-01 05:00:00" AND "2013-06-30 05:00:00" AND Uinfo.`address_type` = 'BT'

Count number of distinct rows for multiple values

Let's consider this table specifing how many times a person bought a property.
+--------+----------+
| user | property |
+--------+----------+
| john | car |
| john | car |
| john | house |
| peter | car |
| peter | car |
| amanda | house |
| amanda | house |
+--------+----------+
I need to know how many times a car was bought once, how many times a house was bought once, etc. Something like this:
+----------+---+---+
| property | 1 | 2 |
+----------+---+---+
| cars | 4 | 2 |
| house | 3 | 1 |
+----------+---+---+
How many times a car was bought? Four, two for peter and two for john.
How many times a car was bought twice? Two, for the same guys.
How many times a house was bought? Three, two for amanda and once for john.
How many times a house was bought twice? Only once, for amanda
Is this possible to do this only using SQL queries?
I don't care about performance or hackish ways.
There are more than two frequencies.
There's a fixed set of time a person can buy a property (5) so it's not problem to specify the columns manually in the query. I mean there's not problem doing something like:
SELECT /* ... */ AS 1, /* ... */ AS 2, /* ... */, AS 3 /* ... */
SELECT DISTINCT #pr := prop,
(SELECT COUNT(1) FROM tbl WHERE prop = #pr LIMIT 1),
(SELECT COUNT(1) FROM
(SELECT *, COUNT(*) cnt
FROM tbl
GROUP BY usr, prop
HAVING cnt = 2) as tmp
WHERE `tmp`.prop = #pr LIMIT 1)
FROM tbl;
Yes, it is not the best method; but hey, you get the answers as desired.
Also, it'll generate the results for any kind of property in your table.
The fiddle link lies here.
P.S.: 60 tries O_O
I am here since you posted the question. Good one...
Here is a way to do it exactly as you asked for, with just groups and counts.
The trick is that I concatenate the user and property columns to produce a unique "id" for each, if we could call it that. It should work independently of the count of purchases.
SELECT C.`property`, COUNT(C.`property`), D.`pcount` from `purchases` C
LEFT JOIN(
SELECT A.`property`, B.`pcount` FROM `purchases` A
LEFT JOIN (
SELECT `property`,
CONCAT(`user`, `property`) as conc,
COUNT(CONCAT(`user`, `property`)) as pcount
FROM `purchases` GROUP BY CONCAT(`user`, `property`)
) B
ON A.`property` = B.`property`
GROUP BY B.pcount
) D
ON C.`property` = D.`property`
GROUP BY C.`property`
SQL Fiddle
MySQL 5.5.30 Schema Setup:
CREATE TABLE Table1
(`user` varchar(6), `property` varchar(5))
;
INSERT INTO Table1
(`user`, `property`)
VALUES
('john', 'car'),
('john', 'car'),
('john', 'house'),
('peter', 'car'),
('peter', 'car'),
('amanda', 'house'),
('amanda', 'house')
;
Query 1:
select t.property, t.total, c1.cnt as c1, c2.cnt as c2, c3.cnt as c3
from
(select
t.property ,
count(t.property) as total
from Table1 t
group by t.property
) as t
left join (
select property, count(*) as cnt
from (
select
property, user, count(*) as cnt
from table1
group by property, user
having count(*) = 1
) as i1
group by property
) as c1 on t.property = c1.property
left join (
select property, count(*) as cnt
from (
select
property, user, count(*) as cnt
from table1
group by property, user
having count(*) = 2
) as i2
group by property
) as c2 on t.property = c2.property
left join (
select property, count(*) as cnt
from (
select
property, user, count(*) as cnt
from table1
group by property, user
having count(*) = 3
) as i3
group by property
) as c3 on t.property = c3.property
Results:
| PROPERTY | TOTAL | C1 | C2 | C3 |
-------------------------------------------
| car | 4 | (null) | 2 | (null) |
| house | 3 | 1 | 1 | (null) |
You may try following.
SELECT COUNT(TABLE1.PROPERTY) AS COUNT, PROPERTY.USER FROM TABLE1
INNER JOIN (SELECT DISTINCT PROPERTY, USER FROM TABLE1) AS PROPERTY
ON PROPERTY.PROPERTY = TABLE1.PROPERTY
AND PROPERTY.USER = TABLE1.USER
GROUP BY TABLE1.USER, PROPERTY.PROPERTRY
tested similar in MySQL
try this
SELECT property , count(property) as bought_total , count(distinct(user)) bought_per_user
FROM Table1
GROUP BY property
the output will be like that
PROPERTY | BOUGHT_TOTAL | BOUGHT_PER_USER
________________________________________________________
car | 4 | 2
house | 3 | 2
DEMO SQL FIDDLE HERE
You should be able to do this with sub-selects.
SELECT property, user, COUNT(*) FROM purchases GROUP BY property, user;
will return you the full set of grouped data that you want. You then need to look at the different frequencies:
SELECT property, freq, COUNT(*) FROM (SELECT property, user, COUNT(*) freq FROM purchases GROUP BY property, user) AS foo GROUP BY property, freq;
It's not quite in the format that you illustrated but it returns the data
I hope this can help u.....let us create one table first:
create table prop(user varchar(max),property varchar(max))
insert into prop values('john','car'),insert into prop values('john','car'),
insert into prop values('john','house'),insert into prop values('peter','car'),
insert into prop values('peter','car'),insert into prop values('amanda','house'),
insert into prop values('amanda','house')
1)how many times car was bought?
ANS: select count(property) from prop where property = 'car'
(4)
2)How many times a car was bought twice?
ANS: select user,COUNT(property) from prop where property = 'car' group by user
having COUNT(property) = 2
2-john
2-peter
3)How many times a house was bought?
ANS: select COUNT(property) from prop where property = 'house'
(3)
4)How many times a house was bought twice?
ANS: select user,COUNT(property) from prop where property='house' group by user
having COUNT(property)< =2
2-amanda
1-john