How to exclude rows based on certain values and duplicates - mysql

We have data related to subscription events - create, update, delete, etc. I want to be able to query on this data based on certain values to determine if a given user was active on a given date based on the events logged. I have the following table: (SQL fiddle here)
CREATE TABLE events (
eid varchar(45) NOT NULL,
cid varchar(45) DEFAULT NULL,
sid varchar(45) DEFAULT NULL,
event_type varchar(45) DEFAULT NULL,
period_start datetime DEFAULT NULL,
period_end datetime DEFAULT NULL,
date date DEFAULT NULL,
datetime datetime DEFAULT NULL,
PRIMARY KEY (eid)
);
with the following example data:
INSERT INTO events
(eid, cid, sid, event_type, period_start, period_end, date, datetime)
VALUES
('event_1', 'customer_1', 'subscription_456', 'created', '2016-03-11 17:38:50', '2016-09-11 18:38:50', '2016-03-11', '2016-03-11 17:38:51');
('event_2', 'customer_1', 'subscription_456', 'updated', '2016-09-11 18:38:50', '2017-03-11 17:38:50', '2016-09-11', '2016-09-11 18:46:04'),
('event_3', 'customer_1', 'subscription_456', 'deleted', '2016-09-11 18:38:50', '2017-03-11 17:38:50', '2016-09-11', '2016-09-11 22:39:43'),
I am looking for a query where I could enter in any date to see if this user was active during this time based on the period_start, period_end, and event_type.
Basically, if a row exists with event_type = 'deleted', then it should exclude that row and any other rows with the same sid, period_start and period_end values. I have tried:
SELECT e.* FROM events e
JOIN (SELECT sid, event_type, period_start, period_end
FROM events) e2
ON
(e2.sid = e.sid AND e2.event_type = "deleted"
AND e2.period_start = e.period_start
AND e2.period_end = e.period_end)
WHERE
(e.event_type = 'created' OR e.event_type = 'updated')
AND date(e.period_start) <= '2016-04-01'
AND date(e.period_end) >= '2016-04-01';
which should return the created event (but isn't returning anything), while using the dates 2016-09-01 or 2017-01-01 should return nothing. I'm not sure what to try next. I'd really like to be able to accomplish this in a query rather than having to process the data in PHP or JS.

As you have now added a description of how you want deleted information excluded I suggest the following:
SELECT e.eid, e.cid, e.sid, e.event_type, e.period_start, e.period_end, e2.eid, e2.event_type
FROM events e
left join events e2
on e.sid = e2.sid and e.event_type <> 'deleted' and e2.event_type = 'deleted'
AND e.period_start = e2.period_start AND e.period_end = e2.period_end
WHERE e.event_type <> 'deleted'
AND e2.eid IS NULL
AND '2016-04-01' between e.period_start and e.period_end
Previous answer:
I really don't know what you want, perhaps it would help if you listed the a set of parameters, and then listed the expected result for those?
In the absence of that perhaps this will help:
Query 1:
SELECT e.*
FROM events e
WHERE '2016-04-01' between e.period_start and e.period_end
Results:
| eid | cid | sid | event_type | period_start | period_end | date | datetime |
|---------|------------|------------------|------------|-------------------------|-----------------------------|-------------------------|-------------------------|
| event_1 | customer_1 | subscription_456 | created | March, 11 2016 17:38:50 | September, 11 2016 18:38:50 | March, 11 2016 00:00:00 | March, 11 2016 17:38:51 |
Query 2:
SELECT e.eid, e.cid, e.sid, e.event_type, e.period_start, e.period_end, e2.eid, e2.event_type
FROM events e
left join events e2
on e.sid = e2.sid and e.event_type <> 'deleted' and e2.event_type = 'deleted'
WHERE '2016-04-01' between e.period_start and e.period_end
Results:
| eid | cid | sid | event_type | period_start | period_end | eid | event_type |
|---------|------------|------------------|------------|-------------------------|-----------------------------|---------|------------|
| event_1 | customer_1 | subscription_456 | created | March, 11 2016 17:38:50 | September, 11 2016 18:38:50 | event_3 | deleted |

It is hard to tell what is broken at given sample data.
However it is important to debug at which line SQL stops to return values:
See the following example, there is still one line returned at the highlighted part of SQL
However if you run the additional 2 lines at the end, there will be no result, and you will need to double check why? Because no data will match after SQL line #11

Related

Query to find active records on a given date in a table recording "start" and "stop" dates

I have a list of "start / stop" activity logged in a table, each one associated with a date. I need to determine which users had "started" on a particular date - i.e. were in progress with the task. My current setup and query can be represented by this simplistic view of it:
CREATE TABLE `registration_statuses` (
`status_id` INT(11) NOT NULL AUTO_INCREMENT,
`status_user_id` INT(10) UNSIGNED NOT NULL DEFAULT '0',
`status_activity` ENUM('start','stop') DEFAULT 'start',
`status_date` DATE NULL DEFAULT NULL,
PRIMARY KEY (`status_id`),
INDEX `status_user_id` (`status_user_id`)
);
INSERT INTO `registration_statuses` (`status_user_id`, `status_activity`, `status_date`)
VALUES (1, 'start', '2020-01-01'),
(2, 'start', '2020-01-02'),
(1, 'stop', '2020-01-19'),
(1, 'start', '2020-01-25'),
(2, 'stop', '2020-01-31'),
(1, 'stop', '2020-01-31');
I am then running this query:
SELECT `rs`.`status_user_id`
FROM `registration_statuses` `rs`
INNER JOIN (
SELECT `status_user_id`, MAX(status_date) `last_date`
FROM `registration_statuses`
WHERE `status_date` < '2020-01-03'
GROUP BY `status_user_id`
) `srs` ON `rs`.`status_user_id` = `srs`.`status_user_id`
AND `rs`.`status_date` = `srs`.`last_date`
WHERE `status_activity` = 'start';
(See http://sqlfiddle.com/#!9/c8d371/1/0)
By changing the date in the query, this query returns a list of user ids that tell me who is engaged (i.e. has started a task) on that specific date. However, the users are considered (in real life) to have been engaged with the task on the actual date that they stop it. This query doesn't allow for this in that if you were to change the date in the query to reflect 2020-01-19, the day on which user 1 stopped, the query would only return user 2.
I tried changing the <= condition to a strict <, and while this solves that part of the problem, users are not considered busy on the day that they start. With a strict <, only user is returned on '2019-01-25' whereas I want both users to appear.
My only "viable" solution, at this point, is to merge the results of the two versions of the queries (in the form of a DISTINCT / UNION query), but I can't help but think that there must be a more efficient way of obtaining the results I need.
Does this help?
SELECT a.status_id
, a.status_user_id
, a.status_date start
, MIN(b.status_date) stop
FROM registration_statuses a
LEFT
JOIN registration_statuses b
ON b.status_user_id = a.status_user_id
AND b.status_id > a.status_id
AND b.status_activity = 'stop'
WHERE a.status_activity = 'start'
GROUP
BY a.status_id;
+-----------+----------------+------------+------------+
| status_id | status_user_id | start | stop |
+-----------+----------------+------------+------------+
| 1 | 1 | 2020-01-01 | 2020-01-19 |
| 2 | 2 | 2020-01-02 | 2020-01-31 |
| 4 | 1 | 2020-01-25 | 2020-01-31 |
+-----------+----------------+------------+------------+
One method is a correlated subquery:
select rs.*
from registration_statuses rs
where rs.status_date = (select max(rs2.status_date)
from registration_statuses rs2
where rs2.status_user_id = rs.status_user_id and
rs2.status_date <= ?
) and
rs.status_activity = 'active';
For performance, you want an index on registration_statuses(status_user_id, status_date).
There are other interesting methods. If you just want the user_id, here is an approach only using aggregation:
select rs.user_id
from registration_statuses rs
where rs.status_date <= ?
group by rs.user_id
having max(rs.status_date) = max(case when rs.status_activity = 'active' then status_date end);
That is, select users whose most recent status date as of a particular date is "active".

Counting events before a specific event

Let's say, I have a table with the following columns:
date | event | user_id | unit_id |cost | ad_id | spend
03-15 | impression | 2353 | 3436 | 0.15 | NULL | NULL
03-15 | impression | 2353 | 3436 | 0.12 | NULL | NULL
03-15 | impression | 1234 | 5678 | 0.10 | NULL | NULL
03-15 | click | 1234 | 5678 | NULL | NULL | NULL
03-15 | create_ad | 1234 | 5678 | NULL | 6789 | 10
I want to calculate how many impressions on average it takes before a user creates an id.
In this particular scenario, it took one impression for user 1234 to create an ad.
I'm not sure that I can somehow use date to discriminate events (but logically all these events should happen at different moments). However, you can see that impressions have NULLs in ad_id and spend, while create_id does have a number in spend.
This one doesn't work:
select i.user_id
, i.unit_id
, count(i.event) impressions_n
, count(c.event) as ads_n
from add4ad i
left
join add4ad c
on i.user_id = c.user_id
and i.unit_id = c.unit_id
where i.event in ('impression')
and c.spend <> NULL
group
by i.user_id
, i.unit_id
I have created a SQLFiddle with this data
I went to SQL Fiddle and ran the test via MS SQL engine.
CREATE TABLE add4ad (date date, event varchar(10), user_id int,
unit_id int, cost float, ad_id float, spend float);
INSERT INTO add4ad (date, Event, user_id,unit_id,cost,ad_id,spend)
VALUES
('2018-03-15','impression','2353','3436','0.15',NULL,NULL),
('2018-03-15','impression','2353','3436','0.12',NULL,NULL),
('2018-03-15','impression','2353','3436','0.10',NULL,NULL),
('2018-03-15','click','1234','5678', NULL, NULL,NULL),
('2018-03-15','create_ad','2353','5678', NULL, 6789,10);
My query
with e10 as (select user_id, event, date, rowid=row_number() over (Partition by user_id order by date)
from add4ad
where event='create_ad'
),
e20 as ( -- get the first create_ad event
select user_id, date
from e10
where rowid=1
)
select a.user_id, count(1) as N
from e20 inner join add4ad a
on e20.user_id=a.user_id
and a.date<=e20.date
and a.event='impression'
group by a.user_id
If I got it right, you need to count distinct ads
CREATE TABLE add4ad (`date` date, `event` varchar(10), `user_id` int,
`unit_id` int, `cost` float, `ad_id` float, `spend` float);
INSERT INTO add4ad (`date`, `Event`, `user_id`,`unit_id`,`cost`,`ad_id`,`spend`)
VALUES
('2018-03-15','impression','2353','3436','0.15',NULL,NULL),
('2018-03-15','impression','2353','3436','0.12',NULL,NULL),
('2018-03-15','impression','2353','3436','0.10',NULL,NULL),
('2018-03-15','impression','1234','5678','0.10',NULL,NULL),
('2018-03-15','click','1234','5678', NULL, NULL,NULL),
('2018-03-15','create_ad','1234','5678', NULL, 6789,10),
('2018-03-16','impression','8765','8871','0.10',NULL,NULL),
('2018-03-16','impression','8765','8871','0.10',NULL,NULL),
('2018-03-16','impression','8765','8871','0.2',NULL,NULL),
('2018-03-16','impression','8765','8871','0.23',NULL,NULL),
('2018-03-16','click','8765','8871', NULL, NULL,NULL),
('2018-03-16','create_ad','8765','8871', NULL, 6789,10);
select i.user_id, i.unit_id, count(i.event) as impressions_n,
count(distinct c.event) as ads_n
from add4ad i
join add4ad c
on i.user_id = c.user_id and i.unit_id = c.unit_id
where i.event in ('impression')
and c.event in ('create_ad') and c.spend is not NULL
group by i.user_id, i.unit_id
Returns
user_id unit_id impressions_n ads_n
1234 5678 1 1
8765 8871 4 1
I've replaced left join with join because where as it is effectively makes your join inner If you still need left join move predicates to ON clause or handle NULLs in where.
fiddle
The issue is for checking NULLS you have to use is NULL or is not NULL. Also your data in fiddle is incorrect. It does not have impression for 1234 in fiddle.
select i.user_id, i.unit_id, count(i.event) as impressions_n,
count(c.event) as ads_n
from add4ad i
left join add4ad c
on i.user_id = c.user_id and i.unit_id = c.unit_id
where i.event in ('impression')
/*and c.event in ('create_ad')*/ and c.spend is not NULL
group by i.user_id, i.unit_id
Seems this is the solution:
select sum(c.impressions_n) / count(1) as average_num_of_impressions from (
select count(i.event) as impressions_n
from add4ad i
join add4ad c
on i.user_id = c.user_id and i.unit_id = c.unit_id
where i.event in ('impression') and c.event in ('create_ad')
group by i.user_id, i.unit_id ) c

How to get entries which are grouped and satisfy restriction within the group?

In the table REPORT there are following 3 columns:
RID - report id, which is not unique
TYPE - can be either new or cancel, report is identified by RID, so one report can have multiple cancellations and multiple new entries
TIMESTAMP - is just a timestamp of when entry arrived
Example below:
RID | Type | TIMESTAMP |
-----+---------+-------------------+
4 | New | 2019-10-27 10:35 |
4 | Cancel | 2019-10-27 09:35 |
3 | Cancel | 2019-10-27 07:35 |
2 | New | 2019-10-27 07:35 |
1 | Cancel | 2019-10-27 09:35 |
1 | Cancel | 2019-10-27 08:35 |
1 | New | 2019-10-27 07:35 |
I'd like to get all reports which at some point were created and then canceled, so that the latest state is canceled.
It is possible to have cancellations of non-existed reports, or entries with first cancellations and then new entries, all of those should be excluded.
My attempt so far was to use nested query to get all cancellations, which have corresponding new entry, but do not know how to consider their timestamps into consideration, to exclude entries which have sequence cancel->new
SELECT
RID
FROM
REPORT
WHERE
TYPE = 'Cancel'
AND RID IN (
SELECT
RID
FROM
REPORT
WHERE
TYPE = 'New'
);
My expectation from the example is to get RID 1, I'm interested in only RIDs.
Using: MySQL 5.7.27 InnoDB
With EXISTS:
select distinct r.rid
from report r
where r.type = 'Cancel'
and exists (
select 1 from report
where rid = r.rid and type='New' and timestamp < r.timestamp
)
See the demo.
Or:
select rid
from report
group by rid
having
min(case when type = 'New' then timestamp end) <
min(case when type = 'Cancel' then timestamp end)
See the demo.
Results:
| rid |
| --- |
| 1 |
I would get the latest type using a correlated subquery. Then check for "cancel":
select t.*
from t
where t.timestamp = (select max(t2.timestamp)
from t t2
where t2.rid = t.rid
) and
t.type = 'Cancel';
If you just want the rid and date, then another fun way uses aggregation:
select rid, max(timestamp)
from t
group by rid
having max(timestamp) = max(case when type = 'Cancel' then timestamp end);
The logic here is to get timestamps where the maximum date is also the maximum cancel date.
Out of my head, probably with few typos, and probably not the best, but should be easy to understand...
SELECT n.RID FROM (
SELECT RID, TYPE, MIN(DATETIME) AS FirstAdd
FROM REPORT
WHERE TYPE = 'New'
GROUP BY RID, TYPE) n INNER JOIN (
SELECT RID, TYPE, MAX(DATETIME) AS LastCancel
FROM REPORT
WHERE TYPE = 'Cancel'
GROUP BY RID, TYPE) c ON n.RID = c.RID
AND n.FirstAdd < c.LastCancel

Difference between 2 records timestamp sql

I have this table:
CREATE TABLE result (
id bigint(20) NOT NULL AUTO_INCREMENT,
tag int(11) NOT NULL,
timestamp timestamp NULL DEFAULT NULL,
value double NOT NULL,
PRIMARY KEY (id),
UNIQUE KEY nasudnBBEby333412dsa (timestamp, tag)
) ENGINE=InnoDB AUTO_INCREMENT=115 DEFAULT CHARSET=utf8mb4;
I would like to calculate the difference between two consecutive days that have the same column tag. For example, in timestamp:
| 1 | 1 | 2017-06-18 00:00:00 | 7.3 |
| 2 | 1 | 2017-06-17 00:00:00 | 7.4 |
I want to result: -0.1
Which query should i write?
You can try this
1) Use join to select value of next consecutive day.
2) then calculate difference
SELECT r1.id, r1.tag, r1.value AS CURRENT_VALUE, r2.value AS NEXT_VALUE, (
r1.value - r2.value
) AS DIFF, r1.timestamp
FROM `result` r1
LEFT JOIN result r2 ON r2.tag=r1.tag AND r2.`timestamp` = r1.`timestamp` + INTERVAL 1
DAY WHERE r2.value IS NOT NULL
GROUP BY r1.timestamp
Output
First, if you want to store date values, you can use date, so there is no time component.
Second, you can do this with join:
select r.*, (r.value - rprev.value) as diff
from results r left join
results rprev
on r.tag = rprev.tag and
r.timestamp = rprev.timestamp + interval 1 day;

Efficiently finding the next event in SQL - with all columns

I found solutions that find the next event date, but not ones that would include all the data from the event. With cheating, I can get it done, but that only works in mysql and fails in vertica.
Here is the problem I am trying to solve:
I want to show all events a with data from the first event X that follows a and is not of type a. So here is the cut and paste example so you can play with it to see what actually works:
CREATE TABLE events (user_id int ,created_at int, event varchar(20));
INSERT INTO events values (0,0, 'a');
INSERT INTO events values (0,1, 'b');
INSERT INTO events values (0,2, 'c');
INSERT INTO events values (0,3, 'a');
INSERT INTO events values (0,4, 'c');
INSERT INTO events values (0,5, 'b');
INSERT INTO events values (0,6, 'a');
INSERT INTO events values (0,7, 'a');
INSERT INTO events values (0,8, 'd');
SELECT * FROM events;
+---------+------------+-------+
| user_id | created_at | event |
+---------+------------+-------+
| 0 | 0 | a |
| 0 | 1 | b |
| 0 | 2 | c |
| 0 | 3 | a |
| 0 | 4 | c |
| 0 | 5 | b |
| 0 | 6 | a |
| 0 | 7 | a |
| 0 | 8 | d |
+---------+------------+-------+
9 rows in set (0.00 sec)
Here is the result I know how to get in both, but I cannot seem to be able to get the event info in it as well:
SELECT user_id, MAX(purchased) AS purchased, spent
FROM (
SELECT
e1.user_id AS user_id, e1.created_at AS purchased,
MIN(e2.created_at) AS spent
FROM events e1, events e2
WHERE
e1.user_id = e2.user_id AND e1.created_at <= e2.created_at AND
e1.Event = 'a' AND e2.Event != 'a'
GROUP BY e1.user_id, e1.created_at
) e3 GROUP BY user_id, spent;
user_id | purchased | spent
---------+-----------+-------
0 | 0 | 1
0 | 3 | 4
0 | 7 | 8
Now if I want the event type in there as well, it does not work with the query above,
because you either have to use the event field in the group-by (not what we want) or with an aggregate (not what we want either). Funny enough in mysql it works, but I consider that cheating and since I have to use vertica for this, it won't help me:
SELECT user_id, MAX(purchased) as purchased, spent, event FROM (
SELECT
e1.User_ID AS user_id,
e1.created_at AS purchased,
MIN(e2.created_at) AS spent,
e2.event AS event
FROM events e1, events e2
WHERE
e1.user_id = e2.user_id AND e1.created_at <= e2.created_at AND
e1.Event = 'a' AND e2.Event != 'a'
GROUP BY
e1.user_id,e1.created_at
) e3 GROUP BY user_id, spent;
+---------+-----------+-------+-------+
| user_id | purchased | spent | event |
+---------+-----------+-------+-------+
| 0 | 0 | 1 | b |
| 0 | 3 | 4 | c |
| 0 | 7 | 8 | d |
+---------+-----------+-------+-------+
3 rows in set (0.00 sec)
For vertica the same query throws an error:
ERROR 2640: Column "e2.event" must appear in the GROUP BY clause or be used in an aggregate function
Whats an elegant solution to get the two events paired up with all their columns and without cheating so that I can get the same result as shown above when executing in vertica or some other database that does not allow the cheat? In the sample data I have exactly one additional column I want, which is the event type, but in the real world it would be two or three columns.
Please try it out with the sample data posted before answering :)
Ok, while I'm not 100% sure I understand what you're trying to do, see if this won't work:
SELECT e3.user_id, MAX(e3.purchased) AS purchased, e3.spent, e.event
FROM (
SELECT
e1.user_id AS user_id, e1.created_at AS purchased,
MIN(e2.created_at) AS spent
FROM events e1, events e2
WHERE
e1.user_id = e2.user_id AND e1.created_at <= e2.created_at AND
e1.Event = 'a' AND e2.Event != 'a'
GROUP BY e1.user_id, e1.created_at
) e3
JOIN events e on e3.user_id = e.user_id and e3.spent = e.created_at
GROUP BY e3.user_id, e3.spent, e.event
Essentially I'm just joining on the event table again assuming user_id and created_at are your primary key.
And here is the SQL Fiddle.
Try this...
With Cte As
(
Select Row_Number() Over (Partition By [user_id] Order By [created_at]) As row_num,
[user_id],
[created_at],
[event]
From [events]
)
Select c1.[user_id],
c1.[created_at] As purchased,
c2.[created_at] As spent,
c2.[event]
From Cte c1
Left Join Cte c2
On c1.row_num = c2.row_num - 1
Where c1.event = 'a'
And c2.event <> 'a'
I usually do the "next" calculations using correlated subqueries, and then join back to the original table. In this case, I am assuming that , uniquely defines a row.
Here is the query:
SELECT user_id, MAX(purchased) as purchased, spent, event
FROM (
SELECT e.User_ID, e.created_at AS purchased,
MIN(enext.created_at) AS spent,
min(enext.event) AS event
FROM (select e.*,
(select MIN(e2.created_at)
from event e2
where e2.user_id = e.user_id and e2.created_at > e.created_at and e2.event <> 'a'
) nextcreatedat
from events e
where e.event = 'a'
) e left outer join
events enext
on e.user_id = enext.user_id and
e.nextcreatedat = enext.create_at
GROUP BY e.user_id, e.created_at
) e3
GROUP BY user_id, spent;
The aggregation GROUP BY e.user_id, e.created_at is not necessary, but I've left it in to remain similar to the original query.
Because Vertica supports cumulative sums, there is a way to do this more efficiently, but it would not work in MySQL.