MySQL Group By Consecutive Rows - mysql

I have a feed application that I am trying to group results from consecutively.
My table looks like this:
postid | posttype | target | action | date | title | content
1 | userid | NULL | upgrade | 0000-01-00 00:00:00 | Upgraded 1 | exmple
1 | userid | NULL | upgrade | 0000-01-00 00:00:01 | Upgraded 2 | exmple
1 | userid | NULL | downgrade | 0000-01-00 00:00:02 | Downgraded | exmple
1 | userid | NULL | upgrade | 0000-01-00 00:00:03 | Upgraded | exmple
What I would like the outcome to be is:
postid | posttype | target | action | date | title | content
1 | userid | NULL | upgrade | 0000-01-00 00:00:01 | Upgrade 1 | exmple,exmple
1 | userid | NULL | downgrade | 0000-01-00 00:00:02 | Downgraded | exmple
1 | userid | NULL | upgrade | 0000-01-00 00:00:03 | Upgraded | exmple
So as you can see because Upgrade 1 & Upgrade 2 were sent Consecutively, it groups them together. The "Action" table is a reference, and should be used for the consecutive grouping as well as the postid & posttype.
I looked around on SO but didnt see anything quite like mine. Thanks in advance for any help.

Here's another version that works with MySQL Variables and doesn't require 3 level nesting deep. The first one pre-sorts the records in order by postID and Date and assigns them a sequential number per group whenever any time a value changes in one of the Post ID, Type and/or action. From that, Its a simple group by... no comparing record version T to T2 to T3... what if you wanted 4 or 5 criteria... would you have to nest even more entries?, or just add 2 more #sql variables to the comparison test...
Your call on which is more efficient...
select
PreQuery.postID,
PreQuery.PostType,
PreQuery.Target,
PreQuery.Action,
PreQuery.Title,
min( PreQuery.Date ) as FirstActionDate,
max( PreQuery.Date ) as LastActionDate,
count(*) as ActionEntries,
group_concat( PreQuery.content ) as Content
from
( select
t.*,
#lastSeq := if( t.action = #lastAction
AND t.postID = #lastPostID
AND t.postType = #lastPostType, #lastSeq, #lastSeq +1 ) as ActionSeq,
#lastAction := t.action,
#lastPostID := t.postID,
#lastPostType := t.PostType
from
t,
( select #lastAction := ' ',
#lastPostID := 0,
#lastPostType := ' ',
#lastSeq := 0 ) sqlVars
order by
t.postid,
t.date ) PreQuery
group by
PreQuery.postID,
PreQuery.ActionSeq,
PreQuery.PostType,
PreQuery.Action
Here's my link to SQLFiddle sample
For the title, you might want to adjust the line...
group_concat( distinct PreQuery.Title ) as Titles,
At least this will give DISTINCT titles concatinated... much tougher to get let without nesting this entire query one more level by having the max query date and other elements to get the one title associated with that max date per all criteria.

There is no primary key in your table so for my example I used date. You should create an auto increment value and use that instead of the date in my example.
This is a solution (view on SQL Fiddle):
SELECT
postid,
posttype,
target,
action,
COALESCE((
SELECT date
FROM t t2
WHERE t2.postid = t.postid
AND t2.posttype = t.posttype
AND t2.action = t.action
AND t2.date > t.date
AND NOT EXISTS (
SELECT TRUE
FROM t t3
WHERE t3.date > t.date
AND t3.date < t2.date
AND (t3.postid != t.postid OR t3.posttype != t.posttype OR t3.action != t.action)
)
), t.date) AS group_criterion,
MAX(title),
GROUP_CONCAT(content)
FROM t
GROUP BY 1,2,3,4,5
ORDER BY group_criterion
It basically reads:
For each row create a group criterion and in the end group by it.
This criterion is the highest date of the rows following the current one and having the same postid, posttype and action as the current one but there may be not a row of different postid, posttype or action between them.
In other words, the group criterion is the highest occurring date in a group of consecutive entries.
If you use proper indexes it shouldn't be terribly slow but if you have a lot of rows you should think of caching this information.

Related

Update last row in group with data from first row in group

I'm currently in the process of converting data from one structure to another, and in the process I have to take a status id from the first entry in the group and apply it to the last entry in that same group. I am able to target and update the last item in the group just fine when using a hard-coded value, but I'm hitting a wall when trying to use the status_id from the first entry. Here is an example of the data structure.
-----------------------------------------------------------
| id | ticket_id | status_id | new_status_id | created_at |
-----------------------------------------------------------
| 1 | 10 | NULL | 3 | 2018-06-20 |
| 2 | 10 | 1 | 1 | 2018-06-22 |
| 3 | 10 | 1 | 1 | 2018-06-23 |
| 4 | 10 | 1 | 1 | 2018-06-26 |
-----------------------------------------------------------
So the idea would be to take the new_status_id of ID 1 and apply it to the same field for ID 4.
Here is the query that works when using a hard-coded value
UPDATE Communications_History as ch
JOIN
(
SELECT communication_id, MAX(created_at) max_time, new_status_id
FROM Communications_History
GROUP BY communication_id
) ch2
ON ch.communication_id = ch2.communication_id AND ch.created_at = ch2.max_time
SET ch.new_status_id = 3
But when I use the following query, I get Unknown column ch.communication_id in where clause
UPDATE Communications_History as ch
JOIN
(
SELECT communication_id, MAX(created_at) max_time, new_status_id
FROM Communications_History
GROUP BY communication_id
) ch2
ON ch.communication_id = ch2.communication_id AND ch.created_at = ch2.max_time
SET ch.new_status_id = (
SELECT nsi FROM
(
SELECT new_status_id FROM Communications_History WHERE communication_id = ch.communication_id AND status_id IS NULL
) as ch3
)
Thanks!
So I just figured it out using variables. It turns out the original "solution" only worked when there was one ticket's worth of history in the table, but when all the data was imported, it no longer worked. However, this tweak did seem to fix the issue.
UPDATE Communications_History as ch
JOIN
(
SELECT communication_id, MAX(created_at) max_time, new_status_id
FROM Communications_History
GROUP BY communication_id
) ch2
ON ch.communication_id = ch2.communication_id AND ch.created_at = ch2.max_time
SET ch.new_status_id = ch2.new_status_id;

MySQL: how to assign same ID for records with close timestamp

I have a MySQL table with timestamp column t. I need to create another integer column (groupId) which will have the same value for records with timestamp with
less then 3 sec difference. My version of MySQL has no window function support. This is the expected output in 2nd column:
+---------------------+--------+
| t | groupId|
+---------------------+--------+
| 2017-06-17 18:15:13 | 1 |
| 2017-06-17 18:15:14 | 1 |
| 2017-06-17 20:30:06 | 2 |
| 2017-06-17 20:30:07 | 2 |
| 2017-06-17 22:44:58 | 3 |
| 2017-06-17 22:44:59 | 3 |
| 2017-06-17 23:59:50 | 4 |
| 2017-06-17 23:59:51 | 4 |
I tried to use self-join and TIMESTAMPDIFF(SECOND,t1,t2) <3
but I do not know how to generate the unique groupId.
P.S.
It is guaranteed by the nature of data what there is no continues range which spans > 3 sec
You can do this using variables.
select tm
,#diff:=timestampdiff(second,#prev,tm)
,#prev:=tm
,#grp:=case when #diff<3 or #diff is null then #grp else #grp+1 end as groupID
from t
cross join (select #prev:='',#diff:=0,#grp:=1) r
order by tm
For this, I believe that you need to create a stored procedure that first sort your table by the column t (timestamp) and then goes through it grouping and assigning the groupId accordingly.... in this case you can use your own counter as groupID.
What it is important here, is how you split the time into frames of 2 seconds, you could end with different results depending of your point of reference...
This query puts every record in the same group when the previous record is just 3 seconds before:
UPDATE t
JOIN (
SELECT
t.*
, #gid := IF(TIMESTAMPDIFF(SECOND, #prev, t) > 3, #gid + 1, #gid) AS gid
, #prev := t
FROM t
, (SELECT #prev := NULL, #gid := 1) v
ORDER BY t
) sq ON t.t = sq.t
SET t.groupId = sq.gid;
see it working live in an sqlfiddle
learn more about user-defined variables here
This query will work in Oracle sql:
select *
from (
select e.*,
rank() over (partition by trunc(hiredate,'mi') order by trunc(hiredate,'mi') desc) MINu
from emp e
)

How to deduplicate mysql rows but keep max view

I have MySQL rows like this
id | title | desc | view
1 | i'm a title | i'm a desc | 0
2 | i'm a title | i'm a desc | 0
3 | i'm a title | i'm a desc | 5
4 | i'm a title | i'm a desc | 0
5 | i'm a title | i'm a desc | 0
6 | i'm a title | i'm a desc | 3
8 | i'm a title | i'm a desc | 0
And i would like to keep only
3 | i'm a title | i'm a desc | 5
because this record as the max view and others are duplicates
If your data is not too big, you can use delete like this:
delete t from yourtable t join
(select title, `desc`, max(view) as maxview
from yourtable t
group by title, `desc`
) tt
on t.title = tt.title and
t.`desc` = tt.`desc` and
t.view < tt.maxview;
Note: if there are multiple rows with the same maximum number of views, this will keep all of them. Also, desc is a lousy name for a column because it is a SQL (and MySQL) reserved word.
EDIT:
If you have a large amount of data, often it is faster to do the truncate/re-insert approach:
create table temp_t as
select t.*
from yourtable t join
(select title, `desc`, max(view) as maxview
from yourtable t
group by title, `desc`
) tt
on t.title = tt.title and
t.`desc` = tt.`desc` and
t.view = tt.maxview;
truncate table yourtable;
insert into yourtable
select *
from temp_t;
I could not understand what the specific question is. The possible solutions are followed...
1) Use UPDATE instead of INSERT statement in mysql. Just write UPDATE your_table_name SET view=view+1
2) or you can run a cron job if using php to delete duplicate rows having lower value
3) If INSERT is necessary then you should do ON DUPLICATE KEY UPDATE. Refer to the documentation * http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html

What is SQL to select a property and the max number of occurrences of a related property?

I have a table like this:
Table: p
+----------------+
| id | w_id |
+---------+------+
| 5 | 8 |
| 5 | 10 |
| 5 | 8 |
| 5 | 10 |
| 5 | 8 |
| 6 | 5 |
| 6 | 8 |
| 6 | 10 |
| 6 | 10 |
| 7 | 8 |
| 7 | 10 |
+----------------+
What is the best SQL to get the following result? :
+-----------------------------+
| id | most_used_w_id |
+---------+-------------------+
| 5 | 8 |
| 6 | 10 |
| 7 | 8 |
+-----------------------------+
In other words, to get, per id, the most frequent related w_id.
Note that on the example above, id 7 is related to 8 once and to 10 once.
So, either (7, 8) or (7, 10) will do as result. If it is not possible to
pick up one, then both (7, 8) and (7, 10) on result set will be ok.
I have come up with something like:
select counters2.p_id as id, counters2.w_id as most_used_w_id
from (
select p.id as p_id,
w_id,
count(w_id) as count_of_w_ids
from p
group by id, w_id
) as counters2
join (
select p_id, max(count_of_w_ids) as max_counter_for_w_ids
from (
select p.id as p_id,
w_id,
count(w_id) as count_of_w_ids
from p
group by id, w_id
) as counters
group by p_id
) as p_max
on p_max.p_id = counters2.p_id
and p_max.max_counter_for_w_ids = counters2.count_of_w_ids
;
but I am not sure at all whether this is the best way to do it. And I had to repeat the same sub-query two times.
Any better solution?
Try to use User defined variables
select id,w_id
FROM
( select T.*,
if(#id<>id,1,0) as row,
#id:=id FROM
(
select id,W_id, Count(*) as cnt FROM p Group by ID,W_id
) as T,(SELECT #id:=0) as T1
ORDER BY id,cnt DESC
) as T2
WHERE Row=1
SQLFiddle demo
Formal SQL
In fact - your solution is correct in terms of normal SQL. Why? Because you have to stick with joining values from original data to grouped data. Thus, your query can not be simplified. MySQL allows to mix non-group columns and group function, but that's totally unreliable, so I will not recommend you to rely on that effect.
MySQL
Since you're using MySQL, you can use variables. I'm not a big fan of them, but for your case they may be used to simplify things:
SELECT
c.*,
IF(#id!=id, #i:=1, #i:=#i+1) AS num,
#id:=id AS gid
FROM
(SELECT id, w_id, COUNT(w_id) AS w_count
FROM t
GROUP BY id, w_id
ORDER BY id DESC, w_count DESC) AS c
CROSS JOIN (SELECT #i:=-1, #id:=-1) AS init
HAVING
num=1;
So for your data result will look like:
+------+------+---------+------+------+
| id | w_id | w_count | num | gid |
+------+------+---------+------+------+
| 7 | 8 | 1 | 1 | 7 |
| 6 | 10 | 2 | 1 | 6 |
| 5 | 8 | 3 | 1 | 5 |
+------+------+---------+------+------+
Thus, you've found your id and corresponding w_id. The idea is - to count rows and enumerate them, paying attention to the fact, that we're ordering them in subquery. So we need only first row (because it will represent data with highest count).
This may be replaced with single GROUP BY id - but, again, server is free to choose any row in that case (it will work because it will take first row, but documentation says nothing about that for common case).
One little nice thing about this is - you can select, for example, 2-nd by frequency or 3-rd, it's very flexible.
Performance
To increase performance, you can create index on (id, w_id) - obviously, it will be used for ordering and grouping records. But variables and HAVING, however, will produce line-by-line scan for set, derived by internal GROUP BY. It isn't such bad as it was with full scan of original data, but still it isn't good thing about doing this with variables. On the other hand, doing that with JOIN & subquery like in your query won't be much different, because of creating temporery table for subquery result set too.
But to be certain, you'll have to test. And keep in mind - you already have valid solution, which, by the way, isn't bound to DBMS-specific stuff and is good in terms of common SQL.
Try this query
select p_id, ccc , w_id from
(
select p.id as p_id,
w_id, count(w_id) ccc
from p
group by id,w_id order by id,ccc desc) xxx
group by p_id having max(ccc)
here is the sqlfidddle link
You can also use this code if you do not want to rely on the first record of non-grouping columns
select p_id, ccc , w_id from
(
select p.id as p_id,
w_id, count(w_id) ccc
from p
group by id,w_id order by id,ccc desc) xxx
group by p_id having ccc=max(ccc);

MySQL: group by consecutive days and count groups

I have a database table which holds each user's checkins in cities. I need to know how many days a user has been in a city, and then, how many visits a user has made to a city (a visit consists of consecutive days spent in a city).
So, consider I have the following table (simplified, containing only the DATETIMEs - same user and city):
datetime
-------------------
2011-06-30 12:11:46
2011-07-01 13:16:34
2011-07-01 15:22:45
2011-07-01 22:35:00
2011-07-02 13:45:12
2011-08-01 00:11:45
2011-08-05 17:14:34
2011-08-05 18:11:46
2011-08-06 20:22:12
The number of days this user has been to this city would be 6 (30.06, 01.07, 02.07, 01.08, 05.08, 06.08).
I thought of doing this using SELECT COUNT(id) FROM table GROUP BY DATE(datetime)
Then, for the number of visits this user has made to this city, the query should return 3 (30.06-02.07, 01.08, 05.08-06.08).
The problem is that I have no idea how shall I build this query.
Any help would be highly appreciated!
You can find the first day of each visit by finding checkins where there was no checkin the day before.
select count(distinct date(start_of_visit.datetime))
from checkin start_of_visit
left join checkin previous_day
on start_of_visit.user = previous_day.user
and start_of_visit.city = previous_day.city
and date(start_of_visit.datetime) - interval 1 day = date(previous_day.datetime)
where previous_day.id is null
There are several important parts to this query.
First, each checkin is joined to any checkin from the previous day. But since it's an outer join, if there was no checkin the previous day the right side of the join will have NULL results. The WHERE filtering happens after the join, so it keeps only those checkins from the left side where there are none from the right side. LEFT OUTER JOIN/WHERE IS NULL is really handy for finding where things aren't.
Then it counts distinct checkin dates to make sure it doesn't double-count if the user checked in multiple times on the first day of the visit. (I actually added that part on edit, when I spotted the possible error.)
Edit: I just re-read your proposed query for the first question. Your query would get you the number of checkins on a given date, instead of a count of dates. I think you want something like this instead:
select count(distinct date(datetime))
from checkin
where user='some user' and city='some city'
Try to apply this code to your task -
CREATE TABLE visits(
user_id INT(11) NOT NULL,
dt DATETIME DEFAULT NULL
);
INSERT INTO visits VALUES
(1, '2011-06-30 12:11:46'),
(1, '2011-07-01 13:16:34'),
(1, '2011-07-01 15:22:45'),
(1, '2011-07-01 22:35:00'),
(1, '2011-07-02 13:45:12'),
(1, '2011-08-01 00:11:45'),
(1, '2011-08-05 17:14:34'),
(1, '2011-08-05 18:11:46'),
(1, '2011-08-06 20:22:12'),
(2, '2011-08-30 16:13:34'),
(2, '2011-08-31 16:13:41');
SET #i = 0;
SET #last_dt = NULL;
SET #last_user = NULL;
SELECT v.user_id,
COUNT(DISTINCT(DATE(dt))) number_of_days,
MAX(days) number_of_visits
FROM
(SELECT user_id, dt
#i := IF(#last_user IS NULL OR #last_user <> user_id, 1, IF(#last_dt IS NULL OR (DATE(dt) - INTERVAL 1 DAY) > DATE(#last_dt), #i + 1, #i)) AS days,
#last_dt := DATE(dt),
#last_user := user_id
FROM
visits
ORDER BY
user_id, dt
) v
GROUP BY
v.user_id;
----------------
Output:
+---------+----------------+------------------+
| user_id | number_of_days | number_of_visits |
+---------+----------------+------------------+
| 1 | 6 | 3 |
| 2 | 2 | 1 |
+---------+----------------+------------------+
Explanation:
To understand how it works let's check the subquery, here it is.
SET #i = 0;
SET #last_dt = NULL;
SET #last_user = NULL;
SELECT user_id, dt,
#i := IF(#last_user IS NULL OR #last_user <> user_id, 1, IF(#last_dt IS NULL OR (DATE(dt) - INTERVAL 1 DAY) > DATE(#last_dt), #i + 1, #i)) AS
days,
#last_dt := DATE(dt) lt,
#last_user := user_id lu
FROM
visits
ORDER BY
user_id, dt;
As you see the query returns all rows and performs ranking for the number of visits. This is known ranking method based on variables, note that rows are ordered by user and date fields. This query calculates user visits, and outputs next data set where days column provides rank for the number of visits -
+---------+---------------------+------+------------+----+
| user_id | dt | days | lt | lu |
+---------+---------------------+------+------------+----+
| 1 | 2011-06-30 12:11:46 | 1 | 2011-06-30 | 1 |
| 1 | 2011-07-01 13:16:34 | 1 | 2011-07-01 | 1 |
| 1 | 2011-07-01 15:22:45 | 1 | 2011-07-01 | 1 |
| 1 | 2011-07-01 22:35:00 | 1 | 2011-07-01 | 1 |
| 1 | 2011-07-02 13:45:12 | 1 | 2011-07-02 | 1 |
| 1 | 2011-08-01 00:11:45 | 2 | 2011-08-01 | 1 |
| 1 | 2011-08-05 17:14:34 | 3 | 2011-08-05 | 1 |
| 1 | 2011-08-05 18:11:46 | 3 | 2011-08-05 | 1 |
| 1 | 2011-08-06 20:22:12 | 3 | 2011-08-06 | 1 |
| 2 | 2011-08-30 16:13:34 | 1 | 2011-08-30 | 2 |
| 2 | 2011-08-31 16:13:41 | 1 | 2011-08-31 | 2 |
+---------+---------------------+------+------------+----+
Then we group this data set by user and use aggregate functions:
'COUNT(DISTINCT(DATE(dt)))' - counts the number of days
'MAX(days)' - the number of visits, it is a maximum value for the days field from our subquery.
That is all;)
As data sample provided by Devart, the inner "PreQuery" works with sql variables. By defaulting the #LUser to a -1 (probable non-existent user ID), the IF() test checks for any difference between last user and current. As soon as a new user, it gets a value of 1... Additionally, if the last date is more than 1 day from the new date of check-in, it gets a value of 1. Then, the subsequent columns reset the #LUser and #LDate to the value of the incoming record just tested against for the next cycle. Then, the outer query just sums them up and counts them for the final correct results per the Devart data set of
User ID Distinct Visits Total Days
1 3 9
2 1 2
select PreQuery.User_ID,
sum( PreQuery.NextVisit ) as DistinctVisits,
count(*) as TotalDays
from
( select v.user_id,
if( #LUser <> v.User_ID OR #LDate < ( date( v.dt ) - Interval 1 day ), 1, 0 ) as NextVisit,
#LUser := v.user_id,
#LDate := date( v.dt )
from
Visits v,
( select #LUser := -1, #LDate := date(now()) ) AtVars
order by
v.user_id,
v.dt ) PreQuery
group by
PreQuery.User_ID
for a first sub-task:
select count(*)
from (
select TO_DAYS(p.d)
from p
group by TO_DAYS(p.d)
) t
I think you should consider changing database structure. You could add table visits and visit_id into your checkins table. Each time you want to register new checkin you check if there is any checkin a day back. If yes then you add a new checkin with visit_id from yesterday's checkin. If not then you add new visit to visits and new checkin with new visit_id.
Then you could get you data in one query with something like that:
SELECT COUNT(id) AS number_of_days, COUNT(DISTINCT visit_id) number_of_visits FROM checkin GROUP BY user, city
It's not very optimal but still better than doing anything with current structure and it will work. Also if results can be separate queries it will work very fast.
But of course drawbacks are you will need to change database structure, do some more scripting and convert current data to new structure (i.e. you will need to add visit_id to current data).