SQL: count sequentially visits - mysql

I have the following table (visits):
id(int) | fb_id(varchar)| flipbook(varchar) |
---- ---------- ---------
1 1123 november 2014
2 1124 november 2014
3 1123 december 2014
4 1124 december 2014
5 1123 december 2014
6 1123 january 2015
7 1126 january 2015
8 1125 february 2015
9 1123 february 2015
10 1124 march 2015
11 1125 march 2015
11 1123 march 2015
After the query runs, I want to get the following results:
sequence count
5 1 (1 user visited 5 flipbooks in a row: 1123)
2 2 (2 users visited 2 flipbooks in a row: 1124, 1125)
1 1 (1 user visited only 1 flipbook: 1126)
Any ideas how to achieve this?

fb_id=1124 visited only 1 flipbook "in a row". Unless row id=4 supposed to be "december 2014" and not "december 2015".
Yes, this is possible to accomplish in MySQL, using user-defined variables. The MySQL Reference Manual cautions that the behavior with this usage of user-defined variables within the same statement is undefined.
sequence count info
-------- ------ ---------------------------------------------
5 1 (1 user visited 5 flipbooks in a row: 1123
2 1 (1 user visited 2 flipbooks in a row: 1125
1 2 (2 users visited only 1 flipbook: 1124,1126
That result set is produced from the following SQL query:
SELECT d.seq AS `sequence`
, COUNT(1) AS `count`
, CONCAT('('
,COUNT(1)
,' user'
,IF(COUNT(1)>1,'s','')
,' visited'
,IF(d.seq>1,'',' only')
,' '
,d.seq
,' flipbook'
,IF(d.seq>1,'s in a row: ',': ')
,GROUP_CONCAT(d.fb_id ORDER BY d.fb_id)
) AS `info`
FROM ( SELECT c.fb_id
, MAX(c.cnt) AS seq
FROM ( SELECT #cnt := IF(#prev_fb_id = v.fb_id AND PERIOD_DIFF(v.yyyymm,#prev_yyyymm)=1, #cnt + 1, 1) AS cnt
, #prev_yyyymm := v.yyyymm AS yyyymm
, #prev_fb_id := v.fb_id AS fb_id
FROM ( SELECT #prev_fb_id := NULL
, #prev_yyyymm := NULL
, #cnt := 0
) i
CROSS
JOIN ( SELECT t.fb_id
, DATE_FORMAT(STR_TO_DATE(CONCAT('01 ',t.flipbook),'%d %M %Y'),'%Y%m') AS yyyymm
FROM t
GROUP BY t.fb_id, yyyymm
ORDER BY t.fb_id, yyyymm
) v
) c
GROUP BY c.fb_id
) d
GROUP BY d.seq
ORDER BY d.seq DESC
FOLLOWUP
The table name of the source table goes into the query in the innermost inline view, aliased as v. (In the example above, the table name is t.
To understand how this works, you can run just the query for that innermost inline view, to see what it returns. It's big job is reformatting the flipbook column into a YYYYMM format, and ordering the rows. (We will later use the PERIOD_DIFF function to calculate the number of months between flipbook values.)
The inline view i is only there to initialize the user-defined variables we are going to use. We do it in an innermost inline view query so that gets done before those variables are referenced in the outer query. It's essentially equivalent to running separate SET statements immediately before the query runs. (We don't want any left-over values of the variables mucking with our results.)
Once the v and i views are materialized (as derived tables), the query in the inline view aliased as v can run. (The view queries in the FROM clause essentially serve like tables.)
This query is where the magic is. We're using user-defined variables to preserve the values of the "previous" row, so we can compare that to the current row. If the current row is for the same user, and is exactly one month after the previous row, we increment the sequence count by 1, otherwise, we set it to 1.
Once that query completes, we have a derived table we can use as the row source for yet another query. In this query, we want to find the "maximum" value of that sequence counter for each user. That will give us the longest sequence for each user.
With that set, the outermost query is almost trivial... order by the longest sequence in descending order, and collapse the rows to get a count of the number of users that had the same maximum sequence value.
To get the highest number of visits by an fb_id within a sequence, we can accumulate the number of visits in that innermost view query. A COUNT(1) or a SUM(1) will give us the number of visits in each month.
That can feed into the next query. We can do the same check we do for accumulating the contiguous months. Instead of incrementing by 1, we'll accumulate the total number of visits.
The next query has to be modified. We can't just wrap a MAX() around the tot, because we wouldn't be guaranteed that the total visits would be from that same longest sequence. We might have 6 visits in a 5 month sequence, but that same user might have visited 8 times in 3 months. So we scrap the MAX() function, and instead use ordering (from highest to lowest). We'll keep the value for the first row for a fb_id, and set the other values to NULL. Then, on the outermost query, we can use MAX() aggregate which will ignore the NULLs, and return the highest total visits from among all the users that had the same sequence value.
We can get this result:
sequence count highest_tot
-------- ------ -----------
5 1 6
2 1 2
1 2 1
From a query like this:
SELECT d.seq AS `sequence`
, COUNT(1) AS `count`
, MAX(FLOOR(d.tot)) AS `highest_tot`
FROM ( SELECT IF(#c_fb_id=c.fb_id,NULL,c.cnt) AS seq
, IF(#c_fb_id=c.fb_id,NULL,c.tot) AS tot
, #c_fb_id := c.fb_id AS fb_id
FROM ( SELECT #cnt := IF(#prev_fb_id = v.fb_id AND PERIOD_DIFF(v.yyyymm,#prev_yyyymm)=1, #cnt + 1, 1) AS cnt
, #tot := IF(#prev_fb_id = v.fb_id AND PERIOD_DIFF(v.yyyymm,#prev_yyyymm)=1, #tot + v.tot, v.tot) AS tot
, #prev_yyyymm := v.yyyymm AS yyyymm
, #prev_fb_id := v.fb_id AS fb_id
FROM ( SELECT #prev_fb_id := NULL
, #prev_yyyymm := NULL
, #cnt := 0
, #tot := 0
, #c_fb_id := NULL
) i
CROSS
JOIN ( SELECT t.fb_id
, DATE_FORMAT(STR_TO_DATE(CONCAT('01 ',t.flipbook),'%d %M %Y'),'%Y%m') AS yyyymm
, SUM(1) AS tot
FROM t
GROUP BY t.fb_id, yyyymm
ORDER BY t.fb_id, yyyymm
) v
) c
ORDER BY c.fb_id DESC, c.cnt DESC, c.tot DESC
) d
WHERE d.seq IS NOT NULL
GROUP BY d.seq
ORDER BY d.seq DESC

SQL is hard to achieve the result to find out the sequence visits of a user.
However it is easier if you use some other method, eg: create or reuse one of your table :
users(fb_id, best_sequence, current_sequence, last_modified)
-------------------------------------------------------
1123 5 3 2016-01
loop over your flipbook eg
SELECT * FROM visits WHERE fb_id=1123 AND flipbook >= ....
(you might redesign data to make SQL easier here)
Update current_sequence to 4 if sequence match, or 0 if gap found
If current_sequence > best_sequence, set best_sequence = current_sequence
You can do it via cron job, trigger, or some other methods you feel most comfortable.
This is an idea, and write your own code.

Related

SQL Query Sequential Month Logins

I have the following SQL table
username
Month
292
10
123
12
123
1
123
2
123
4
345
6
345
7
I want to query it, to get each username's login streak in Count of sequential Month. meaning the end result I am looking for looks like this :
username
Streak
292
1
123
3
345
2
How can I achieve it ? taking into note the Month 12 --> Month 1 issue;
Appreciate your help;
This would give you the result you want:
select username, count(*)
from (
select
username
, month_1
, coalesce(nullif(lead(month_1)
over (partition by username
order by coalesce(nullif(month_1,12),0))
- coalesce(nullif(month_1,12),0),-1),1) as MonthsTillNext
from login_tab
) Step1
where MonthsTillNext=1
group by username
By calculating the difference from the next row, where the next row is defined as the next month_no in ascending order, treating 12 as 0 (refer to the ambiguity I mentioned in my comment). It then just leaves the rows for consecutive months rows, and counts them.
Beware though, in addition to the anomaly around month:12, there is another case not considered: if the months for the user are 1,2,3 and 6,7,8 this would count as Streak:6; is it what you wanted?
One way would be with a recursive CTE, like
WITH RECURSIVE cte (username, month, cnt) AS
(
SELECT username, month, 1
FROM test
UNION ALL
SELECT test.username, test.month, cte.cnt+1
FROM cte INNER JOIN test
ON cte.username = test.username AND CASE WHEN cte.month = 12 THEN 1 ELSE cte.month + 1 END = test.month
)
SELECT username, MAX(cnt)
FROM cte
GROUP BY username
ORDER BY username
The idea is that the CTE (named cte in my example) recursively joins back to the table on a condition where the user is the same and the month is the next one. So for user 345, you have:
Username
Month
Cnt
345
6
1
345
7
1
345
7
2
The rows with cnt=1 are from the original table (with the extra cnt column hardcoded to 1), the row with cnt=2 is from the recursive part of the query (which found a match and used cnt+1 for its cnt). The query then selects the maximum for each user.
The join uses a CASE statement to handle 12 being followed by 1.
You can see it working with your sample data in this fiddle.
The one shared by #EdmCoff is quite elegant.
Another one without recursive and just using conditional logic -
with data_cte as
(
select username, month_1,
case when (count(month_1) over (partition by username) = 1) then 1
when (lead(month_1) over (partition by username order by username) - month_1) = 1 OR (month_1 - lag(month_1) over (partition by username order by username)) = 1 then 1
when (month_1 = 12 and min (month_1) over (partition by username) =1) then 1
end cnt
from login_tab
)
select username, count(cnt) from data_cte group by username
DB Fiddle here.

SQL querying - counting from two tables by weekday

I have the two following (MySQL) tables called "Jobs" and "Employees_Jobs:
Jobs:
job_id job_creation_date
1 2016-01-01
2 2016-01-02
Employees_Jobs (job applications):
EJ_job_id EJ_creation_date
1 2016-01-02
2 2016-01-02
2 2016-01-03
I want MySQL returning the number of jobs created, and the number of job applications created per day of the week; taking the above data it should return:
weekday num_of_jobs_entered num_of_applications_entered
Friday 1 0
Saturday 1 2 // corrected from 1
Sunday 0 1 // 2
I now have the following query:
SELECT
DAYNAME(job_creation_date) as weekday,
(SELECT COUNT(*) FROM Jobs) as num_of_jobs_entered,
(SELECT COUNT(*) FROM Employees_Jobs) as num_of_applications_entered
FROM
dual
GROUP BY
weekday
ORDER BY
weekday;
What am I doing wrong?
Thanks!
Try this:
SELECT wd, COUNT(jcnt), COUNT(ecnt) FROM
(SELECT DAYNAME(job_creation_date) wd, 1 jcnt, null ecnt FROM Jobs UNION ALL
SELECT DAYNAME(EJ_creation_date), null, 1 FROM Employees_Jobs ) a
GROUP BY wd
See here for a working example.
EDIT
If I understand you correctly you want to divide the above calculated counts of job offers and requirements by the current week number of the current year (which will make sense of course only if all considered job counts have also occurred in the current year -> some filtering might become necessary).
However, whithout the filtering you could do
SELECT week_day, Week(CURDATE()) weekCurrdate,
COUNT(num_of_jobs_entered)/Week(CURDATE()) avg_of_jobs_entered,
COUNT(num_of_applications_entered)/Week(CURDATE()) avg_num_of_applications_entered
FROM (
SELECT DAYNAME(job_creation_date) week_day, 1 num_of_jobs_entered,
null num_of_applications_entered FROM Jobs UNION ALL
SELECT DAYNAME(EJ_creation_date), null, 1 FROM Employees_Jobs ) A
GROUP BY week_day;
The IFNULL() function is obsolete but you will have to use COUNT() instead in the outer select. Since the current week number is 1 at the moment the result from this query will (presently!) be identical to the one of the previous query, see modified fiddle here.

Cumulative sum query on foreign key

I want to write a query for cumulative sum in MYSQL. I have a foreign key in my table and I want to add their hours as a cumulative sum.
Table 1
id(not primary key) Hours
1 4
2 4
1 5
I have tried this query
select spent_hours := spent_hours + hours as spent
from time
join (select spent_hours := 0) s
I am getting this
id(not primary key) hours spent
1 4 4
2 4 8
1 5 13
But I want this result:
id(not primary key) Hours spent
1 4 4
2 4 4
1 5 9
Since you have an autoincrement field (let's assume for this case its called record_id) you can use this little trick to achieve what you want:
SELECT Main.id, Main.spentHours,
(
SELECT SUM(spentHours)
FROM Table1 WHERE Table1.id = Main.id
AND Table1.record_id >= Main.record_id
) as totalSpentHours
FROM Table1 Main
ORDER BY Main.record_id ASC
This will fetch the id, current spent hours, along using a subselect, all hours from the current ID and above for that user.
You need additional an variable to keep track of the cumulative sum within each id:
select t.id, t.hours,
(#h := if(#i = id, #h + spent_hours,
if(#i := id, spent_hours, spent_hours)
)
) as spent
from time cross join
(select #h := 0, #i := 0) params
order by id, ??;
Note: you need an additional column to specify the order for the cumulative sum (indicated by ?? in the order by clause. Remember that SQL tables represent unordered sets, so you need a column to explicitly represent ordering.

How to wirte an extensible SQL to find the users who continuously login for n days

If I have a table(Oracle or MySQL), which stores the date user logins.
So how can I write a SQL(or something else) to find the users who have continuously login for n days.
For example:
userID | logindate
1000 2014-01-10
1000 2014-01-11
1000 2014-02-01
1000 2014-02-02
1001 2014-02-01
1001 2014-02-02
1001 2014-02-03
1001 2014-02-04
1001 2014-02-05
1002 2014-02-01
1002 2014-02-03
1002 2014-02-05
.....
We can see that user 1000 has continually logined for two days in 2014, and user 1001 has continually logined for 5 days. and user 1002 never continuously logins.
The SQL should be extensible , which means I can pick every number of n, and modify a little or pass a new parameter, and the results is as expected.
Thank you!
As we don't know what dbms you are using (you named both MySQL and Oracle), here are are two solutions, both doing the same: Order the rows and subtract rownumber days from the login date (so if the 6th record is 2014-02-12 and the 7th is 2014-02-13 they both result in 2014-02-06). So we group by user and that groupday and count the days. Then we group by user to find the longest series.
Here is a solution for a dbms with analytic window functions (e.g. Oracle):
select userid, max(days)
from
(
select userid, groupday, count(*) as days
from
(
select
userid, logindate - row_number() over (partition by userid order by logindate) as groupday
from mytable
)
group by userid, groupday
)
group by userid
--having max(days) >= 3
And here is a MySQL query (untested, because I don't have MySQL available):
select
userid, max(days)
from
(
select
userid, date_add(logindate, interval -row_number day) as groupday, count(*) as days
from
(
select
userid, logindate,
#row_num := #row_num + 1 as row_number
from mytable
cross join (select #row_num := 0) r
order by userid, logindate
)
group by userid, groupday
)
group by userid
-- having max(days) >= 3
I think the following query will give you a very extensible parametrization:
select z.userid, count(*) continuous_login_days
from
(
with max_dates as
( -- Get max date for every user ID
select t.userid, max(t.logindate) max_date
from test t
group by t.userid
),
ranks as
( -- Get ranks for login dates per user
select t.*,
row_number() over
(partition by t.userid order by t.logindate desc) rnk
from test t
)
-- So here, we select continuous days by checking if rank inside group
-- (per user ID) matches login date compared to max date
select r.userid, r.logindate, r.rnk, m.max_date
from ranks r, max_dates m
where m.userid = r.userid
and r.logindate + r.rnk - 1 = m.max_date -- here is the key
) z
-- Then we only group by user ID to get the number of continuous days
group by z.userid
;
Here is the result:
USERID CONTINUOUS_LOGIN_DAYS
1 1000 2
2 1001 5
3 1002 1
So you can just choose by querying field CONTINUOUS_LOGIN_DAYS.
EDIT : If you want to choose from all ranges (not only the last one), my query structure no longer works because it relied on the last range. But here is a workaround:
with w as
( -- Parameter
select 2 nb_cont_days from dual
)
select *
from
(
select t.*,
-- Get number of days around
(select count(*) from test t2
where t2.userid = t.userid
and t2.logindate between t.logindate - nb_cont_days + 1
and t.logindate) m1,
-- Get also number of days more in the past, and in the future
(select count(*) from test t2
where t2.userid = t.userid
and t2.logindate between t.logindate - nb_cont_days
and t.logindate + 1) m2,
w.nb_cont_days
from w, test t
) x
-- If these 2 fields match, then we have what we want
where x.m1 = x.nb_cont_days
and x.m2 = x.nb_cont_days
order by 1, 2
You just have to change the parameter in the WITH clause, so you can even create a function from this query to call it with this parameter.
SELECT userID,count(userID) as numOfDays FROM LOGINTABLE WHERE logindate between '2014-01-01' AND '2014-02-28'
GROUP BY userID
In this case you can check the login days per user, in a specific period

mysql moving average of N rows

I have a simple MySQL table like below, used to compute MPG for a car.
+-------------+-------+---------+
| DATE | MILES | GALLONS |
+-------------+-------+---------+
| JAN 25 1993 | 20.0 | 3.00 |
| FEB 07 1993 | 55.2 | 7.22 |
| MAR 11 1993 | 44.1 | 6.28 |
+-------------+-------+---------+
I can easily compute the Miles Per Gallon (MPG) for the car using a select statement, but because the MPG varies widely from fillup to fillup (i.e. you don't fill the exact same amount of gas each time), I would like to computer a 'MOVING AVERAGE' as well. So for any row the MPG is MILES/GALLON for that row, and the MOVINGMPG is the SUM(MILES)/SUM(GALLONS) for the last N rows. If less than N rows exist by that point, just SUM(MILES)/SUM(GALLONS) up to that point.
Is there a single SELECT statement that will fetch the rows with MPG and MOVINGMPG by substituting N into the select statement?
Yes, it's possible to return the specified resultset with a single SQL statement.
Unfortunately, MySQL does not support analytic functions, which would make for a fairly simple statement. Even though MySQL does not have syntax to support them, it is possible to emulate some analytic functions using MySQL user variables.
One of the ways to achieve the specified result set (with a single SQL statement) is to use a JOIN operation, using a unique ascending integer value (rownum, derived by and assigned within the query) to each row.
For example:
SELECT q.rownum AS rownum
, q.date AS latest_date
, q.miles/q.gallons AS latest_mpg
, COUNT(1) AS cnt_rows
, MIN(r.date) AS earliest_date
, SUM(r.miles) AS rtot_miles
, SUM(r.gallons) AS rtot_gallons
, SUM(r.miles)/SUM(r.gallons) AS rtot_mpg
FROM ( SELECT #s_rownum := #s_rownum + 1 AS rownum
, s.date
, s.miles
, s.gallons
FROM mytable s
JOIN (SELECT #s_rownum := 0) c
ORDER BY s.date
) q
JOIN ( SELECT #t_rownum := #t_rownum + 1 AS rownum
, t.date
, t.miles
, t.gallons
FROM mytable t
JOIN (SELECT #t_rownum := 0) d
ORDER BY t.date
) r
ON r.rownum <= q.rownum
AND r.rownum > q.rownum - 2
GROUP BY q.rownum
Your desired value of "n" to specify how many rows to include in each rollup row is specified in the predicate just before the GROUP BY clause. In this example, up to "2" rows in each running total row.
If you specify a value of 1, you will get (basically) the original table returned.
To eliminate any "incomplete" running total rows (consisting of fewer than "n" rows), that value of "n" would need to be specified again, adding:
HAVING COUNT(1) >= 2
sqlfiddle demo: http://sqlfiddle.com/#!2/52420/2
Followup:
Q: I'm trying to understand your SQL statement. Does your solution do a select of twenty rows for each row in the db? In other words, if I have 1000 rows will your statement perform 20000 selects? (I'm worried about performance)...
A: You are right to be concerned with performance.
To answer your question, no, this does not perform 20,000 selects for 1,000 rows.
The performance hit comes from the two (essentially identical) inline views (aliased as q and r). What MySQL does with these (basically) is create temporary MyISAM tables (MySQL calls them "derived tables"), which are basically copies of mytable, with an extra column, each row assigned a unique integer value from 1 to the number of rows.
Once the two "derived" tables are created and populated, MySQL runs the outer query, using those two "derived" tables as a row source. Each row from q, is matched with up to n rows from r, to calculate the "running total" miles and gallons.
For better performance, you could use a column already in the table, rather than having the query assign unique integer values. For example, if the date column is unique, then you could calculate "running total" over a certain period of days.
SELECT q.date AS latest_date
, SUM(q.miles)/SUM(q.gallons) AS latest_mpg
, COUNT(1) AS cnt_rows
, MIN(r.date) AS earliest_date
, SUM(r.miles) AS rtot_miles
, SUM(r.gallons) AS rtot_gallons
, SUM(r.miles)/SUM(r.gallons) AS rtot_mpg
FROM mytable q
JOIN mytable r
ON r.date <= q.date
AND r.date > q.date + INTERVAL -30 DAY
GROUP BY q.date
(For performance, you would want an appropriate index defined with date as a leading column in the index.)
For the first query, any predicates included (in the inline view definition queries) to reduce the number of rows returned (for example, return only date values in the past year) would reduce the number of rows to be processed, and would also likely improve performance.
Again, to your question about running 20,000 selects for 1,000 rows... a nested loops operation is another way to get the same result set. For a large number of rows, this can exhibit slower performance. (On the other hand, this approach can be fairly efficient, when only a few rows are being returned:
SELECT q.date AS latest_date
, q.miles/q.gallons AS latest_mpg
, ( SELECT SUM(r.miles)/SUM(r.gallons)
FROM mytable r
WHERE r.date <= q.date
AND r.date >= q.date + INTERVAL -90 DAY
) AS rtot_mpg
FROM mytable q
ORDER BY q.date
Something like this should work:
SELECT Date, Miles, Gallons, Miles/Gallons as MilesPerGallon,
#Miles:=#Miles+Miles overallMiles,
#Gallons:=#Gallons+Gallons overallGallons,
#RunningTotal:=#Miles/#Gallons runningTotal
FROM YourTable
JOIN (SELECT #Miles:= 0) t
JOIN (SELECT #Gallons:= 0) s
SQL Fiddle Demo
Which produces the following:
DATE MILES GALLONS MILESPERGALLON RUNNINGTOTAL
January, 25 1993 20 3 6.666667 6.666666666667
February, 07 1993 55.2 7.22 7.645429 7.358121330724
March, 11 1993 44.1 6.28 7.022293 7.230303030303
--EDIT--
In response to the comment, you can add another Row Number to limit your results to the last N rows:
SELECT *
FROM (
SELECT Date, Miles, Gallons, Miles/Gallons as MilesPerGallon,
#Miles:=#Miles+Miles overallmiles,
#Gallons:=#Gallons+Gallons overallGallons,
#RunningTotal:=#Miles/#Gallons runningTotal,
#RowNumber:=#RowNumber+1 rowNumber
FROM (SELECT * FROM YourTable ORDER BY Date DESC) u
JOIN (SELECT #Miles:= 0) t
JOIN (SELECT #Gallons:= 0) s
JOIN (SELECT #RowNumber:= 0) r
) t
WHERE rowNumber <= 3
Just change your ORDER BY clause accordingly. And here is the updated fiddle.