I have some sql that looks like this:
SELECT
stageName,
count(*) as `count`
FROM x2production.contact_stages
WHERE FROM_UNIXTIME(createDate) between '2016-05-01' AND DATE_ADD('2016-08-31', INTERVAL 1 DAY)
AND (stageName = 'DI-Whatever' OR stageName = 'DI-Quote' or stageName = 'DI-Meeting')
Group by stageName
Order by field(stageName, 'DI-Quote', 'DI-Meeting', 'DI-Whatever')
This produces a table that looks like:
+-------------+-------+
| stageName | count |
+-------------+-------+
| DI-quote | 1230 |
| DI-Meeting | 985 |
| DI-Whatever | 325 |
+-------------+-------+
Question:
I would like a percentage from one row to the next. For example the percentage of DI-Meeting to DI-quote. The math would be 100*985/1230 = 80.0%
So in the end the table would look like so:
+-------------+-------+------+
| stageName | count | perc |
+-------------+-------+------+
| DI-quote | 1230 | 0 |
| DI-Meeting | 985 | 80.0 |
| DI-Whatever | 325 | 32.9 |
+-------------+-------+------+
Is there any way to do this in mysql?
Here is an SQL fiddle to mess w/ the data: http://sqlfiddle.com/#!9/61398/1
The query
select stageName,count,if(rownum=1,0,round(count/toDivideBy*100,3)) as percent
from
( select stageName,count,greatest(#rn:=#rn+1,0) as rownum,
coalesce(if(#rn=1,count,#prev),null) as toDivideBy,
#prev:=count as dummy2
from
( SELECT
stageName,
count(*) as `count`
FROM Table1
WHERE FROM_UNIXTIME(createDate) between '2016-05-01' AND DATE_ADD('2016-08-31', INTERVAL 1 DAY)
AND (stageName = 'DI-Underwriting' OR stageName = 'DI-Quote' or stageName = 'DI-Meeting')
Group by stageName
Order by field(stageName, 'DI-Quote', 'DI-Meeting', 'DI-Underwriting')
) xDerived1
cross join (select #rn:=0,#prev:=-1) as xParams1
) xDerived2;
Results
+-----------------+-------+---------+
| stageName | count | percent |
+-----------------+-------+---------+
| DI-Quote | 16 | 0 |
| DI-Meeting | 13 | 81.250 |
| DI-Underwriting | 4 | 30.769 |
+-----------------+-------+---------+
Note, you want a 0 as the percent for the first row. That is easily changed to 100.
The cross join brings in the variables for use and initializes them. The greatest and coalesce are used for safety in variable use as spelled out well in this article, and clues from the MySQL Manual Page Operator Precedence. The derived tables names are just that: every derived table needs a name.
If you do not adhere to the principles in those referenced articles, then the use of variables is unsafe. I am not saying I nailed it, but that safety is always my focus.
The assignment of variables need to follow a safe form, such as the #rn variable being set on the inside of a function like greatest or least. We know that #rn is always greater than 0. So we are using the greatest function to force our will on the query. Same trick with coalesce, null will never happen, and := has lower precedence in the column that follows it. That is, the last one: #prev:= which follows the coalesce.
That way, a variable is set before other columns in that select row attempt to use its value.
So, just getting the expected results does not mean you did it safely and that it will work with your real data.
What you need is to use a LAG function, since MySQL doesn't support it your have to mimic it this way:
select stageName,
cnt,
IF(valBefore is null,0,((100*cnt)/valBefore)) as perc
from (SELECT tb.stageName,
tb.cnt,
#ct AS valBefore,
(#ct := cnt)
FROM (SELECT stageName,
count(*) as cnt
FROM Table1,
(SELECT #_stage = NULL,
#ct := NULL) vars
WHERE FROM_UNIXTIME(createDate) between '2016-05-01'
AND DATE_ADD('2016-08-31', INTERVAL 1 DAY)
AND stageName in ('DI-Underwriting', 'DI-Quote', 'DI-Meeting')
Group by stageName
Order by field(stageName, 'DI-Quote', 'DI-Meeting', 'DI-Underwriting')
) tb
WHERE (CASE WHEN #_stage IS NULL OR #_stage <> tb.stageName
THEN #ct := NULL
ELSE NULL END IS NULL)
) as final
See it working here: http://sqlfiddle.com/#!9/61398/35
EDIT I've actually edited it to remove an unnecessary step (subquery)
Related
I'm importing data where groups of rows need to be given an id but there is nothing unique and common to them in the incoming data. What there is is a known indicator of the first row of a group and that the data is in order so we can step through row by row setting an id and then increment that id whenever this indicator is found. I've done this however it's incredibly slow, so is there a better way to do this in mysql or am i better off perhaps pre-processing the text data going line by line to add the id.
Example of data coming in, I need to increment an id whenever we see "NEW"
id,linetype,number,text
1,NEW,1234,sometext
2,CONTINUE,2412,anytext
3,CONTINUE,1,hello
4,NEW,2333,bla bla
5,CONTINUE,333,hello
6,NEW,1234,anything
So i'll end up with
id,linetype,number,text,group_id
1,NEW,1234,sometext,1
2,CONTINUE,2412,anytext,1
3,CONTINUE,1,hello,1
4,NEW,2333,bla bla,2
5,CONTINUE,333,hello,2
6,NEW,1234,anything,3
I've tried a stored procedure where i go row by row updating as i go, but it's super slow.
select count(*) from mytable into n;
set i=1;
while i<=n do
select linetype into l_linetype from mytable where id = i;
if l_linetype = "NEW" then
set l_id = l_id + 1;
end if;
update mytable set group_id = l_id where id = i;
end while;
No errors, it's just something that i could go line by line reading and writing the text file and do in a second while in mysql it's taking 100 seconds, it'd be nice if there was a way within mysql to do this reasonably fast so separate pre-processing was not needed.
In absence of MySQL 8+ (non availability of Windowing functions), you can use a Correlated Subquery instead:
EDIT: As pointed out by #Paul in comments,
SELECT t1.*,
(SELECT COUNT(*)
FROM your_table t2
WHERE t2.id <= t1.id
AND t2.linetype = 'NEW'
) group_id
FROM your_table t1
Above query can be more performant, if we define the following composite index (linetype, id). The order of columns is important, because we have a Range condition on id.
Previously:
SELECT t1.*,
(SELECT SUM(t2.linetype = 'NEW')
FROM your_table t2
WHERE t2.id <= t1.id
) group_id
FROM your_table t1
Above query requires indexing on id.
Another approach using User-defined Variables (Session variables) would be:
SELECT
t1.*,
#g := IF(t1.linetype = 'NEW', #g + 1, #g) AS group_id
FROM your_table t1
CROSS JOIN (SELECT #g := 0) vars
ORDER BY t1.id
It is like a looping technique, where we use Session Variables whose previous value is accessible during next row's calculation during SELECT. So, we initialize the variable #g to 0, and then compute it row by row. If we can encounter a row with NEW linetype, we increment it, else use the previous row's value. You can also check https://stackoverflow.com/a/53465139/2469308 for more discussion and caveats to take care of while using this approach.
For MySql 8.0+ you can use SUM() window function:
select *,
sum(linetype = 'NEW') over (order by id) group_id
from tablename
See the demo.
For previous versions you can simulate this functionality with the use of a variable:
set #group_id := 0;
select *,
#group_id := #group_id + (linetype = 'NEW') group_id
from tablename
order by id
See the demo.
Results:
| id | linetype | number | text | group_id |
| --- | -------- | ------ | -------- | -------- |
| 1 | NEW | 1234 | sometext | 1 |
| 2 | CONTINUE | 2412 | anytext | 1 |
| 3 | CONTINUE | 1 | hello | 1 |
| 4 | NEW | 2333 | bla bla | 2 |
| 5 | CONTINUE | 333 | hello | 2 |
| 6 | NEW | 1234 | anything | 3 |
My data looks like this:
CreateTime | mobile
-----------+--------
2017/01/01 | 111
2017/01/01 | 222
2017/01/05 | 111
2017/01/08 | 333
2017/03/09 | 111
What I am trying to do is to add a variable if it is the first time that this mobile number occured:
CreateTime | mobile | FirstTime
-----------+--------+----------
2017/01/01 | 111 | 1
2017/01/01 | 222 | 1
2017/01/05 | 111 | 0
2017/01/08 | 333 | 1
2017/03/09 | 111 | 0
2017/03/15 | 222 | 0
2017/03/18 | 444 | 1
Basically we need to add a "true/false" column if it is the first time (based on createtime (and some other fields) which may or may not be sorted) that this specific mobile number occurred.
Ideally, this adjusted table will then be able to give me the following results when queried:
Select Month(createtime) as month,
count(mobile) as received,
sum(Firsttime) as Firsttimers
from ABC
Group by month(createtime)
Result:
Month | Received | FirstTimers
--------+----------+------------
2017/01 | 4 | 3
2017/03 | 3 | 1
If I can get to the RESULTS without needing to create the additional step, then that will be even better.
I do however need the query to run fast hence my thinking of creating the middle table perhaps but I stand corrected.
This is my current code and it works but it is not as fast as I'd like nor is it elegant.
SELECT Month(InF1.createtime) as 'Month',
Count(InF1.GUID) AS Received,
Sum(coalesce(Unique_lead,1)) As FirstTimers
FROM MYDATA_TABLE as InF1
Left Join
( SELECT createtime, mobile, GUID, 0 as Unique_lead
FROM MYDATA_TABLE as InF2
WHERE createtime = (SELECT min(createtime)
FROM MYDATA_TABLE as InF3
WHERE InF2.mobile=InF3.mobile
)
) as InF_unique
On Inf1.GUID = InF_unique.GUID
group by month(createtime)
(appologies if the question is incorrectly posted, it is my first post)
You could use sub query to get the first date per mobile, outer join it on the actual mobile date, and count matches. Make sure to count distinct mobile numbers to not double count the same number when it occurs with the same date twice:
select substr(createtime, 1, 7) month,
count(*) received,
count(distinct grp.mobile) firsttimers
from abc
left join (
select mobile,
min(createtime) firsttime
from abc
group by mobile
) grp
on abc.mobile = grp.mobile
and abc.createtime = grp.firsttime
group by month
Here is an alternative using variables, which can give you a row number:
select substr(createtime, 1, 7) month,
count(*) received,
sum(rn = 1) firsttimers
from (
select createtime,
#rn := if(#mob = mobile, #rn + 1, 1) rn,
#mob := mobile mobile
from (select * from abc order by mobile, createtime) ordered,
(select #rn := 1, #mob := null) init
order by mobile, createtime
) numbered
group by month;
NB: If you have MySql 8+, then use window functions.
I have a problem with MySQL query. I have two tables, table currency and table currency_detail. Table currency contains currency code, such as USD, EUR, IDR, etc. Table currency_detail contains date and rate. After joining tables to get all rates USD in 1 year, I have data which look like this :
Date | Rate
-------------------
2015-10-20 | 14463
2015-10-19 | 14452
2015-10-18 | 14442
2015-10-15 | 14371
2015-10-14 | 14322
2015-10-10 | 14306
2015-10-08 | 14322
I need to count every current row with current row + 1. Is it possible to get results that look like this ?
Date | Rate | PX
------------------------------
2015-10-20 | 14463 | 0.000761 -> LN(14463/14452)
2015-10-19 | 14452 | 0.000692 -> LN(14452/14442)
2015-10-18 | 14442 | 0.004928 -> LN(14442/14371)
2015-10-15 | 14371 | 0.003415 -> LN(14371/14322)
2015-10-14 | 14322 | 0.001118 -> LN(14322/14306)
2015-10-10 | 14306 | -0.00112 -> LN(14306/14322)
2015-10-08 | 14322 | 0 -> 0 (because no data after this row)
I have tried many ways, but still cant find the solutions. Anyone can help with the query ? Thanks before..
In standard SQL you would simply use LAG to read the value from the previous record. In MySQL you need a workaround. The easiest way might be to select all rows twice and number them on-the-fly; then you can join by row number:
select
this.rdate, this.rate,
ln(this.rate / prev.rate) as px
from
(
select #rownum1 := #rownum1 + 1 as rownum, rates.*
from rates
cross join (select #rownum1 := 0) init
order by rdate
) this
left join
(
select #rownum2 := #rownum2 + 1 as rownum, rates.*
from rates
cross join (select #rownum2 := 0) init
order by rdate
) prev on prev.rownum = this.rownum - 1
order by this.rdate desc;
I had to use different rownum variable names in the two subqueries, by the way, as MySQL got confused otherwise. I consider this a flaw, but I must admit MySQL's variables-in-SQL thing is still kind of alien to me :-)
SQL fiddle: http://www.sqlfiddle.com/#!9/341c4/7
Background before we begin...
Table schema:
UserId | ActivityDate | Time_diff
where "ActivityDate" is timestamp of activity by user
"Time_diff" is timestampdiff between the next activity and current activity in seconds
in general, but for the last recorded activity of user, since there is no next activity I set the Time_diff to -999
Ex:
UserId | ActivityDate | Time_diff
| 1 | 2012-11-10 11:19:04 | 12 |
| 1 | 2012-11-10 11:19:16 | 11 |
| 1 | 2012-11-10 11:19:27 | 3 |
| 1 | 2012-11-10 11:19:30 | 236774 |
| 1 | 2012-11-13 05:05:44 | 39 |
| 1 | 2012-11-13 05:06:23 | 77342 |
| 1 | 2012-11-14 02:35:25 | 585888 |
| 1 | 2012-11-20 21:20:13 | 1506130 |
...
| 1 | 2013-06-13 06:32:48 | 1616134 |
| 1 | 2013-07-01 23:28:22 | 5778459 |
| 1 | 2013-09-06 20:36:01 | -999 |
| 2 | 2008-08-01 04:59:33 | 622 |
| 2 | 2008-08-01 05:09:55 | 38225 |
| 2 | 2008-08-01 15:47:00 | 31108 |
| 2 | 2008-08-02 00:25:28 | 28599 |
| 2 | 2008-08-02 08:22:07 | 163789 |
| 2 | 2008-08-04 05:51:56 | 1522915 |
| 2 | 2008-08-21 20:53:51 | 694678 |
| 2 | 2008-08-29 21:51:49 | 2945291 |
| 2 | 2008-10-03 00:00:00 | 172800 |
| 2 | 2008-10-05 00:00:00 | 776768 |
| 2 | 2008-10-13 23:46:08 | 3742999 |
I have just added the field "session_id"
alter table so_time_diff add column session_id int(11) not null;
My actual question...
I would like to update this field for each of the above records based on the following logic:
for first record: set session_id = 1
from second record:
if previous_record.UserId == this_record.UserId AND previous_record.time_diff <=3600
set this_record.session_id = previous_record.session_id
else if previous_record.UserId == this_record.UserId AND previous_record.time_diff >3600
set this_record.session_id = previous_record.session_id + 1
else if previous_record.UserId <> this_record.UserId
set session_id = 1 ## for a different user, restart
In simple words,
if two records of the same user are within a time_interval of 3600 seconds, assign the same sessionid, if not increment the sessionid, if its a different user, restart the sessionid count.
I've never written logic in an update query before. Is this possible? Any guidance is greatly appreciated!
Yes, this is possible. It would be easier if the time_diff was on the later record, rather than the previous record, but we can make it work. (We don't really need the stored time_diff.)
The "trick" to getting this to work is really writing a SELECT statement. If you've got a SELECT statement that returns the key of the row to be updated, and the values to be assigned, making that into an UPDATE is trivial.
The "trick" to getting a SELECT statement is to make use of MySQL user variables, and is dependent on non-guaranteed behavior of MySQL.
This is the skeleton of the statement:
SELECT #prev_userid AS prev_userid
, #prev_activitydate AS prev_activitydate
, #sessionid AS sessionid
, #prev_userid := t.userid AS userid
, #prev_activitydate := t.activitydate AS activitydate
FROM (SELECT #prev_userid := NULL, #prev_activitydate := NULL, #sessionid := 1) i
JOIN so_time_diff t
ORDER BY t.userid, t.activitydate
(We hope there's an index ON mytable (userid, activitydate), so the query can be satisfied from the index, without a need for an expensive "Using filesort" operation.)
Let's unpack that a bit. Firstly, the three MySQL user variables get initialized by the inline view aliased as i. We don't really care about what that returns, we only really care that it initializes the user variables. Because we're using it in a JOIN operation, we also care that it returns exactly one row.
When the first row is processed, we have the values that were previously assigned to the user variable, and we assign the values from the current row to them. When the next row is processed, the values from the previous row are in the user variables, and we assign the current row values to them, and so on.
The "ORDER BY" on the query is important; it's vital that we process the rows in the correct order.
But that's just a start.
The next step is comparing the userid and activitydate values of the current and previous rows, and deciding whether we're in the same sessionid, or whether its a different session, and we need to increment the sessionid by 1.
SELECT #sessionid := #sessionid +
IF( t.userid = #prev_userid AND
TIMESTAMPDIFF(SECOND,#prev_activitydate,t.activitydate) <= 3600
,0,1) AS sessionid
, #prev_userid := t.userid AS userid
, #prev_activitydate := t.activitydate AS activitydate
FROM (SELECT #prev_userid := NULL, #prev_activitydate := NULL, #sessionid := 1) i
JOIN so_time_diff t
ORDER BY t.userid, t.activitydate
You could make use of the value stored in the existing time_diff column, but you need the value from previous row when checking the current row, so that just be another MySQL user variable, a check of #prev_time_diff, rather than calculating the timestamp difference (as in my example above.) (We can add other expressions to the select list, to make debugging/verification easier...
, #prev_userid=t.userid
, TIMESTAMPDIFF(SECOND,#prev_activitydate,t.activitydate)
N.B. The ORDER of the expressions in the SELECT list is important; the expressions are evaluated in the order they appear... this wouldn't work if we were to assign the userid value from the current row to the user variable BEFORE we checked it... that's why those assignments come last in the SELECT list.
Once we have a query that looks good, that's returning a "sessionid" value that we want to assign to the row with a matching userid and activitydate, we can use that in a multitable update statement.
UPDATE (
-- query that generates sessionid for userid, activityid goes here
) s
JOIN so_time_diff t
ON t.userid = s.userid
AND t.activitydate = s.activity_date
SET t.sessionid = s.sessionid
(If there's a lot of rows, this could crank a very long time. With versions of MySQL prior to 5.6, I believe the derived table (aliased as s) won't have any indexes created on it. Hopefully, MySQL will use the derived table s as the driving table for the JOIN operation, and do index lookups to the target table.)
FOLLOWUP
I entirely missed the requirement to restart sessionid at 1 for each user. To do that, I'd modify the expression that's assigned to #sessionid, just split the condition tests of userid and activitydate. If the userid is different than the previous row, then return a 1. Otherwise, based on the comparison of activitydate, return either the current value of #sessionid, or the current value incremented by 1.
Like this:
SELECT #sessionid :=
IF( t.userid = #prev_userid
, IF( TIMESTAMPDIFF(SECOND,#prev_activitydate,t.activitydate) <= 3600
, #sessionid
, #sessionid + 1 )
, 1 )
AS sessionid
, #prev_userid := t.userid AS userid
, #prev_activitydate := t.activitydate AS activitydate
FROM (SELECT #prev_userid := NULL, #prev_activitydate := NULL, #sessionid := 1) i
JOIN so_time_diff t
ORDER BY t.userid, t.activitydate
N.B. None of these statements is tested, these statements have only been desk checked; I've successfully used this pattern innumerable times.
Here is what I wrote, and this worked!!!
SELECT #sessionid := #sessionid +
CASE WHEN #prev_userid IS NULL THEN 0
WHEN t.UserId <> #prev_userid THEN 1-#sessionid
WHEN t.UserId = #prev_userid AND
TIMESTAMPDIFF(SECOND,#prev_activitydate,t.ActivityDate) <= 3600
THEN 0 ELSE 1
END
AS sessionid
, #prev_userid := t.UserId AS UserId
, #prev_activitydate := t.ActivityDate AS ActivityDate,
time_diff
FROM (SELECT #prev_userid := NULL, #prev_activitydate := NULL, #sessionid := 1) i
JOIN example t
ORDER BY t.UserId, t.ActivityDate;
thanks again to #spencer7593 for your very descriptive answer giving me the right direction..!!!
I'd like to count how many occurrences of a value happen before a specific value
Below is my starting table
+-----------------+--------------+------------+
| Id | Activity | Time |
+-----------------+--------------+------------+
| 1 | Click | 1392263852 |
| 2 | Error | 1392263853 |
| 3 | Finish | 1392263862 |
| 4 | Click | 1392263883 |
| 5 | Click | 1392263888 |
| 6 | Finish | 1392263952 |
+-----------------+--------------+------------+
I'd like to count how many clicks happen before a finish happens.
I've got a very roundabout way of doing it where I write a function to find the last
finished activity and query the clicks between the finishes.
Also repeat this for Error.
What I'd like to achieve is the below table
+-----------------+--------------+------------+--------------+------------+
| Id | Activity | Time | Clicks | Error |
+-----------------+--------------+------------+--------------+------------+
| 3 | Finish | 1392263862 | 1 | 1 |
| 6 | Finish | 1392263952 | 2 | 0 |
+-----------------+--------------+------------+--------------+------------+
This table is very long so I'm looking for an efficient solution.
If anyone has any ideas.
Thanks heaps!
This is a complicated problem. Here is an approach to solving it. The groups between the "finish" records need to be identified as being the same, by assigning a group identifier to them. This identifier can be calculated by counting the number of "finish" records with a larger id.
Once this is assigned, your results can be calculated using an aggregation.
The group identifier can be calculated using a correlated subquery:
select max(id) as id, 'Finish' as Activity, max(time) as Time,
sum(Activity = 'Clicks') as Clicks, sum(activity = 'Error') as Error
from (select s.*,
(select sum(s2.activity = 'Finish')
from starting s2
where s2.id >= s.id
) as FinishCount
from starting s
) s
group by FinishCount;
A version that leverages user(session) variables
SELECT MAX(id) id,
MAX(activity) activity,
MAX(time) time,
SUM(activity = 'Click') clicks,
SUM(activity = 'Error') error
FROM
(
SELECT t.*, #g := IF(activity <> 'Finish' AND #a = 'Finish', #g + 1, #g) g, #a := activity
FROM table1 t CROSS JOIN (SELECT #g := 0, #a := NULL) i
ORDER BY time
) q
GROUP BY g
Output:
| ID | ACTIVITY | TIME | CLICKS | ERROR |
|----|----------|------------|--------|-------|
| 3 | Finish | 1392263862 | 1 | 1 |
| 6 | Finish | 1392263952 | 2 | 0 |
Here is SQLFiddle demo
Try:
select x.id
, x.activity
, x.time
, sum(case when y.activity = 'Click' then 1 else 0 end) as clicks
, sum(case when y.activity = 'Error' then 1 else 0 end) as errors
from tbl x, tbl y
where x.activity = 'Finish'
and y.time < x.time
and (y.time > (select max(z.time) from tbl z where z.activity = 'Finish' and z.time < x.time)
or x.time = (select min(z.time) from tbl z where z.activity = 'Finish'))
group by x.id
, x.activity
, x.time
order by x.id
Here's another method of using variables, which is somewhat different to #peterm's:
SELECT
Id,
Activity,
Time,
Clicks,
Errors
FROM (
SELECT
t.*,
#clicks := #clicks + (activity = 'Click') AS Clicks,
#errors := #errors + (activity = 'Error') AS Errors,
#clicks := #clicks * (activity <> 'Finish'),
#errors := #errors * (activity <> 'Finish')
FROM
`starting` t
CROSS JOIN
(SELECT #clicks := 0, #errors := 0) i
ORDER BY
time
) AS s
WHERE Activity = 'Finish'
;
What's similar to Peter's query is that this one uses a subquery that's returning all the rows, setting some variables along the way and returning the variables' values as columns. That may be common to most methods that use variables, though, and that's where the similarity between these two queries ends.
The difference is in how the accumulated results are calculated. Here all the accumulation is done in the subquery, and the main query merely filters the derived dataset on Activity = 'Finish' to return the final result set. In contrast, the other query uses grouping and aggregation at the outer level to get the accumulated results, which may make it slower than mine in comparison.
At the same time Peter's suggestion is more easily scalable in terms of coding. If you happen to have to extend the number of activities to account for, his query would only need expansion in the form of adding one SUM(activity = '...') AS ... per new activity to the outer SELECT, whereas in my query you would need to add a variable and several expressions, as well as a column in the outer SELECT, per every new activity, which would bloat the resulting code much more quickly.