MySQL cumulative sum grouped by date - mysql

I know there have been a few posts related to this, but my case is a little bit different and I wanted to get some help on this.
I need to pull some data out of the database that is a cumulative count of interactions by day. currently this is what i have
SELECT
e.Date AS e_date,
count(e.ID) AS num_interactions
FROM example AS e
JOIN example e1 ON e1.Date <= e.Date
GROUP BY e.Date;
The output of this is close to what I want but not exactly what I need.
The problem I'm having is the dates are stored with the hour minute and second that the interaction happened, so the group by is not grouping days together.
This is what the output looks like.
On 12-23 theres 5 interactions but its not grouped because the time stamp is different. So I need to find a way to ignore the timestamp and just look at the day.
If I try GROUP BY DAY(e.Date) it groups the data by the day only (i.e everything that happened on the 1st of any month is grouped into one row) and the output is not what I want at all.
GROUP BY DAY(e.Date), MONTH(e.Date) is splitting it up by month and the day of the month, but again the count is off.
I'm not a MySQL expert at all so I'm puzzled on what i'm missing

New Answer
At first, I didn't understand you were trying to do a running total. Here is how that would look:
SET #runningTotal = 0;
SELECT
e_date,
num_interactions,
#runningTotal := #runningTotal + totals.num_interactions AS runningTotal
FROM
(SELECT
DATE(eDate) AS e_date,
COUNT(*) AS num_interactions
FROM example AS e
GROUP BY DATE(e.Date)) totals
ORDER BY e_date;
Original Answer
You could be getting duplicates because of your join. Maybe e1 has more than one match for some rows which is inflating your count. Either that or the comparison in your join is also comparing the seconds, which is not what you expect.
Anyhow, instead of chopping the datetime field into days and months, just strip the time from it. Here is how you do that.
SELECT
DATE(e.Date) AS e_date,
count(e.ID) AS num_interactions
FROM example AS e
JOIN example e1 ON DATE(e1.Date) <= DATE(e.Date)
GROUP BY DATE(e.Date);

I figured out what I needed to do last night... but since I'm new to this I couldn't post it then... what I did that worked was this:
SELECT
DATE(e.Date) AS e_date,
count(e.ID) AS num_daily_interactions,
(
SELECT
COUNT(id)
FROM example
WHERE DATE(Date) <= e_date
) as total_interactions_per_day
FROM example AS e
GROUP BY e_date;
Would that be less efficient than your query? I may just do the calculation in python after pulling out the count per day if its more efficient, because this will be on the scale of thousands to hundred of thousands of rows returned.

Related

What is difference for below there two sql queries

select
substr(insert_date, 1, 14),
device, count(1)
from
abc.xyztable
where
insert_date >= DATE_SUB(NOW(), INTERVAL 10 DAY)
group by
device, substr(insert_date, 1, 14) ;
and then I am trying to get average of the same rows count which I got above.
SELECT
date, device, AVG(count)
FROM
(SELECT
substr(insert_date, 1, 14) AS date,
device,
COUNT(1) AS count
FROM
abc.xyztable
WHERE
insert_date >= DATE_SUB(NOW(), INTERVAL 10 DAY)
GROUP BY
device, substr(insert_date, 1, 14)) a
GROUP BY
device, date;
AS I found both queries return the same results, I tried for last 10 days data.
My purpose is to get the average rows count for last 10 days which I get from the above 1st query.
I'm not entirely sure what you're asking, the "difference" between the two queries is that the first one is valid but the second does not appear to be, as per HoneyBadger's comment. They also seem to be trying to achieve two different goals.
However, I think what you are trying to do is produce a query based on the data from the first query, which returns the date, device, and an average of the count column. If so, I believe the following query would calculate this:
WITH
dataset AS (
select substr(insert_date,1,14) AS theDate, device, count(*) AS
theCount
from abc.xyztable
where insert_date >=DATE_SUB(NOW(), INTERVAL 10 DAY)
group by device,substr(insert_date,1,14)
)
SELECT theDate, device, (SELECT ROUND(AVG(CAST(theCount
AS FLOAT)), 2) FROM
dataset) AS Average
FROM dataset
GROUP BY theDate, device
I have referenced the accepted answers of this question to calculate the average: How to calculate average of a column and then include it in a select query in oracle?
And this question to tidy up the query: Formatting Clear and readable SQL queries
Without having a sample of your data, or any proper context, I can't see how this would be especially useful, so if it was not what you were looking for, please edit your question and clarify exactly what you need.
EDIT: Based on what extra information you have provided, I've made a tweak to my solution to increase the precision of the average column. It now calculates the average to two decimal places. You have stated that this returns the same result as your original query, but the two queries are not formulating the same thing. If the count column is consistently the same number with little variation, the AVG function will round this, which in turn could produce results which look the same, especially if you only compare a small sample, so I have amended my answer to demonstrate this. Again, we'd all be able to help you much easier if you would provide more information, such as a sample of your data.
If you want an average you need to change the last GROUP BY
to get an average per device
GROUP BY device;
to get an average per date
GROUP BY date;
or remove it completely to get an average for all rows in the sub-query
Update
Below is a full example for getting the average per device
SELECT device, avg(count)
FROM (SELECT substr(insert_date,1,14) as date, device, count(1) as count
FROM abc.xyztable
WHERE insert_date >=DATE_SUB(NOW(), INTERVAL 10 DAY)
GROUP BY device,substr(insert_date,1,14)) a
GROUP BY device;

MySQL query to select all hostels with at least X spaces between start and end dates?

I have 2 tables, one with hostels (effectively a single-room hotel with lots of beds), and the other with bookings.
Hostel table: unique ID, total_spaces
Bookings table: start_date, end_date, num_guests, hostel_ID
I need a (My)SQL query to generate a list of all hostels that have at least num_guests free spaces between start_date and end_date.
Logical breakdown of what I'm trying to achieve:
For each hostel:
Get all bookings that overlap start_date and end_date
For each day between start_date and end_date, sum the total bookings for that day (taking into account num_guests for each booking) and compare with total_spaces, ensuring that there are at least num_guests spaces free on that day (if there aren't on any day then that hostel can be discounted from the results list)
Any suggestions on a query that would do this please? (I can modify the tables if necessary)
I built an example for you here, with more comments, which you can test out:
http://sqlfiddle.com/#!9/10219/9
What's probably tricky for you is to join ranges of overlapping dates. The way I would approach this problem is with a DATES table. It's kind of like a tally table, but for dates. If you join to the DATES table, you basically break down all the booking ranges into bookings for individual dates, and then you can filter and sum them all back up to the particular date range you care about. Helpful code for populating a DATES table can be found here: Get a list of dates between two dates and that's what I used in my example.
Other than that, the query basically follows the logical steps you've already outlined.
Ok, if you are using mysql 8.0.2 and above, then you can use window functions. In such case you can use the solution bellow. This solution does not need to compute the number of quests for each day in the query interval, but only focuses on days when there is some change in the number of hostel guests. Therefore, there is no helping table with dates.
with query as
(
select * from bookings where end_date > '2017-01-02' and start_date < '2017-01-05'
)
select hostel.*, bookingsSum.intervalMax
from hostel
join
(
select tmax.id, max(tmax.intervalCount) intervalMax
from
(
select hostel.id, t.dat, sum(coalesce(sum(t.gn),0)) over (partition by t.id order by t.dat) intervalCount
from hostel
left join
(
select id, start_date dat, guest_num as gn from query
union all
select id, end_date dat, -1 * guest_num as gn from query
) t on hostel.id = t.id
group by hostel.id, t.dat
) tmax
group by tmax.id
) bookingsSum on hostel.id = bookingsSum.id and hostel.total_spaces >= bookingsSum.intervalMax + <num_of_people_you_want_accomodate>
demo
It uses a simple trick, where each start_date represents +guest_num to the overall number of quests and each 'end_date' represents -guest_num to the overall number of quests. We than do the necessary sumarizations in order to find peak number of quests (intervalMax) in the query interval.
You change '2017-01-05' in my query to '2017-01-06' (then only two hostels are in the result) and if you use '2017-01-07' then just hostel id 3 is in the result, since it does not have any bookings yet.

Dealing with the count function

My Boss want to know how many times each of these shipping number of days occurred. Ordered by number of days DESC.
So far have :
SELECT DateDiff(shippedDate,orderDate) As '#Days', COUNT(*)
FROM datenumtest
I think I need condition, can someone help me out with this?
Calculate the DateDiff across all records first in a data set r, then you can do the grouping on that data set which becomes data set r1 then sort the r1 data set:
select r.NumDays, count(1) as the_count
from (
SELECT DateDiff(shippedDate,orderDate) as 'NumDays'
FROM datenumtest
) r
group by r.NumDays
order by r.NumDays desc;

MySQL Group By Order and Count(Distinct)

What is the best way to think about the Group By function in MySQL?
I am writing a MySQL query to pull data through an ODBC connection in a pivot table in Excel so that users can easily access the data.
For example, I have:
Select
statistic_date,
week(statistic_date,4),
year(statistic_date),
Emp_ID,
count(distict Emp_ID),
Site
Cost_Center
I'm trying to count the number of unique employees we have by site by week. The problem I'm running into is around year end, the calendar years don't always match up so it is important to have them by date so that I can manually filter down to the correct dates using a pivot table (2013/2014 had a week were we had to add week 53 + week 1).
I'm experimenting by using different group by statements but I'm not sure how the order matters and what changes when I switch them around.
i.e.
Group by week(statistic_date,4), Site, Cost_Center, Emp_ID
vs
Group by Site, Cost_Center, week(statistic_date,4), Emp_ID
Other things to note:
-Employees can work any number of days. Some are working 4 x 10's, others 5 x 8's with possibly a 6th day if they sign up for OT. If I sum the counts by week, I get anywhere between 3-7 per Emp_ID. I'm hoping to get 1 for the week.
-There are different pay code per employee so the distinct count helps when we are looking by day (VTO = Voluntary Time Off, OT = Over Time, LOA = Leave of Absence, etc). The distinct count will show me 1, where often times I will have 2-3 for the same emp in the same day (hits 40 hours and starts accruing OT then takes VTO or uses personal time in the same day).
I'm starting with a query I wrote to understand our paid hours by week. I'm trying to adapt it for this application. Actual code is below:
SELECT
dkh.STATISTIC_DATE AS 'Date'
,week(dkh.STATISTIC_DATE,4) as 'Week'
,month(dkh.STATISTIC_DATE) as 'Month'
,year(dkh.STATISTIC_DATE) as 'Year'
,dkh.SITE AS 'Site ID Short'
,aep.LOC_DESCR as 'Site Name'
,dkh.EMPLOYEE_ID AS 'Employee ID'
,count(distinct dkh.EMPLOYEE_ID) AS 'Distinct Employee ID'
,aep.NAME AS 'Employee Name'
,aep.BUSINESS_TITLE AS 'Business_Ttile'
,aep.SPRVSR_NAME AS 'Manager'
,SUBSTR(aep.DEPTID,1,4) AS 'Cost_Center'
,dkh.PAY_CODE
,dkh.PAY_CODE_SHORT
,dkh.HOURS
FROM metrics.DAT_KRONOS_HOURS dkh
JOIN metrics.EMPLOYEES_PUBLIC aep
ON aep.SNAPSHOT_DATE = SUBDATE(dkh.STATISTIC_DATE, DAYOFWEEK(dkh.STATISTIC_DATE) + 1)
AND aep.EMPLID = dkh.EMPLOYEE_ID
WHERE dkh.STATISTIC_DATE BETWEEN adddate(now(), interval -1 year) AND DATE(now())
group by dkh.SITE, SUBSTR(aep.DEPTID,1,4), week(dkh.STATISTIC_DATE,4), dkh.STATISTIC_DATE, dkh.EMPLOYEE_ID
The order you use in group by doesn't matter. Each unique combination of the values gets a group of its own. Selecting columns you don't group by gives you somewhat arbitrary results; you'd probably want to use some aggregation function on them, such as SUM to get the group total.
Grouping by values you derive from other values that you already use in group by, like below, isn't very useful.
week(dkh.STATISTIC_DATE,4), dkh.STATISTIC_DATE
If two rows have different weeks, they'll also have different dates, right?

Group by date from multiple columns?

first of all sorry for that title, but I have no idea how to describe it:
I'm saving sessions in my table and I would like to get the count of sessions per hour to know how many sessions were active over the day. The sessions are specified by two timestamps: start and end.
Hopefully you can help me.
Here we go:
http://sqlfiddle.com/#!2/bfb62/2/0
While I'm still not sure how you'd like to compare the start and end dates, looks like using COUNT, YEAR, MONTH, DAY, and HOUR, you could come up with your desired results.
Possibly something similar to this:
SELECT COUNT(ID), YEAR(Start), HOUR(Start), DAY(Start), MONTH(Start)
FROM Sessions
GROUP BY YEAR(Start), HOUR(Start), DAY(Start), MONTH(Start)
And the SQL Fiddle.
What you want to do is rather hard in MySQL. You can, however, get an approximation without too much difficulty. The following counts up users who start and stop within one day:
select date(start), hour,
sum(case when hours.hour between hour(start) and hours.hour then 1 else 0
end) as GoodEstimate
from sessions s cross join
(select 0 as hour union all
select 1 union all
. . .
select 23
) hours
group by date(start), hour
When a user spans multiple days, the query is harder. Here is one approach, that assumes that there exists a user who starts during every hour:
select thehour, count(*)
from (select distinct date(start), hour(start),
(cast(date(start) as datetime) + interval hour(start) hour as thehour
from sessions
) dh left outer join
sessions s
on s.start <= thehour + interval 1 hour and
s.end >= thehour
group by thehour
Note: these are untested so might have syntax errors.
OK, this is another problem where the index table comes to the rescue.
An index table is something that everyone should have in their toolkit, preferably in the master database. It is a table with a single id int primary key indexed column containing sequential numbers from 0 to n where n is a number big enough to do what you need, 100,000 is good, 1,000,000 is better. You only need to create this table once but once you do you will find it has all kinds of applications.
For your problem you need to consider each hour and, if I understand your problem you need to count every session that started before the end of the hour and hasn't ended before that hour starts.
Here is the SQL fiddle for the solution.
What it does is use a known sequential number from the indextable (only 0 to 100 for this fiddle - just over 4 days - you can see why you need a big n) to link with your data at the top and bottom of the hour.