im back to stack overflow with another headache that I have been trying to get to the bottom of with no success at all. No matter how many times I use avg(datediff) functions.
I have an SQL table like the below:
ID | PersonID | Start | End
1 | 1 | 2006-03-21 00:00:00 | 2007-05-19 00:00:00 | Active
2 | 1 | 2007-05-19 00:00:00 | 2007-05-20 00:00:00 | Active
3 | 2 | 2016-08-24 00:00:00 | 2016-08-25 00:00:00 | Active
4 | 2 | 2005-08-25 00:00:00 | 2016-08-28 00:00:00 | Active
5 | 2 | 2016-08-28 00:00:00 | 2017-10-05 00:00:00 | Active
Im trying to find the average active stay (in days) across all unique people.
Ie the average number of days based on their EARLIEST start date and LATEST end date (as a single person ID can have multiple active statuses).
For example, person ID 1, their earliest start date was 2006-03-21 and their latest end date is 2007-05-20. Their stay has therefore been 425 days.
Repeat this for ID number 2, their stay is 407 days.
After doing this for everyone on the table... I want to get the average length of stay, the average for the above 5 rows, with 2 unique people is 416. Doing a simple datediff average across all rows will give me a very inaccurate average of 102.
Hope this makes sense. As always,any help you could give is very much appreciated.
So why not try that:
SELECT
AVG(DATEDIFF(PersonEnd, PersonStart))
FROM
(SELECT
MIN(Start) AS PersonStart,
MAX(End) AS PersonEnd
FROM
table
GROUP BY
PersonID) PeriodsPerPerson
Of course, you should have proper indexes so that MySQL can compute MAX and MIN fast and can group fast as well, which means indexes at least on PersonID, Start and End.
Please note that you really need the alias for the inner query although I don't use it anywhere. If you leave it away, you'll run into an error, at least with MySQL 5.5 (I don't know about later versions).
If you have millions or even billions of rows, you might be better off moving the calculation into a stored procedure or a back-end application instead of doing it as shown above.
Related
I'm trying to add a column to a production hours dataset that will tell if a provider who worked last week was also working three weeks earlier. The current dataset looks something like this:
RowID | ProviderID | ClientID | DOS | DOS (Week) | Hours
1 | 1111111111 | 22222222 | 11/2/2020 | 11/1/2020 | 2.5
2 | 1111111111 | 33333333 | 11/5/2020 | 11/1/2020 | 1
3 | 1111111111 | 44444444 | 10/13/2020 | 10/11/2020 | 3
I'm trying to get an extra column 'Active 3 Weeks Prior' with y/n or 1/0 for values. For the above table, let's assume the provider started on 10/13/20. The new column would ideally populate like this:
RowID | ProviderID | ClientID | DOS | DOS (Week) | Hours | Active 3 weeks Prior
1 | 1111111111 | 22222222 | 11/2/2020 | 11/1/2020 | 2.5 | Yes
2 | 1111111111 | 33333333 | 11/5/2020 | 11/1/2020 | 1 | Yes
3 | 1111111111 | 44444444 | 10/13/2020 | 10/11/2020 | 3 | No
A couple extra tidbits: our org uses Sunday as the start of the week so DOS (Week) is the Sunday prior to the date of service. From what I've been reading so far, it seems like the solution here is some kind of self join, where the base production records are aggregated into weekly hours and compared with that same providerID's records for DOS (Week) - 21.
The trouble I'm having is: whether I'm on the right track in the first place with the self-join and how I would generate the y/n values based on the success or failure to find a matching value. Also, I suspect that joining based on a concatenate of ProviderID and DOS(Week) might be flawed? This is what I've been playing with so far.
Please let me know if I can clarify the question at all or am missing something very obvious. I truly appreciate any help, as I've been trying to figure out the right search terms to get a clue on the answer for a few days now.
If you are running MySQL 8.0, you can use window functions and a range specification:
select t.*,
(
max(providerid) over(
partition by providerid
order by dos
range between interval 3 week preceding and interval 3 week preceding
) is not null
) as active_3_weeks_before
from mytable t
It is not really clear from your explanation and data what you mean by was also working three weeks earlier. What the query does is, for each row, to check if another row exists with the same supplier and a dos that is exactly 3 week before the dos of the current row. This can easily be adapted for some other requirement.
Edit: if you want to check for any record within the last 3 weeks, you would change the window range to:
range between interval 3 week preceding and interval 1 day preceding
And if you want this in MySQL < 8.0, where window functions are not available, then you would use a correlated subquery:
select t.*,
exists (
select 1
from mytable t1
where
t1.providerid = t.provider_id
and t1.dos >= t.dos - interval 3 week
and t1.dos < t.dos
) as active_3_weeks_before
from mytable t
I have a table where I store historical data and add a record for items I'm tracking every 5 mins.
This is an example using just 2 items:
+----+-------------+
| id | timestamp |
+----+-------------+
| 1 | 1533209426 |
| 2 | 1533209426 |
| 1 | 1533209726 |
| 2 | 1533209726 |
| 1 | 1533210026 |
| 2 | 1533210026 |
+----+-------------+
The problem is that I'm actually tracking 4k items and the table keeps getting bigger, also, I don't need 5 mins data if I want to get the last month. What I'm trying to understand is if there's a way to keep 5 mins records for the last 24h, 1h records for the last 7 days etc. Maybe every hour I could get the first 12 records from the 5 mins table and store the average in the 1h table? But what if some records are missing because there were errors? Is this the correct way to solve this problem or there are some better alternatives?
You are on the right track.
There are multiple issues to decide on how to handle -- missing entries, timestamps skewed by 1 second (or whatever), etc.
By providing a count (which should always be 12), you can discover some hiccups:
SELECT FLOOR(timestamp / 3600) AS hr, -- MEDIUMINT UNSIGNED
COUNT(*), -- TINYINT UNSIGNED
AVG(metric) -- FLOAT
FROM tbl
GROUP BY 1;
Yes, every hour, do the previous hour's worth of data. Add WHERE timestamp BETWEEN ... AND ... + 3599 to constrain the range in question. Then purge the same set of data.
The table would have PRIMARY KEY(hr).
Unless you are talking about millions of rows in a table, I would not recommend any use of PARTITION.
I need to create a date range in a table that houses transaction information. The table updates sporadically throughout the week from a manual process. Each time the table is updated transactions are added up to the previous Sunday. For instance, the upload took place yesterday and so transactions were loaded through last Sunday (Feb 26th). If it had been loaded on Wednesday it would still be dated for Sunday. The point is that I have a moving target with my transactions and also when the data is loaded to the table. I am trying to fix my look back period to the date of the latest transaction then go three weeks back. Here is the query that I came up with:
SELECT distinct TransactionDate
FROM TransactionTABLE TB
inner join (
SELECT distinct top 21 TransactionDate FROM TrasactionTABLE ORDER BY TransactionDate desc
) A on TB.TransactionDate = A.TransactionDate
ORDER BY TB.TransactionDate desc
Technically this code works. The problem that I am running into now is when there were no transactions on a given date, such as bank holidays (in this case Martin Luther King Day), then the query looks back one day too far.
I have tried a few different options including MAX(TransactionDate) but if I use that in a sub-query or CTE then use the new value in a WHERE statement as a reference I only get the max value or the value I subtract that statement by. For instance if I say WHERE TransactionDate >= MAX(TransactionDate)-21 and the max date is Feb 26th then the result is Feb 2nd instead of the range of dates from Feb 2nd through Feb 26th.
IN SUMMARY, what I need is a date range looking three weeks back from the date of the latest transaction date. This is for a daily report so I cannot hardcode the date in. Since I am also using Excel Connections the use of Declare statements is prohibited.
Thank you StackOverflow gurus in advance!
You could use something like this:
;with n as (select n from (values(0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) t(n))
, dates as (
select top (21)
[Date]=convert(date,dateadd(day, row_number() over (order by (select 1))-1
, dateadd(day,-20,(select max(TransactionDate) from t) ) ) )
from n as deka
cross join n as hecto
order by [Date]
)
select Date=convert(varchar(10),dates.date,120) from dates
rextester demo: http://rextester.com/ZFYV25543
returns:
+------------+
| Date |
+------------+
| 2017-02-06 |
| 2017-02-07 |
| 2017-02-08 |
| 2017-02-09 |
| 2017-02-10 |
| 2017-02-11 |
| 2017-02-12 |
| 2017-02-13 |
| 2017-02-14 |
| 2017-02-15 |
| 2017-02-16 |
| 2017-02-17 |
| 2017-02-18 |
| 2017-02-19 |
| 2017-02-20 |
| 2017-02-21 |
| 2017-02-22 |
| 2017-02-23 |
| 2017-02-24 |
| 2017-02-25 |
| 2017-02-26 |
+------------+
I just found this for looking up dates that fall within a given week. The code can be manipulated to change the week start date.
select convert(datetime,dateadd(dd,-datepart(dw,convert(datetime,convert(varchar(10),DateAdd(dd,-1/*this # changes the week start day*/,getdate()),101)))+1/*this # is used to change the week start date*/,
convert(datetime,convert(varchar(10),getdate(),21))))/*also can enter # here to change the week start date*/
I've included a screenshot of the results if you were to include this with a full query. This way you can see how it looks with a range of dates. I did a little manipulation so that the week starts on Monday and references Monday's date.
Since I am only looking back three weeks a simple GETDATE()-21 is sufficient because as the query moves forward through the week it will look back 21 days and pick the Monday at the beginning of the week as my start date.
I have a table with the following structure:
id | workerID | materialID | date | materialGathered
Different workers contribute different amounts of different material per day. A single worker can only contribute once a day, but not necessarily every day.
What I need to do is to figure out which of them was the most productive and which of them was the least productive, while it is supposed to be measured as AVG() material gathered per day.
I honestly have no idea how to do that, so I'll appreciate any help.
EDIT1:
Some sample data
1 | 1 | 2013-01-20 | 25
2 | 1 | 2013-01-21 | 15
3 | 1 | 2013-01-22 | 17
4 | 1 | 2013-01-25 | 28
5 | 2 | 2013-01-20 | 23
6 | 2 | 2013-01-21 | 21
7 | 3 | 2013-01-22 | 17
8 | 3 | 2013-01-24 | 15
9 | 3 | 2013-01-25 | 19
Doesn't really matter how the output looks, to be honest. Maybe a simple table like that:
workerID | avgMaterialGatheredPerDay
And I didn't really attempt anything because I literally have no idea, haha.
EDIT2:
Any time period that is in the table (from earliest to latest date in the table) is considered.
Material doesn't matter at the moment. Only the arbitrary units in the materialGathered column matter.
As in your comments you say that we look at each worker and consider their avarage daily working skill, rather than checking which worked most in a given time, the answer is rather easy: Group by workerid to get a result record per worker, use AVG to get their avarage amount:
select workerid, avg(materialgathered) as avg_gathered
from work
group by workerid;
Now to the best and worst workers. These can be more than two. So you cannot just take the first or last record, but need to know the maximum and the minimum avg_gathered.
select max(avg_gathered) as max_avg_gathered, min(avg_gathered) as min_avg_gathered
from
(
select avg(materialgathered) as avg_gathered
from work
group by workerid
);
Now join the two queries to get all workers that worked the avarage minimum or maximum:
select work.*
from
(
select workerid, avg(materialgathered) as avg_gathered
from work
group by workerid
) as worker
inner join
(
select max(avg_gathered) as max_avg_gathered, min(avg_gathered) as min_avg_gathered
from
(
select avg(materialgathered) as avg_gathered
from work
group by workerid
)
) as worked on worker.avg_gathered in (worked.max_avg_gathered, worked.min_avg_gathered)
order by worker.avg_gathered;
There are other ways to do this. For example with HAVING avg(materialgathered) IN (select min(avg_gathered)...) OR avg(materialgathered) IN (select max(avg_gathered)...) instead of a join. The join is very effective though, because you need just one select for both min and max.
I have this existing schema where a "schedule" table looks like this (very simplified).
CREATE TABLE schedule (
id int(11) NOT NULL AUTO_INCREMENT,
name varchar(45),
start_date date,
availability int(3),
PRIMARY KEY (id)
);
For each person it specifies a start date and percentage of work time available to spent on this project. That availability percentage implicitly continues until a newer value is specified.
For example take a project that lasts from 2012-02-27 to 2012-03-02:
id | name | start_date | availability
-------------------------------------
1 | Tom | 2012-02-27 | 100
2 | Tom | 2012-02-29 | 50
3 | Ben | 2012-03-01 | 80
So Tom starts on Feb., 27nd, full time, until Feb, 29th, from which on he'll be available only with 50% of his work time.
Ben only starts on March, 1st and only with 80% of his time.
Now the goal is to "normalize" this sparse data, so that there is a result row for each person for each day with the availability coming from the last specified day:
name | start_date | availability
--------------------------------
Tom | 2012-02-27 | 100
Tom | 2012-02-28 | 100
Tom | 2012-02-29 | 50
Tom | 2012-03-01 | 50
Tom | 2012-03-02 | 50
Ben | 2012-02-27 | 0
Ben | 2012-02-28 | 0
Ben | 2012-02-29 | 0
Ben | 2012-03-01 | 80
Ben | 2012-03-02 | 80
Think a chart showing the availability of each person over time, or calculating the "resource" values in a burndown diagram.
I can easily do this with procedural code in the app layer, but would prefer a nicer, faster solution.
To make this remotely effective, I recommend creating a calendar table. One that contains each and every date of interest. You then use that as a template on which to join your data.
Equally, things improve further if you have person table to act as the template for the name dimension of your results.
You can then use a correlated sub-query in your join, to pick which record in Schedule matches the calendar, person template you have created.
SELECT
*
FROM
calendar
CROSS JOIN
person
LEFT JOIN
schedule
ON schedule.name = person.name
AND schedule.start_date = (SELECT MAX(start_date)
FROM schedule
WHERE name = person.name
AND start_date <= calendar.date)
WHERE
calendar.date >= <yourStartDate>
AND calendar.date <= <yourEndDate>
etc
Often, however, it is more efficient to deal with it in one of two other ways...
Don't allow gaps in the data in the first place. Have a nightly batch process, or some other business logic that ensures all relevant dat apoints are populated.
Or deal with it in your client. Return each dimension in you report (data, and name) as seperate data sets to act as your templates, and then return the data as your final data set. Your client can itterate over the data and fill in the blanks as appropriate. It's more code, but can actually use less resource overall than trying to fill-the-gaps with SQL.
(If your client side code does this slowly, post another question examining that code. Provided that the data is sorted, this is acutally quite quick to do in most languages.)