I am working on a query and I did a DateDiff in my Select statement to create three columns creating minute differences between two columns in my query. What I need to do now is put the data from those DateDiff results into buckets, but I am stuck on how to get this done.
These are the calculations from my Select statement:
,DATEDIFF (minute, ORD_MSG_MST.ORD_RCV_DTTM, PF_MST.ISSU_DTTM) AS 'Order_Issue'
,DATEDIFF (minute, ORD_MSG_MST.ORD_RCV_DTTM, UNIT_HIST.OCCR_DTTM AS 'Order_XM'
,DATEDIFF (minute, UNIT_HIST.OCCR_DTTM, PF_MST.ISSU_DTTM) AS 'XM_IS'
I was going to try add this as a subquery in my FROM statement:
LEFT OUTER JOIN
SELECT
count(CASE WHEN 'Order_XM'>= 0 AND 'Order_XM' < 10 THEN 1 END) AS '0 - 10',
count(CASE WHEN 'Order_XM'>= 11 AND 'Order_XM' < 20 THEN 1 END) AS '11 - 20',
count(CASE WHEN 'Order_XM'>= 21 AND 'Order_XM' < 30 THEN 1 END) AS '21 - 30',
count(CASE WHEN 'Order_XM'>= 31 AND 'Order_XM' < 40 THEN 1 END) AS '31 - 40',
FROM ____) )
But I don't know what table I need to put in to my FROM statement. And I'm not sure if this is really the correct way to do this.
Any thoughts on how you would get a calculated column in a query into buckets within the same query so I can create a Histogram in Report Builder?
I've done a lot of searching on this and haven't found anything where these are values that have been calculated.
TIA
You can take your initial query with the calculated DateDiff columns and use that as a subquery -- this will allow you then to have a primary query that does the bucketing.
You did not post most of your SQL, making it difficult to show an example that is perfectly relevant. So instead I'll use the Northwind sample database from MS to show an example that should be easy enough to follow and fix your own query.
Northwind has an Orders table with two DATETIME columns: RequiredDate and ShippedDate. My subquery calculates a DateDiff (with days rather than minutes) between these two columns and calls it Diff. The primary query then calculates a COUNT for each of three buckets covering ranges of values.
SELECT
COUNT(CASE WHEN Diff >= -50 AND Diff < -25 THEN 1 END) AS '-50 through -25',
COUNT(CASE WHEN Diff >= -25 AND Diff < 0 THEN 1 END) AS '-25 thorugh 0',
COUNT(CASE WHEN Diff >= 1 AND Diff < 25 THEN 1 END) AS '0 thorugh 25'
FROM
(
SELECT DATEDIFF(day, RequiredDate, ShippedDate) AS 'Diff'
FROM [Northwind].[dbo].[Orders]
) a
Using this example I think you should be able to make similar changes to your own query to get the result you are looking for.
Related
BACKGROUND:
I have data that looks like this
date src subsrc subsubsrc param1 param2
2020-02-01 src1 ksjd dfd8 47 31
2020-02-02 src1 djsk zmnc 44 95
2020-02-03 src2 skdj awes 92 100
2020-02-04 src2 mxsf kajs 80 2
2020-02-05 src3 skdj asio 46 53
2020-02-06 src3 dekl jdqo 19 18
2020-02-07 src3 dskl dqqq 69 18
2020-02-08 src4 sqip riow 64 46
2020-02-09 src5 ss01 qwep 34 34
I am trying to aggregate for all time, last 30 days and last 90 days (no rolling sum)
So my final data would look like this:
src subsrc subsubsrc p1_all p1_30 p1_90 p2_all p2_30 p2_90
src1 ksjd dfd8 7 1 7 98 7 98
src1 djsk zmnc 0 0 0 0 0 0
src2 skdj awes 12 12 12 4 4 4
src2 mxsf kajs 6 6 6 31 31 31
src3 skdj asio 0 0 0 0 0 0
src3 dekl jdqo 20 20 20 17 17 17
src3 dskl dqqq 3 3 3 4 4 4
src4 sqip qwep 0 0 0 0 0 0
src5 ss01 qwes 15 15 15 2 2 2
ABOUT DATA:
This is only dummy data and therefore incorrect.
There are tens of thousands of rows in my data.
There are a dozen of src columns that make up the key for the table.
There are a dozen of param columns that I have to sum for 30 and 90 and all time.
Also there are null values in param columns.
Also there are might be multiple rows for same day and src column.
New data is being added every day and the query is probably going to be run every day to get the latest 30, 90, all time data.
WHAT I HAVE TRIED:
This is what I have come up with:
SELECT src, subsubsrc, subsubsrc,
SUM(param1) as param1_all,
SUM(CASE WHEN DATE_DIFF(CURRENT_DATE,date,day) <= 30 THEN param1 END) as param1_30,
SUM(CASE WHEN DATE_DIFF(CURRENT_DATE,date,day) <= 90 THEN param1 END) as param1_90,
SUM(param2) as param2_all,
SUM(CASE WHEN DATE_DIFF(CURRENT_DATE,date,day) <= 30 THEN param2 END) as param2_30,
SUM(CASE WHEN DATE_DIFF(CURRENT_DATE,date,day) <= 90 THEN param2 END) as param2_90,
FROM `MY_TABLE`
GROUP BY src
ORDER BY src
This actually works but I can anticipate how long this query is going to become for multiple sources and even more param columns.
I have been trying something called "Filtered aggregate functions (or manual pivot)" explained HERE. But I am unable to understand/implement it for my case.
Also I have looked at dozens of answers and most of them are running sums for each day OR are complicated cases of this basic calculation. Maybe I am not searching it correctly.
As you can see I am newbie in SQL and would really appreciate any help.
Your query looks quite good; conditional aggregation is the canonical method to pivot a dataset.
One way to possibly increase performance would be to change the date filter in the conditional expressions: using a date function precludes the use of an index.
Instead, you could phrase this as:
select
src,
subsrc,
subsubsrc,
sum(param1) as param1_all,
sum(case when date >= current_date - interval 30 day then param1 end) as param1_30,
sum(case when date >= current_date - interval 90 day then param1 end) as param1_90,
sum(param2) as param2_all,
sum(case when date >= current_date - interval 30 day then param2 end) as param2_30,
sum(case when date >= current_date - interval 90 day then param2 end) as param2_90
from my_table
group by src, subsrc, subsubsrc
order by src, subsrc, subsubsrc
For performance, the following index may be helpul: (src, subsrc, subsubsrc, date).
Note that I included all three non-aggregated columns (src, subsrc, subsubsrc) in the group by clause: starting MySQL 5.7, this is by default mandatory (although you can play around with sql modes to alter that behavior) - and most other databases implement the same constraint.
Your first approach isn't a bad one if you are able to build the query programmatically. One alternative might be to create side tables for the 30 and 90 day cases first so you can effectively select all columns from each. This could also be done in sub-queries but there are performance considerations.
Some untested pseudo code to hopefully clarify:
SELECT
src,
subsrc,
subsubsrc,
SUM(param1) as param1_all,
-- other "all" sums here
SUM(t30.param1) as param1_30,
-- other "30" sums here
SUM(t90.param1) as param1_90,
-- other "90" sums here
FROM MY_TABLE
LEFT JOIN (
SELECT *
FROM MY_TABLE
WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
) as t30 on t30.src = MY_TABLE.src
LEFT JOIN (
SELECT *
FROM MY_TABLE
WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
) as t90 on t90.src = MY_TABLE.src
GROUP BY MY_TABLE.src
ORDER BY MY_TABLE.src
Note the date conditions have been switched to not use a function on the date column but instead compare to a date value. Your original approach would defeat any index on date (which you will want to make this more efficient).
If you first put these sub-queries into side tables that have a key on src the joins will be more efficient too. You could even group/sum directly into those side tables first rather than creating larger copies of your data, and then join the aggregated data together.
Your code looks good. Your RDBMS needs to loop all records under the hood and do some calculations. One thing that you can improve is that you are calculating date differences for all records. It would make sense to calculate the moment 30 days ago and 90 days ago beforehand, respectively and only compare the dates against those.
Since you already know that the number of rows and parameters will increase in the future, it makes sense to create a cron job which daily computes this in the following manner:
the first time it calculates the values, it should store all the results along with the date it was running at (maybe into a table dedicated for this analytics)
on subsequent days you can calculate the all time sum by loading the items which were created since the last check
you will still need to calculate the 30 and 90 day stuff, but that would be much less of a problem than calculating this for all time
If you do this properly and have daily information, then later on you will be able to analyze trends in history as well.
I'd recommend you use 3 different queries for that:
Sum for all time
Sum for 30 days
Sum for 90 days
Because when you're trying to do all-in-1 query then you end up with full table scan because of CASE-WHEN-END (BTW there is compact form IF() in MySQL). This is extremely non-optimal.
If you split it into 3 different queries and add an index to the date column then it won't do full-scan for the 2nd and 3rd query. Only for the 1st query, which can be optimised separately (for example by caching).
Also this approach: DATE_DIFF(CURRENT_DATE,date,day) <= 90
should be changed to: date >= 'date-90-days-ago' (where 'date-90-days-ago' is a fixed date)
Thus you won't have to compute difference of 2 dates for every row. You'll have just to compute 2 dates: 30 days ago and 90 days ago and compare all other dates to these two. This approach will benefit of the date column index.
I am using MySQL 8 and need to create a stored procedure
I have a single table that has a DATE field and a value field which can be 0 or any other number. This value field represents the daily amount of rain for that day.
The table stores data between today and 10 years.
I need to find out how many periods of rain there will be in the next 10 years.
So, for example, if my table contains the following data:
Date - Value
2018-06-09 - 0
2018-06-10 - 50
2018-06-11 - 0
2018-06-12 - 15
2018-06-13 - 17
2018-06-14 - 0
2018-06-15 - 0
2018-06-16 - 12
2018-06-17 - 123
2018-06-18 - 17
Then the SP should return 3, because there were 3 periods of rain.
Any help in getting me closer to the answer will be appreciated!
You don't need to have a stored procedure for this.
A solution with MySQL's 8.0 LEAD function this supports dates with gaps.
The complete table needs to be scanned but i don't think that a huge problem with ~3560 records.
Query
SELECT
SUM(filter_match = 1) AS number
FROM (
SELECT
((t.value = 0) AND (LEAD(t.value) OVER (ORDER BY t.date ASC) != 0)) AS filter_match
FROM
t
) t
see demo https://www.db-fiddle.com/f/sev4NqgLsFPgtNgwzruwy/2
By the way, would you mind expanding your answer to understand how
LEAD and SUM work together?
LEAD(t.value) OVER (ORDER BY t.date ASC) simply means get the next value from the next record ordered by date.
this demo shows it nicely https://www.db-fiddle.com/f/sev4NqgLsFPgtNgwzruwy/6
SUM(filter_match = 1) is a conditional sum. in this case the alias filter_match needs to be true.
see what filter_match is demo https://www.db-fiddle.com/f/sev4NqgLsFPgtNgwzruwy/8
In MySQL aggregate functions can have a SQL expression something like 1 = 1 (which is always true or 1) or 1 = 0 (which is always false or 0).
The conditional sum only sums up when the condition is true.
see demo https://www.db-fiddle.com/f/sev4NqgLsFPgtNgwzruwy/7
Use MySQL join:
SELECT COUNT(*) Number_of_Periods
FROM yourTable A JOIN yourTable B
ON DATE(A.`DATE`)=DATE(B.`DATE` - INTERVAL 1 DAY)
AND A.`VALUE`=0 AND B.`VALUE`>0;
See Demo on DB Fiddle.
I Need to retrieve values from database to plot them in graph. For that I need to get values on criteria basis. Data matching different criteria has to be returned as different rows/ column to my query
(i.e)
I have a table called TABLEA which has a column TIME. I need to get the value based on time critreia as a result, count of rows which are matching TIME>1 and TIME<10 as a result, TIME>11 and TIME <20 as a result and so on. Is it possible to get the values in a single query. I use Mysql with JDBC.
I should plot all the counts in a graph
Thanks in advance.
select sum(case when `time` between 2 and 9 then 1 else 0 end) as count_1,
sum(case when `time` between 12 and 19 then 1 else 0 end) as count_2
from your_table
This can be done with CASE statements, but they can get kind of verbose. You may just want to rely on Boolean (true/false) logic:
SELECT
SUM(TIME BETWEEN 1 AND 10) as `1 to 10`,
SUM(TIME BETWEEN 11 and 20) as `11 to 20`,
SUM(TIME BETWEEN 21 and 30) as `21 to 30`
FROM
TABLEA
The phrase TIME BETWEEN 1 AND 10) will either returnTRUEorFALSEfor each record.TRUEbeing equivalent to1andFALSEbeing equivalent to0`, we then only need sum the results and give our new field a name.
I also made the assumption that you wanted records where 1 <= TIME <= 10 instead of 1 < TIME < 10 which you stated since, as stated, it would drop values where the TIME was 1,10,20, etc. If that was your intended result, then you can just adjust the TIME BETWEEN 1 AND 10 to be TIME BETWEEN 2 AND 9 instead.
I've been looking at several other SO questions but I could not make out a solution from these. First, the description, then what I'm missing from the other threads. (Heads up: I'm very well aware of the non-normalised structure of our database, which is something I have addressed in meetings before but this is what we have and what I have to work with.)
Background description
We have a machine that manufactures products in 25 positions. These products' production data is being logged in a table that among other things logs current and voltage for every position. This is only logged when the machine is actually producing products (i.e. has a product in the machine). The time where no product is present, nothing is being logged.
This machine can run in two different production modes: full production and R&D production. Full production means that products are being inserted continuously so that every instance has a product at all times (i.e. 25 products are present in the machine at all times). The second mode, R&D production, only produces one product at a time (i.e. one product enters the machine, goes through the 25 instances one by one and when this one is finished, the second product enters the machine).
To clarify: every position logs data once every second whenever a product is present, which means 25 instances per second when full production is running. When R&D mode is running, position 1 will have ~20 instances for 20 consecutive seconds, position 2 will have ~20 instances for the next 20 consecutive seconds and so on.
Table structure
Productiondata:
id (autoincrement)
productID
position
time (timestamp for logged data)
current (amperes)
voltage (volts)
Question
We want to calculate the uptime of the machine, but we want to separate the uptime for production mode and R&D mode, and we want to separate this data on a weekly basis.
Guessed solution
Since we have instances logged every second I can count the amount of DISTINCT instances of time values we have in the table to find out the total uptime for both production and R&D mode. To find the R&D mode, I can safely say that whenever there is a time instance that has only one entry, I'm running in R&D mode (production mode would have 25 instances).
Progress so far
I have the following query which sums up all distinct instances to find both production and R&D mode:
SELECT YEARWEEK(time) AS YWeek, COUNT(DISTINCT time) AS Time_Seconds, ROUND(COUNT(DISTINCT time)/3600, 1) AS Time_Hours
FROM Database.productiondata
WHERE YEARWEEK(time) >= YEARWEEK(curdate()) - 21
GROUP BY YWeek;
This query finds out how many DISTINCT time instances there are in the table and counts the number and groups that by the week.
Problem
The above query counts the amount of instances that exist in the table, but I want to find ONLY the UNIQUE instances. Basically, I'm trying to find something like IF count(time) = 1, then count that instance, IF count(time) > 1 then don't count it at all (DISTINCT still counts this).
I looked at several other SO threads, but almost all explain how to find unique values with DISTINCT, which only accomplishes half of what I'm looking for. The closest I got was this which uses a HAVING clause. I'm currently stuck at the following:
SELECT YEARWEEK(time) as YWeek, COUNT(Distinct time) As Time_Seconds, ROUND(COUNT(Distinct time)/3600, 1) As Time_Hours
FROM
(SELECT * FROM Database.productiondata
WHERE time > '2014-01-01 00:00:00'
GROUP BY time
HAVING count(time) = 1) as temptime
GROUP BY YWeek
ORDER BY YWeek;
The problem here is that we have a GROUP BY time inside the nested select clause which takes forever (~5 million rows only for this year so I can understand that). I mean, syntactically I think that this is correct but it takes forever to exectue. Even EXPLAIN for this times out.
And that is where I am. Is this the correct approach or is there any other way that is smarter/requires less query time/avoids the group by time clause?
EDIT: As a sample, we have this table (apologies for formatting, don't know how to make a table format here on SO)
id position time
1 1 1
2 2 1
3 5 1
4 19 1
... ... ...
25 7 1
26 3 2
27 6 2
... ... ...
This table shows how it looks like when there is a production run going on. As you can see, there is no general structure for which position gets the first entry when logging the data in the table; what happens is that the 25 positions gets logged during every second and the data is then added to the table depending on how fast the PLC sends the data for every position. The following table shows how the table looks like when it runs in research mode.
id position time
245 1 1
246 1 2
247 1 3
... ... ...
269 1 25
270 2 26
271 2 27
... ... ...
Since all the data is consolidated into one single table, we want to find out how many instances there are when COUNT(time) is exactly equal to 1, or we could look for every instance when COUNT(time) is strictly larger than 1.
EDIT2: As a reply to Alan, the suggestion gives me
YWeek Time_Seconds Time_Hours
201352 1 0.0
201352 1 0.0
201352 1 0.0
... ... ...
201352 1 0.0 (1000 row limit)
Whereas my desired output is
Yweek Time_Seconds Time_Hours
201352 2146 35.8
201401 5789 96.5
... ... ...
201419 8924 148.7
EDIT3: I have gathered the tries and the results so far here with a description in gray above the queries.
You might achieve better results by eliminating your sub select:
SELECT YEARWEEK(time) as YWeek,
COUNT(time) As Time_Seconds,
ROUND(COUNT(time)/3600, 1) As Time_Hours
FROM Database.productiondata
WHERE time > '2014-01-01 00:00:00'
GROUP BY YWeek
HAVING count(time) = 1)
ORDER BY YWeek;
I'm assuming time has an index on it, but if it does not you could expect a significant improvement in performance by adding one.
UPDATE:
Per the recently added sample data, I'm not sure your approach is correct. The time column appears to be an INT representing seconds while you're treating it as a DATETIME with YEARWEEK. Below I have a working example in SQL that does exactly what you asked IF time is actually a DATETIME column:
DECLARE #table TABLE
(
id INT ,
[position] INT ,
[time] DATETIME
)
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -1, GETDATE()) )
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -2, GETDATE()) )
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -2, GETDATE()) )
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -2, GETDATE()) )
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -2, GETDATE()) )
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -3, GETDATE()) )
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -3, GETDATE()) )
SELECT CAST(DATEPART(year, [time]) AS VARCHAR)
+ CAST(DATEPART(week, [time]) AS VARCHAR) AS YWeek ,
COUNT([time]) AS Time_Seconds ,
ROUND(COUNT([time]) / 3600, 1) AS Time_Hours
FROM #table
WHERE [time] > '2014-01-01 00:00:00'
GROUP BY DATEPART(year, [time]) ,
DATEPART(week, [time])
HAVING COUNT([time]) > 0
ORDER BY YWeek;
SELECT pd1.*
FROM Database.productiondata pd1
LEFT JOIN Database.productiondata pd2 ON pd1.time=pd2.time AND pd1.id<pd2.id
WHERE pd1.time > '2014-01-01 00:00:00' AND pd2.time > '2014-01-01 00:00:00'
AND pd2.id IS NULL
You can LEFT JOIN to the same table and leave only the rows with no related
UPDATE The query works using the SQL fiddle
SELECT pd1.* From productiondata pd1
left Join productiondata pd2
ON pd1.time = pd2.time and pd1.id < pd2.id
Where pd1.time > '2014-01-01 00:00:00' and pd2.id IS NULL;
I am executing a query that obviously contains a subquery in MySQL.
Let me just jump into the code:
SELECT DATEDIFF(CURDATE(),
(SELECT due FROM checkOut JOIN People ON checkOut.p_id = People.p_id
WHERE CASE WHEN DATE_SUB(date_add(CURDATE(),INTERVAL 4 MONTH), INTERVAL 3 MONTH)
>= checkOut.checkTime THEN 1 ELSE 0 END ORDER BY checkOut.due)
);
The main query is the SELECT DATEDIFF(). Within that is my subquery which essentially searches through the table to look for items that are overdue based on an interval. I know that there will be multiple rows returned from the query and that it will not work with how I currently have it set up.
What I want are multiple values to be returned from my SELECT DATEDIFF(), so that I can loop through it with php later. To elaborate, I want each of the rows returned in the subquery to have an associated value from DATEDIFF(). How can I modify this query to do what I want? Or if anyone has a better method, please let me know.
Any help is appreciated.
In case you are wondering the why there is a DATE_ADD() within the DATE_SUB(), it is to simply make the query work for today.
get rid of the subquery, you can calculate the difference directly.
SELECT DATEDIFF(CURDATE(), due), due
FROM checkOut JOIN People
ON checkOut.p_id = People.p_id
WHERE CASE
WHEN DATE_SUB(date_add(CURDATE(),INTERVAL 4 MONTH), INTERVAL 3 MONTH)
>= checkOut.checkTime
THEN 1
ELSE 0
END
ORDER BY checkOut.due
Use the subquery as table. e.g. below:
SELECT DATEDIFF(CURDATE(), d.due)
FROM
(SELECT due FROM checkOut JOIN People ON checkOut.p_id = People.p_id
WHERE CASE WHEN DATE_SUB(date_add(CURDATE(),INTERVAL 4 MONTH), INTERVAL 3 MONTH)
>= checkOut.checkTime THEN 1 ELSE 0 END ORDER BY checkOut.due)
) AS d;