MySQL - Count only unique instances between specific dates - mysql

I've been looking at several other SO questions but I could not make out a solution from these. First, the description, then what I'm missing from the other threads. (Heads up: I'm very well aware of the non-normalised structure of our database, which is something I have addressed in meetings before but this is what we have and what I have to work with.)
Background description
We have a machine that manufactures products in 25 positions. These products' production data is being logged in a table that among other things logs current and voltage for every position. This is only logged when the machine is actually producing products (i.e. has a product in the machine). The time where no product is present, nothing is being logged.
This machine can run in two different production modes: full production and R&D production. Full production means that products are being inserted continuously so that every instance has a product at all times (i.e. 25 products are present in the machine at all times). The second mode, R&D production, only produces one product at a time (i.e. one product enters the machine, goes through the 25 instances one by one and when this one is finished, the second product enters the machine).
To clarify: every position logs data once every second whenever a product is present, which means 25 instances per second when full production is running. When R&D mode is running, position 1 will have ~20 instances for 20 consecutive seconds, position 2 will have ~20 instances for the next 20 consecutive seconds and so on.
Table structure
Productiondata:
id (autoincrement)
productID
position
time (timestamp for logged data)
current (amperes)
voltage (volts)
Question
We want to calculate the uptime of the machine, but we want to separate the uptime for production mode and R&D mode, and we want to separate this data on a weekly basis.
Guessed solution
Since we have instances logged every second I can count the amount of DISTINCT instances of time values we have in the table to find out the total uptime for both production and R&D mode. To find the R&D mode, I can safely say that whenever there is a time instance that has only one entry, I'm running in R&D mode (production mode would have 25 instances).
Progress so far
I have the following query which sums up all distinct instances to find both production and R&D mode:
SELECT YEARWEEK(time) AS YWeek, COUNT(DISTINCT time) AS Time_Seconds, ROUND(COUNT(DISTINCT time)/3600, 1) AS Time_Hours
FROM Database.productiondata
WHERE YEARWEEK(time) >= YEARWEEK(curdate()) - 21
GROUP BY YWeek;
This query finds out how many DISTINCT time instances there are in the table and counts the number and groups that by the week.
Problem
The above query counts the amount of instances that exist in the table, but I want to find ONLY the UNIQUE instances. Basically, I'm trying to find something like IF count(time) = 1, then count that instance, IF count(time) > 1 then don't count it at all (DISTINCT still counts this).
I looked at several other SO threads, but almost all explain how to find unique values with DISTINCT, which only accomplishes half of what I'm looking for. The closest I got was this which uses a HAVING clause. I'm currently stuck at the following:
SELECT YEARWEEK(time) as YWeek, COUNT(Distinct time) As Time_Seconds, ROUND(COUNT(Distinct time)/3600, 1) As Time_Hours
FROM
(SELECT * FROM Database.productiondata
WHERE time > '2014-01-01 00:00:00'
GROUP BY time
HAVING count(time) = 1) as temptime
GROUP BY YWeek
ORDER BY YWeek;
The problem here is that we have a GROUP BY time inside the nested select clause which takes forever (~5 million rows only for this year so I can understand that). I mean, syntactically I think that this is correct but it takes forever to exectue. Even EXPLAIN for this times out.
And that is where I am. Is this the correct approach or is there any other way that is smarter/requires less query time/avoids the group by time clause?
EDIT: As a sample, we have this table (apologies for formatting, don't know how to make a table format here on SO)
id position time
1 1 1
2 2 1
3 5 1
4 19 1
... ... ...
25 7 1
26 3 2
27 6 2
... ... ...
This table shows how it looks like when there is a production run going on. As you can see, there is no general structure for which position gets the first entry when logging the data in the table; what happens is that the 25 positions gets logged during every second and the data is then added to the table depending on how fast the PLC sends the data for every position. The following table shows how the table looks like when it runs in research mode.
id position time
245 1 1
246 1 2
247 1 3
... ... ...
269 1 25
270 2 26
271 2 27
... ... ...
Since all the data is consolidated into one single table, we want to find out how many instances there are when COUNT(time) is exactly equal to 1, or we could look for every instance when COUNT(time) is strictly larger than 1.
EDIT2: As a reply to Alan, the suggestion gives me
YWeek Time_Seconds Time_Hours
201352 1 0.0
201352 1 0.0
201352 1 0.0
... ... ...
201352 1 0.0 (1000 row limit)
Whereas my desired output is
Yweek Time_Seconds Time_Hours
201352 2146 35.8
201401 5789 96.5
... ... ...
201419 8924 148.7
EDIT3: I have gathered the tries and the results so far here with a description in gray above the queries.

You might achieve better results by eliminating your sub select:
SELECT YEARWEEK(time) as YWeek,
COUNT(time) As Time_Seconds,
ROUND(COUNT(time)/3600, 1) As Time_Hours
FROM Database.productiondata
WHERE time > '2014-01-01 00:00:00'
GROUP BY YWeek
HAVING count(time) = 1)
ORDER BY YWeek;
I'm assuming time has an index on it, but if it does not you could expect a significant improvement in performance by adding one.
UPDATE:
Per the recently added sample data, I'm not sure your approach is correct. The time column appears to be an INT representing seconds while you're treating it as a DATETIME with YEARWEEK. Below I have a working example in SQL that does exactly what you asked IF time is actually a DATETIME column:
DECLARE #table TABLE
(
id INT ,
[position] INT ,
[time] DATETIME
)
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -1, GETDATE()) )
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -2, GETDATE()) )
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -2, GETDATE()) )
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -2, GETDATE()) )
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -2, GETDATE()) )
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -3, GETDATE()) )
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -3, GETDATE()) )
SELECT CAST(DATEPART(year, [time]) AS VARCHAR)
+ CAST(DATEPART(week, [time]) AS VARCHAR) AS YWeek ,
COUNT([time]) AS Time_Seconds ,
ROUND(COUNT([time]) / 3600, 1) AS Time_Hours
FROM #table
WHERE [time] > '2014-01-01 00:00:00'
GROUP BY DATEPART(year, [time]) ,
DATEPART(week, [time])
HAVING COUNT([time]) > 0
ORDER BY YWeek;

SELECT pd1.*
FROM Database.productiondata pd1
LEFT JOIN Database.productiondata pd2 ON pd1.time=pd2.time AND pd1.id<pd2.id
WHERE pd1.time > '2014-01-01 00:00:00' AND pd2.time > '2014-01-01 00:00:00'
AND pd2.id IS NULL
You can LEFT JOIN to the same table and leave only the rows with no related
UPDATE The query works using the SQL fiddle
SELECT pd1.* From productiondata pd1
left Join productiondata pd2
ON pd1.time = pd2.time and pd1.id < pd2.id
Where pd1.time > '2014-01-01 00:00:00' and pd2.id IS NULL;

Related

SQL Query to get distinct values from a table and the difference between ordered rows

I have a real time data table with time stamps for different data points
Time_stamp, UID, Parameter1, Parameter2, ....
I have 400 UIDs so each time_stamp is repeated 400 times
I want to write a query that uses this table to check if the real time data flow to the SQL database is working as expected - new timestamp every 5 minute should be available
For this what I usually do is query the DISTINCT values of time_stamp in the table and order descending - do a visual inspection and copy to excel to calculate the difference in minutes between subsequent distinct time_stamp
Any difference over 5 min means I have a problem. I am trying to figure out how I can do something similar in SQL, maybe get a table that looks like this. Tried to use LEAD and DISTINCT together but could not write the code myself, im just getting started on SQL
Time_stamp, LEAD over last timestamp
Thank you for your help
You can use lag analytical function as follows:
select t.* from
(select t.*
lag(Time_stamp) over (order by Time_stamp) as lg_ts
from your_Table t)
where timestampdiff('minute',lg_ts,Time_stamp) > 5
Or you can also use the not exists as follows:
select t.*
from your_table t
where not exists
(select 1 from your_table tt
where timestampdiff('minute',tt.Time_stamp,t.Time_stamp) <= 5)
and t.Time_stamp <> (select min(tt.Time_stamp) from your_table tt)
lead() or lag() is the right approach (depending on whether you want to see the row at the start or end of the gap).
For the time comparison, I recommend direct comparisons:
select t.*
from (select t.*
lead(Time_stamp) over (partition by uid order by Time_stamp) as next_time_stamp
from t
) t
where next_timestamp > time_stamp + interval 5 minute;
Note: exactly 5 minutes seems unlikely. You might want a fudge factor such as:
where next_timestamp > time_stamp + interval 5*60 + 10 second;
timestampdiff() counts the number of "boundaries" between two values. So, the difference in minutes between 00:00:59 and 00:01:02 is 1. And the difference between 00:00:00 and 00:00:59 is 0.
So, a difference of "5 minutes" could really be 4 minutes and 1 second or could be 5 minutes and 59 seconds.

SQL Query - Find out how many times a row changes from 0 to another value

I am using MySQL 8 and need to create a stored procedure
I have a single table that has a DATE field and a value field which can be 0 or any other number. This value field represents the daily amount of rain for that day.
The table stores data between today and 10 years.
I need to find out how many periods of rain there will be in the next 10 years.
So, for example, if my table contains the following data:
Date - Value
2018-06-09 - 0
2018-06-10 - 50
2018-06-11 - 0
2018-06-12 - 15
2018-06-13 - 17
2018-06-14 - 0
2018-06-15 - 0
2018-06-16 - 12
2018-06-17 - 123
2018-06-18 - 17
Then the SP should return 3, because there were 3 periods of rain.
Any help in getting me closer to the answer will be appreciated!
You don't need to have a stored procedure for this.
A solution with MySQL's 8.0 LEAD function this supports dates with gaps.
The complete table needs to be scanned but i don't think that a huge problem with ~3560 records.
Query
SELECT
SUM(filter_match = 1) AS number
FROM (
SELECT
((t.value = 0) AND (LEAD(t.value) OVER (ORDER BY t.date ASC) != 0)) AS filter_match
FROM
t
) t
see demo https://www.db-fiddle.com/f/sev4NqgLsFPgtNgwzruwy/2
By the way, would you mind expanding your answer to understand how
LEAD and SUM work together?
LEAD(t.value) OVER (ORDER BY t.date ASC) simply means get the next value from the next record ordered by date.
this demo shows it nicely https://www.db-fiddle.com/f/sev4NqgLsFPgtNgwzruwy/6
SUM(filter_match = 1) is a conditional sum. in this case the alias filter_match needs to be true.
see what filter_match is demo https://www.db-fiddle.com/f/sev4NqgLsFPgtNgwzruwy/8
In MySQL aggregate functions can have a SQL expression something like 1 = 1 (which is always true or 1) or 1 = 0 (which is always false or 0).
The conditional sum only sums up when the condition is true.
see demo https://www.db-fiddle.com/f/sev4NqgLsFPgtNgwzruwy/7
Use MySQL join:
SELECT COUNT(*) Number_of_Periods
FROM yourTable A JOIN yourTable B
ON DATE(A.`DATE`)=DATE(B.`DATE` - INTERVAL 1 DAY)
AND A.`VALUE`=0 AND B.`VALUE`>0;
See Demo on DB Fiddle.

Count consecutive row occurrences

I have a MySQL table with three columns: takenOn (datetime - primary key), sleepDay (date), and type (int). This table contains my sleep data from when I go to bed to when I get up (at a minute interval).
As an example, if I go to bed on Oct 29th at 11:00pm and get up on Oct 30th at 6:00am, I will have 420 records (7 hours * 60 minutes). takenOn will range from 2016-10-29 23:00:00 to 2016-10-30 06:00:00. sleepDay will be 2016-10-30 for all 420 records. type is the "quality" of my sleep (1=asleep, 2=restless, 3=awake). I'm trying to get how many times I was restless/awake, which can be calculated by counting how many times I see type=2 (or type=3) consecutively.
So far, I have to following query, which works for one day only. Is this the correct/"efficient" way of doing this (as this method requires that I have the data without any "gaps" in takenOn)? Also, how can I expand it to calculate for all possible sleepDays?
SELECT
sleepDay,
SUM(CASE WHEN type = 2 THEN 1 ELSE 0 END) AS TimesRestless,
SUM(CASE WHEN type = 3 THEN 1 ELSE 0 END) AS TimesAwake
FROM
(SELECT s1.sleepDay, s1.type
FROM sleep s1
LEFT JOIN sleep s2
ON s2.takenOn = ADDTIME(s1.takenOn, '00:01:00')
WHERE
(s2.type <> s1.type OR s2.takenOn IS NULL)
AND s1.sleepDay = '2016-10-30'
ORDER BY s1.takenOn) a
I have created an SQL Fiddle - http://sqlfiddle.com/#!9/b33b4/3
Thank you!
Your own solution is quite alright, given the assumptions you are aware of.
I present here an alternative solution, that will deal well with gaps in the series, and can be used for more than one day at a time.
The downside is that it relies more heavily on non-standard MySql features (inline use of variables):
select sleepDay,
sum(type = 2) TimesRestless,
sum(type = 3) TimesAwake
from (
select #lagDay as lagDay,
#lagType as lagType,
#lagDay := sleepDay as sleepDay,
#lagType := type as type
from (select * from sleep order by takenOn) s1,
(select #lagDay := '',
#lagType := '') init
) s2
where lagDay <> sleepDay
or lagType <> type
group by sleepDay
To see how it works it can help to select the second select statement on its own. The inner-most select must have the order by clause to make sure the middle query will process the records in that order, which is important for the variable assignments that happen there.
See your updated SQL fiddle.

TSQL SELECT based on a condition

I am trying to do a select from CTE based on a condition.
There is a variable I've declared for today's period (#PRD). It holds the value of what period we are currently in.
Now I would like to do a selection from a table that will restrict what information is returned based on whether we are in the first half of the year or not.
For instance, we are in period 2 so I want everything returned from my CTE which falls between PRD 1 and 5. IF we were in say period 6 (after 5), then yes I'd want everything returned from the table.
This is the pseudocode of what I'm trying to accomplish:
SELECT
CASE
WHEN #PRD <= 5
THEN (SELECT * FROM DISPLAY WHERE PERIOD IN (1,2,3,4,5))
ELSE (SELECT * FROM DISPLAY)
END
I'm getting an error:
Only one expression can be specified in the select list when the subquery is not introduced with EXISTS.
Please any thoughts on how I can do this?
Thanks x
EDITED/UPDATED:
More of the code involves a CTE and is really long. Bottom line is lets say I have this CTE
;WITH DISPLAY as (
select * from lots_of_things
)
SELECT * FROM DISPLAY
Having done a regular select on this CTE, it returns data that looks like this:
PERIOD (INT) DEPARTMENT GROUP BUDGET
1 ENERGY HE 500
2 ENERGY HE 780
3 ENERGY HE 1500
4 ENERGY HE 4500
5 ENERGY HE 400
6 ENERGY HE 3500
7 ENERGY HE 940
8 ENERGY HE 1200
I want it to show me just the top 5 rows if we the current period is 1,2,3,4,5. But to display ALL table rows if we are in any other period like 6,7,8,9 and onwards. The current period is held in the variable #PRD which is derived from doing a comparison of today's date with ranges held in a table. The value is accurate and also type INT
Hope this helps
SQL FIDDLE
This will work:
SELECT * FROM DISPLAY WHERE (#PRD > 5 OR PERIOD IN (1, 2, 3, 4, 5))
If this code confuses you, what's happening is that we check if #PRD > 5 and if that returns true, our expression is always true so we return all the rows.
If the variable is less or equal to 5 (like you checked in your example), the first check is false and then we check if the period is the list.
This might be a solution:
IF #PRD <= 5
SELECT * FROM DISPLAY WHERE PERIOD IN (1,2,3,4,5)
ELSE
SELECT * FROM DISPLAY
UPD
In this case you should use variable instead of CTE, if it's possible.
DECLARE #PRD INT;
SELECT #PRD = PERIOD FROM SOME_TABLE WHERE ...

Query a MySQL Database and Group By Date Range to Create a Chart

I'm looking to create the following chart from a MySQL database. I know how to actually create the chart (using excel or similar program), my problem is how to get the data needed to create the chart. In this example, I can see that on January 1, 60 tickets were in the state illustrated by the green line.
I need to track the historical state of tickets of a project through a date range. The date range is determined by a project manager (in this case it's January 1st through January 9th).
For each ticket, I have the following set of historical data. Each time something changes in the ticket (state, description, assignee, customer update, and other attributes not shown in this problem), a "timestamp" entry is made in the database.
ticket_num status_changed_date from_state to_state
123456 2011-01-01 18:03:44 -- 1
123456 2011-01-01 18:10:26 1 2
123456 2011-01-01 14:37:10 2 2
123456 2011-01-02 07:55:44 2 3
123456 2011-01-03 06:12:18 3 2
123456 2011-01-04 19:03:43 3 3
123456 2011-01-05 02:05:24 3 4
123456 2011-01-06 18:13:28 4 4
123456 2011-01-07 13:14:48 4 5
123456 2011-01-09 01:35:39 5 5
How can I query the database for a given time (determined by my script) and find out what state each of the tickets are in?
For example: To produce the chart shown above, given the date 2011-01-02 12:00:00, how many tickets were in the state "2"?
I've tried querying the database with specific dates and ranges, but can't figure out the proper way to get the data to create the chart. Thanks in advance for any help.
I'm not exactly sure I know what you want. But . . .
Assuming a table definition like:
create table ticket_data (ticket_num int,
status_changed_date datetime,
from_state int,
to_state int);
The following, for example would give you the number of values per day:
select date(status_changed_date) as status_date, count(*)
from ticket_data
group by status_date;
Now, if you want just from_state = 2, just add a where clause in to that effect. If you want just the ones on Jan 2, then add in where date(status_changed_date) = '2011-01-02'
Or, if you you're looking for the distinct number of tickets per day then, change count(*) to count(distinct ticket_num)
Is this what you're asking? SQL Fiddle here
Ok so if you are trying to get a count of records in a certain state at a certain time, I think a stored proc might be necessary.
CREATE PROCEDURE spStatesAtDate
#Date datetime,
#StateId int
AS
BEGIN
SET NOCOUNT ON;
SELECT COUNT(*) as Count
FROM ticket_table t1
WHERE to_state = #StateId AND status_changed_date < #Date
AND status_changed_date = (SELECT MAX(status_changed_date) FROM ticket_table t2 where t2.ticket_num=t1.ticket_num AND status_changed_date < #Date)
END
then to call this for the above example, you're query would look like
EXEC spStatesAtDate #Date='2011-01-02 12:00:00', #StateId=2
You can use a subquery to select the last modification date before a given point grouped by ticket_num and then select the states at this time.
SELECT
ticket_num,
to_state,
status_changed_date
FROM
tickets
WHERE
status_changed_date IN (
SELECT MAX(status_changed_date)
FROM tickets
WHERE status_changed_date < '2012-02-01 01:00:00'
GROUP BY ticket_num
)
It all boils down to common question: how to get list of items and their most recent statuses. So. Given one issue, we can get its most recent status with query:
select to_state
from ticket_states
where ticket_num = t.ticket_num
order by status_changed_date desc
limit 1
Next, we need to get all applicable distinct issue ids, which is a simple distinct select:
select distinct ticket_num from ticket_states
With these two subqueries we can already start building. For example, current list of issues and their latest statuses before specified date would be:
select t.ticket_num
, (select to_state
from ticket_states
where ticket_num = t.ticket_num
and status_changed_date <= '2012-01-01'
order by status_changed_date desc
limit 1) as last_state
from (select distinct ticket_num
from ticket_states) t;
All issues, which were non-existant at at the specified time will have last_state set to null.
This probably isn't the best way of doing this, but it is first which came to mind. I'll leave other stuff to you. Also I should mention that this is not a very efficient solution also.