I have a string of events being logged on a 5 minute basis throughout the day in a MySQL DB. I need to identify the first event (where logid > 0) of the day as well as the last (where logid=0), but struggling to find a simple SQL solution.
A 0 will be stored in the logid field in every row starting at midnight until the first event is triggered, at which point it will change to a number > 0. Then various events will be triggered logging a number > 0 for the remainder of the day, at which point the field will once again be logged as 0 until midnight, when the process starts over again.
Is there a quick and simple way to pull the rows identifying the time when the events start, and another result showing when the events end?
CREATE TABLE logs(
id INT AUTO_INCREMENT,
date DATETIME,
logid INT,
PRIMARY KEY (id)
) ENGINE=INNODB;
This is the test data:
id date logid
1 2018-11-12 01:05:00 0
2 2018-11-12 01:10:00 0
3 2018-11-12 01:15:00 0
4 2018-11-12 01:20:00 0
5 2018-11-12 01:05:00 0
…
84 2018-11-12 06:35:00 0
85 2018-11-12 06:35:00 1
86 2018-11-12 06:40:00 1
87 2018-11-12 06:45:00 1
88 2018-11-12 06:50:00 1
…
164 2018-11-12 15:20:00 1
165 2018-11-12 15:25:00 0
166 2018-11-12 15:30:00 0
167 2018-11-12 15:35:00 0
Desired Result set:
85 2018-11-12 06:35:00 1
165 2018-11-12 15:25:00 0
I'm not concerned about logid up until the first instance where it is greater than 0. But I need to identify the first instance where logid > 0, and then the next chronological instance where logid = 0 again.
My primary attempt was to group and order on the date and logid (edit: failed attempt removed for clarity)
Here's my latest attempt
(SELECT *
FROM logs
WHERE logid>0
GROUP BY date
ORDER BY date
limit 1
)UNION ALL(
SELECT *
FROM logs
WHERE logid>0
GROUP BY date
ORDER BY date DESC
limit 1)
Getting closer, but not quite there. This gives me the correct first row where logid = 1, but it gives me the last row where logid = 1 (id 164) rather than the following row where logid = 0 (id=165).
Is it possible to select the penultimate row of a set if I change limit 1 to 2?
Any other pointers to keep me moving forward?
This question doesn't seem to be a problem for others, but I thought I would post the answer I came up with in case anyone runs into a similar situation in the future.
SET #v1 := (SELECT date
FROM logs
WHERE logid > 0
GROUP BY date
ORDER BY date
limit 1);
(SELECT *
FROM logs
WHERE date>#v1 and logid>0
GROUP BY date
ORDER BY date
limit 1
) UNION ALL (
SELECT *
FROM logs
WHERE date>#v1 and logid=0
GROUP BY date
ORDER BY date
limit 1
)
Related
I have a table that is arranged as such, need to get the average time each step takes, and am using the updated_at column to do so.
workflow_id
step_name
created_at
updated_at
25
data_capture
2022-03-21 10:20:34
2022-03-21 10:20:55
25
client_signature
2022-03-21 10:20:34
2022-03-22 17:25:15
25
pm_signature
2022-03-21 10:20:34
2022-03-22 23:05:12
105
data_capture
2022-03-24 05:20:34
2022-03-24 10:20:55
105
client_signature
2022-03-24 05:20:34
2022-03-24 17:25:15
105
pm_signature
2022-03-24 05:20:34
2022-03-24 23:05:12
My issue is with the query I have, it is subtracting row 4 from row 3 which are unrelated workflows and making the time spent on data_capture for workflow id 105 incorrect.
I have tried using LAG, LEAD, PARTITION BY, to try and create new rows but I have not found a way to keep the subtractions separate by workflow ID. Here is how the calculation should look:
updated_at from row 1 - created_at for row 1 = time spent on data_capture
updated_at from row 2 - value from step 1 = time spent on client_signature
updated_at from row 3 - value from step 2 = time spent on pm_signature
process restarts for workflow 105
Below are some of my attempts
`WITH tb1 AS (
SELECT
workflow_id,
workflow_type,
step_name,
step_status,
step_created_at,
step_updated_at,
TIMESTAMPDIFF(SECOND, updated_time, step_updated_at) / 3600 elapsed_time
FROM (
SELECT *,
LAG(step_updated_at) OVER (ORDER BY step_updated_at) updated_time
FROM view_client_workflow_status
) s
)
SELECT
step_name,
AVG(elapsed_time)
FROM tb1
WHERE workflow_type = 'client_onboarding_fcc_ip'
GROUP BY 1`
I have two tables in my schema. The first contains a list of recurring appointments - default_appointments. The second table is actual_appointments - these can be generated from the defaults or individually created so not linked to any default entry.
Example:
default_appointments
id
day_of_week
user_id
appointment_start_time
appointment_end_time
1
1
1
10:00:00
16:00:00
2
4
1
11:30:00
17:30:00
3
6
5
09:00:00
17:00:00
actual_appointments
id
default_appointment_id
user_id
appointment_start
appointment_end
1
1
1
2021-09-13 10:00:00
2021-09-13 16:00:00
2
NULL
1
2021-09-13 11:30:00
2021-09-13 13:30:00
3
6
5
2021-09-18 09:00:00
2021-09-18 17:00:00
I'm looking to calculate the total minutes that were scheduled in against the total that were actually created/generated. So ultimately I'd end up with a query result with this data:
user_id
appointment_date
total_planned_minutes
total_actual_minutes
1
2021-09-13
360
480
1
2021-09-16
360
0
5
2021-09-18
480
480
What would be the best approach here? Hopefully the above makes sense.
Edit
OK so the default_appointments table contains all appointments that are "standard" and are automatically generated. These are what appointments "should" happen every week. So e.g. ID 1, this appointment should occur between 10am and 4pm every Monday. ID 2 should occur between 11:30am an 5:30pm every Thursday.
The actual_appointments table contains a list of all of the appointments which did actually occur. Basically what happens is a default_appointment will automatically generate itself an instance in the actual_appointments table when initially set up. The corresponding default_appointment_id indicates that it links to a default and has not been changed - therefore the times on both will remain the same. The user is free to change these appointments that have been generated by a default, resulting in setting the default_appointment_id to NULL * - or -* can add new appointments unrelated to a default.
So, if on a Monday (day_of_week = 1) I should normally have a default appointment at 10am - 4pm, the total minutes I should have planned based on the defaults are 360 minutes, regardless of what's in the actual_appointments table, I should be planned for those 360 minutes every Monday without fail. If in the system I say - well actually, I didn't have an appointment from 10am - 4pm and instead change it to 10am - 2pm, actual_appointments table will then contain the actual time for the day, and the actual minutes appointed would be 240 minutes.
What I need is to group each of these by the date and user to understand how much time the user had planned for appointments in the default_appointments table vs how much they actually appointed.
Adjusted based on new detail in the question.
Note: I used day_of_week values compatible with default MySQL behavior, where Monday = 2.
The first CTE term (args) provides the search parameters, start date and number of days. The second CTE term (drange) calculates the dates in the range to allow generation of the scheduled appointments within that range.
allrows combines the scheduled and actual appointments via UNION to prepare for aggregation. There are other ways to set this up.
Finally, we aggregate the results per user_id and date.
The test case:
Working Test Case (Updated)
WITH RECURSIVE args (startdate, days) AS (
SELECT DATE('2021-09-13'), 7
)
, drange (adate, days) AS (
SELECT startdate, days-1 FROM args UNION ALL
SELECT adate + INTERVAL '1' DAY, days-1 FROM drange WHERE days > 0
)
, allrows AS (
SELECT da.user_id
, dr.adate
, ROUND(TIME_TO_SEC(TIMEDIFF(da.appointment_end_time, da.appointment_start_time))/60, 0) AS planned
, 0 AS actual
FROM drange AS dr
JOIN default_appointments AS da
ON da.day_of_week = dayofweek(adate)
UNION
SELECT user_id
, DATE(appointment_start) AS xdate
, 0 AS planned
, TIMESTAMPDIFF(MINUTE, appointment_start, appointment_end)
FROM drange AS dr
JOIN actual_appointments aa
ON DATE(appointment_start) = dr.adate
)
SELECT user_id, adate
, SUM(planned) AS planned
, SUM(actual) AS actual
FROM allrows
GROUP BY adate, user_id
;
Result:
+---------+------------+---------+--------+
| user_id | adate | planned | actual |
+---------+------------+---------+--------+
| 1 | 2021-09-13 | 360 | 480 |
| 1 | 2021-09-16 | 360 | 0 |
| 5 | 2021-09-18 | 480 | 480 |
+---------+------------+---------+--------+
I need to extract and migrate values from one table to another. the source table contains sumarized values for a specific effectivity date. If a value is changed, a new line is written if something is changed on the component values with the data valid starting at this effective date.
source_id
entity_id
effective_date
component_1
component_2
component_3
int(ai)
int
date
int
int
int
1
159
2020-01-01
100
0
90
2
159
2020-05-01
140
50
90
3
159
2020-08-01
0
30
90
5
159
2020-12-01
0
30
50
i need now migrate this data to a new table like this. the goal is that selecting data for a given month the result is the valid data for this month is given.
id
source_id
entity_id
startdate
enddate
component_type
value
int(ai)
int
int
date
date
int
int
each row represents a value for a component valid for a period of month.
I now run the insert update for each effective month by setting it as a parameter.
I insert value changes as new rows to the table an prevent duplicates by using a unique key (entity_id,effective_date,component_type)
SET #effective_date = '2020-01-01';
INSERT INTO component_final
select NULL,
source_id,
entity_id,
effective_date,
NULL,
1,
component_1
FROM component_source
WHERE effective_date = #effective_date
AND component_1>0;
after migrating the first row it should be that result
id
source_id
entity_id
startdate
enddate
component_type
value
1
1
159
2020-01-01
NULL
1
100
2
1
159
2020-01-01
NULL
3
90
SET #effective_date = '2020-05-01';
INSERT INTO component_final
select NULL,
source_id,
entity_id,
effective_date,
NULL,
1,
component_1
FROM component_source
WHERE effective_date = #effective_date
AND component_1>0;
after migrating the second row it should be that result
id
source_id
entity_id
startdate
enddate
component_type
value
1
1
159
2020-01-01
2020-04-30
1
100
2
1
159
2020-01-01
NULL
3
90
3
2
159
2020-05-01
NULL
1
140
4
2
159
2020-05-01
NULL
2
50
so if there is a value change in the future an end date has to be set.
I'm not able to do the second step, updating the data, if the component is changed in the future.
Maybe it is possible to have it as triggers after insert new row with same entity and component - but I was not able to make it work.
Some ideas? I want to handle this only inside of the MySQL.
You do not need the column enddate in the table component_final, because it's value depends on other values in the same table:
SELECT
id,
source_id,
entity_id,
startdate,
( SELECT DATE_ADD(MIN(cf2.startdate),INTERVAL -1 DAY)
FROM component_final cf2
WHERE cf2.startdate > cf1.startdate
AND cf2.source_id = cf1.source_id
AND cf2.entity_id = cf1.entity_id
) as enddate,
component_type,
value
FROM component_final cf1;
I understand that the core issue is how to find the source_ids where a component changes (0 means a removal, so we don't want these entries in the result) and how to assign the respective end dates at the same time. For the sake of illustration I simplify your example a bit:
There is only one component_type (I take into account that there might then be consecutive entries with unchanged value)
there is only one entity_id, so we can ignore it
It should be easy to extend this simpler version to your real-world problem.
To this is an example input:
source_id
effective_date
value
1
2020-01-01
100
2
2020-01-03
100
3
2020-01-05
80
4
2020-01-10
0
5
2020-01-12
30
I would expect the following output to be generated:
source_id
start_date
end_date
value
1
2020-01-01
2020-01-04
100
3
2020-01-05
2020-01-09
80
5
2020-01-12
NULL
30
You can achieve this with one query by joing each row with the previous one to check if the value has changed (find the start dates of periods) and the first row that is in the future and has a different value (find the start of the next period). If there is no previous row, it is considered a start as well. If there is no later update of the value, we have no end_date.
SELECT
main.source_id,
main.effective_date as start_date,
DATE_SUB(next_start.effective_date, INTERVAL 1 DAY) as end_date,
main.value
FROM source main
LEFT JOIN source prev ON prev.effective_date = (
SELECT MAX(effective_date)
FROM source
WHERE effective_date < main.effective_date
)
LEFT JOIN source next_start ON next_start.effective_date = (
SELECT MIN(effective_date)
FROM source
WHERE effective_date > main.effective_date AND value <> main.value
)
WHERE
ISNULL(prev.source_id) OR prev.value <> main.value
AND main.value <> 0
ORDER BY main.source_id
As I said: This will have to be adapted to your problem, e.g. by adding proper join conditions for the entity_id.
#Luuk pointed out that you don't need the end date because it can be derived from the data. This would be the case if you had entries for the start of "0 periods" as well, i.e. if there is no value set. If you don't have entries for these, you can't derive the end from the start of the respectively next period since there might be a gap in between.
I have a MySQL table which has some records as follows:
unix_timestamp value
1001 2
1003 3
1012 1
1025 5
1040 0
1101 3
1105 4
1130 0
...
I want to compute the average for every 10 epochs to see the following results:
unix_timestamp_range avg_value
1001-1010 2.5
1011-1020 1
1021-1030 5
1031-1040 0
1041-1050 -1
1051-1060 -1
1061-1070 -1
1071-1080 -1
1081-1090 -1
1091-1100 -1
1101-1110 3.5
1111-1120 -1
1121-1130 0
...
I saw some similar answers like enter link description here and enter link description here and enter link description here but these answers are not a solution for my specific question. How can I get the above results?
The easiest way to do this is to use a calendar table. Consider this approach:
SELECT
CONCAT(CAST(cal.ts AS CHAR(50)), '-', CAST(cal.ts + 9 AS CHAR(50))) AS unix_timestamp_range,
CASE WHEN COUNT(t.value) > 0 THEN AVG(t.value) ELSE -1 END AS avg_value
FROM
(
SELECT 1001 AS ts UNION ALL
SELECT 1011 UNION ALL
SELECT 1021 UNION ALL
...
) cal
LEFT JOIN yourTable t
ON t.unix_timestamp BETWEEN cal.ts AND cal.ts + 9
GROUP BY
cal.ts
ORDER BY
cal.ts;
In practice, if you have the need to do this sort of query often, instead of the inline subquery labelled as cal above, you might want to have a full dedicated table representing all timestamp ranges.
I have a table that has many rows in it, with rows occurring at the rate of 400-500 per minute (I know this isn't THAT many), but I need to do some sort of 'trend' analysis on the data that has been collected over the last 1 minute.
Instead of pulling all records that have been entered and then processing each of those, I would really like to be able to select, say, 10 records - which occur at a -somewhat- even distribution through the timeframe specified.
ID DEVICE_ID LA LO CREATED
-------------------------------------------------------------------
1 1 23.4 948.7 2018-12-13 00:00:01
2 2 22.4 948.2 2018-12-13 00:01:01
3 2 28.4 948.3 2018-12-13 00:02:22
4 1 26.4 948.6 2018-12-13 00:02:33
5 1 21.4 948.1 2018-12-13 00:02:42
6 1 22.4 948.3 2018-12-13 00:03:02
7 1 28.4 948.0 2018-12-13 00:03:11
8 2 23.4 948.8 2018-12-13 00:03:12
...
492 2 21.4 948.4 2018-12-13 00:03:25
493 1 22.4 948.2 2018-12-13 00:04:01
494 1 24.4 948.7 2018-12-13 00:04:02
495 2 27.4 948.1 2018-12-13 00:05:04
Considering this data set, instead of pulling all those rows, I would like to maybe pull a row from the set every 50 records (10 rows for roughly ~500 rows returned).
This does not need to be exact, I just need a sample in which to perform some sort of linear regression on.
Is this even possible? I can do it in my application code if need be, but I wanted to see if there was a function or something in MySQL that would handle this.
Edit
Here is the query I have tried, which works for now - but I would like the results more evenly distributed, not by RAND().
SELECT * FROM (
SELECT * FROM (
SELECT t.*, DATE_SUB(NOW(), INTERVAL 30 HOUR) as offsetdate
from tracking t
HAVING created > offsetdate) as parp
ORDER BY RAND()
LIMIT 10) as mastr
ORDER BY id ASC;
Do not order by RAND() as the rand calculated for every row, then reordered and only then you are selecting a few records.
You can try something like this:
SELECT
*
FROM
(
SELECT
tracking.*
, #rownum := #rownum + 1 AS rownum
FROM
tracking
, (SELECT #rownum := 0) AS dummy
WHERE
created > DATE_SUB(NOW(), INTERVAL 30 HOUR)
) AS s
WHERE
(rownum % 10) = 0
Index on created is "the must".
Also, you might consider to use something like 'AND (UNIX_TIMESTAMP(created) % 60 = 0)' which is slightly different from what you wanted, however might be OK (depends on your insert distribution)