MySQL Find Average Timediff of with group By And Exclusion - mysql

I have a MySQL table which contains two datetime columns:
CREATE TABLE test (job_id int, dateCol1 DATETIME, datecol2 DATETIME);
It contains a series of data representing the start and end times of jobs. I need to know what is the Average duration of those jobs in minutes, but so far can't see how to achieve it.
I have tried various things:
SELECT job_id, AVG(TIMEDIFF(datecol2,datecol1)) FROM test GROUP BY job_id;
SELECT job_id, SEC_TO_TIME(AVG(TIMEDIFF(datecol2,datecol1))) FROM test GROUP BY job_id;
SELECT job_id, SEC_TO_TIME(AVG(TIME_TO_SEC(datecol2)-TIME_TO_SEC(datecol1))) FROM test GROUP BY job_id;
SELECT job_id, SEC_TO_TIME(AVG(TIME_TO_SEC(FROM_UNIXTIME(UNIX_TIMESTAMP(datecol2)-UNIX_TIMESTAMP(datecol1))))) FROM test GROUP BY job_id;;
I'm comparing it to a list of all jobs which I printed out and Averaged in Excel, but so far I am getting different results.
As an addition I will also need to exclude any where the duration is greater than 1 hour.
I'm sure I am just overcomplicating this, and if someone can show me the way it would be appreciated, otherwise Ill end up having to print out a list for each jobid, and averaging them manually in excel.

You could use UNIX_TIMESTAMP to get the seconds difference and then get the average.
SELECT job_id,
SEC_TO_TIME(AVG(UNIX_TIMESTAMP(datecol2) - UNIX_TIMESTAMP(datecol1)))
FROM test
GROUP BY job_id;
If you need to filter those with a higher difference than an hour...
SELECT job_id,
SEC_TO_TIME(AVG(UNIX_TIMESTAMP(datecol2) - UNIX_TIMESTAMP(datecol1)))
FROM test
WHERE UNIX_TIMESTAMP(datecol2) - UNIX_TIMESTAMP(datecol1) < 3600
GROUP BY job_id;
Br

The follow query will exclude Job runs that are greater than 1 hour
SELECT job_id
,AVG(TIMEDIFF(datecol2,datecol1))
FROM test
WHERE datecol2 < DATE_ADD(datecol1, INTERVAL 1 HOUR)
GROUP BY
job_id;

Related

Optomizing a simple query with 70mil rows to fit into Tableau

Noobie to SQL. I have a simple query here that is 70 million rows, and my work laptop will not handle the capacity when I import it into Tableau. Usually 20 million rows and less seem to work fine. Here's my problem.
Table name: Table1
Fields: UniqueID, State, Date, claim_type
Query:
SELECT uniqueID, states, claim_type, date
FROM table1
WHERE date >= '11-09-2021'
This gives me what I want, BUT, I can limit the query significantly if I count the number of uniqueIDs that have been used in 3 or more different states. I use this query to do that.
SELECT unique_id, count(distinct states), claim_type, date
FROM table1
WHERE date >= '11-09-2021'
GROUP BY Unique_id, claim_type, date
HAVING COUNT(DISTINCT states) > 3
The only issue is, when I put this query into Tableau it only displays the FIRST state a unique_id showed up in, and the first date it showed up. A unique_id shows up in multiple states over multiple dates, so when I use this count aggregation it's only giving me the first result and not the whole picture.
Any ideas here? I am totally lost and spent a whole business day trying to fix this
Expected output would be something like
uniqueID | state | claim type | Date
123 Ohio C 01-01-2021
123 Nebraska I 02-08-2021
123 Georgia D 03-08-2021
If your table is only of those four columns, and your queries are based on date ranges, your index must exist to help optimize that. If 70 mil records exist, how far back does that go... Years? If your data since 2021-09-11 is only say... 30k records, that should be all you are blowing through for your results.
I would ensure you have the index based on (and in this order)
(date, uniqueId, claim_type, states). Also, you mentioned you wanted a count of 3 OR MORE, your query > 3 will results in 4 or more unless you change to count(*) >= 3.
Then, to get the entries you care about, you need
SELECT date, uniqueID, claim_type
FROM table1
WHERE date >= '2021-09-11'
group by date, uniqueID, claim_type
having count( distinct states ) >= 3
This would give just the 3-part qualifier for date/id/claim that HAD them. Then you would use THIS result set to get the other entries via
select distinct
date, uniqueID, claim_type, states
from
( SELECT date, uniqueID, claim_type
FROM table1
WHERE date >= '2021-09-11'
group by date, uniqueID, claim_type
having count( distinct states ) >= 3 ) PQ
JOIN Table1 t1
on PQ.date = t1.date
and PQ.UniqueID = t1.UniqueID
and PQ.Claim_Type = t1.Claim_Type
The "PQ" (preQuery) gets the qualified records. Then it joins back to the original table and grabs all records that qualified from the unique date/id/claim_type and returns all the states.
Yes, you are grouping rows, so therefore you 'loose' information on the grouped result.
You won't get 70m records with your grouped query.
Why don't you split your imports in smaller chunks? Like limit the rows to chunks of, say 15m:
1st:
SELECT uniqueID, states, claim_type, date FROM table1 WHERE date >= '11-09-2021' LIMIT 15000000;
2nd:
SELECT uniqueID, states, claim_type, date FROM table1 WHERE date >= '11-09-2021' LIMIT 15000000 OFFSET 15000000;
3rd:
SELECT uniqueID, states, claim_type, date FROM table1 WHERE date >= '11-09-2021' LIMIT 15000000 OFFSET 30000000;
and so on..
I know its not a perfect or very handy solution but maybe it gets you to the desired outcome.
See this link for infos about LIMIT and OFFSET
https://www.bitdegree.org/learn/mysql-limit-offset
It is wise in the long run to use DATE datatype. That requires dates to look like '2021-09-11, not '09-11-2021'. That will let > correctly compare dates that are in two different years.
If your data is coming from some source that formats it '11-09-2021', use STR_TO_DATE() to convert as it goes in; You can reconstruct that format on output via DATE_FORMAT().
Once you have done that, we can talk about optimizing
SELECT unique_id, count(distinct states), claim_type, date
FROM table1
WHERE date >= '2021-09-11'
GROUP BY Unique_id, claim_type, date
HAVING COUNT(DISTINCT states) > 3
Tentatively I recommend this composite index speed up the query:
INDEX(Unique_id, claim_type, date, states)
That will also help with your other query.
(I as assuming the ambiguous '11-09-2021' is DD-MM-YYYY.)

How do I SELECT a MySQL Table value that has not been updated on a given date?

I have a MySQL database named mydb in which I store daily share prices for
423 companies in a table named data. Table data has the following columns:
`epic`, `date`, `open`, `high`, `low`, `close`, `volume`
epic and date being primary key pairs.
I update the data table each day using a csv file which would normally have 423 rows
of data all having the same date. However, on some days prices may not available
for all 423 companies and data for a particular epic and date pair will
not be updated. In order to determine the missing pair I have resorted
to comparing a full list of epics against the incomplete list of epics using
two simple SELECT queries with different dates and then using a file comparator, thus
revealing the missing epic(s). This is not a very satisfactory solution and so far
I have not been able to construct a query that would identify any epics that
have not been updated for any particular day.
SELECT `epic`, `date` FROM `data`
WHERE `date` IN ('2019-05-07', '2019-05-08')
ORDER BY `epic`, `date`;
Produces pairs of values:
`epic` `date`
"3IN" "2019-05-07"
"3IN" "2019-05-08"
"888" "2019-05-07"
"888" "2019-05-08"
"AA." "2019-05-07"
"AAL" "2019-05-07"
"AAL" "2019-05-08"
Where in this case AA. has not been updated on 2019-05-08. The problem with this is that it is not easy to spot a value that is not a pair.
Any help with this problem would be greatly appreciated.
You could do a COUNT on epic, with a GROUP BY epic for items in that date range and see if you get any with a COUNT less than 2, then select from this result where UpdateCount is less than 2, forgive me if the syntax on the column names is not correct, I work in SQL Server, but the logic for the query should still work for you.
SELECT x.epic
FROM
(
SELECT COUNT(*) AS UpdateCount, epic
FROM data
WHERE date IN ('2019-05-07', '2019-05-08')
GROUP BY epic
) AS x
WHERE x.UpdateCount < 2
Assuming you only want to check the last date uploaded, the following will return every item not updated on 2019-05-08:
SELECT last_updated.epic, last_updated.date
FROM (
SELECT epic , max(`date`) AS date FROM `data`
GROUP BY 'epic'
) AS last_updated
WHERE 'date' <> '2019-05-08'
ORDER BY 'epic'
;
or for any upload date, the following will compare against the entire database, so you don't rely on '2019-08-07' having every epic row. I.e. if the epic has been in the database before then it will show if not updated:
SELECT d.epic, max(d.date)
FROM data as d
WHERE d.epic NOT IN (
SELECT d2.epic
FROM data as d2
WHERE d2.date = '2019-05-08'
)
GROUP BY d.epic
ORDER BY d.epic

How to return zero values if nothing was written in time interval?

I am using the Graph Reports for the select below. The MySQL database only has the active records in the database, so if no records are in the database from X hours till Y hours that select does not return anything. So in my case, I need that select return Paypal zero values as well even the no activity was in the database. And I do not understand how to use the UNION function or re-create select in order to get the zero values if nothing was recorded in the database in time interval. Could you please help?
select STR_TO_DATE ( DATE_FORMAT(`acctstarttime`,'%y-%m-%d %H'),'%y-%m-%d %H')
as '#date', count(*) as `Active Paid Accounts`
from radacct_history where `paymentmethod` = 'PayPal'
group by DATE_FORMAT(`#date`,'%y-%m-%d %H')
When I run the select the output is:
Current Output
But I need if there are no values between 2016-07-27 07:00:00 and 2016-07-28 11:00:00, then in every hour it should show zero active accounts Like that:
Needed output with no values every hour
I have created such select below , but it not put to every hour the zero value like i need. showing the big gap between the 12 Sep and 13 Sep anyway, but there should be the zero values every hour
(select STR_TO_DATE ( DATE_FORMAT(acctstarttime,'%y-%m-%d %H'),'%y-%m-%d %H')
as '#date', count(paymentmethod) as Active Paid Accounts
from radacct_history where paymentmethod <> 'PayPal'
group by DATE_FORMAT(#date,'%y-%m-%d %H'))
union ALL
(select STR_TO_DATE ( DATE_FORMAT(acctstarttime,'%y-%m-%d %H'),'%y-%m-%d %H')
as '#date', 0 as Active Paid Accounts
from radacct_history where paymentmethod <> 'PayPal'
group by DATE_FORMAT(#date,'%y-%m-%d %H')) ;
I guess, you want to return 0 if there is no matching rows in MySQL. Here is an example:
(SELECT Col1,Col2,Col3 FROM ExampleTable WHERE ID='1234')
UNION (SELECT 'Def Val' AS Col1,'none' AS Col2,'' AS Col3) LIMIT 1;
Updated the post: You are trying to retrieve data that aren't present in the table, I guess in reference to the output provided. So in this case, you have to maintain a date table to show the date that aren't in the table. Please refer to this and it's little bit tricky - SQL query that returns all dates not used in a table
You need an artificial table with all necessary time intervals. E.g. if you need daily data create a table and add all day dates e.g. start from 1970 till 2100.
Then you can use the table and LEFT JOIN your radacct_history. So for each desired interval you will have group item (group by should be based on the intervals table.

Group by date from multiple columns?

first of all sorry for that title, but I have no idea how to describe it:
I'm saving sessions in my table and I would like to get the count of sessions per hour to know how many sessions were active over the day. The sessions are specified by two timestamps: start and end.
Hopefully you can help me.
Here we go:
http://sqlfiddle.com/#!2/bfb62/2/0
While I'm still not sure how you'd like to compare the start and end dates, looks like using COUNT, YEAR, MONTH, DAY, and HOUR, you could come up with your desired results.
Possibly something similar to this:
SELECT COUNT(ID), YEAR(Start), HOUR(Start), DAY(Start), MONTH(Start)
FROM Sessions
GROUP BY YEAR(Start), HOUR(Start), DAY(Start), MONTH(Start)
And the SQL Fiddle.
What you want to do is rather hard in MySQL. You can, however, get an approximation without too much difficulty. The following counts up users who start and stop within one day:
select date(start), hour,
sum(case when hours.hour between hour(start) and hours.hour then 1 else 0
end) as GoodEstimate
from sessions s cross join
(select 0 as hour union all
select 1 union all
. . .
select 23
) hours
group by date(start), hour
When a user spans multiple days, the query is harder. Here is one approach, that assumes that there exists a user who starts during every hour:
select thehour, count(*)
from (select distinct date(start), hour(start),
(cast(date(start) as datetime) + interval hour(start) hour as thehour
from sessions
) dh left outer join
sessions s
on s.start <= thehour + interval 1 hour and
s.end >= thehour
group by thehour
Note: these are untested so might have syntax errors.
OK, this is another problem where the index table comes to the rescue.
An index table is something that everyone should have in their toolkit, preferably in the master database. It is a table with a single id int primary key indexed column containing sequential numbers from 0 to n where n is a number big enough to do what you need, 100,000 is good, 1,000,000 is better. You only need to create this table once but once you do you will find it has all kinds of applications.
For your problem you need to consider each hour and, if I understand your problem you need to count every session that started before the end of the hour and hasn't ended before that hour starts.
Here is the SQL fiddle for the solution.
What it does is use a known sequential number from the indextable (only 0 to 100 for this fiddle - just over 4 days - you can see why you need a big n) to link with your data at the top and bottom of the hour.

Difference between rows Mysql Query

I have one table which is having four fields:
trip_paramid, creation_time, fuel_content,vehicle_id
I want to find the difference between two rows.In my table i have one field fuel_content.Every two minutes i getting packets and inserting to database.From this i want to find out total refuel quantity.If fuel content between two packets is greater than 2,i will treat it as refueling quantity.Multiple refuel may happen in same day.So i want to find out total refuel quantity for a day for a vehicle.I created one table schema&sample data in sqlfiddle. Can anyone help me to find a solution for this.here is the link for table schema..http://www.sqlfiddle.com/#!2/4cf36
Here is a good query.
Parameters (vehicle_id=13) and (date='2012-11-08') are injected in the query, but they are parameters to be modified.
You can note that have I chosen an expression using creation_time<.. and creation_time>.. in instead of DATE(creation_time)='...', this is because the first expression can use indexes on "creation_time" while the second one cannot.
SELECT
SUM(fuel_content-prev_content) AS refuel_tot
, COUNT(*) AS refuel_nbr
FROM (
SELECT
p.trip_paramid
, fuel_content
, creation_time
, (
SELECT ps.fuel_content
FROM trip_parameters AS ps
WHERE (ps.vehicle_id=p.vehicle_id)
AND (ps.trip_paramid<p.trip_paramid)
ORDER BY trip_paramid DESC
LIMIT 1
) AS prev_content
FROM trip_parameters AS p
WHERE (p.vehicle_id=13)
AND (creation_time>='2012-11-08')
AND (creation_time<DATE_ADD('2012-11-08', INTERVAL 1 DAY))
ORDER BY p.trip_paramid
) AS log
WHERE (fuel_content-prev_content)>2
Test it:
select sum(t2.fuel_content-t1.fuel_content) TotalFuel,t1.vehicle_id,t1.trip_paramid as rowIdA,
t2.trip_paramid as rowIdB,
t1.creation_time as timeA,
t2.creation_time as timeB,
t2.fuel_content fuel2,
t1.fuel_content fuel1,
(t2.fuel_content-t1.fuel_content) diffFuel
from trip_parameters t1, trip_parameters t2
where t1.trip_paramid<t2.trip_paramid
and t1.vehicle_id=t2.vehicle_id
and t1.vehicle_id=13
and t2.fuel_content-t1.fuel_content>2
order by rowIdA,rowIdB
where (rowIdA,rowIdB) are all possibles tuples without repetition, diffFuel is the difference between fuel quantity and TotalFuel is the sum of all refuel quanty.
The query compare all fuel content diferences for same vehicle(in this example, for vehicle with id=13) and only sum fuel quantity when the diff fuel is >2.
Regards.