How to get average difference of timestamp in hive - mysql

I have this table below which contains two column
hive> select * from hivetable;
a 2016-09-16T03:01:12.367782Z
b 2016-09-16T03:01:12.300514Z
c 2016-09-16T03:01:12.241532Z
a 2016-09-16T03:01:12.138016Z
c 2016-09-16T03:01:12.136986Z
b 2016-09-16T03:01:10.512201Z
c 2016-09-16T03:01:12.235671Z
Time taken: 0.457 seconds, Fetched: 7 row(s)
and now I want to find the unique value from first column and the timestamp difference or I should say average timestamp difference in case there are more than 2 records as in case of c. so in my case the output should be like
a 1 day 5 hr 30 min 20 sec
b 5 sec
c 30 minutes
Note: it is just a sample output and not the actual output
Is it possible to get this output or any similar one in hive?

You just need to use a window function to select the previous row in the grouping. I don't believe it can be compressed into just one query.
select
id,
avg(DATEDIFF(time, prev_time)) as avg_time_diff_days
from (
select id,
time,
LAG(time, 1, 0) OVER (PARTITION BY id, time ORDER BY time ASC)) as prev_time
from table
) intervals
group by id;

Related

SQL Query to get distinct values from a table and the difference between ordered rows

I have a real time data table with time stamps for different data points
Time_stamp, UID, Parameter1, Parameter2, ....
I have 400 UIDs so each time_stamp is repeated 400 times
I want to write a query that uses this table to check if the real time data flow to the SQL database is working as expected - new timestamp every 5 minute should be available
For this what I usually do is query the DISTINCT values of time_stamp in the table and order descending - do a visual inspection and copy to excel to calculate the difference in minutes between subsequent distinct time_stamp
Any difference over 5 min means I have a problem. I am trying to figure out how I can do something similar in SQL, maybe get a table that looks like this. Tried to use LEAD and DISTINCT together but could not write the code myself, im just getting started on SQL
Time_stamp, LEAD over last timestamp
Thank you for your help
You can use lag analytical function as follows:
select t.* from
(select t.*
lag(Time_stamp) over (order by Time_stamp) as lg_ts
from your_Table t)
where timestampdiff('minute',lg_ts,Time_stamp) > 5
Or you can also use the not exists as follows:
select t.*
from your_table t
where not exists
(select 1 from your_table tt
where timestampdiff('minute',tt.Time_stamp,t.Time_stamp) <= 5)
and t.Time_stamp <> (select min(tt.Time_stamp) from your_table tt)
lead() or lag() is the right approach (depending on whether you want to see the row at the start or end of the gap).
For the time comparison, I recommend direct comparisons:
select t.*
from (select t.*
lead(Time_stamp) over (partition by uid order by Time_stamp) as next_time_stamp
from t
) t
where next_timestamp > time_stamp + interval 5 minute;
Note: exactly 5 minutes seems unlikely. You might want a fudge factor such as:
where next_timestamp > time_stamp + interval 5*60 + 10 second;
timestampdiff() counts the number of "boundaries" between two values. So, the difference in minutes between 00:00:59 and 00:01:02 is 1. And the difference between 00:00:00 and 00:00:59 is 0.
So, a difference of "5 minutes" could really be 4 minutes and 1 second or could be 5 minutes and 59 seconds.

Finding records in a range, rounding down when needed

This is a bit difficult to describe, and I'm not sure if this can be done in SQL. Using the following example data set:
ID Count Date
1 0 1/1/2015
2 3 1/5/2015
3 4 1/6/2015
4 3 1/9/2015
5 9 1/15/2015
I want to return records where the Date column falls into a range. But, if the "from" date doesn't exist in the table, I want to use the most recent date as my "From" select. For example, if my date range is between 1/5 and 1/9, I would expect to have records 2,3, and 4 returned. But, if I have a date range of 1/3 - 1/6 I want to return records 1,2,and 3. I want to include record 1 because, as 1/3 does not exist, I want the value of the Count that is rounded down.
Any thoughts on how this can be done? I'm using MySQL.
Basically, you need to replace the from date with the latest date before or on that date. Let me assume that the variables are #v_from and #v_to.
select e.*
from example e
where e.date >= (select max(e2.date) from example e2 where e2.date <= #v_from) and
e.date <= #v_to;
EDIT AFTER EDIT:
SELECT *
FROM TABLE
WHERE DATE BETWEEN (
SELECT Date
FROM TABLE
WHERE Date <= #Start
ORDER BY Date DESC
LIMIT 1
)
AND #End
Or
SELECT *
FROM TABLE
WHERE DATE BETWEEN (
SELECT MAX(Date)
FROM TABLE
WHERE Date <= #Start
)
AND #End

DB, how to select data based on time and particular interval

I have a table in my database, my program will insert data to that table in every 10 mins.
The table has a field recording the insert date and time.
Now I want to retrieve those data, but I don't want hundreds of data comes out.
I want to get 1 records from every half hour based on insert time stamp (so less than 50 in total of a day).
For that 1 record, it can be either random pick or average from each interval.
Sorry for the ambiguit, cuz I just wanna figure out the way to select from intervals
Let say,
Table name: network_speed
----------------------------------
ID. ....... Speed ......... Insert_time
1 ....... 10 ......... 10:02am......
2 ....... 12 ......... 10:12am......
...
...
...
123 ....... 17 ........ 9:23am........
To get them all but out put must be average of each half hour record
How can I write a query to achieve this?
Here is a query that calculates half hour intervals on a specific day ( 2013-09-04).
SELECT ID, Speed, Insert_time,
ROUND(TIMESTAMPDIFF(MINUTE, '2013-09-04', Insert_time)/48) AS 'interval'
FROM network_speed
WHERE DATE(Insert_time) = '2013-09-04';
Use that in a nested query to get stats on the records in the intervals.
SELECT IT.interval, COUNT(ID), MIN(Insert_time), MAX(Insert_time), AVG(Speed)
FROM
(SELECT ID, Speed, Insert_time,
ROUND(TIMESTAMPDIFF(MINUTE, '2013-09-04', Insert_time)/48) AS 'interval'
FROM network_speed
WHERE DATE(Insert_time) = '2013-09-04') AS IT
GROUP BY IT.interval;
Here it is used to get the first record in each interval.
SELECT NS.*
FROM
(SELECT IT.interval, MIN(ID) AS 'first_id'
FROM
(SELECT ID, Speed, Insert_time,
ROUND(TIMESTAMPDIFF(MINUTE, '2013-09-04', Insert_time)/48) AS 'interval'
FROM network_speed
WHERE DATE(Insert_time) = '2013-09-04') AS IT
GROUP BY IT.interval) AS MI,
network_speed AS NS
WHERE MI.first_id = NS.ID;
Hope that helps.
Is this what you need?
SELECT HOUR(ts) as hr, fld1, fld2 from tbl group by hr
This query selects only hour from the timestamp and then groups the result based on the hour field so you get 1 row for each hour

Optimize query on single table with group by, where, and 'flag'

I am struggling with the performance of type of query that I am using repeatedly. Any help would be greatly appreciated.
I have the following table:
item week type flag value
1 1 5 0 0.93
1 1 12 1 0.58
1 1 12 0 0.92
1 2 6 0 0.47
1 2 5 0 0.71
1 2 5 1 0.22
... ... ... ... ...
(the complete table has about 10k different items, 200 weeks, 1k types. flag is 0 or 1. Around 20M rows in total)
I would like to optimize the following query:
select item, week,
sum(value) as total_value,
sum(value * (1-least(flag, 1))) as unflagged_value
from test_table
where type in (5,10,14,22,114,116,121,123,124,2358,2363,2381)
group by item, week;
Currently, the fastest I could get was with an index on (type, item, week) and engine = MyIsam.
(I am using mysql on a standard desktop).
Do you have any suggestions (indices, reformulation, etc.) ?
as per my knowledge GROUP BY query can be fully optimized only with covering index.
add following covering index on your table and check with EXPLAIN:
ALTER TABLE test_table ADD KEY ix1 (type, item, week, value, flag);
after adding index check following query with EXPLAIN:
SELECT type, item, week,
SUM(value) AS total_value,
SUM(IF(flag = 1, value, 0)) AS unflagged_value
FROM test_table
WHERE type IN(5,10,14,22,114,116,121,123,124,2358,2363,2381)
GROUP BY type, item, week;
you may need to modify your query like this:
SELECT item, week,
SUM(total_value) AS total_value,
SUM(unflagged_value) AS unflagged_value
FROM(
SELECT type, item, week,
SUM(value) AS total_value,
SUM(IF(flag = 1, value, 0)) AS unflagged_value
FROM test_table
GROUP BY type, item, week
)a
WHERE type IN(5,10,14,22,114,116,121,123,124,2358,2363,2381)
GROUP BY item, week;
see query execution plan. SQLFIDDLE DEMO HERE
I think You should have only two indices in the table
1. An index on type (non clustered)
2. A composite index on (item, week) in the same order (non clustered)

MySQL: Average interval between records

Assume this table:
id date
----------------
1 2010-12-12
2 2010-12-13
3 2010-12-18
4 2010-12-22
5 2010-12-23
How do I find the average intervals between these dates, using MySQL queries only?
For instance, the calculation on this table will be
(
( 2010-12-13 - 2010-12-12 )
+ ( 2010-12-18 - 2010-12-13 )
+ ( 2010-12-22 - 2010-12-18 )
+ ( 2010-12-23 - 2010-12-22 )
) / 4
----------------------------------
= ( 1 DAY + 5 DAY + 4 DAY + 1 DAY ) / 4
= 2.75 DAY
Intuitively, what you are asking should be equivalent to the interval between the first and last dates, divided by the number of dates minus 1.
Let me explain more thoroughly. Imagine the dates are points on a line (+ are dates present, - are dates missing, the first date is the 12th, and I changed the last date to Dec 24th for illustration purposes):
++----+---+-+
Now, what you really want to do, is evenly space your dates out between these lines, and find how long it is between each of them:
+--+--+--+--+
To do that, you simply take the number of days between the last and first days, in this case 24 - 12 = 12, and divide it by the number of intervals you have to space out, in this case 4: 12 / 4 = 3.
With a MySQL query
SELECT DATEDIFF(MAX(dt), MIN(dt)) / (COUNT(dt) - 1) FROM a;
This works on this table (with your values it returns 2.75):
CREATE TABLE IF NOT EXISTS `a` (
`dt` date NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
INSERT INTO `a` (`dt`) VALUES
('2010-12-12'),
('2010-12-13'),
('2010-12-18'),
('2010-12-22'),
('2010-12-24');
If the ids are uniformly incremented without gaps, join the table to itself on id+1:
SELECT d.id, d.date, n.date, datediff(d.date, n.date)
FROM dates d
JOIN dates n ON(n.id = d.id + 1)
Then GROUP BY and average as needed.
If the ids are not uniform, do an inner query to assign ordered ids first.
I guess you'll also need to add a subquery to get the total number of rows.
Alternatively
Create an aggregate function that keeps track of the previous date, and a running sum and count. You'll still need to select from a subquery to force the ordering by date (actually, I'm not sure if that's guaranteed in MySQL).
Come to think of it, this is a much better way of doing it.
And Even Simpler
Just noting that Vegard's solution is much better.
The following query returns correct result
SELECT AVG(
DATEDIFF(i.date, (SELECT MAX(date)
FROM intervals WHERE date < i.date)
)
)
FROM intervals i
but it runs a dependent subquery which might be really inefficient with no index and on a larger number of rows.
You need to do self join and get differences using DATEDIFF function and get average.