Optimize query on single table with group by, where, and 'flag' - mysql

I am struggling with the performance of type of query that I am using repeatedly. Any help would be greatly appreciated.
I have the following table:
item week type flag value
1 1 5 0 0.93
1 1 12 1 0.58
1 1 12 0 0.92
1 2 6 0 0.47
1 2 5 0 0.71
1 2 5 1 0.22
... ... ... ... ...
(the complete table has about 10k different items, 200 weeks, 1k types. flag is 0 or 1. Around 20M rows in total)
I would like to optimize the following query:
select item, week,
sum(value) as total_value,
sum(value * (1-least(flag, 1))) as unflagged_value
from test_table
where type in (5,10,14,22,114,116,121,123,124,2358,2363,2381)
group by item, week;
Currently, the fastest I could get was with an index on (type, item, week) and engine = MyIsam.
(I am using mysql on a standard desktop).
Do you have any suggestions (indices, reformulation, etc.) ?

as per my knowledge GROUP BY query can be fully optimized only with covering index.
add following covering index on your table and check with EXPLAIN:
ALTER TABLE test_table ADD KEY ix1 (type, item, week, value, flag);
after adding index check following query with EXPLAIN:
SELECT type, item, week,
SUM(value) AS total_value,
SUM(IF(flag = 1, value, 0)) AS unflagged_value
FROM test_table
WHERE type IN(5,10,14,22,114,116,121,123,124,2358,2363,2381)
GROUP BY type, item, week;
you may need to modify your query like this:
SELECT item, week,
SUM(total_value) AS total_value,
SUM(unflagged_value) AS unflagged_value
FROM(
SELECT type, item, week,
SUM(value) AS total_value,
SUM(IF(flag = 1, value, 0)) AS unflagged_value
FROM test_table
GROUP BY type, item, week
)a
WHERE type IN(5,10,14,22,114,116,121,123,124,2358,2363,2381)
GROUP BY item, week;
see query execution plan. SQLFIDDLE DEMO HERE

I think You should have only two indices in the table
1. An index on type (non clustered)
2. A composite index on (item, week) in the same order (non clustered)

Related

SQL Query to get distinct values from a table and the difference between ordered rows

I have a real time data table with time stamps for different data points
Time_stamp, UID, Parameter1, Parameter2, ....
I have 400 UIDs so each time_stamp is repeated 400 times
I want to write a query that uses this table to check if the real time data flow to the SQL database is working as expected - new timestamp every 5 minute should be available
For this what I usually do is query the DISTINCT values of time_stamp in the table and order descending - do a visual inspection and copy to excel to calculate the difference in minutes between subsequent distinct time_stamp
Any difference over 5 min means I have a problem. I am trying to figure out how I can do something similar in SQL, maybe get a table that looks like this. Tried to use LEAD and DISTINCT together but could not write the code myself, im just getting started on SQL
Time_stamp, LEAD over last timestamp
Thank you for your help
You can use lag analytical function as follows:
select t.* from
(select t.*
lag(Time_stamp) over (order by Time_stamp) as lg_ts
from your_Table t)
where timestampdiff('minute',lg_ts,Time_stamp) > 5
Or you can also use the not exists as follows:
select t.*
from your_table t
where not exists
(select 1 from your_table tt
where timestampdiff('minute',tt.Time_stamp,t.Time_stamp) <= 5)
and t.Time_stamp <> (select min(tt.Time_stamp) from your_table tt)
lead() or lag() is the right approach (depending on whether you want to see the row at the start or end of the gap).
For the time comparison, I recommend direct comparisons:
select t.*
from (select t.*
lead(Time_stamp) over (partition by uid order by Time_stamp) as next_time_stamp
from t
) t
where next_timestamp > time_stamp + interval 5 minute;
Note: exactly 5 minutes seems unlikely. You might want a fudge factor such as:
where next_timestamp > time_stamp + interval 5*60 + 10 second;
timestampdiff() counts the number of "boundaries" between two values. So, the difference in minutes between 00:00:59 and 00:01:02 is 1. And the difference between 00:00:00 and 00:00:59 is 0.
So, a difference of "5 minutes" could really be 4 minutes and 1 second or could be 5 minutes and 59 seconds.

SQL Query - Find out how many times a row changes from 0 to another value

I am using MySQL 8 and need to create a stored procedure
I have a single table that has a DATE field and a value field which can be 0 or any other number. This value field represents the daily amount of rain for that day.
The table stores data between today and 10 years.
I need to find out how many periods of rain there will be in the next 10 years.
So, for example, if my table contains the following data:
Date - Value
2018-06-09 - 0
2018-06-10 - 50
2018-06-11 - 0
2018-06-12 - 15
2018-06-13 - 17
2018-06-14 - 0
2018-06-15 - 0
2018-06-16 - 12
2018-06-17 - 123
2018-06-18 - 17
Then the SP should return 3, because there were 3 periods of rain.
Any help in getting me closer to the answer will be appreciated!
You don't need to have a stored procedure for this.
A solution with MySQL's 8.0 LEAD function this supports dates with gaps.
The complete table needs to be scanned but i don't think that a huge problem with ~3560 records.
Query
SELECT
SUM(filter_match = 1) AS number
FROM (
SELECT
((t.value = 0) AND (LEAD(t.value) OVER (ORDER BY t.date ASC) != 0)) AS filter_match
FROM
t
) t
see demo https://www.db-fiddle.com/f/sev4NqgLsFPgtNgwzruwy/2
By the way, would you mind expanding your answer to understand how
LEAD and SUM work together?
LEAD(t.value) OVER (ORDER BY t.date ASC) simply means get the next value from the next record ordered by date.
this demo shows it nicely https://www.db-fiddle.com/f/sev4NqgLsFPgtNgwzruwy/6
SUM(filter_match = 1) is a conditional sum. in this case the alias filter_match needs to be true.
see what filter_match is demo https://www.db-fiddle.com/f/sev4NqgLsFPgtNgwzruwy/8
In MySQL aggregate functions can have a SQL expression something like 1 = 1 (which is always true or 1) or 1 = 0 (which is always false or 0).
The conditional sum only sums up when the condition is true.
see demo https://www.db-fiddle.com/f/sev4NqgLsFPgtNgwzruwy/7
Use MySQL join:
SELECT COUNT(*) Number_of_Periods
FROM yourTable A JOIN yourTable B
ON DATE(A.`DATE`)=DATE(B.`DATE` - INTERVAL 1 DAY)
AND A.`VALUE`=0 AND B.`VALUE`>0;
See Demo on DB Fiddle.

Query Database Accurately Based on Timestamp

I am currently having an accuracy issue when querying price vs. time in a Google Big Query Dataset. What I would like is the price of an asset every five minutes, yet there are some assets that have an empty row for an exact minute.
For example, with VEN vs ICX which are two cryptocurrencies, there might be a time at which price data is not available for a specific second. In my query, I am querying a database for every 300 seconds and taking the price data, yet some assets don't have a timestamp for 5 minutes and 0 seconds. Thus, I would like the get the last known price: a good price to use would be 4 minutes and 58 seconds.
My query right now is:
SELECT MIN(price) AS PRICE, timestamp
FROM [coin_data]
WHERE coin="BTCUSD" AND TIMESTAMP_TO_SEC(timestamp) % 300 = 0
GROUP BY timestamp
ORDER BY timestamp ASC
This query results in this sort of gap in specific places:
Row((10339.25, datetime.datetime(2018, 2, 26, 21, 55, tzinfo=<UTC>)))
Row((10354.62, datetime.datetime(2018, 2, 26, 22, 0, tzinfo=<UTC>)))
Row((10320.0, datetime.datetime(2018, 2, 26, 22, 10[should be 5 for 5 min], tzinfo=<UTC>)))
This one should not be 10 in the last column as that is the minutes place and it should read 5 mins.
In order to select a row that has a 5 minute mark/timestamp if it exists, or the closest existing entry, you can use "(analytic) window functions"(uses OVER()) instead of aggregate functions(uses GROUP BY), as following:
group all rows into "separate" 5 minute groups
sort them by proximity to the desired time
select the first row from each partition.
Here I am using OVER clause to create the "window frames" and sorts the rows in them. Then RANK() numbers all rows in each window frame as they are sorted.
Standard SQL
WITH
data AS (
SELECT *,
CAST(FLOOR(UNIX_SECONDS(timestamp)/300) AS INT64) AS timegroup
FROM
`coin_data` )
SELECT min(price) as min_price, timestamp
FROM
(SELECT *, RANK() OVER(PARTITION BY timegroup ORDER BY timestamp ASC) AS rank
FROM data)
WHERE rank = 1
group by timestamp
ORDER BY timestamp ASC
Legacy SQL
SELECT MIN(price) AS min_price, timestamp
FROM (
SELECT *,
RANK() OVER(PARTITION BY timegroup ORDER BY timestamp ASC) AS rank,
FROM (
SELECT *,
INTEGER(FLOOR(TIMESTAMP_TO_SEC(timestamp)/300)) AS timegroup
FROM [coin_data]) AS data )
WHERE rank = 1
GROUP BY timestamp
ORDER BY timestamp ASC
It seems that you have many prices for the same time stamp in which case you may want to add another field to OVER clause.
OVER(PARTITION BY timegroup, exchange ORDER BY timestamp ASC)
Notes:
Consider migrating to Standard SQL, which is the preferred SQL dialect for querying data stored in BigQuery. You can do that on single query basis, so you don't have to migrate everything at the same time.
My idea was to provide a general query that would illustrate the principle so I don't filter for empty rows, because it's not clear if they are null or empty string and it's not really necessary for the answer.

How to get average difference of timestamp in hive

I have this table below which contains two column
hive> select * from hivetable;
a 2016-09-16T03:01:12.367782Z
b 2016-09-16T03:01:12.300514Z
c 2016-09-16T03:01:12.241532Z
a 2016-09-16T03:01:12.138016Z
c 2016-09-16T03:01:12.136986Z
b 2016-09-16T03:01:10.512201Z
c 2016-09-16T03:01:12.235671Z
Time taken: 0.457 seconds, Fetched: 7 row(s)
and now I want to find the unique value from first column and the timestamp difference or I should say average timestamp difference in case there are more than 2 records as in case of c. so in my case the output should be like
a 1 day 5 hr 30 min 20 sec
b 5 sec
c 30 minutes
Note: it is just a sample output and not the actual output
Is it possible to get this output or any similar one in hive?
You just need to use a window function to select the previous row in the grouping. I don't believe it can be compressed into just one query.
select
id,
avg(DATEDIFF(time, prev_time)) as avg_time_diff_days
from (
select id,
time,
LAG(time, 1, 0) OVER (PARTITION BY id, time ORDER BY time ASC)) as prev_time
from table
) intervals
group by id;

DB, how to select data based on time and particular interval

I have a table in my database, my program will insert data to that table in every 10 mins.
The table has a field recording the insert date and time.
Now I want to retrieve those data, but I don't want hundreds of data comes out.
I want to get 1 records from every half hour based on insert time stamp (so less than 50 in total of a day).
For that 1 record, it can be either random pick or average from each interval.
Sorry for the ambiguit, cuz I just wanna figure out the way to select from intervals
Let say,
Table name: network_speed
----------------------------------
ID. ....... Speed ......... Insert_time
1 ....... 10 ......... 10:02am......
2 ....... 12 ......... 10:12am......
...
...
...
123 ....... 17 ........ 9:23am........
To get them all but out put must be average of each half hour record
How can I write a query to achieve this?
Here is a query that calculates half hour intervals on a specific day ( 2013-09-04).
SELECT ID, Speed, Insert_time,
ROUND(TIMESTAMPDIFF(MINUTE, '2013-09-04', Insert_time)/48) AS 'interval'
FROM network_speed
WHERE DATE(Insert_time) = '2013-09-04';
Use that in a nested query to get stats on the records in the intervals.
SELECT IT.interval, COUNT(ID), MIN(Insert_time), MAX(Insert_time), AVG(Speed)
FROM
(SELECT ID, Speed, Insert_time,
ROUND(TIMESTAMPDIFF(MINUTE, '2013-09-04', Insert_time)/48) AS 'interval'
FROM network_speed
WHERE DATE(Insert_time) = '2013-09-04') AS IT
GROUP BY IT.interval;
Here it is used to get the first record in each interval.
SELECT NS.*
FROM
(SELECT IT.interval, MIN(ID) AS 'first_id'
FROM
(SELECT ID, Speed, Insert_time,
ROUND(TIMESTAMPDIFF(MINUTE, '2013-09-04', Insert_time)/48) AS 'interval'
FROM network_speed
WHERE DATE(Insert_time) = '2013-09-04') AS IT
GROUP BY IT.interval) AS MI,
network_speed AS NS
WHERE MI.first_id = NS.ID;
Hope that helps.
Is this what you need?
SELECT HOUR(ts) as hr, fld1, fld2 from tbl group by hr
This query selects only hour from the timestamp and then groups the result based on the hour field so you get 1 row for each hour