I have a table t_date_interval_30 that is cartesian product of a 365 calendar year of dates, and a time field incremented at 30 minute intervals. I use this as a framework to hang call data on.
t_date_interval_30
DATE, DAYNAME, INTERVAL
'2013-01-01', 'Tuesday', '00:00:00'
'2013-01-01', 'Tuesday', '00:30:00'
'2013-01-01', 'Tuesday', '01:00:00'
'2013-01-01', 'Tuesday', '01:30:00'
'2013-01-01', 'Tuesday', '02:00:00'
'2013-01-01', 'Tuesday', '02:30:00'
ETC...
Next I have a view v_call_details that is a summarized view of the call data. Call data is summarized down to one row per call session initiated - the source for this can have multiple rows per call session; i.e., call rolls Ring No Answer from one target to another, each leg of the call increments a new record row.
v_call_details
CLIENT, CSQ, SESS_ID, DATE, CALL_START, CONT_DISP, MET_SLA
'Acme','ACME_CSQ','123-123456789-01','2013-01-01','2013-01-01 00:12:34','ABANDONED',TRUE
'Acme','ACME_CSQ','123-123456998-01','2013-01-01','2013-01-01 00:45:02','HANDLED',TRUE
'Acme','ACME_CSQ','123-123457291-01','2013-01-02','2013-01-02 13:31:58','HANDLED',FALSE
ETC...
So, when I run the below query it takes forever.
SELECT
cd.`client`,
cd.`csq`,
di.`date`,
di.`dayname`,
di.`interval`,
count(cd.`sess_id`) AS `calls`,
(count(cd.`sess_id`) - sum(IF(cd.`cont_disp` = 'ABANDONED'
AND cd.`met_sla` > 0,
1,
0))) AS `presented`
FROM
t_date_interval_30 di
LEFT JOIN
v_call_details cd ON (di.`date` = cd.`date`
AND di.`interval` = SEC_TO_TIME((TIME_TO_SEC(cd.`call_start`) DIV 1800) * 1800))
WHERE
di.`date` BETWEEN '2013-05-01' AND '2013-05-02'
GROUP BY cd.`csq`, di.`date`, di.`interval`
I have never really worked with indexes (though I have tried adding a few to the DATE values and CALL_START values). When I run an EXPLAIN EXTENDED I get the below results.
id, select_type, table, type, possible_keys, key, key_len, ref, rows, filtered, Extra
1, PRIMARY, di, range, i_date, i_date, 3, , 96, 100.00, Using where; Using temporary; Using filesort
1, PRIMARY, <derived2>, ALL, , , , , 153419, 100.00, ,
2, DERIVED, t_cisco_csq_agent_details, ALL , , , , 161925, 100.00, Using temporary; Using filesort
2, DERIVED, t_lkp_clients, ALL , , , , 56, 100.00, ,
Any advice would be greatly appreciated. Right now if I run the query, returning results for 2 days worth of data takes roughly 70 seconds. At that rate, doing a 90 day report will take an hour and a half... I need to find a way to bring that down.
First, don't assume that 90 days worth of data will require 45 times the effort of 2 days. Your query is doing a full scan of the call details table, and this may account for much of the effort. MySQL can propagate the condition on date from di to cd through the equijoin. I'm not sure if it does in this case (because of the second condition).
Second, you are using a view. That might make it impossible to actually improve performance. You can try, but you should try to write the query without the view.
My next question is how long does this take to run:
select cd.csq, cd.`date`,
SEC_TO_TIME((TIME_TO_SEC(cd.`call_start`) DIV 1800) * 1800)) as interval,
count(*)
from v_call_details cd
WHERE cd.`date` BETWEEN '2013-05-01' AND '2013-05-02';
If this takes a reasonable amount of time, then test it for 90 days. If that works, then you can do the aggregation first and then join back to the di table. This is just an idea. I suspect the real performance problem is in the view.
Related
I have a MySQL table with around 600 K rows in it (Engine: InnoDB).
MySQL is running in a virtualbox machine with Ubuntu 16.04 LTS in it. MySQL server version is 5.7.23, if that's relevant.
The columns in the WHERE clauses (open_time and close_time) are both indexed and they are both DATETIME columns.
The column that I'm taking the sum of (volume) is a double.
This query returns instantly (0.000 seconds):
SELECT *
FROM klines
WHERE (open_time between '2018-01-01 00:00:00' AND '2018-01-01 12:00:00')
;
EXPLAIN output:
Whereas this one takes almost a second to fetch (varies between 0.640 and 0.703 seconds between 10 tries):
SELECT SUM(volume)
FROM klines
WHERE open_time >= '2018-01-01 00:00:00' AND close_time <= '2018-01-01 12:00:00'
;
EXPLAIN output:
Mind that both queries returns about the same rows (720 for first, 721 for the second. Second query returns the same 720 rows which first one returns, plus another one).
So, if I want to get just the rows, it does not matter if I use WHERE clause for two columns or one. But if I want to get the SUM of a column, query gets drastically slower when I use WHERE clause for two columns. If I use a single column however, it again returns instantly.
While I'm perfectly OK with using the query which queries the table using between two open_time criterias, I'm really curious about what's going on.
So, what would be the reason behind this?
open_time between '2018-01-01 00:00:00'
AND '2018-01-01 12:00:00'
can easily use INDEX(open_time) to touch only the interesting rows. But it is not possible to have an index that stops abruptly for this:
open_time >= '2018-01-01 00:00:00'
AND close_time <= '2018-01-01 12:00:00'
INDEX(open_time) could be used, but the last half of the table would be scanned. INDEX(close_time), similarly, would scan the first half of the table. And there is now way to do both.
You probably have an additional constraint that is nowhere visible:
[open..close] time ranges don't overlap?
open is always < close?
These cannot be specified in standard SQL, nor is there any index formulation that would take advantage of either constraint.
Here are two rows that will mess up any optimization attempt:
INSERT INTO klines (open_time, close_time)
VALUES ('2018-01-01 06:00:00', '2037-12-31'),
('1971-01-01', '2018-01-01 06:00:00')
('2037-01-01', '1971-01-01')
There are fixes, but they require either assuming non-overlapping, then playing with the queries is severe ways; or playing with buckets.
I have the query below
SELECT SUM(CAST(hd.value AS SIGNED)) as case_count
FROM historical_data hd
WHERE hd.tag_id IN (45,109,173,237,301,365,429)
AND hd.shift = 1
AND hd.timestamp BETWEEN '2018-04-10' AND '2018-04-11'
ORDER BY TIMESTAMP DESC
and with this I'm trying to select a SUM of the value for each of the IDs passed, during the time frame in the BETWEEN statement - but the most recent respective to that timeframe. So the end result would be a SUM of the case_count values for each ID passed in at the last timestamp the ID has i nthat date range.
I am having trouble figuring out HOW to accomplish this. My historical_data table is HUGE, however I do have very specific indexing on it that allows the queries to function fairly well - as well as partitioning on the table by YEAR.
Can anyone provide a pointer on how to get the data I need? I'd rather not loop over the list of IDs and run this query without the SUM and a LIMIT 1, but I guess I can if that's the only way.
Here is one method:
SELECT SUM(CAST(hd.value AS SIGNED)) as case_count
FROM historical_data hd
WHERE hd.tag_id IN (45, 109, 173, 237, 301, 365, 429) AND
hd.shift = 1 AND
hd.timestamp = (SELECT MAX(hd2.timestamp)
FROM historical_data hd
WHERE hd2.tag_id = hd.tag_id AND
hd2.shift = hd.shift AND
hd2.timestamp BETWEEN '2018-04-10' AND '2018-04-11'
);
The optimal index for this query is on historical_data(shift, tag_id, timestamp).
I am currently having an accuracy issue when querying price vs. time in a Google Big Query Dataset. What I would like is the price of an asset every five minutes, yet there are some assets that have an empty row for an exact minute.
For example, with VEN vs ICX which are two cryptocurrencies, there might be a time at which price data is not available for a specific second. In my query, I am querying a database for every 300 seconds and taking the price data, yet some assets don't have a timestamp for 5 minutes and 0 seconds. Thus, I would like the get the last known price: a good price to use would be 4 minutes and 58 seconds.
My query right now is:
SELECT MIN(price) AS PRICE, timestamp
FROM [coin_data]
WHERE coin="BTCUSD" AND TIMESTAMP_TO_SEC(timestamp) % 300 = 0
GROUP BY timestamp
ORDER BY timestamp ASC
This query results in this sort of gap in specific places:
Row((10339.25, datetime.datetime(2018, 2, 26, 21, 55, tzinfo=<UTC>)))
Row((10354.62, datetime.datetime(2018, 2, 26, 22, 0, tzinfo=<UTC>)))
Row((10320.0, datetime.datetime(2018, 2, 26, 22, 10[should be 5 for 5 min], tzinfo=<UTC>)))
This one should not be 10 in the last column as that is the minutes place and it should read 5 mins.
In order to select a row that has a 5 minute mark/timestamp if it exists, or the closest existing entry, you can use "(analytic) window functions"(uses OVER()) instead of aggregate functions(uses GROUP BY), as following:
group all rows into "separate" 5 minute groups
sort them by proximity to the desired time
select the first row from each partition.
Here I am using OVER clause to create the "window frames" and sorts the rows in them. Then RANK() numbers all rows in each window frame as they are sorted.
Standard SQL
WITH
data AS (
SELECT *,
CAST(FLOOR(UNIX_SECONDS(timestamp)/300) AS INT64) AS timegroup
FROM
`coin_data` )
SELECT min(price) as min_price, timestamp
FROM
(SELECT *, RANK() OVER(PARTITION BY timegroup ORDER BY timestamp ASC) AS rank
FROM data)
WHERE rank = 1
group by timestamp
ORDER BY timestamp ASC
Legacy SQL
SELECT MIN(price) AS min_price, timestamp
FROM (
SELECT *,
RANK() OVER(PARTITION BY timegroup ORDER BY timestamp ASC) AS rank,
FROM (
SELECT *,
INTEGER(FLOOR(TIMESTAMP_TO_SEC(timestamp)/300)) AS timegroup
FROM [coin_data]) AS data )
WHERE rank = 1
GROUP BY timestamp
ORDER BY timestamp ASC
It seems that you have many prices for the same time stamp in which case you may want to add another field to OVER clause.
OVER(PARTITION BY timegroup, exchange ORDER BY timestamp ASC)
Notes:
Consider migrating to Standard SQL, which is the preferred SQL dialect for querying data stored in BigQuery. You can do that on single query basis, so you don't have to migrate everything at the same time.
My idea was to provide a general query that would illustrate the principle so I don't filter for empty rows, because it's not clear if they are null or empty string and it's not really necessary for the answer.
I have a table that has 1.6M rows. Whenever I use the query below, I get an average of 7.5 seconds.
select * from table
where pid = 170
and cdate between '2017-01-01 0:00:00' and '2017-12-31 23:59:59';
I tried adding a LIMIT 1000 or 10000 or change the date to filter for 1 month, it still processes it to an average of 7.5s. I tried adding a composite index for pid and cdate but it resulted to 1 second slower.
Here is the INDEX list
https://gist.github.com/primerg/3e2470fcd9b21a748af84746554309bc
Can I still make it faster? Is this an acceptable performance considering the amount of data?
Looks like the index is missing. Create this index and see if its helping you.
CREATE INDEX cid_date_index ON table_name (pid, cdate);
And also modify your query to below.
select * from table
where pid = 170
and cdate between CAST('2017-01-01 0:00:00' AS DATETIME) and CAST('2017-12-31 23:59:59' AS DATETIME);
Please provide SHOW CREATE TABLE clicks.
How many rows are returned? If it is 100K rows, the effort to shovel that many rows is significant. And what will you do with that many rows? If you then summarize them, consider summarizing in SQL!
Do have cdate as DATETIME.
Do you use id for anything? Perhaps this would be better:
PRIMARY KEY (pid, cdate, id) -- to get benefit from clustering
INDEX(id) -- if still needed (and to keep AUTO_INCREMENT happy)
This smells like Data Warehousing. DW benefits significantly from building and maintaining Summary table(s), such as one that has the daily click count (etc), from which you could very rapidly sum up 365 counts to get the answer.
CAST is unnecessary. Furthermore 0:00:00 is optional -- it can be included or excluded for either DATE or DATETIME. I prefer
cdate >= '2017-01-01'
AND cdate < '2017-01-01' + INTERVAL 1 YEAR
to avoid leap year, midnight, date arithmetic, etc.
I've been looking at several other SO questions but I could not make out a solution from these. First, the description, then what I'm missing from the other threads. (Heads up: I'm very well aware of the non-normalised structure of our database, which is something I have addressed in meetings before but this is what we have and what I have to work with.)
Background description
We have a machine that manufactures products in 25 positions. These products' production data is being logged in a table that among other things logs current and voltage for every position. This is only logged when the machine is actually producing products (i.e. has a product in the machine). The time where no product is present, nothing is being logged.
This machine can run in two different production modes: full production and R&D production. Full production means that products are being inserted continuously so that every instance has a product at all times (i.e. 25 products are present in the machine at all times). The second mode, R&D production, only produces one product at a time (i.e. one product enters the machine, goes through the 25 instances one by one and when this one is finished, the second product enters the machine).
To clarify: every position logs data once every second whenever a product is present, which means 25 instances per second when full production is running. When R&D mode is running, position 1 will have ~20 instances for 20 consecutive seconds, position 2 will have ~20 instances for the next 20 consecutive seconds and so on.
Table structure
Productiondata:
id (autoincrement)
productID
position
time (timestamp for logged data)
current (amperes)
voltage (volts)
Question
We want to calculate the uptime of the machine, but we want to separate the uptime for production mode and R&D mode, and we want to separate this data on a weekly basis.
Guessed solution
Since we have instances logged every second I can count the amount of DISTINCT instances of time values we have in the table to find out the total uptime for both production and R&D mode. To find the R&D mode, I can safely say that whenever there is a time instance that has only one entry, I'm running in R&D mode (production mode would have 25 instances).
Progress so far
I have the following query which sums up all distinct instances to find both production and R&D mode:
SELECT YEARWEEK(time) AS YWeek, COUNT(DISTINCT time) AS Time_Seconds, ROUND(COUNT(DISTINCT time)/3600, 1) AS Time_Hours
FROM Database.productiondata
WHERE YEARWEEK(time) >= YEARWEEK(curdate()) - 21
GROUP BY YWeek;
This query finds out how many DISTINCT time instances there are in the table and counts the number and groups that by the week.
Problem
The above query counts the amount of instances that exist in the table, but I want to find ONLY the UNIQUE instances. Basically, I'm trying to find something like IF count(time) = 1, then count that instance, IF count(time) > 1 then don't count it at all (DISTINCT still counts this).
I looked at several other SO threads, but almost all explain how to find unique values with DISTINCT, which only accomplishes half of what I'm looking for. The closest I got was this which uses a HAVING clause. I'm currently stuck at the following:
SELECT YEARWEEK(time) as YWeek, COUNT(Distinct time) As Time_Seconds, ROUND(COUNT(Distinct time)/3600, 1) As Time_Hours
FROM
(SELECT * FROM Database.productiondata
WHERE time > '2014-01-01 00:00:00'
GROUP BY time
HAVING count(time) = 1) as temptime
GROUP BY YWeek
ORDER BY YWeek;
The problem here is that we have a GROUP BY time inside the nested select clause which takes forever (~5 million rows only for this year so I can understand that). I mean, syntactically I think that this is correct but it takes forever to exectue. Even EXPLAIN for this times out.
And that is where I am. Is this the correct approach or is there any other way that is smarter/requires less query time/avoids the group by time clause?
EDIT: As a sample, we have this table (apologies for formatting, don't know how to make a table format here on SO)
id position time
1 1 1
2 2 1
3 5 1
4 19 1
... ... ...
25 7 1
26 3 2
27 6 2
... ... ...
This table shows how it looks like when there is a production run going on. As you can see, there is no general structure for which position gets the first entry when logging the data in the table; what happens is that the 25 positions gets logged during every second and the data is then added to the table depending on how fast the PLC sends the data for every position. The following table shows how the table looks like when it runs in research mode.
id position time
245 1 1
246 1 2
247 1 3
... ... ...
269 1 25
270 2 26
271 2 27
... ... ...
Since all the data is consolidated into one single table, we want to find out how many instances there are when COUNT(time) is exactly equal to 1, or we could look for every instance when COUNT(time) is strictly larger than 1.
EDIT2: As a reply to Alan, the suggestion gives me
YWeek Time_Seconds Time_Hours
201352 1 0.0
201352 1 0.0
201352 1 0.0
... ... ...
201352 1 0.0 (1000 row limit)
Whereas my desired output is
Yweek Time_Seconds Time_Hours
201352 2146 35.8
201401 5789 96.5
... ... ...
201419 8924 148.7
EDIT3: I have gathered the tries and the results so far here with a description in gray above the queries.
You might achieve better results by eliminating your sub select:
SELECT YEARWEEK(time) as YWeek,
COUNT(time) As Time_Seconds,
ROUND(COUNT(time)/3600, 1) As Time_Hours
FROM Database.productiondata
WHERE time > '2014-01-01 00:00:00'
GROUP BY YWeek
HAVING count(time) = 1)
ORDER BY YWeek;
I'm assuming time has an index on it, but if it does not you could expect a significant improvement in performance by adding one.
UPDATE:
Per the recently added sample data, I'm not sure your approach is correct. The time column appears to be an INT representing seconds while you're treating it as a DATETIME with YEARWEEK. Below I have a working example in SQL that does exactly what you asked IF time is actually a DATETIME column:
DECLARE #table TABLE
(
id INT ,
[position] INT ,
[time] DATETIME
)
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -1, GETDATE()) )
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -2, GETDATE()) )
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -2, GETDATE()) )
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -2, GETDATE()) )
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -2, GETDATE()) )
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -3, GETDATE()) )
INSERT INTO #table
VALUES ( 1, 1, DATEADD(week, -3, GETDATE()) )
SELECT CAST(DATEPART(year, [time]) AS VARCHAR)
+ CAST(DATEPART(week, [time]) AS VARCHAR) AS YWeek ,
COUNT([time]) AS Time_Seconds ,
ROUND(COUNT([time]) / 3600, 1) AS Time_Hours
FROM #table
WHERE [time] > '2014-01-01 00:00:00'
GROUP BY DATEPART(year, [time]) ,
DATEPART(week, [time])
HAVING COUNT([time]) > 0
ORDER BY YWeek;
SELECT pd1.*
FROM Database.productiondata pd1
LEFT JOIN Database.productiondata pd2 ON pd1.time=pd2.time AND pd1.id<pd2.id
WHERE pd1.time > '2014-01-01 00:00:00' AND pd2.time > '2014-01-01 00:00:00'
AND pd2.id IS NULL
You can LEFT JOIN to the same table and leave only the rows with no related
UPDATE The query works using the SQL fiddle
SELECT pd1.* From productiondata pd1
left Join productiondata pd2
ON pd1.time = pd2.time and pd1.id < pd2.id
Where pd1.time > '2014-01-01 00:00:00' and pd2.id IS NULL;