I have a big mySQL table that contains values for all kinds of data (all with a different data_id) and a timestamp (unix timestamp in ms). I trying to build a (real-time) plotter for all this data and I want to be able to plot any data on the vertical axis against any other data on the horizontal axis. The problem I encouter is how to couple datapoints efficiently based on their timestamps.
The dataset is quite large and the logging frequency is about 10 Hz and I want a datapoint for every 1-5 minute. I already managed to make a (kinda) efficient SQL call to get an average value and an average timestamp for every 1 minute:
SELECT AVG(value), AVG(timestamp)
FROM
(
(
SELECT value, timestamp
FROM database
WHERE
data_id = 100 AND
timestamp < ... and timestamp > ...
ORDER BY timestamp DESC
) as data
)
GROUP BY timestamp DIV 60000
ORDER BY timestamp DESC;
However, now I want to be able to plot for example data_id 100 against data_id 200 instead of data_id 100 against time. So how do I couple the values for data_id 100 and 200 for a timestep of about 1 minute for a large dataset?
I already tried the following, but the SQL call took way too long...
SELECT a.timestamp, a.value, b.value
FROM
(
SELECT value, timestamp
FROM daq_test.data_f32
WHERE
data_id = 166 AND
timestamp < 1507720000000 AND
timestamp > 1507334400000
ORDER BY timestamp DESC
) a,
(
SELECT value, timestamp
FROM daq_test.data_f32
WHERE
data_id = 137 AND
timestamp < 1507720000000 AND
timestamp > 1507334400000
ORDER BY timestamp DESC
) b
WHERE a.timestamp DIV 60000 = b.timestamp DIV 60000
ORDER BY a.timestamp DESC;
Well i have no idea what is the point of this query. But my suggestion is to create an index based on your parameters in the WHERE clause.
So, if you are searching for records with data_id and timestamp, it is a good idea to create composite index based on those two columns.
Also the most significant slow down is probably caused by the ORDER BY timestamp.
Can you do EXPLAIN SELECT and edit your question so i can update the answer with more correct editing.
Related
I have written a query to select all rows where value of a column 'gvA' in previous row is 0 and non-zero in current row. But my issue is this query takes too long to execute.
My table has 40000 rows and query takes about 60-65 seconds which is too much for a query. How can I improve query for better performance.Following is my query
SELECT device_no,datetime
FROM (
SELECT
gvA,
(SELECT e2.gvA
FROM tyn_records e2
WHERE e2.tyn_id < e1.tyn_id
ORDER BY tyn_id DESC LIMIT 1) as previous_value,
datetime,
device_no
FROM tyn_records e1
WHERE gvA > 0 AND DATE(datetime) = CURDATE() - INTERVAL 2 DAY
) selected
WHERE selected.previous_value = 0
Following are my tables
Devices:
tyn_records:
I would do two things:
I would rephrase the query a bit, specifically to remove the DATE() function in the left side of the filtering condition.
select
device_no,
datetime
from (
select
gva,
lag(gva) over(order by tyn_id) as previous_value,
datetime,
device_no
from tyn_records
where gva > 0
and datetime between curdate() - interval 2 day
and curdate() - interval 1 day
) x
where previous_value = 0
With the function on the left side of the predicate removed, you can create an index suitable to optimize the query:
create index ix1 on tyn_records (datetime, gva);
As a side note, the way you compute previous_value may not be deterministic, and could produce different results each time you run the query. This may happen if the column tyn_id is non unique.
I have a MySQL database named mydb in which I store daily share prices for
423 companies in a table named data. Table data has the following columns:
`epic`, `date`, `open`, `high`, `low`, `close`, `volume`
epic and date being primary key pairs.
I update the data table each day using a csv file which would normally have 423 rows
of data all having the same date. However, on some days prices may not available
for all 423 companies and data for a particular epic and date pair will
not be updated. In order to determine the missing pair I have resorted
to comparing a full list of epics against the incomplete list of epics using
two simple SELECT queries with different dates and then using a file comparator, thus
revealing the missing epic(s). This is not a very satisfactory solution and so far
I have not been able to construct a query that would identify any epics that
have not been updated for any particular day.
SELECT `epic`, `date` FROM `data`
WHERE `date` IN ('2019-05-07', '2019-05-08')
ORDER BY `epic`, `date`;
Produces pairs of values:
`epic` `date`
"3IN" "2019-05-07"
"3IN" "2019-05-08"
"888" "2019-05-07"
"888" "2019-05-08"
"AA." "2019-05-07"
"AAL" "2019-05-07"
"AAL" "2019-05-08"
Where in this case AA. has not been updated on 2019-05-08. The problem with this is that it is not easy to spot a value that is not a pair.
Any help with this problem would be greatly appreciated.
You could do a COUNT on epic, with a GROUP BY epic for items in that date range and see if you get any with a COUNT less than 2, then select from this result where UpdateCount is less than 2, forgive me if the syntax on the column names is not correct, I work in SQL Server, but the logic for the query should still work for you.
SELECT x.epic
FROM
(
SELECT COUNT(*) AS UpdateCount, epic
FROM data
WHERE date IN ('2019-05-07', '2019-05-08')
GROUP BY epic
) AS x
WHERE x.UpdateCount < 2
Assuming you only want to check the last date uploaded, the following will return every item not updated on 2019-05-08:
SELECT last_updated.epic, last_updated.date
FROM (
SELECT epic , max(`date`) AS date FROM `data`
GROUP BY 'epic'
) AS last_updated
WHERE 'date' <> '2019-05-08'
ORDER BY 'epic'
;
or for any upload date, the following will compare against the entire database, so you don't rely on '2019-08-07' having every epic row. I.e. if the epic has been in the database before then it will show if not updated:
SELECT d.epic, max(d.date)
FROM data as d
WHERE d.epic NOT IN (
SELECT d2.epic
FROM data as d2
WHERE d2.date = '2019-05-08'
)
GROUP BY d.epic
ORDER BY d.epic
I have a table which currently has about 80 million rows, created as follows:
create table records
(
id int auto_increment primary key,
created int not null,
status int default '0' not null
)
collate = utf8_unicode_ci;
create index created_and_status_idx
on records (created, status);
The created column contains unix timestamps and status can be an integer between -10 and 10. The records are evenly distributed regarding the created date, and around half of them are of status 0 or -10.
I have a cron that selects records that are between 32 and 8 days old, processes them and then deletes them, for certain statuses. The query is as follows:
SELECT
records.id
FROM records
WHERE
(records.status = 0 OR records.status = -10)
AND records.created BETWEEN UNIX_TIMESTAMP() - 32 * 86400 AND UNIX_TIMESTAMP() - 8 * 86400
LIMIT 500
The query was fast when the records were at the beginning of the creation interval, but now that the cleanup reaches the records at the end of interval it takes about 10 seconds to run. Explaining the query says it uses the index, but it parses about 40 million records.
My question is if there is anything I can do to improve the performance of the query, and if so, how exactly.
Thank you.
I think union all is your best approach:
(SELECT r.id
FROM records r
WHERE r.status = 0 AND
r.created BETWEEN UNIX_TIMESTAMP() - 32 * 86400 AND UNIX_TIMESTAMP() - 8 * 86400
LIMIT 500
) UNION ALL
(SELECT r.id
FROM records r
WHERE r.status = -10 AND
r.created BETWEEN UNIX_TIMESTAMP() - 32 * 86400 AND UNIX_TIMESTAMP() - 8 * 86400
LIMIT 500
)
LIMIT 500;
This can use an index on records(status, created, id).
Note: use union if records.id could have duplicates.
You are also using LIMIT with no ORDER BY. That is generally discouraged.
Your index is in the wrong order. You should put the IN column (status) first (you phrased it as an OR), and put the 'range' column (created) last:
INDEX(status, created)
(Don't give me any guff about "cardinality"; we are not looking at individual columns.)
Are there really only 3 columns in the table? Do you need id? If not, get rid of it and change to
PRIMARY KEY(status, created)
Other techniques for walking through large tables efficiently.
I am currently having an accuracy issue when querying price vs. time in a Google Big Query Dataset. What I would like is the price of an asset every five minutes, yet there are some assets that have an empty row for an exact minute.
For example, with VEN vs ICX which are two cryptocurrencies, there might be a time at which price data is not available for a specific second. In my query, I am querying a database for every 300 seconds and taking the price data, yet some assets don't have a timestamp for 5 minutes and 0 seconds. Thus, I would like the get the last known price: a good price to use would be 4 minutes and 58 seconds.
My query right now is:
SELECT MIN(price) AS PRICE, timestamp
FROM [coin_data]
WHERE coin="BTCUSD" AND TIMESTAMP_TO_SEC(timestamp) % 300 = 0
GROUP BY timestamp
ORDER BY timestamp ASC
This query results in this sort of gap in specific places:
Row((10339.25, datetime.datetime(2018, 2, 26, 21, 55, tzinfo=<UTC>)))
Row((10354.62, datetime.datetime(2018, 2, 26, 22, 0, tzinfo=<UTC>)))
Row((10320.0, datetime.datetime(2018, 2, 26, 22, 10[should be 5 for 5 min], tzinfo=<UTC>)))
This one should not be 10 in the last column as that is the minutes place and it should read 5 mins.
In order to select a row that has a 5 minute mark/timestamp if it exists, or the closest existing entry, you can use "(analytic) window functions"(uses OVER()) instead of aggregate functions(uses GROUP BY), as following:
group all rows into "separate" 5 minute groups
sort them by proximity to the desired time
select the first row from each partition.
Here I am using OVER clause to create the "window frames" and sorts the rows in them. Then RANK() numbers all rows in each window frame as they are sorted.
Standard SQL
WITH
data AS (
SELECT *,
CAST(FLOOR(UNIX_SECONDS(timestamp)/300) AS INT64) AS timegroup
FROM
`coin_data` )
SELECT min(price) as min_price, timestamp
FROM
(SELECT *, RANK() OVER(PARTITION BY timegroup ORDER BY timestamp ASC) AS rank
FROM data)
WHERE rank = 1
group by timestamp
ORDER BY timestamp ASC
Legacy SQL
SELECT MIN(price) AS min_price, timestamp
FROM (
SELECT *,
RANK() OVER(PARTITION BY timegroup ORDER BY timestamp ASC) AS rank,
FROM (
SELECT *,
INTEGER(FLOOR(TIMESTAMP_TO_SEC(timestamp)/300)) AS timegroup
FROM [coin_data]) AS data )
WHERE rank = 1
GROUP BY timestamp
ORDER BY timestamp ASC
It seems that you have many prices for the same time stamp in which case you may want to add another field to OVER clause.
OVER(PARTITION BY timegroup, exchange ORDER BY timestamp ASC)
Notes:
Consider migrating to Standard SQL, which is the preferred SQL dialect for querying data stored in BigQuery. You can do that on single query basis, so you don't have to migrate everything at the same time.
My idea was to provide a general query that would illustrate the principle so I don't filter for empty rows, because it's not clear if they are null or empty string and it's not really necessary for the answer.
I've been trying to work this one out for a while now, maybe my problem is coming up with the correct search query. I'm not sure.
Anyway, the problem I'm having is that I have a table of data that has a new row added every second (imagine the structure {id, timestamp(datetime), value}). I would like to do a single query for MySQL to go through the table and output only the first value of each minute.
I thought about doing this with multiple queries with LIMIT and datetime >= (beginning of minute) but with the volume of data I'm collecting that is a lot of queries so it would be nicer to produce the data in a single query.
Sample data:
id datetime value
1 2015-01-01 00:00:00 128
2 2015-01-01 00:00:01 127
3 2015-01-01 00:00:04 129
4 2015-01-01 00:00:05 127
...
67 2015-01-01 00:00:59 112
68 2015-01-01 00:01:12 108
69 2015-01-01 00:01:13 109
Where I would want the result to select the rows:
1 2015-01-01 00:00:00 128
68 2015-01-01 00:01:12 108
Any ideas?
Thanks!
EDIT: Forgot to add, the data, whilst every second, is not reliably on the first second of every minute - it may be :30 or :01 rather than :00 seconds past the minute
EDIT 2: A nice-to-have (definitely not required for answer) would be a query that is flexible to also take an arbitrary number of minutes (rather than one row each minute)
SELECT t2.* FROM
( SELECT MIN(`datetime`) AS dt
FROM tbl
GROUP BY DATE_FORMAT(`datetime`,'%Y-%m-%d %H:%i')
) t1
JOIN tbl t2 ON t1.dt = t2.`datetime`
SQLFiddle
Or
SELECT *
FROM tbl
WHERE dt IN ( SELECT MIN(dt) AS dt
FROM tbl
GROUP BY DATE_FORMAT(dt,'%Y-%m-%d %H:%i'))
SQLFiddle
SELECT t1.*
FROM tbl t1
LEFT JOIN (
SELECT MIN(dt) AS dt
FROM tbl
GROUP BY DATE_FORMAT(dt,'%Y-%m-%d %H:%i')
) t2 ON t1.dt = t2.dt
WHERE t2.dt IS NOT NULL
SQLFiddle
In MS SQL Server I would use CROSS APPLY, but as far as I know MySQL doesn't have it, so we can emulate it.
Make sure that you have an index on your datetime column.
Create a table of numbers, or in your case a table of minutes. If you have a table of numbers starting from 1 it is trivial to turn it into minutes in the necessary range.
SELECT
tbl.ID
,tbl.`dt`
,tbl.value
FROM
(
SELECT
MinuteValue
, (
SELECT tbl.id
FROM tbl
WHERE tbl.`dt` >= Minutes.MinuteValue
ORDER BY tbl.`dt`
LIMIT 1
) AS ID
FROM Minutes
) AS IDs
INNER JOIN tbl ON tbl.ID = IDs.ID
For each minute find one row that has timestamp greater than the minute. I don't know how to return the full row, rather than one column in MySQL in the nested SELECT, so at first I'm making a temp table with two columns: Minute and id from the original table and then explicitly look up rows from original table knowing their IDs.
SQL Fiddle
I've created a table of Minutes in the SQL Fiddle with the necessary values to make example simple. In real life you would have a more generic table.
Here is SQL Fiddle that uses a table of numbers, just for illustration.
In any case, you do need to know in advance somehow the range of dates/numbers you are interested in.
It is trivial to make it work for any interval of minutes. If you need results every 5 minutes, just generate a table of minutes that has values not every 1 minute, but every 5 minutes. The main query would remain the same.
It may be more efficient, because here you don't join the big table to itself and you don't make calculations on the datetime column, so the server should be able to use the index on it.
The example that I made assumes that for each minute there is at least one row in the big table. If it is possible that there are some minutes that don't have any data at all you'd need to add extra check in the WHERE clause to make sure that the found row is still within that minute.
select * from table where timestamp LIKE "%-%-% %:%:00" could work.
This is similar to this question: Stack Overflow Date SQL Query Question
Edit: This probably would work better:
`select , date_format(timestamp, '%Y-%m-%d %H:%i') as the_minute, count()
from table
group by the_minute
order by the_minute
Similar to this question here: mysql select date format
i'm not really sure, but you could try this:
SELECT MIN(timestamp) FROM table WHERE YEAR(timestamp)=2015 GROUP BY DATE(timestamp), HOUR(timestamp), MINUTE(timestamp)