How to improve wind data SQL query performance - mysql

I'm looking for help on how to optimize (if possible) the performance of a SQL query used for reading wind information (see below) by changing the e.g. the database structure, query or something else?
I use a hosted database to store a table with more than 800,000 rows with wind information (speed and direction). New data is added each minute from an anemometer. The database is accessed using a PHP script which creates a web page for plotting the data using Google's visualization API.
The web page takes approximately 15 seconds to load. I've added some time measurements in both the PHP and Javascript part to profile the code and find possible areas for improvements.
One part where I hope to improve is the following query which takes approximately 4 seconds to execute. The purpose of the query is to group 15 minutes of wind speed (min/max/mean) and calculate the mean value and total min/max during this period of measurements.
SELECT AVG(d_mean) AS group_mean,
MAX(d_max) as group_max,
MIN(d_min) AS
group_min,
dir,
FROM_UNIXTIME(MAX(dt),'%Y-%m-%d %H:%i') AS group_dt
FROM (
SELECT #i:=#i+1,
FLOOR(#i/15) AS group_id,
CAST(mean AS DECIMAL(3,1)) AS d_mean,
CAST(min AS DECIMAL(3,1)) AS d_min,
CAST(max AS DECIMAL(3,1)) AS d_max,
dir,
UNIX_TIMESTAMP(STR_TO_DATE(dt, '%Y-%m-%d %H:%i')) AS dt
FROM table, (SELECT #i:=-1) VAR_INIT
ORDER BY id DESC
) AS T
GROUP BY group_id
LIMIT 0, 360
...
$oResult = mysql_query($sSQL);
The table has the following structure:
1 ID int(11) AUTO_INCREMENT
2 mean varchar(5) utf8_general_ci
3 max varchar(5) utf8_general_ci
4 min varchar(5) utf8_general_ci
5 dt varchar(20) utf8_general_ci // Date and time
6 dir varchar(5) utf8_general_ci
The following setup is used:
Database: MariaDB, 5.5.42-MariaDB-1~wheezy
Database client version: libmysql - 5.1.66
PHP version: 5.6
PHP extension: mysqli

I strongly agree with the comments so far -- Cleanse the data as you put it into the table.
Once you have done the cleansing, let's avoid the subquery by doing...
SELECT MIN(dt) as 'Start of 15 mins',
FORMAT(AVG(mean), 1) as 'Avg wind speed',
...
FROM table
GROUP BY FLOOR(UNIX_TIMESTAMP(dt) / 900)
ORDER BY FLOOR(UNIX_TIMESTAMP(dt) / 900);
I don't understand the purpose of the LIMIT. I'll guess that you want to a few days at a time. For that, I recommend you add (after cleansing) between the FROM and the GROUP BY.
WHERE dt >= '2015-04-10'
AND dt < '2015-04-10' + INTERVAL 7 DAY
That would show 7 days, starting '2015-04-10' morning.
In order to handle a table of 800K, you would decidedly need (again, after cleansing):
INDEX(dt)
To cleanse the 800K rows, there are multiple approaches. I suggest creating a new table, copy the data in, test, and eventually swap over. Something like...
CREATE TABLE new (
dt DATETIME,
mean FLOAT,
...
PRIMARY KEY(dt) -- assuming you have only one row per minute?
) ENGINE=InnoDB;
INSERT INTO new (dt, mean, ...)
SELECT str_to_date(...),
mean, -- I suspect that the CAST is not needed
...;
Write the new select and test it.
By now new is missing the newer rows. You can either rebuild it and hope to finish everything in your one minute window, or play some other game. Let us know if you want help there.

Related

MySQL - group by interval query optimisation

Some background first. We have a MySQL database with a "live currency" table. We use an API to pull the latest currency values for different currencies, every 5 seconds. The table currently has over 8 million rows.
Structure of the table is as follows:
id (INT 11 PK)
currency (VARCHAR 8)
value (DECIMAL
timestamp (TIMESTAMP)
Now we are trying to use this table to plot the data on a graph. We are going to have various different graphs, e.g: Live, Hourly, Daily, Weekly, Monthly.
I'm having a bit of trouble with the query. Using the Weekly graph as an example, I want to output data from the last 7 days, in 15 minute intervals. So here is how I have attempted it:
SELECT *
FROM currency_data
WHERE ((currency = 'GBP')) AND (timestamp > '2017-09-20 12:29:09')
GROUP BY UNIX_TIMESTAMP(timestamp) DIV (15 * 60)
ORDER BY id DESC
This outputs the data I want, but the query is extremely slow. I have a feeling the GROUP BY clause is the cause.
Also BTW I have switched off the sql mode 'ONLY_FULL_GROUP_BY' as it was forcing me to group by id as well, which was returning incorrect results.
Does anyone know of a better way of doing this query which will reduce the time taken to run the query?
You may want to create summary tables for each of the graphs you want to do.
If your data really is coming every 5 seconds, you can attempt something like:
SELECT *
FROM currency_data cd
WHERE currency = 'GBP' AND
timestamp > '2017-09-20 12:29:09' AND
UNIX_TIMESTAMP(timestamp) MOD (15 * 60) BETWEEN 0 AND 4
ORDER BY id DESC;
For both this query and your original query, you want an index on currency_data(currency, timestamp, id).

MySQL Comparing Times of Different Formats

I am working with a database full of songs, with titles and durations.
I need to return all songs with a duration greater than 29:59 (MM:SS).
The data is formatted in two different ways.
Format 1
Most of the data in the table is formatted as MM:SS, with some songs being greater than 60 minutes formatted for example as 72:15.
Format 2
Other songs in the table are formatted as HH:MM:SS, where the example given for Format 1 would instead be 01:12:15.
I have tried two different types of queries to solve this problem.
Query 1
The following query returns all of the values that I seek to return for Format 1, but I could not find a way to get values included for Format 2.
select title, duration from songs where
time(cast(duration as time)) >
time(cast('29:59' as time))
Query 2
With the next query, I hoped to use the format specifiers in str_to_date to locate those results with the format HH:MM:SS, but instead I received results such as 3:50. The interpreter is assuming that all of the data is of the form HH:MM, and I do not know how to tell it otherwise without ruining the results.
select title, duration from songs where
time(cast(str_to_date(duration, '%H:%i:%s') as time)) >
time(cast(str_to_date('00:29:59', '%H:%i:%s') as time))
I've tried changing the specifiers in the first call to str_to_date to %i:%s, which gives me all values greater than 29:59, but none greater than 59:59. This is worse than the original query. I've also tried 00:%i:%s and '00:' || duration, '%H:%i:%s'. These two in particular would ruin the results anyway, but I'm just fiddling at this point.
I'm thoroughly stumped, but I'm sure the solution is an easy one. Any help is appreciated.
EDIT: Here is some data requested from the comments below.
Results from show create table:
CREATE TABLE `songs` (
`song_id` int(11) NOT NULL,
`title` varchar(100) NOT NULL,
`duration` varchar(20) DEFAULT NULL,
PRIMARY KEY (`song_id`),
UNIQUE KEY `songs_uq` (`title`,`duration`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Keep in mind, there are more columns than I described above, but I left some out for the sake of simplicity. I will also leave them out in the sample data.
Sample Data
title duration
(Allegro Moderato) 3:50
Agatha 1:56
Antecessor Machine 06:16
Very Long Song 01:24:16
Also Very Long 2:35:22
You are storing unstructured data in a relational database. And that is making you unhappy. So structure it.
Either add a TIME column, or copy song_id into a parallel time table on the side that you can JOIN against. Select all the two-colon durations and trivially update TIME. Repeat, prepending '00:' to all the one-colon durations. Now you have parsed all rows, and can safely ignore the duration column.
Ok, fine, I suppose you could construct a VIEW that offers UNION ALL of those two queries, but that is slow and ugly, much better to fix the on-disk data.
Forget times. Convert to seconds. Here is one way:
select s.*
from (select s.*,
( substring_index(duration, ':', -1) + 0 +
substring_index(substring_index(duration, ':', -2), ':', 1) * 60 +
(case when duration like '%:%:%' then substring_index(duration, ':', 1) * 60*60
else 0
end)
) as duration_seconds
from songs s
) s
where duration_seconds > 29*60 + 59;
After some research I have come up with an answer of my own that I am happy with.
select title, duration from songs where
case
when length(duration) - length(replace(duration, ':', '')) = 1
then time_to_sec(duration) > time_to_sec('29:59')
else time_to_sec(duration) > time_to_sec('00:29:59')
end
Thank you to Gordon Linoff for suggesting that I convert the times to seconds. This made things much easier. I just found his solution a bit overcomplicated, and it reinvents the wheel by not using time_to_sec.
Output Data
title duration
21 Album Mix Tape 45:40
Act 1 1:20:25
Act 2 1:12:05
Agog Opus I 30:00
Among The Vultures 2:11:00
Anabasis 1:12:00
Avalanches Mixtape 60:00
Beautiful And Timeless 73:46
Beggars Banquet Tracks 76:07
Bonus Tracks 68:55
Chindogu 66:23
Spun 101:08
Note: Gordon mentioned his reason for not using time_to_sec was to account for songs greater than 23 hours long. After testing, I found that time_to_sec does support hours larger than 23, just as it supports minutes greater than 59.
It is also perfectly fine with other non-conforming formats such as 1:4:32 (e.g. 01:04:32).

How to count entries in a mysql table grouped by time

I've found lots of not quite the answers to this question, but nothing I can base my rather limited sql skills on...
I've got a gas meter, which gives a pulse every cm3 of gas used - the time the pulses happen is obtained by a pi and stored in a mysql db. I'm trying to graph the db. In order to graph the data, I want to sum how many pulses are received every n time period. Where n may be 5 mins for a graph covering a day or n may be up to 24hours for a graph covering a year.
The data are in a table which has two columns, a primary key/auto inc called "pulse_ref" and "pulse_time" which stores a unix timestamp of the time a pulse was received.
Can anyone suggest a sql query to count how many pulses occurred grouped up into, say, 5minutely intervals?
Create table:
CREATE TABLE `gas_pulse` (
`pulse_ref` int(11) NOT NULL AUTO_INCREMENT,
`pulse_time` int(11) DEFAULT NULL,
PRIMARY KEY (`pulse_ref`));
Populate some data:
INSERT INTO `gas_pulse` VALUES (1,1477978978),(2,1477978984),(3,1477978990),(4,1477978993),(5,1477979016),(6,1477979063),(7,1477979111),(8,1477979147),(9,1477979173),(10,1477979195),(11,1477979214),(12,1477979232),(13,1477979249),(14,1477979267),(15,1477979285),(16,1477979302),(17,1477979320),(18,1477979337),(19,1477979355),(20,1477979372),(21,1477979390),(22,1477979408),(23,1477979425),(24,1477979443),(25,1477979461),(26,1477979479),(27,1477979497),(28,1477979515),(29,1477979533),(30,1477979551),(31,1477979568),(32,1477979586),(33,1477980142),(34,1477980166),(35,1477981433),(36,1477981474),(37,1477981526),(38,1477981569),(39,1477981602),(40,1477981641),(41,1477981682),(42,1477981725),(43,1477981770),(44,1477981816),(45,1477981865),(46,1477981915),(47,1477981966),(48,1477982017),(49,1477982070),(50,1477982124),(51,1477982178),(52,1477982233),(53,1477988261),(54,1477988907),(55,1478001784),(56,1478001807),(57,1478002385),(58,1478002408),(59,1478002458),(60,1478002703),(61,1478002734),(62,1478002784),(63,1478002831),(64,1478002863),(65,1478002888),(66,1478002909),(67,1478002928),(68,1478002946),(69,1478002964),(70,1478002982),(71,1478003000),(72,1478003018),(73,1478003036),(74,1478003054),(75,1478003072),(76,1478003090),(77,1478003108),(78,1478003126),(79,1478003145),(80,1478003163),(81,1478003181),(82,1478003199),(83,1478003217),(84,1478003235),(85,1478003254),(86,1478003272),(87,1478003290),(88,1478003309),(89,1478003327),(90,1478003346),(91,1478003366),(92,1478003383),(93,1478003401),(94,1478003420),(95,1478003438),(96,1478003457),(97,1478003476),(98,1478003495),(99,1478003514),(100,1478003533),(101,1478003552),(102,1478003572),(103,1478003592),(104,1478003611),(105,1478003632),(106,1478003652),(107,1478003672),(108,1478003693),(109,1478003714),(110,1478003735),(111,1478003756),(112,1478003778),(113,1478003799),(114,1478003821),(115,1478003844),(116,1478003866),(117,1478003889),(118,1478003912),(119,1478003936),(120,1478003960),(121,1478003984),(122,1478004008),(123,1478004033),(124,1478004058),(125,1478004084),(126,1478004109),(127,1478004135),(128,1478004161),(129,1478004187),(130,1478004214),(131,1478004241),(132,1478004269),(133,1478004296),(134,1478004324),(135,1478004353),(136,1478004381),(137,1478004410),(138,1478004439),(139,1478004469),(140,1478004498),(141,1478004528),(142,1478004558),(143,1478004589),(144,1478004619),(145,1478004651),(146,1478004682),(147,1478004714),(148,1478004746),(149,1478004778),(150,1478004811),(151,1478004844),(152,1478004877),(153,1478004911),(154,1478004945),(155,1478004979),(156,1478005014),(157,1478005049),(158,1478005084),(159,1478005120),(160,1478005156),(161,1478005193),(162,1478005231),(163,1478005268),(164,1478005306),(165,1478005344),(166,1478005383),(167,1478005422),(168,1478005461),(169,1478005501),(170,1478005541),(171,1478005582),(172,1478005622),(173,1478005663),(174,1478005704),(175,1478005746),(176,1478005788),(177,1478005831),(178,1478005873),(179,1478005917),(180,1478005960),(181,1478006004),(182,1478006049),(183,1478006094),(184,1478006139),(185,1478006186),(186,1478006231),(187,1478006277),(188,1478010694),(189,1478010747),(190,1478010799),(191,1478010835),(192,1478010862),(193,1478010884),(194,1478010904),(195,1478010924),(196,1478010942),(197,1478010961),(198,1478010980),(199,1478010999),(200,1478011018),(201,1478011037),(202,1478011056),(203,1478011075),(204,1478011094),(205,1478011113),(206,1478011132),(207,1478011151),(208,1478011170),(209,1478011189),(210,1478011208),(211,1478011227),(212,1478011246),(213,1478011265),(214,1478011285),(215,1478011304),(216,1478011324),(217,1478011344),(218,1478011363),(219,1478011383),(220,1478011403),(221,1478011423),(222,1478011443),(223,1478011464),(224,1478011485),(225,1478011506),(226,1478011528),(227,1478011549),(228,1478011571),(229,1478011593),(230,1478011616),(231,1478011638),(232,1478011662),(233,1478011685),(234,1478011708),(235,1478011732),(236,1478011757),(237,1478011782),(238,1478011807),(239,1478011832),(240,1478011858),(241,1478011885),(242,1478011912),(243,1478011939),(244,1478011967),(245,1478011996),(246,1478012025),(247,1478012054),(248,1478012086),(249,1478012115),(250,1478012146),(251,1478012178),(252,1478012210),(253,1478012244),(254,1478012277),(255,1478012312),(256,1478012347),(257,1478012382),(258,1478012419),(259,1478012456),(260,1478012494),(261,1478012531),(262,1478012570),(263,1478012609),(264,1478012649),(265,1478012689),(266,1478012730),(267,1478012771),(268,1478012813),(269,1478012855),(270,1478012898),(271,1478012941),(272,1478012984),(273,1478013028),(274,1478013072),(275,1478013117),(276,1478013163),(277,1478013209),(278,1478013255),(279,1478013302),(280,1478013350),(281,1478013399),(282,1478013449),(283,1478013500),(284,1478013551),(285,1478013604),(286,1478013658),(287,1478013714),(288,1478013771),(289,1478013830),(290,1478013891),(291,1478013954),(292,1478014019),(293,1478014086),(294,1478014156),(295,1478014228),(296,1478014301),(297,1478014373),(298,1478014446),(299,1478014518),(300,1478014591),(301,1478014664),(302,1478014736),(303,1478014809),(304,1478014882),(305,1478015377),(306,1478015422),(307,1478015480),(308,1478015543),(309,1478015608),(310,1478015676),(311,1478015740),(312,1478015803),(313,1478015864),(314,1478015921),(315,1478015977),(316,1478016030),(317,1478016081),(318,1478016129),(319,1478016176);
I assume you need to get the pulse count in n-minute (in your case 5 minutes) intervals. For achieving this, please try the following query
SELECT
COUNT(*) AS gas_pulse_count,
FROM_UNIXTIME(pulse_time - MOD(pulse_time, 5 * 60)) from_time,
FROM_UNIXTIME((pulse_time - MOD(pulse_time, 5 * 60)) + 5 * 60) to_time
FROM
gas_pulse
GROUP BY from_time

Speed up SQL SELECT with arithmetic and geometric calculations

This is a follow-up to my previous post How to improve wind data SQL query performance.
I have expanded the SQL statement to also perform the first part in the calculation of the average wind direction using circular statistics. This means that I want to calculate the average of the cosines and sines of the wind direction. In my PHP script, I will then perform the second part and calculate the inverse tangent and add 180 or 360 degrees if necessary.
The wind direction is stored in my table as voltages read from the sensor in the field 'dirvolt' so I first need to convert it to radians.
The user can look at historical wind data by stepping backwards using a pagination function, hence the use of LIMIT which values are set dynamically in my PHP script.
My SQL statement currently looks like this:
SELECT ROUND(AVG(speed),1) AS speed_mean, MAX(speed) as speed_max,
MIN(speed) AS speed_min, MAX(dt) AS last_dt,
AVG(SIN(2.04*dirvolt-0.12)) as dir_sin_mean,
AVG(COS(2.04*dirvolt-0.12)) as dir_cos_mean
FROM table
GROUP BY FLOOR(UNIX_TIMESTAMP(dt) / 300)
ORDER BY FLOOR(UNIX_TIMESTAMP(dt) / 300) DESC
LIMIT 0, 72
The query takes about 3-8 seconds to run depending on what value I use to group the data (300 in the code above).
In order for me to learn, is there anything I can do to optimize or improve the SQL statement otherwise?
SHOW CREATE TABLE table;
From that I can see if you already have INDEX(dt) (or equivalent). With that, we can modify the SELECT to be significantly faster.
But first, change the focus from 72*300 seconds worth of readings to datetime ranges, which is 6(?) hours.
Let's look at this query:
SELECT * FROM table
WHERE dt >= '...' - INTERVAL 6 HOUR
AND dt < '...';
The '...' would be the same datetime in both places. Does that run fast enough with the index?
If yes, then let's build the final query using that as a subquery:
SELECT FORMAT(AVG(speed), 1) AS speed_mean,
MAX(speed) as speed_max,
MIN(speed) AS speed_min,
MAX(dt) AS last_dt,
AVG(SIN(2.04*dirvolt-0.12)) as dir_sin_mean,
AVG(COS(2.04*dirvolt-0.12)) as dir_cos_mean
FROM
( SELECT * FROM table
WHERE dt >= '...' - INTERVAL 6 HOUR
AND dt < '...'
) AS x
GROUP BY FLOOR(UNIX_TIMESTAMP(dt) / 300)
ORDER BY FLOOR(UNIX_TIMESTAMP(dt) / 300) DESC;
Explanation: What you had could not use an index, hence had to scan the entire table (which is getting bigger and bigger). My subquery could use an index, hence was much faster. The effort for my outer query was not "too bad" since it worked with only N rows.

MySQL - SQLite How to improve this very simple query?

I have one simple but large table.
id_tick INTEGER eg: 1622911
price DOUBLE eg: 1.31723
timestamp DATETIME eg: '2010-04-28 09:34:23'
For 1 month of data, I have 2.3 millions rows (150MB)
My query aims at returning the latest price at a given time.
I first set up a SQLite table and used the query:
SELECT max(id_tick), price, timestamp
FROM EURUSD
WHERE timestamp <='2010-04-16 15:22:05'
It is running in 1.6s.
As I need to run this query several thousands of time, 1.6s is by far too long...
I then set up a MySQL table and modified the query (the max function differs from MySQL to SQLite):
SELECT id_tick, price, timestamp
FROM EURUSD
WHERE id_tick = (SELECT MAX(id_tick)
FROM EURUSD WHERE timestamp <='2010-04-16 15:22:05')
Execution time is getting far worse 3.6s
(I know I can avoid the sub query using ORDER BY and LIMIT 1 but it does not improve the execution time.)
I am only using one month of data for now, but I will have to use several years at some point.
My questions are then the following:
is there a way to improve my query?
given the large dataset, should I use another database engine?
any tips ?
Thanks !
1) Make sure you have an index on timestamp
2) Assuming that id_tick is both the PRIMARY KEY and Clustered Index, and assuming that id_tick increments as a function of time (since you are doing a MAX)
You can try this:
SELECT id_tick, price, timestamp
FROM EURUSD
WHERE id_tick = (SELECT id_tick
FROM EURUSD WHERE timestamp <='2010-04-16 15:22:05'
ORDER BY id_tick DESC
LIMIT 1)
This should be similar to janmoesen's performance though, since there should be high page correlation between id_tick and timestamp in any event
Do you have any indexed fields ?
indexing timestamp and/or id_tick could change a lot of things.
Also why don't you use an interval for timestamp ?
WHERE timestamp >= '2010-04-15 15:22:05' AND timestamp <= '2010-04-16 15:22:05'
that would ease the burden of the MAX function.
You are doing analysis using ALL the ticks for large intervals? I'd tried to filter data into minute/hour/day etc. graphs.
OK, I guess my index was corrupted somehow, a re-indexation greatly improved the performance.
The following is now executed in 0.0012s (non cached)
SELECT id_tick, price, timestamp
FROM EURUSD
WHERE timestamp <= '2010-05-11 05:30:10'
ORDER by id_tick desc
LIMIT 1
Thanks!