This is a follow-up to my previous post How to improve wind data SQL query performance.
I have expanded the SQL statement to also perform the first part in the calculation of the average wind direction using circular statistics. This means that I want to calculate the average of the cosines and sines of the wind direction. In my PHP script, I will then perform the second part and calculate the inverse tangent and add 180 or 360 degrees if necessary.
The wind direction is stored in my table as voltages read from the sensor in the field 'dirvolt' so I first need to convert it to radians.
The user can look at historical wind data by stepping backwards using a pagination function, hence the use of LIMIT which values are set dynamically in my PHP script.
My SQL statement currently looks like this:
SELECT ROUND(AVG(speed),1) AS speed_mean, MAX(speed) as speed_max,
MIN(speed) AS speed_min, MAX(dt) AS last_dt,
AVG(SIN(2.04*dirvolt-0.12)) as dir_sin_mean,
AVG(COS(2.04*dirvolt-0.12)) as dir_cos_mean
FROM table
GROUP BY FLOOR(UNIX_TIMESTAMP(dt) / 300)
ORDER BY FLOOR(UNIX_TIMESTAMP(dt) / 300) DESC
LIMIT 0, 72
The query takes about 3-8 seconds to run depending on what value I use to group the data (300 in the code above).
In order for me to learn, is there anything I can do to optimize or improve the SQL statement otherwise?
SHOW CREATE TABLE table;
From that I can see if you already have INDEX(dt) (or equivalent). With that, we can modify the SELECT to be significantly faster.
But first, change the focus from 72*300 seconds worth of readings to datetime ranges, which is 6(?) hours.
Let's look at this query:
SELECT * FROM table
WHERE dt >= '...' - INTERVAL 6 HOUR
AND dt < '...';
The '...' would be the same datetime in both places. Does that run fast enough with the index?
If yes, then let's build the final query using that as a subquery:
SELECT FORMAT(AVG(speed), 1) AS speed_mean,
MAX(speed) as speed_max,
MIN(speed) AS speed_min,
MAX(dt) AS last_dt,
AVG(SIN(2.04*dirvolt-0.12)) as dir_sin_mean,
AVG(COS(2.04*dirvolt-0.12)) as dir_cos_mean
FROM
( SELECT * FROM table
WHERE dt >= '...' - INTERVAL 6 HOUR
AND dt < '...'
) AS x
GROUP BY FLOOR(UNIX_TIMESTAMP(dt) / 300)
ORDER BY FLOOR(UNIX_TIMESTAMP(dt) / 300) DESC;
Explanation: What you had could not use an index, hence had to scan the entire table (which is getting bigger and bigger). My subquery could use an index, hence was much faster. The effort for my outer query was not "too bad" since it worked with only N rows.
Related
I have a table that looks like this:
id
slot
total
1
2022-12-01T12:00
100
2
2022-12-01T12:30
150
3
2022-12-01T13:00
200
There's an index on slot already. The table has ~100mil rows (and a bunch more columns not shown here)
I want to sum the total up to the current moment in time (EDIT: WASN'T CLEAR INITIALLY, I WILL PROVIDE A LOWER SLOT BOUND, SO THE SUM WILL BE OVER SOME NUMBER OF DAYS/WEEKS, NOT OVER FULL TABLE). Let's say the time is currently 2022-12-01T12:45. If I run select * from my_table where slot < CURRENT_TIMESTAMP(),
then I get back records 1 and 2.
However, in my data, the records represent forecasted sales within a time slot. I want to find the forecasts as of 2022-12-01T12:45, and so I want to find the proportion of the half hour slot of record 2 that has elapsed, and return that proportion of the total.
As of 2022-12-01T12:45 (assuming minute granularity), 50% of row 2 has elapsed, so I would expect the total to return as 150 / 2 = 75.
My current query works, but is slow. What are some ways I can optimise this, or other approaches I can take?
Also, how can we extend this solution to be generalised to any interval frequency? Maybe tomorrow we change our forecasting model and the data comes in sporadically. The hardcoded 30 would not work in that case.
select sum(fraction * total) as t from
select total,
LEAST(
timestampdiff(
minute,
datetime,
current_timestamp()
),
30
) / 30 as fraction
from my_table
where slot <= current_timestamp()
Consider computing your sum first, then remove the last element partial total. In order to keep the last element total, I'd prefer applying window functions instead of aggregations, and limit the output to the last row.
SET #current_time = CURRENT_TIMESTAMP();
WITH cte AS (
SELECT slot,
SUM(total) OVER(ORDER BY slot) AS total,
total AS rowtotal
FROM my_table
WHERE slot < #current_time
ORDER BY slot DESC
LIMIT 1
)
SELECT slot,
total - (30 - TIMESTAMPDIFF(MINUTE,
slot,
#current_time))
/30 * rowtotal AS total
FROM cte
Check the demo here.
Note1: Adding an index on the slot field is likely to boost this query performance.
Note2: If your query is running on millions of data, your timestamp may be likely to change during the query. You could store it into a variable before the query is run (or into another cte).
create an ondex in slot column btree as it is having high selectivity;
Some background first. We have a MySQL database with a "live currency" table. We use an API to pull the latest currency values for different currencies, every 5 seconds. The table currently has over 8 million rows.
Structure of the table is as follows:
id (INT 11 PK)
currency (VARCHAR 8)
value (DECIMAL
timestamp (TIMESTAMP)
Now we are trying to use this table to plot the data on a graph. We are going to have various different graphs, e.g: Live, Hourly, Daily, Weekly, Monthly.
I'm having a bit of trouble with the query. Using the Weekly graph as an example, I want to output data from the last 7 days, in 15 minute intervals. So here is how I have attempted it:
SELECT *
FROM currency_data
WHERE ((currency = 'GBP')) AND (timestamp > '2017-09-20 12:29:09')
GROUP BY UNIX_TIMESTAMP(timestamp) DIV (15 * 60)
ORDER BY id DESC
This outputs the data I want, but the query is extremely slow. I have a feeling the GROUP BY clause is the cause.
Also BTW I have switched off the sql mode 'ONLY_FULL_GROUP_BY' as it was forcing me to group by id as well, which was returning incorrect results.
Does anyone know of a better way of doing this query which will reduce the time taken to run the query?
You may want to create summary tables for each of the graphs you want to do.
If your data really is coming every 5 seconds, you can attempt something like:
SELECT *
FROM currency_data cd
WHERE currency = 'GBP' AND
timestamp > '2017-09-20 12:29:09' AND
UNIX_TIMESTAMP(timestamp) MOD (15 * 60) BETWEEN 0 AND 4
ORDER BY id DESC;
For both this query and your original query, you want an index on currency_data(currency, timestamp, id).
How can i improve the performance of below query? What indexes might help?
SELECT platform, country, Source, window,
Round(SUM(ProjectedARPI*PlayerCount) / SUM(PlayerCount), 2) AS ProjectedARPI,
Round(SUM(ProjectedARPIOrganicLow*PlayerCount) / SUM(PlayerCount), 2) AS ProjectedARPIOrganicLow,
Round(SUM(ProjectedARPIOrganicMed*PlayerCount) / SUM(PlayerCount), 2) AS ProjectedARPIOrganicMed,
Round(SUM(ProjectedARPIOrganicHigh*PlayerCount) / SUM(PlayerCount), 2) AS ProjectedARPIOrganicHigh,
SUM(PlayerCount) AS PlayerCount, SUM(PayerCount) AS PayerCount,
CASE WHEN(SUM(PlayerCount) > 500 AND SUM(PayerCount) > 10) THEN TRUE ELSE FALSE END AS isSignificant,
ProjectionDate,
min(CohortRangeLow) as CohortRangeLow,
max(CohortRangeHigh) as CohortRangeHigh
FROM web_synch.UI_data
WHERE PlayerCount > 0 AND ProjectionDate BETWEEN '2015-07-25' AND '2016-10-25' AND window = 365
GROUP BY Platform, country, source, ProjectionDate
ORDER BY Platform, source, ProjectionDate;
For this query, basically your only hope in using indexes is either: UI_data(window, ProjectionDate, PlayerCount) or UI_data(window, PlayerCount, ProjectionDate). Which is better depends on which selects fewer records . . . I would guess the first is better.
I suggest that this is the best index:
INDEX(window, -- first because "="
ProjectionDate -- range
) -- nothing after range will be looked at
This has a slight advantage over the 3-column index previously suggested, in that the index will be slightly smaller.
More discussion: Index cookbook .
I expect there will be two sorts -- one for GROUP BY, then one for ORDER BY. It would run a little faster if you made the ORDER BY and the GROUP BY list identical.
Possible bug: if ProjectionDate is a DATE datatype, then the range is three months plus one day. Recommend this pattern:
ProjectionDate >= '2015-07-25'
AND ProjectionDate < '2016-07-25' + INTERVAL 3 MONTH
I'm using CodeIgniter 2 and in my database model, I have a query that joins two tables and filters row based upon distance from a given geolocation.
SELECT users.id,
(3959 * acos(cos(radians(42.327612)) *
cos(radians(last_seen.lat)) * cos(radians(last_seen.lon) -
radians(-77.661591)) + sin(radians(42.327612)) *
sin(radians(last_seen.lat)))) AS distance
FROM users
JOIN last_seen ON users.id = last_seen.seen_id
WHERE users.age >= 18 AND users.age <= 30
HAVING distance < 50
I'm not sure if it's the distance that is making this query take especially long. I do have over 300,000 rows in my users table. The same amount in my last_seen table. I'm sure that plays a role.
But, the age column in the users table is indexed along with the id column.
The lat and lon columns in the last_seen table are also indexed.
Does anyone have ideas as to why this query takes so long and how I can improve it?
UPDATE
It turns out that this query actually runs pretty quickly. When I execute this query in PHPMyAdmin, it takes 0.56 seconds. Not too bad. But, when I try to execute this query with a third party SQL client like SequelPro, it takes at least 20 seconds and all of the other apps on my mac slow down. When the query is executed by loading the script via jQuery's load() method, it takes around the same amount of time.
Upon viewing my network tab in Google Chrome's developer tools, it seems that the reason it's taking so long to load is because of what's called TTFB or Time To First Byte. It's taking forever.
To make this query faster you need to limit the count of rows using an index before actually calculating the distance on every and each of them. To do so you can limit the rows from last_seen based on their lat/lon and a rough formula for desired distance.
The idea is that the positions with the same latitude as the reference latitude would be in 50 miles distance if their longitude falls in a certain distance from the reference longitude and vice versa.
For 50 miles distance, RefLat+-1 and RefLon+-1 would be a good start to limit the rows before actually calculating the precise distance.
last_seen.lat BETWEEN 42.327612 - 1 AND 42.327612 + 1
AND last_seen.lon BETWEEN -77.661591 - 1 AND -77.661591 + 1
For this query:
SELECT users.id, (3959 * acos(cos(radians(42.327612)) * cos(radians(last_seen.lat)) * cos(radians(last_seen.lon) - radians(-77.661591)) + sin(radians(42.327612)) * sin(radians(last_seen.lat)))) AS distance
FROM users JOIN
last_seen
ON users.id = last_seen.seen_id
WHERE users.age >= 18 AND users.age <= 30
HAVING distance < 50;
The best index is users(age, id) and last_seen(seen_id). Unfortunately, the distance calculations are going to take a while, because they have to be calculated for every row. You might want to consider a GIS extension to MySQL to help with this type of query.
Query 1 works but query 2 doesn't:
Query #1:
SELECT * FROM `users` WHERE users.dob <= '1994-1-14' AND users.dob >= '1993-1-14' LIMIT 10
Query #2:
SELECT * FROM `users` WHERE users.dob BETWEEN '1994-1-14' AND '1993-1-14' LIMIT 10
The 2nd one should be able to do the same thing as the first but I don't understand why it's not working.
The dob (date of birth) field in the users table is a type date field with records that look like this:
1988-11-08
1967-11-14
1991-03-09
1958-03-08
1967-06-30
1988-10-19
1986-01-23
1965-09-20
YEAR - MONTH - DAY
With either query #1 or #2 I'm trying to get back all users who are between 18 and 19 years of age, because 1994-1-14 is exactly 18 years from today and 1993-1-14 is 19 years from today. So is there a way to get the between query to work?
By not working I mean it doesn't return any records from the db while the working query does.
Also is the between query more efficient or is the performance difference negligible?
To answer the first part: "expr BETWEEN min AND max". Try switching those 2 dates in the second query.
The usage is wrong. See the BETWEEN documentation:
expr BETWEEN min AND max is equivalent to (min <= expr AND expr <= max).
Therefore, users.dob BETWEEN '1994-1-14' AND '1993-1-14' is the same as ('1994-1-14' <= users.dob AND users.dob <= '1993-1-14'), of which there will never be more than 0 results.
Simply reverse the order :)
There will be no performance difference when using either form, possibly subject to the note below. This transformation happens at the query planner level. However, if you have concerns, remember to profile, profile, profile. Then you can see for yourself and appease the premature-optimization demons.
Also note the ... note:
For best results when using BETWEEN with date or time values, use CAST() to explicitly convert the values to the desired data type.