I'm using a MySQL database to store values from some energy measurement system. The problem is that the DB contains millions of rows, and the queries take somewhat long to complete. Are the queries optimal? What should I do to improve them?
The database table consists of rows with 15 columns each (t, UL1, UL2, UL3, PL1, PL2, PL3, P, Q1, Q2, Q3,CosPhi1, CosPhi2, CosPhi3, i), where t is time, P is total power and i is some identifier.
Seeing as I display the data in graphs grouped in different intervals (15 minutes, 1 hour, 1 day, 1 month) I want to group the querys as such.
As an example I have a graph that shows the kWh for every day in the current year. The query to gather the data goes like this:
SELECT t, SUM(P) as P
FROM table
WHERE i = 0 and t >= '2015-01-01 00:00:00'
GROUP BY DAY(t), MONTH(t)
ORDER BY t
The database has been gathering measurements for 13 days, and this query alone is already taking 2-3 seconds to complete. Those 13 days have added about 1-1.3 million rows to the db, as a new row gets added every second.
Is this query optimal?
I would actually create a secondary table that has a column for each DAY, and one for the total. Then, via a trigger, your insert into the detail table can update the secondary aggregate table. This way, you can sum the DAILY table which will be much quicker, and yet still have the per second table if you needed to look at the granular level details.
Having aggregate tables can be a common time-saver for querying, especially for read-only types of data, or data you know wont be changing. Then, if you want more granular detail such as hourly or 15 minute intervals, go directly to the raw data.
For this query:
SELECT t, SUM(P) as P
FROM table
WHERE i = 0 and t >= '2015-01-01 00:00:00'
GROUP BY DAY(t), MONTH(t)
ORDER BY t
The optimal index is a covering index: table(i, t, p).
2-3 seconds for 1+ million rows suggests that you already have an index.
You may want to consider DRapp's suggestion and use summary tables. In a few months, you will have so much data that historical queries could be taking a long time.
In the meantime, though, indexes and partitioning might provide sufficient performance for your needs.
Related
I'm probably approaching this all wrong, but how can I best do this query? I only want the number of records within 1 day of the the min timestamp for each studyID. This query gets the answer I need but it takes forever with a million records. The subquery runs by itself in seconds and the main query without the subquery runs in seconds, but this runs terribly slow......
SELECT st.StudyID,COUNT(*)
FROM Studies st
JOIN(SELECT StudyID,DATE_ADD(MIN(TimeStamp),INTERVAL 1440 MINUTE) AS MaxTime
FROM Studies GROUP BY StudyID) AS d1
ON d1.StudyID = st.Study.ID
WHERE st.TimeStamp<=d1.MaxTime
GROUP BY st.StudyID
The output should be this obviously....
ID COUNT(*)
2 3
4 4
In the absence of a query plan, I would guess your query is reading the whole table twice. There's no much you can do about it (since this strategy looks already optimal), if the table does have a massive number of rows.
A covering index may help, and I would try the following one:
create index ix1 on Studies (StudyID, TimeStamp);
For a task in a managerial accounting context, I generated a relatively large SQL-Query with a MySQL-DB. This query has close to 600 lines and generates as a result a large table with the economic analysis for different products.
This works fine so far and the query just takes about 3 seconds.
But the outcome is only the analysis for one month. Now we would like to execute the query for a couple of month and aggregate the results.
I simply could change the query to include a larger time period as a condition (now just one month). But that would lead to an incorrect (averaged) distribution of overhead costs due to ignoring of larger monthly fluctuations in certain key figures.
Therefore, I think, I would have to generate one (sub-)table per month I would like to analyze. Finally, all these sub-tables would have to be aggregated with a superordinate main query. That should probably work, but this query would then be really large. E.g. for 12 months I would need about 12 x 600 lines for the sub-queries and about another 100 lines for the main query.
This leads to my question: Is this the way how one would do that? Without better knowing, it seems to me an unusually large query which also might be cumbersome to maintain. What would be the best practice way to accomplish the given task?
Thank you
If the data is static once the month is over you can launch your select at the beginning of each month (to calculate the previous month) and store the result in a table with an extra column "month".
insert into monthly_aggregation (month, ...)
select ... <600 lines of SQL for specific month>
This can be triggered at the beginning of every month.
If historical data can change, you have to rebuild the whole table by executing the INSERT-SElECT per month.
Let's say this is your query showing products for a particular month:
select product_id, sum(purchased), avg(price), ...
from <many tables>
where month = 6
group by product_id;
Then you can change it thus to have it show product data per month:
select month, product_id, sum(purchased), avg(price), ...
from <many tables>
group by month, product_id;
You can then work on this with an outer query:
select ...
from
(
select month, product_id, sum(purchased), avg(price), ...
from <many tables>
group by product_id, month
) product_and_month
group by ...;
I have a table say "sample" which saves a new record each five minutes.
Users might ask for data collected for a specific sampling interval of say 10 min or 30 min or an hour.
Since I have a record every five minutes, when a user asks for data with a hour sample interval, I will have to club/group every 12 (60/5) records in to one record (already sorted based on the time-stamp), and the criteria could be either min/max/avg/last value.
I was trying to do this in Java once I fetch all the records, and am seeing pretty bad performance as I have to iterate through the collection multiple times, I have read of other alternatives like jAgg and lambdaj, but wanted to check if that's possible in SQL (MySQL) itself.
The sampling interval is dynamic and the aggregation function (min/max/avg/last) too is user provided.
Any pointers ?
You can do this in SQL, but you have to carefully construct the statement. Here is an example by hour for all four aggregations:
select min(datetime) as datetime,
min(val) as minval, max(val) as maxval, avg(val) as avgval,
substring_index(group_concat(val order by datetime desc), ',', 1) as lastval
from table t
group by floor(to_seconds(datetime) / (60*60));
I store top-views and 'likes' in a table called 'counts'. Once a night I run this query
UPDATE `counts` SET rank=d7+d6+d5+d4+d3+d2+d1,d7=d6,d6=d5,d5=d4,d4=d3,d3=d2,d2=d1,d1=0
Each day of the week has a d1-d7 variable, and we move it 'down' one each night and re-calculate the sum.
As my site has grown, this query now takes ~20 minutes.
I'm looking for suggestions on how to organize this more efficiently, as it seems like it might be a common pattern.
As the comments say, we need to see the schema. But I'll make a suggestion anyway. Don't have 7 different fields d1-d7. What if later you decide to keep the score over a year? Ouch.
I'm going to assume that counts has view_id as its PK. Then have another table ranks with columns view_id (set as FK into counts), rank (generalizes d1-d7, whatever datatype they are) and rank_date, which is a date. Now every night you have
UPDATE counts SET rank = (SELECT SUM(rank) FROM ranks r WHERE r.view_id=counts.view_id
AND r.rank_date>=DATE_SUB(CURDATE(), INTERVAL 1 WEEK) );
[Some RDBMSs allow a JOIN-type syntax in UPDATE queries. I believe MySQL understands something similar to the following, but it isn't my usual RDBMS
UPDATE counts, (SELECT view_id, SUM(rank) AS srank FROM ranks r
WHERE r.rank_date>=DATE_SUB(CURDATE(), INTERVAL 1 WEEK)
GROUP BY r.view_id) AS q1
SET rank = srank
WHERE counts.view_id=q1.view_id;
]
If so, that will probably run faster than the first version.
Meanwhile, optionally to clean up, you can delete rows from ranks that are more than 1 week old, but in a more flexible schema, you don't have to.
I have the following table in MySQL that records event counts of stuff happening each day
event_date event_count
2011-05-03 21
2011-05-04 12
2011-05-05 12
I want to be able to query this efficiently by date range AND by day of week. For example - "What is the event_count on Tuesdays in May?"
Currently the event_date field is a date type. Are there any functions in MySQL that let me query this column by day of week, or should I add another column to the table to store the day of week?
The table will hold hundreds of thousands of rows, so given a choice I'll choose the most efficient solution (as opposed to most simple).
Use DAYOFWEEK in your query, something like:
SELECT * FROM mytable WHERE MONTH(event_date) = 5 AND DAYOFWEEK(event_date) = 7;
This will find all info for Saturdays in May.
To get the fastest reads store a denormalized field that is the day of the week (and whatever else you need). That way you can index columns and avoid full table scans.
Just try the above first to see if it suits your needs and if it doesn't, add some extra columns and store the data on write. Just watch out for update anomalies (make sure you update the day_of_week column if you change event_date).
Note that the denormalized fields will increase the time taken to do writes, increase calculations on write, and take up more space. Make sure you really need the benefit and can measure that it helps you.
Check DAYOFWEEK() function
If you want textual representation of day of week - use DAYNAME() function.