mysql query - optimizing existing MAX-MIN query for a huge table - mysql

I have a more or less good working query (concerning to the result) but it takes about 45seconds to be processed. That's definitely too long for presenting the data in a GUI.
So my demand is to find a much faster/efficient query (something around a few milliseconds would be nice)
My data table has something around 3000 ~2,619,395 entries and is still growing.
Schema:
num | station | fetchDate | exportValue | error
1 | PS1 | 2010-10-01 07:05:17 | 300 | 0
2 | PS2 | 2010-10-01 07:05:19 | 297 | 0
923 | PS1 | 2011-11-13 14:45:47 | 82771 | 0
Explanation
the exportValue is always incrementing
the exportValue represents the actual absolute value
in my case there are 10 stations
every ~15 minutes 10 new entries are written to the table
error is just an indicator for a proper working station
Working query:
select
YEAR(fetchDate), station, Max(exportValue)-MIN(exportValue)
from
registros
where
exportValue > 0 and error = 0
group
by station, YEAR(fetchDate)
order
by YEAR(fetchDate), station
Output:
Year | station | Max-Min
2008 | PS1 | 24012
2008 | PS2 | 23709
2009 | PS1 | 28102
2009 | PS2 | 25098
My thoughts on it:
writing several queries with between statements like 'between 2008-01-01 and 2008-01-02' to fetch the MIN(exportValue) and between 2008-12-30 and 2008-12-31' to grab the MAX(exportValue) - Problem: a lot of queries and the problem with having no data in a specified time range (it's not guaranteed that there will be data)
limiting the resultset to my 10 stations only with using order by MIN(fetchDate) - problem: takes also a long time to process the query
Additional Info:
I'm using the query in a JAVA Application. That means, it would be possible to do some post-processing on the resultset if necessary. (JPA 2.0)
Any help/approaches/ideas are very appreciated. Thanks in advance.

Adding suitable indexes will help.
2 compound indexes will speed things up significantly:
ALTER TABLE tbl_name ADD INDEX (error, exportValue);
ALTER TABLE tbl_name ADD INDEX (station, fetchDate);

This query running on 3000 records should be extremely fast.
Suggestions:
do You have PK set on this table? station, fetchDate?
add indexes; You should experiment and try with indexes as rich.okelly suggested in his answer
depending on experiments with indexes, try breaking your query into multiple statements - in one stored procedure; this way You will not loose time in network traffic between multiple queries sent from client to mysql
You mentioned that You tried with separate queries and there is a problem when there is no data for particular month; it is regular case in business applications, You should handle it in a "master query" (stored procedure or application code)
guess fetchDate is current date and time at the moment of record insertion; consider keeping previous months data in sort of summary table with fields: year, month, station, max(exportValue), min(exportValue) - this means that You should insert summary records in summary table at the end of each month; deleting, keeping or moving detail records to separate table is your choice
Since your table is rapidly growing (every 15 minutes) You should take the last suggestion into account. Probably, there is no need to keep detailed history at one place. Archiving data is process that should be done as part of maintenance.

Related

Sorting query performance in PHP+MYSQL

I have a table with huge number of records. When I query from that specific table specially when using ORDER BY in query, it takes too much execution time.
How can I optimize this table for Sorting & Searching?
Here is an example scheme of my table (jobs):
+---+-----------------+---------------------+--+
| id| title | created_at |
+---+-----------------+---------------------+--+
| 1 | Web Developer | 2018-04-12 10:38:00 | |
| 2 | QA Engineer | 2018-04-15 11:10:00 | |
| 3 | Network Admin | 2018-04-17 11:15:00 | |
| 4 | System Analyst | 2018-04-19 11:19:00 | |
| 5 | UI/UX Developer | 2018-04-20 12:54:00 | |
+---+-----------------+---------------------+--+
I have been searching for a while, I learned that creating INDEX can help improving the performance, can someone please elaborate how the performance can be increased?
Add "explain" word before ur query, and check result
explain select ....
There u can see what u need to improve, then add index on ur search and/or sorting field and run explain query again
If you want to earn performance on your request, a way is paginating it. So, you can put a limit (as you want) and specify the page you want to display.
For example SELECT * FROM your_table LIMIT 50 OFFSET 0.
I don't know if this answer will help you in your problem but you can try it ;)
Indexes are the databases way to create lookup trees (B-Trees in most cases) to more efficiently sort, filter, and find rows.
Indexes are used to find rows with specific column values quickly.
Without an index, MySQL must begin with the first row and then read
through the entire table to find the relevant rows. The larger the
table, the more this costs. If the table has an index for the columns
in question, MySQL can quickly determine the position to seek to in
the middle of the data file without having to look at all the data.
This is much faster than reading every row sequentially.
https://dev.mysql.com/doc/refman/5.5/en/mysql-indexes.html
You can use EXPLAIN to help identify how the query is currently running, and identify areas of improvement. It's important to not over-index a table, for reasons probably beyond the scope of this question, so it'd be good to do some research on efficient uses of indexes.
ALTER TABLE jobs
ADD INDEX(created_at);
(Yes, there is a CREATE INDEX syntax that does the equivalent.)
Then, in the query, do
ORDER BY created_at DESC
However, with 15M rows, it may still take a long time. Will you be filtering (WHERE)? LIMITing?
If you really want to return to the user 15M rows -- well, that is a lot of network traffic, and that will take a long time.
MySQL details
Regardless of the index declaration or version, the ASC/DESC in ORDER BY will be honored. However it may require a "filesort" instead of taking advantage of the ordering built into the index's BTree.
In some cases, the WHERE or GROUP BY is to messy for the Optimizer to make use of any index. But if it can, then...
(Before MySQL 8.0) While it is possible to declare an index DESC, the attribute is ignored. However, ORDER BY .. DESC is honored; it scans the data backwards. This also works for ORDER BY a DESC, b DESC, but not if you have a mixture of ASC and DESC in the ORDER BY.
MySQL 8.0 does create DESC indexes; that is, the BTree containing the index is stored 'backwards'.

Aggregate data tables

I am building a front-end to a largish db (10's of millions of rows). The data is water usage for loads of different companies and the table looks something like:
id | company_id | datetime | reading | used | cost
=============================================================
1 | 1 | 2012-01-01 00:00:00 | 5000 | 5 | 0.50
2 | 1 | 2012-01-01 00:01:00 | 5015 | 15 | 1.50
....
On the frontend users can select how they want to view the data, eg: 6 hourly increments, daily increments, monthly etc. What would be the best way to do this quickly. Given the data changes so much and the number of times any one set of data will be seen, caching the query data in memcahce or something similar is almost pointless and there is no way to build the data before hand as there are too many variables.
I figured using some kind of agregate aggregate table would work having tables such as readings, readings_6h, readings_1d with exactly the same structure, just already aggregated.
If this is a viable solution, what is the best way to keep the aggregate tables upto date and accurate. Besides the data coming in from meters the table is read only. Users don't ever have to update or write to it.
A number of possible solutions include:
1) stick to doing queries with group / aggregate functions on the fly
2) doing a basic select and save
SELECT `company_id`, CONCAT_WS(' ', date(`datetime`), '23:59:59') AS datetime,
MAX(`reading`) AS reading, SUM(`used`) AS used, SUM(`cost`) AS cost
FROM `readings`
WHERE `datetime` > '$lastUpdateDateTime'
GROUP BY `company_id`
3) duplicate key update (not sure how the aggregation would be done here, also making sure that the data is accurate not counted twice or missing rows.
INSERT INTO `readings_6h` ...
SELECT FROM `readings` ....
ON DUPLICATE KEY UPDATE .. calculate...
4) other ideas / recommendations?
I am currently doing option 2 which is taking around 15 minutes to aggregate +- 100k rows into +- 30k rows over 4 tables (_6h, _1d, _7d, _1m, _1y)
TL;DR What is the best way to view / store aggregate data for numerous reports that can't be cached effectively.
This functionality would be best served by a feature called materialized view, which MySQL unfortunately lacks. You could consider migrating to a different database system, such as PostgreSQL.
There are ways to emulate materialized views in MySQL using stored procedures, triggers, and events. You create a stored procedure that updates the aggregate data. If the aggregate data has to be updated on every insert you could define a trigger to call the procedure. If the data has to be updated every few hours you could define a MySQL scheduler event or a cron job to do it.
There is a combined approach, similar to your option 3, that does not depend on the dates of the input data; imagine what would happen if some new data arrives a moment too late and does not make it into the aggregation. (You might not have this problem, I don't know.) You could define a trigger that inserts new data into a "backlog," and have the procedure update the aggregate table from the backlog only.
All these methods are described in detail in this article: http://www.fromdual.com/mysql-materialized-views

Database suggestions for storing a "count" for every hour of the day

I have had an online archive service for over a year now. Unfortunately, I didn't put in the infrastructure to keep statistics. All I have now are the archive access logs.
For every hour there are two audio files (0-30 min in one and 30-60 min in other). I've currently used MySQL to put in the counts. Looks something like this:
| DATE | TIME | COUNT |
| 2012-06-12 | 20:00 | 39 |
| 2012-06-12 | 20:30 | 26 |
| 2012-06-12 | 21:00 | 16 |
and so on...
That makes 365 days * 24 hrs * 2 (2 halves in an hour) > 17500 rows. This makes read/write slow and I feel a lot of space is wasted storing it this way.
So do you know of any other database that will store this data more efficiently and is faster?
That's not too many rows. If it's properly indexed, reads should be pretty fast (writes will be a little slower, but even with tables of up to about half a million rows I hardly notice).
If you are selecting items from the database using something like
select * from my_table where date='2012-06-12'
Then you need to make sure that you have an index on the date column. You can also create multiple column indexes if you are using more than one column in your where statement. That will make your read statements very fast (like I said up to on the order of a million rows).
If you're unacquainted with indexes, see here:
MySQL Indexes

creating a series of time periods as rows

I want to write a query that, for any given start date in the past, has as each row a week-long date interval up to the present.
For instance, given the start date of Nov 13th 2010, and the present date of 12-16-2010, I want a result set like
+------------+------------+
| Start | End |
+------------+------------+
| 2010-11-15 | 2010-11-21 |
+------------+------------+
| 2010-11-22 | 2010-11-28 |
+------------+------------+
| 2010-11-29 | 2010-12-05 |
+------------+------------+
| 2010-12-06 | 2010-12-12 |
+------------+------------+
It doesn't go past 12 because the week-long period that the present date occurs in isn't complete.
I can't get a foothold on how I would even start to write this query.
Can I do this in a single query? Or should I use code for looping, and do multiple queries?
It's quite difficult (but not impossible) to create such a result set dynamically in MySQL as it doesn't yet support any of recursive CTEs, CONNECT BY, or generate_series that I would use to do this in other databases.
Here's an alternative approach you can use.
Create and prepopulate a table containing all the possible rows from some date far in the past to some date far in the future. Then you can easily generate the result you need by querying this table with a WHERE clause, using an index to make the query efficient.
The drawbacks of this approach are quite obvious:
It takes up storage space unnecessarily.
If you query outside of the range that you populated your table with you won't get any results, which means that you will either have to populate the table with enough dates to last the lifetime of your application or else you need a script to add more dates every so often.
See also this related question:
How do I make a row generator in MySQL
Beware this is just a concept idea: I do not have a mysql installation right here, so that I cannot test it.
However I would base myself on a table containing the integers, in order to emulate a series.
Something like :
CREATE TABLE integers_table
(
id integer primary key
);
Followed by (warning, this is pseudo code)
INSERT INTO integers_table(0…32767);
(that should be enough weeks for the rest of our lives :-)
Then
FirstMondayInUnixTimeStamp_Utc= 3600 * 24 * 4
SecondPerday=3600 * 24
(since 1 jan 1970 was a thursday. Beware I did not cross check! I might be off a few hours!)
And then
CREATE VIEW weeks
AS
SELECT integers_table.id AS week_id,
FROM_UNIXTIME(FirstMondayInUnixTimeStamp_Utc + week_id * SecondPerDay * 7) as week_start
FROM_UNIXTIME(FirstMondayInUnixTimeStamp_Utc + week_id * SecondPerDay * 7 + SecondPerDay * 6) as week_end;

Compound index required to speed up join-ed query?

A colleague asked me to explain how indexes (indices?) boost up performance; I tried to do so, but got confused myself.
I used the model below for explanation (an error/diagnostics logging database). It consists of three tables:
List of business systems, table "System" containing their names
List of different types of traces, table "TraceTypes", defining what kinds of error messages can be logged
Actual trace messages, having foreign keys from System and TraceTypes tables
I used MySQL for the demo, however I don't recall the table types I used. I think it was InnoDB.
System TraceTypes
----------------------------- ------------------------------------------
| ID | Name | | ID | Code | Description |
----------------------------- ------------------------------------------
| 1 | billing | | 1 | Info | Informational mesage |
| 2 | hr | | 2 | Warning| Warning only |
----------------------------- | 3 | Error | Failure |
| ------------------------------------------
| ------------|
Traces | |
--------------------------------------------------
| ID | System_ID | TraceTypes_ID | Message |
--------------------------------------------------
| 1 | 1 | 1 | Job starting |
| 2 | 1 | 3 | System.nullr..|
--------------------------------------------------
First, i added some records to all of the tables and demonstrated that the query below executes in 0.005 seconds:
select count(*) from Traces
inner join System on Traces.System_ID = System.ID
inner join TraceTypes on Traces.TraceTypes_ID = TraceTypes.ID
where
System.Name='billing' and TraceTypes.Code = 'Info'
Then I generated more data (no indexes yet)
"System" contained about 100 entries
"TraceTypes" contained about 50 entries
"Traces" contained ~10 million records.
Now the previous query took 8-10 seconds.
I created indexes on Traces.System_ID column and Traces.TraceTypes_ID column. Now this query executed in milliseconds:
select count(*) from Traces where System_id=1 and TraceTypes_ID=1;
This was also fast:
select count(*) from Traces
inner join System on Traces.System_ID = System.ID
where System.Name='billing' and TraceTypes_ID=1;
but the previous query which joined all the three tables still took 8-10 seconds to complete.
Only when I created a compound index (both System_ID and TraceTypes_ID columns included in index), the speed went down to milliseconds.
The basic statement I was taught earlier is "all the columns you use for join-ing, must be indexed".
However, in my scenario I had indexes on both System_ID and TraceTypes_ID, however MySQL didn't use them. The question is - why? My bets is - the item count ratio 100:10,000,000:50 makes the single-column indexes too large to be used. But is it true?
First, the correct, and the easiest, way to analyze a slow SQL statement is to do EXPLAIN. Find out how the optimizer chose its plan and ponder on why and how to improve that. I'd suggest to study the EXPLAIN results with only 2 separate indexes to see how mysql execute your statement.
I'm not very familiar with MySQL, but it seems that there's restriction of MySQL 4 of using only one index per table involved in a query. There seems to be improvements on this since MySQL 5 (index merge), but I'm not sure whether it applies to your case. Again, EXPLAIN should tell you the truth.
Even with using 2 indexes per table allowed (MySQL 5), using 2 separate indexes is generally slower than compound index. Using 2 separate indexes requires index merge step, compared to the single pass of using a compound index.
Multi Column indexes vs Index Merge might be helpful, which uses MySQL 5.4.2.
It's not the size of the indexes so much as the selectivity that determines whether the optimizer will use them.
My guess would be that it would be using the index and then it might be using traditional look up to move to another index and then filter out. Please check the execution plan. So in short you might be looping through two indexes in nested loop. As per my understanding. We should try to make a composite index on column which are in filtering or in join and then we should use Include clause for the columns which are in select. I have never worked in MySql so my this understanding is based on SQL Server 2005.