Problem Description
I have an audit table that contains history changes of some objects. The audit contains a unique audit event id, id of the object being changed, date of the change, the property that has been changed and also before and after change values and other columns.
What I need to do is query the audit data and get the date the same field was previously changed on the same object. So I need to look at the audit a second time and for each audit entry add the previous similar entry with it's date as the previous change date.
Schema & Data
The table schema has id (id) as the primary key and the object id (parent_id) as an index. Nothing else is indexed. In my test case I have roughly 150 objects with around 80k audit entries for them.
Solution
There are two obvious solutions sub-queries and left join.
In left join I basically join the same audit table on itself again with the join statement making sure the object, field and value changes are correspond, the changes are older than the current change and select the max change date and finally to only pickup one latest previous change I group by id. In case no previous change has been found use the creation date of the object itself.
LEFT JOIN SQL
SELECT `audit`.`id` AS `id`,
`audit`.`parent_id` AS `parent_id`,
`audit`.`date_created` AS `date_created`,
COALESCE(MAX(`audit_prev`.`date_created`), `audit_parent`.`date_entered`) AS `date_created_before`,
`audit`.`field_name` AS `field_name`,
`audit`.`before_value_string` AS `before_value_string`,
`audit`.`after_value_string` AS `after_value_string`
FROM `opportunities_audit` `audit`
LEFT JOIN `opportunities_audit` `audit_prev`
ON(`audit`.`parent_id` = `audit_prev`.`parent_id`
AND `audit_prev`.`date_created` < `audit`.`date_created`
AND `audit_prev`.`after_value_string` = `audit`.`before_value_string`
AND `audit`.`field_name` = `audit_prev`.`field_name`)
LEFT JOIN `opportunities` `audit_parent` ON(`audit`.`parent_id` = `audit_parent`.`id`)
GROUP BY `audit`.`id`;
Subquery logic is rather similar but instead grouping and using the MAX function I simply have order by date DESC and LIMIT 1
SELECT `audit`.`id` AS `id`,
`audit`.`parent_id` AS `parent_id`,
`audit`.`date_created` AS `date_created`,
COALESCE((SELECT `audit_prev`.`date_created`
FROM `opportunities_audit` AS `audit_prev`
WHERE
(`audit_prev`.`parent_id` = `audit`.`parent_id`)
AND (`audit_prev`.`date_created` < `audit`.`date_created`)
AND (`audit_prev`.`after_value_string` = `audit`.`before_value_string`)
AND (`audit_prev`.`field_name` = `audit`.`field_name` )
ORDER BY `date_created` DESC
LIMIT 1
), `audit_parent`.`date_entered`) AS `date_created_before`,
`audit`.`field_name` AS `field_name`,
`audit`.`before_value_string` AS `before_value_string`,
`audit`.`after_value_string` AS `after_value_string`
FROM `opportunities_audit` `audit`
LEFT JOIN `opportunities` `audit_parent` ON(`audit`.`parent_id` = `audit_parent`.`id`);
Both queries produce identical result sets.
Issue
When I run the query in phpMyAdmin the solution with join takes roughly 2m30s to return the result. However, phpMyAdmin says the query took 0.04 seconds. When I run the subquery solution the result comes back immediately and the reported execution time by phpMyAdmin is something like 0.06 seconds.
So I have a hard time understanding where this difference in actual execution time comes from. My initial guess was that the problem would be related to phpMyAdmin's automatic LIMITS on the returned data set- while the result has 80k rows it only displays 25. But adding the LIMIT manually to the queries makes them both execute fast.
Also running the queries from the command line mysql tool returns the full result sets for both queries and the reported execution times correspond to the actual execution time and the method using joins is still roughly 1.5x faster then subquery.
From the profiler data it seems that the bulk of that wait time is spent on sending data. As it takes sending data takes in the order of minutes and everything else is in the order of microseconds.
Still why would the behaviour of phpMyAdmin differ so greatly in the case of the two queries?
Related
I'm saving data in MySQL database every 5 seconds and I want to group this data in an average of 5 minutes.
The select is this:
SELECT MIN(revision) as revrev,
AVG(temperature),
AVG(humidity)
FROM dht22_sens t
Group by revision div 500
ORDER BY `revrev` DESC
Is possible to save data with a single query possibly in the same table?
If it is about reducing the number of rows, then I think you have to insert a new row with aggregated values and then delete the original, detailed rows. I don't know any single sql statement for inserting and deleting in one rush (cf also a similar answer from symcbean at StackOverflow, who additionally suggests to pack these two statements into a procedure).
I'd suggest to add an extra column aggregationLevel and do two statements (with or without procedure):
insert into dht22_sens SELECT MIN(t.revision) as revision,
AVG(t.temperature) as temperature,
AVG(t.humidity) as humidity,
500 as aggregationLevel
FROM dht22_sens t
WHERE t.aggregationLevel IS NULL;
delete from dht22_sens where aggregationLevel is null;
I have an SQL query(see below) that returns exactly what I need but when ran through phpMyAdmin takes anywhere from 0.0009 seconds to 0.1149 seconds and occasionally all the way up to 7.4983 seconds.
Query:
SELECT
e.id,
e.title,
e.special_flag,
CASE WHEN a.date >= '2013-03-29' THEN a.date ELSE '9999-99-99' END as date
CASE WHEN a.date >= '2013-03-29' THEN a.time ELSE '99-99-99' END as time,
cat.lastname,
FROM e_table as e
LEFT JOIN a_table as a ON (a.e_id=e.id)
LEFT JOIN c_table as c ON (e.c_id=c.id)
LEFT JOIN cat_table as cat ON (cat.id=e.cat_id)
LEFT JOIN m_table as m ON (cat.name=m.name AND cat.lastname=m.lastname)
JOIN (
SELECT DISTINCT innere.id
FROM e_table as innere
LEFT JOIN a_table as innera ON (innera.e_id=innere.id AND
innera.date >= '2013-03-29')
LEFT JOIN c_table as innerc ON (innere.c_id=innerc.id)
WHERE (
(
innera.date >= '2013-03-29' AND
innera.flag_two=1
) OR
innere.special_flag=1
) AND
innere.flag_three=1 AND
innere.flag_four=1
ORDER BY COALESCE(innera.date, '9999-99-99') ASC,
innera.time ASC,
innere.id DESC LIMIT 0, 10
) AS elist ON (e.id=elist.id)
WHERE (a.flag_two=1 OR e.special_flag) AND e.flag_three=1 AND e.flag_four=1
ORDER BY a.date ASC, a.time ASC, e.id DESC
Explain Plan:
The question is:
Which part of this query could be causing the wide range of difference in performance?
To specifically answer your question: it's not a specific part of the query that's causing the wide range of performance. That's MySQL doing what it's supposed to do - being a Relational Database Management System (RDBMS), not just a dumb SQL wrapper around comma separated files.
When you execute a query, the following things happen:
The query is compiled to a 'parametrized' query, eliminating all variables down to the pure structural SQL.
The compilation cache is checked to find whether a recent usable execution plan is found for the query.
The query is compiled into an execution plan if needed (this is what the 'EXPLAIN' shows)
For each execution plan element, the memory caches are checked whether they contain fresh and usable data, otherwise the intermediate data is assembled from master table data.
The final result is assembled by putting all the intermediate data together.
What you are seeing is that when the query costs 0.0009 seconds, the cache was fresh enough to supply all data together, and when it peaks at 7.5 seconds either something was changed in the queried tables, or other queries 'pushed' the in-memory cache data out, or the DBMS has other reasons to suspect it needs to recompile the query or fetch all data again. Probably some of the other variations have to do with used indexes still being cached freshly enough in memory or not.
Concluding this, the query is ridiculously slow, you're just sometimes lucky that caching makes it appear fast.
To solve this, I'd recommend looking into 2 things:
First and foremost - a query this size should not have a single line in its execution plan reading "No possible keys". Research how indexes work, make sure you realize the impact of MySQL's limitation of using a single index per joined table, and tweak your database so that each line of the plan has an entry under 'key'.
Secondly, review the query in itself. DBMS's are at their fastest when all they have to do is combine raw data. Using programmatic elements like CASE and COALESCE are by all means often useful, but they do force the database to evaluate more things at runtime than just take raw table data. Try to eliminate such statements, or move them to the business logic as post-processing with the retrieved data.
Finally, never forget that MySQL is actually a rather stupid DBMS. It is optimized for performance in simple data fetching queries such as most websites require. As such it is much faster than SQL Server and Oracle for most generic problems. Once you start complicating things with functions, cases, huge join or matching conditions etc., the competitors are frequently much better optimized, and have better optimization in their query compilers. As such, when MySQL starts becoming slow in a specific query, consider splitting it up in 2 or more smaller queries just so it doesn't become confused, and do some postprocessing in PHP or whatever language you are calling with. I've seen many cases where this increased performance a LOT, just by not confusing MySQL, especially in cases where subqueries were involved (as in your case). Especially the fact that your subquery is a derived table, and not just a subquery, is known to complicate stuff for MySQL beyond what it can cope with.
Lets start that both your outer and inner query are working with the "e" table WITH a minimum requirement of flag_three = 1 AND flag_four = 1 (regardless of your inner query's (( x and y ) or z) condition. Also, your outer WHERE clause has explicit reference to the a.Flag_two, but no NULL which forces your LEFT JOIN to actually become an (INNER) JOIN. Also, it appears every "e" record MUST have a category as you are looking for the "cat.lastname" and no coalesce() if none found. This makes sense at it appears to be a "lookup" table reference. As for the "m_table" and "c_table", you are not getting or doing anything with it, so they can be removed completely.
Would the following query get you the same results?
select
e1.id,
e1.Title,
e1.Special_Flag,
e1.cat_id,
coalesce( a1.date, '9999-99-99' ) ADate,
coalesce( a1.time, '99-99-99' ) ATime
cat.LastName
from
e_table e1
LEFT JOIN a_table as a1
ON e1.id = a1.e_id
AND a1.flag_two = 1
AND a1.date >= '2013-03-29'
JOIN cat_table as cat
ON e1.cat_id = cat.id
where
e1.flag_three = 1
and e1.flag_four = 1
and ( e1.special_flag = 1
OR a1.id IS NOT NULL )
order by
IF( a1.id is null, 2, 1 ),
ADate,
ATime,
e1.ID Desc
limit
0, 10
The Main WHERE clause qualifies for ONLY those that have the "three and four" flags set to 1 PLUS EITHER the ( special flag exists OR there is a valid "a" record that is on/after the given date in question).
From that, simple order by and limit.
As for getting the date and time, it appears that you only want records on/after the date to be included, otherwise ignore them (such as they are old and not applicable, you don't want to see them).
The order by, I am testing FIRST for a NULL value for the "a" ID. If so, we know they will all be forced to a date of '9999-99-99' and time of '99-99-99' and want them pushed to the bottom (hence 2), otherwise, there IS an "a" record and you want those first (hence 1). Then, sort by the date/time respectively and then the ID descending in case many within the same date/time.
Finally, to help on the indexes, I would ensure your "e" table has an index on
( id, flag_three, flag_four, special_flag ).
For the "a" table, index on
(e_id, flag_two, date)
This question is very much related to my previous question: MySQL, return all results within X last hours altough with additional significant constraint:
Now i have 2 tables, one for measurements and one for classified results for part of the measurements.
measurements are constantly arrive so as result, that are constantly added after classification of new measurements.
results will not necessarily be stored in the same order of measurement's arrive and store order!
I am interested only to present the last results. By last i mean to take the max time (the time is a part of the measurement structure) of last available result call it Y and a range of X seconds , and present the measurements together with the available results in the range beteen Y and Y-X.
The following are the structure of 2 tables:
event table:
CREATE TABLE `event_data` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`Feature` char(256) NOT NULL,
`UnixTimeStamp` int(10) unsigned NOT NULL,
`Value` double NOT NULL,
KEY `ix_filter` (`Feature`),
KEY `ix_time` (`UnixTimeStamp`),
KEY `id_index` (`id`)
) ENGINE=MyISAM
classified results table:
CREATE TABLE `event_results` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`level` enum('NORMAL','SUSPICIOUS') DEFAULT NULL,
`score` double DEFAULT NULL,
`eventId` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `eventId_index` (`eventId`)
) ENGINE=MyISAM
I can't query for the last measurements timestamp first since i want to present measurements for which there are currently results, and since measurements arrive constantly, results may still not be available.
Therefore i thought of joining the two tables using
event_results.eventId=event_data.id and than selecting the max time of event_data.UnixTimeStamp as maxTime , after i have the maxTime, i need to do the same opearation again (joining 2 tables) and adding in a where clause a condition
WHERE event_data.UnixTimeStamp >= maxTime + INTERVAL -X SECOND
It seems to be not efficient to execute 2 joins only to achieve what i am asking, Do you have more ef
From my understanding, you are using an aggregate function, MAX. This will produce a record set of size one as a result, which is the highest time from which you will perform. Therefore, it needs to be broken out into a sub query (As you say, nested select). You HAVE to do 2 queries at some point. (Your answer to the last question has 2 queries in it, by having subqueries/nested selects).
The main time sub queries cause problems is when you perform the subquery in the select part of the query, as it performs the subquery for each time there is a row, which will make the query run exponentially slower as the resultset grows. Lets take the answer to your last question and write it in a horrible, inefficient way:
SELECT timeStart,
(SELECT max(timeStart) FROM events) AS maxTime
FROM events
WHERE timeStart > (maxTime + INTERVAL -1 SECOND)
This will perform a select query for each time there is an eventTime record, for the max eventtime. It should produce the same result, but this is slow. This is where the fear of subqueries comes from.
It also performs the aggregate function MAX on each row, which will return the same answer each time. So, you perform that sub query ONCE rather than on each row.
However, in the case of the answer of your last question, the MAX sub query part is ran once, and used to filter on the where, of which that select is ran once. So, in total, 2 queries are ran.
2 super fast queries are faster ran one after the other than 1 super slow query that is super slow.
I'm not entirely sure what resultset you want returned, so I am going to make some assumptions. Please feel free to correct any assumptions I've made.
It sounds (to me) like you want ALL rows from event_data that are within an hour (or however many seconds) of the absolute "latest" timestamp, and along with those rows, you also want to return any related rows from event_results, if any matching rows are available.
If that's the case, then using an inline view to retrieve the maximum value of timestamp is the way to go. (That operation will be very efficient, since the query will be returning a single row, and it can be efficiently retrieved from an existing index.)
Since you want all rows from a specified period of time (from the "latest time" back to "latest time minus X seconds"), we can go ahead and calculate the starting timestamp of the period in that same query. Here we assume you want to "go back" one hour (=60*60 seconds):
SELECT MAX(UnixTimeStamp) - 3600 FROM event_data
NOTE: the expression in the SELECT list above is based on UnixTimeStamp column defined as integer type, rather than as a DATETIME or TIMESTAMP datatype. If the column were defined as DATETIME or TIMESTAMP datatype, we would likely express that with something like this:
SELECT MAX(mydatetime) + INTERVAL -3600 SECONDS
(We could specify the interval units in minutes, hours, etc.)
We can use the result from that query in another query. To do that in the same query text, we simply wrap that query in parentheses, and reference it as a rowsource, as if that query were an actual table. This allows us to get all the rows from event_data that are within in the specified time period, like this:
SELECT d.id
, d.Feature
, d.UnixTimeStamp
, d.Value
JOIN ( SELECT MAX(l.UnixTimeStamp) - 3600 AS from_unixtimestamp
FROM event_data l
) m
JOIN event_data d
ON d.UnixTimetamp >= m.from_unixtimestamp
In this particular case, there's no need for an upper bound predicate on UnixTimeStamp column in the outer query. This is because we already know there are no values of UnixTimeStamp that are greater than the MAX(UnixTimeStamp), which is the upper bound of the period we are interested in.
(We could add an expression to the SELECT list of the inline view, to return MAX(l.UnixTimeStamp) AS to_unixtimestamp, and then include a predicate like AND d.UnixTimeStamp <= m.to_unixtimestamp in the outer query, but that would be unnecessarily redundant.)
You also specified a requirement to return information from the event_results table.
I believe you said that you wanted any related rows that are "available". This suggests (to me) that if no matching row is "available" from event_results, you still want to return the row from the event_data table.
We can use a LEFT JOIN operation to get that to happen:
SELECT d.id
, d.Feature
, d.UnixTimeStamp
, d.Value
, r.id
, r.level
, r.score
, r.eventId
JOIN ( SELECT MAX(l.UnixTimeStamp) - 3600 AS from_unixtimestamp
FROM event_data l
) m
JOIN event_data d
ON d.UnixTimetamp >= m.from_unixtimestamp
LEFT
JOIN event_results r
ON r.eventId = d.id
Since there is no unique constraint on the eventID column in the event_results table, there is a possibility that more than one "matching" row from event_results will be found. Whenever that happens, the row from event_data table will be repeated, once for each matching row from event_results.
If there is no matching row from event_results, then the row from event_data will still be returned, but with the columns from the event_results table set to NULL.
For performance, remove any columns from the SELECT list that you don't need returned, and be judicious in your choice of expressions in an ORDER BY clause. (The addition of a covering index may improve performance.)
For the statement as written above, MySQL is likely to use the ix_time index on the event_data table, and the eventId_index index on the event_results table.
I have a table with ~3M rows. The rows are date, time, msec, and some other columns with int data. Some unknown fraction of these rows are considered 'invalid' based on their existence in a separate table outages (based on date ranges).
Currently the query does a select * and then uses a huge WHERE to remove the invalid date ranges ( lots of 'and not ( RecordDate >'2008-08-05' and RecordDate < '2008-08-10' )') and so on. This blows away any chance of using an index.
Im looking for a better way to limit the results. As it stands now the query takes several minutes to run.
DELETE b FROM bigtable b
INNER JOIN outages o ON (b.`date` BETWEEN o.datestart AND o.dateend)
WHERE (1=1) //In some modes MySQL demands a `where` clause or it will not run.
Make sure you have an index on all fields involved in the query.
I'm using mySQL InnoDB tables. The query is simple but it just takes too long to run. Its bound to run even slower once I upload it to a server (not to mention that these two tables will keep growing in size).
The 'author' table size is 3,045 records.
The 'book' table size is 5,278 records.
SELECT author.*, count( auth_id ) AS author_count
FROM author
LEFT JOIN book ON author.id = book.auth_id
GROUP BY author.id
I'm wondering if there is some trick I can employ to make the query run at least twice as fast as it is now (it currently takes about 10.5 seconds to run - an eternity when displaying web content!)
1) Is this returning what you want?
2) You should list in the GROUP BY clause all the fields from the SELECT clause that are not in an agregate function (like COUNT in this case, but also could be AVG or SUM).