performance comparison in day wise group by in MySQL - mysql

Let's consider the following table.
Table:
ID
epoch_time_in_millis
counter
Query #1:
SELECT
DATE_FORMAT(FROM_UNIXTIME(epoch_time_in_millis/1000),"%Y-%m-%d") date,
SUM(counter) totalCount
FROM my_table
GROUP BY date
Query #2:
SELECT
(epoch_time_in_millis DIV 86400000 ) * 86400000 ms,
SUM(counter) totalCount
FROM my_table
GROUP BY (epoch_time_in_millis DIV 86400000) * 86400000;
My question is:
Will the above two queries show any performance difference?
If yes please let me understand why.
If no let me understand why. :p
Thanks in advance.

The best way to check performance is on your hardware using your data.
But, MySQL implements group by using a file sort algorithm. This algorithm does not generally take advantage of indexes, and especially not in your case. Hence, the work for the two queries is going to be in processing the aggregation.
The other operations are trivial. So, whether the engine does the calculation once or twice really isn't going to be relevant for the overall computation -- unless you have just a handful of rows. And, in that case, performance isn't really an issue.

Related

The efficiency of HAVING vs subquery and why

When I study the SQL HAVING tutorial, it says: HAVING is the “clean” way to filter a query that has been aggregated, but this is also commonly done using a subquery.
Sometimes, HAVING statement is equivalent to subquery, like these:
select account_id, sum(total_amt_usd) as sum_amount
from demo.orders
group by account_id
having sum(total_amt_usd) >= 250000
select *
from (
select account_id, sum(total_amt_usd) as sum_amount
from demo.orders
group by account_id
) as subtable
where sum_amount >= 250000
I want to know which one is recommended and the reason why this one is faster or more efficient than the other.
As with any performance question, you should try it on your data. But, the two should be essentially equivalent. If you are interested in such questions, then you should learn how to read execution plans.
Just one note about MySQL. MySQL tends to materialize subqueries. This might incur a little extra overhead by writing the group by results before filtering them, but you probably would not notice the difference.

mysql SQL optimization

this query takes an hour
select *,
unix_timestamp(finishtime)-unix_timestamp(submittime) timetaken
from joblog
where jobname like '%cas%'
and submittime>='2013-01-01 00:00:00'
and submittime<='2013-01-10 00:00:00'
order by id desc limit 300;
but the same query with one submittime finishes in like .03 seconds
the table has 2.1 Million rows
Any idea whats causing the issue or how to debug it
Your first step should be to use MySQL EXPLAIN to see what the query is doing. It'll probably give you some insight on how to fix your issue.
My guess is that jobname LIKE '%cas%' is the slowest part because you're doing a wildcard text search. Adding an index here won't even help, because you have a leading wildcard. Is there any way to do this query without a leading wildcard like that? Also adding an index on submittime might improve the speed of this query.
You might try adding a LIMIT to the query and see if that increases the speed that it returns ...
Excerpt from http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
"Sometimes MySQL does not use an index, even if one is available. One circumstance under which this occurs is when the optimizer estimates that using the index would require MySQL to access a very large percentage of the rows in the table. (In this case, a table scan is likely to be much faster because it requires fewer seeks.) However, if such a query uses LIMIT to retrieve only some of the rows, MySQL uses an index anyway, because it can much more quickly find the few rows to return in the result. "
select *,unix_timestamp(finishtime)-unix_timestamp(submittime) timetaken
from joblog
where (submittime between '2013-01-10 00:00:00' and '2013-01-19 00:00:00')
and jobname is not null
and jobname like '%cas%';
this helped
(0.93 seconds)

Very slow query, any other ways to format this with better performace?

I have this query (I didn't write) that was working fine for a client until the table got more then a few thousand rows in it, now it's taking 40 seconds+ on only 4200 rows.
Any suggetions on how to optimize and get the same result?
I've tried a few other methods but didn't get the correct result that this slower query returned...
SELECT COUNT(*) AS num
FROM `fl_events`
WHERE id IN(
SELECT DISTINCT (e2.id)
FROM `fl_events` AS e1, fl_events AS e2
WHERE e1.startdate >= now() AND e1.startdate = e2.startdate
)
ORDER BY `startdate`
Any help would be greatly appriciated!
Appart from the obvious indexes needed, I don't really get why you are joining your table with itself for choosing the IN condition. The ORDER BY is also not needed. Are you sure that your query can't be written just like this?:
SELECT COUNT(*) AS num
FROM `fl_events` AS e1
WHERE e1.startdate >= now()
I don't think rewriting the query will help. The key to your question is "until the table got more than a few thousand rows." This implies that important columns aren't indexed. Prior to a certain number of records, all the data fit on a single memory block - over that point, it takes a new block. And index is the only way to speed up the search.
first - check to see that the ID in fl_events is actually marked as a primary key. That physically orders the records and without it you can see data corruption and occasionally super-slow results. The use of distinct in the query makes it look like it might NOT be a unique value. That will pose a problem.
Then, make sure to add an index on the start_date.
The slowness is probably related to the join of the event table with itself, and possibly startdate not having an index.

Which is a less expensive query count(id) or order by id

I'd like to know which of the followings would execute faster in MySQL database. The table would have 200 - 1000 entries.
SELECT id
from TABLE
order by id desc
limit 1
or
SELECT count(id)
from TABLE
The story is the Table is cached. So this query is to be executed every time before cache retrieval to determine whether the cache data is invalid by comparing the previous value.
So if there exists a even less expensive query, please kindly let me know. Thanks.
If you
start from 1
never have any gaps
use the InnoDB engine
id is not nullable
Then the 2nd could run [ever so marginally] faster due to not having to visit table data at all (count is stored in metadata).
Otherwise,
if the table has NO index on ID (causing a SCAN), the 2nd one is faster
Barring both the above
the first one is faster
And if you actually meant to ask SELECT .. LIMIT 1 vs SELECT MAX(id).. then the answer is actually that they are the same for MySQL and most sane DBMS, whether or not there is an index.
I think, the first query will run faster, as the query is limited to be executed for one row only, 200-1000 may not matter that much in this case.
As already pointed out in the comments, your table is so small it really doesn't what your solution will be. For this reason the select count(id) should be used as it expresses the intent and doesn't need any further processing.
Now select count(id) comes with an alternative select count(*). These two are not synonyms. select count(*) will count the number of rows and use a cached value if possible when select count(id) counts the number of non null values of the column id exists. If the id columns is set as not null then the cached row count may be used.
The selection between count(*) and count(id) depends once again on your intent. In the general case, count(*) describes the intent better.
The there is the possibility of count(1) which is actually a synonym of count(*) when using mysql but the interpretation may vary if end up using a different RDBMS.
The performance of each type of count also varies depending on whether you are using MyISAM or InnoDB. The row counts are cached on the former but not on the latter, if I've understood correctly.
In the end, you should rely on query plans and running tests and measuring their performance rather than these general ramblings.

Should criteria be duplicated on subqueries

I have a query which actually runs two queries on a table. I query the whole table, a datediff and then a subquery which tells me the sum of hours each unit spent in certain operational steps. The main query limits the results to the REP depot so technically I don't need to put that same criteria on the subquery since repair_order is unique.
Would it be faster, slower or no difference to apply the depot filter on the subquery?
SELECT
*,
DATEDIFF(date_shipped, date_received) as htg_days,
(SELECT SUM(t3.total_days) FROM report_tables.cycle_time_days as t3 WHERE t1.repair_order=t3.repair_order AND (operation='MFG' OR operation='ENG' OR operation='ENGH' OR operation='HOLD') GROUP BY t3.repair_order) as subt_days
FROM
report_tables.cycle_time_days as t1
WHERE
YEAR(t1.date_shipped)=2010
AND t1.depot='REP'
GROUP BY
repair_order
ORDER BY
date_shipped;
I run into this with a lot of situations but I never know if it would be better to put the filter in the sub query, main query or both.
In this example, it would actually alter the query if you moved your WHERE clause to filter by REP into the subquery. So it wouldn't be about performance at that point, it would be about getting the same result set. In general, though, if you will get the same exact result set by moving a WHERE clause elsewhere in a complex query, it is better to do so at the most atomic level possible, ie, in the subquery. Then the subquery returns a smaller result set to the main query before the main query has to process it.
The answer to your question will vary depending on your schema, the complexity of your queries, the reliability of your data, etc. A general rule of thumb is to try to process the least amount of data possible, which generally means filtering it at the lowest level possible as well.
When you want to optimize a query the absolute number one place to start is to use the EXPLAIN output to see what optimizations the query parser was able to figure out and check to see what the weakest link is in the query plan. Resolve that, rinse, repeat.
You can also use explain's "extended" keyword to see the actual query it built to run which will reveal more about its usage of your criteria. In some cases, it will optimize away duplicate conditions between parent/subqueries. In other cases, it may push the conditions down from the parent in to the subquery. In some cases for (too) complex queries I've seen the it repeat the condition when it was only specified in the query once. Thankfully, you don't have to guess, mysql's explain plan will reveal all, albeit sometimes in cryptic ways.
I usually use a derived table as a "driver or aggregating" query then join that result back onto whatever table that i want to pull data from:
select
t1.*,
datediff(t1.date_shipped, t1.date_received) as htg_days,
subt_days.total_days
from
cycle_time_days as t1
inner join
(
-- aggregating/driver query
select
repair_order,
sum(total_days) as total_days
from
cycle_time_days
where
year(date_shipped) = 2010 and depot = 'REP' and
operation in ('MFG','ENG','ENGH','HOLD') -- covering index on date, depot, op ???
group by
repair_order -- indexed ??
having
total_days > 14 -- added for demonstration purposes
order by
total_days desc limit 10
) as subt_days on t1.repair_order = subt_days.repair_order
order by
t1.date_shipped;