Optimize performance of complex GROUP BY query - mysql

Is there a way to optimize the following query. It takes about 11s:
SELECT
concat(UNIX_TIMESTAMP(date), '000') as datetime,
TRUNCATE(SUM(royalty_price*conversion_to_usd*
(CASE WHEN sales_or_return = 'R' THEN -1 ELSE 1 END)*
(CASE WHEN royalty_currency = 'JPY' THEN .80
WHEN royalty_currency in ('AUD', 'NZD') THEN .95 ELSE 1 END) )
,2) as total_in_usd
FROM
sales_raw
GROUP BY
date
ORDER BY
date ASC
Doing an explain I get:
1 SIMPLE sales_raw index NULL date 5 NULL 735855 NULL

This is an answer to the question in the comment. It formats better here:
An example of filtering on a set of indexed dates means to do something like this:
where date >= AStartDateVariable
and date < TheDayAfterAnEndDateVariable
If there is no index on the date field, create one.

You may be able to speed this up. You seem to have an index on date. What is happening is that the rows are read in the index, then each row is looked up. If the data is not ordered by the date field, then this might not be optimal, because the reads will be on, essentially, random pages. In the case where the original table does not fit into memory, this results in a condition called "page thrashing". A record is needed, the page is read from memory (displacing another page in the memory cache), and the next read probably also results in a cache miss.
To see if this is occuring, I would suggest one of two things. (1) Try removing the index on date or switching the group by criterion to concat(UNIX_TIMESTAMP(date), '000'). Either of these should remove the index as a factor.
From your additional comment, this is not occuring, although the benefit of the index appears to be on the small side.
(2) You can also expand the index to include all the tables used in the query. Besides date, the index would need to contain royalty_price, conversion_to_usd, sales_or_return, and royalty_currency. This would allow the index to fully satisfy the query, without looking up additional infromation in the pages.
You can also check with your DBA to be aure that you have a large enough page cache that matches your hardware capabilities.

This is a simple group by query which does not even involve joins. I would expect the problem to lie on the functions you are using.
Please start with a simple query just retrieving date and the sum of conversion_to_usd. Check performance and build up the query step by step always checking performance. It should not take long to spot the culprit.
Concats are usually slow operations but I wonder if truncate after sum might be confusing the optimiser. The 2nd case could be replaced by relying on a join with a table of currency codes and respective percentages, but it's not obvious that it makes a big difference. First spot the culprit operation.
You could also store the values with the correct amount but that introduces a denormalisation.

Related

Should I avoid ORDER BY in queries for large tables?

In our application, we have a page that displays user a set of data, a part of it actually. It also allows user to order it by a custom field. So in the end it all comes down to query like this:
SELECT name, info, description FROM mytable
WHERE active = 1 -- Some filtering by indexed column
ORDER BY name LIMIT 0,50; -- Just a part of it
And this worked just fine, as long as the size of table is relatively small (used only locally in our department). But now we have to scale this application. And let's assume, the table has about a million of records (we expect that to happen soon). What will happen with ordering? Do I understand correctly, that in order to do this query, MySQL will have to sort a million records each time and give a part of it? This seems like a very resource-heavy operation.
My idea is simply to turn off that feature and don't let users select their custom ordering (maybe just filtering), so that the order would be a natural one (by id in descending order, I believe the indexing can handle that).
Or is there a way to make this query work much faster with ordering?
UPDATE:
Here is what I read from the official MySQL developer page.
In some cases, MySQL cannot use indexes to resolve the ORDER BY,
although it still uses indexes to find the rows that match the WHERE
clause. These cases include the following:
....
The key used to
fetch the rows is not the same as the one used in the ORDER BY:
SELECT * FROM t1 WHERE key2=constant ORDER BY key1;
So yes, it does seem like mysql will have a problem with such a query? So, what do I do - don't use an order part at all?
The 'problem' here seems to be that you have 2 requirements (in the example)
active = 1
order by name LIMIT 0, 50
The former you can easily solve by adding an index on the active field
The latter you can improve by adding an index on name
Since you do both in the same query, you'll need to combine this into an index that lets you resolve the active value quickly and then from there on fetches the first 50 names.
As such, I'd guess that something like this will help you out:
CREATE INDEX idx_test ON myTable (active, name)
(in theory, as always, try before you buy!)
Keep in mind though that there is no such a thing as a free lunch; you'll need to consider that adding an index also comes with downsides:
the index will make your INSERT/UPDATE/DELETE statements (slightly) slower, usually the effect is negligible but only testing will show
the index will require extra space in de database, think of it as an additional (hidden) special table sitting next to your actual data. The index will only hold the fields required + the PK of the originating table, which usually is a lot less data then the entire table, but for 'millions of rows' it can add up.
if your query selects one or more fields that are not part of the index, then the system will have to fetch the matching PK fields from the index first and then go look for the other fields in the actual table by means of the PK. This probably is still (a lot) faster than when not having the index, but keep this in mind when doing something like SELECT * FROM ... : do you really need all the fields?
In the example you use active and name but from the text I get that these might be 'dynamic' in which case you'd have to foresee all kinds of combinations. From a practical point this might not be feasible as each index will come with the downsides of above and each time you add an index you'll add supra to that list again (cumulative).
PS: I use PK for simplicity but in MSSQL it's actually the fields of the clustered index, which USUALLY is the same thing. I'm guessing MySQL works similarly.
Explain your query, and check, whether it goes for filesort,
If Order By doesnt get any index or if MYSQL optimizer prefers to avoid the existing index(es) for sorting, it goes with filesort.
Now, If you're getting filesort, then you should preferably either avoid ORDER BY or you should create appropriate index(es).
if the data is small enough, it does operations in Memory else it goes on the disk.
so you may try and change the variable < sort_buffer_size > as well.
there are always tradeoffs, one way to improve the preformance of order query is to set the buffersize and then the run the order by query which improvises the performance of the query
set sort_buffer_size=100000;
<>
If this size is further increased then the performance will start decreasing

What is the best way to save the status of a record in a MySQL table beyond simple 0 or 1?

If I have a table full of records, they could be payments, or bookings or a multitide of other entities, is there a best practice for saving the status of each record beyond a simple 0 for not active and 1 for active?
For example, a payment might have the status 'pending', 'completed' or 'failed'. The way I have previously done it, is to have another table with a series of definitions in value/text pairs ie. 0 = 'failed', 1 = 'pending' and 2 = 'completed'. I would then store 0, 1 or 2 in the payments table and use an inner join to read the text from the definitions table if needed.
This method sometime seems overly complicated and unnecessary, and I have been thinking of changing my method to simply saving the word 'completed' directly in the status field of the payments table for example.
Is this considered bad practice, and if so, what is the best practice?
These seem to be transaction records, so potentially there are many of them and query performance will be an issue. So, it's probably smart to organize your status column or columns in such a way that compound index access to the records you need will be straightforward.
It's hard to give you crisp "do this, don't do that" advice without knowing your query patterns, so here are a couple of scenarios.
Suppose you need to get all the active bookings this month. You'll want a query of the form
SELECT whatever
FROM xactions
WHERE active = 1 and type = 2 /*bookings*/
AND xaction_date >= CURDATE() - INTERVAL DAY(CURDATE()) DAY
This will perform great with a compound BTREE index on (active,type,xaction_date) . The query can be satisfied by random accessing the index to the first eligible record and then scanning it sequentially.
But if you have type=2 meaning active bookings and type=12 meaning inactive bookings, and you want all bookings both active and inactive this month, your query will look like this:
SELECT whatever
FROM xactions
WHERE type IN (2,12)
AND xaction_date >= CURDATE() - INTERVAL DAY(CURDATE()) DAY
This won't be able to scan a compound index quite so easily due to the IN(2,12) clause needing disjoint ranges of values.
tl;dr In MySQL it's easier to index separate columns for various items of status to get better query performance. But it's hard to know without understanding query patterns.
For the specific case you mention, MySQL supports ENUM datatypes.
In your example, an ENUM seems appropriate - it limits the range of valid options, it's translated back to human-readable text in results, and it creates legible code. It has some performance advantages at query time.
However, see this answer for possible drawbacks.
If the status is more than an on/off bool type, then I always have a lookup table as you describe. Apart from being (I believe) a better normalised design, it makes objects based on the data entities easier to code and use.

What difference does it make which column SQL COUNT() is run on?

Firstly, this is not asking In SQL, what's the difference between count(column) and count(*)?.
Say I have a users table with a primary key user_id and another field logged_in which describes if the user is logged in right now.
Is there a difference between running
SELECT COUNT(user_id) FROM users WHERE logged_in=1
and
SELECT COUNT(logged_in) FROM users WHERE logged_in=1
to see how many users are marked as logged in? Maybe a difference with indexes?
I'm running MySQL if there are DB-specific nuances to this.
In MySQL, the count function will not count null expressions, so the results of your two queries may be different. As mentioned in the comments and Remus' answer, this is as a general rule for SQL and part of the spec.
For example, consider this data:
user_id logged_in
1 1
null 1
SELECT COUNT(user_id) on this table will return 1, but SELECT COUNT(logged_in) will return 2.
As a practical matter, the results from the example in the question ought to always be the same, as long as the table is properly constructed, but the utilized indexes and query plans may differ, even though the results will be the same. Additionally, if that's a simplified example, counting on different columns may change the results as well.
See also this question: MySQL COUNT() and nulls
For the record: the two queries return different results. As the spec says:
Returns a count of the number of non-NULL values of expr in the rows
retrieved by a SELECT statement.
You may argue that given the condition for logged_in=1 the NULL logged_in rows are filtered out anyway, and user_id will not have NULLs in a table users. While this may be true, it does not change the fundamentals that the queries are different. You are asking the query optimizer to make all the logical deductions above, for you they may be obvious but for the optimizer may be is not.
Now, assuming that the results are in practice always identical between the two, the answer is simple: don't run such a query in production (and I mean either of them). Is a scan, no matter how you slice it. logged_in has too low cardinality to matter. Keep a counter, update it at each log in and each log out event. It will drift in time, refresh as often as needed (once a day, once an hour).
As for the question itself: SELECT COUNT(somefield) FROM sometable can use a narrow index on somefield resulting in less IO. The recommendation is to use * because this room for the optimizer to use any index it sees fit (this will vary from product to product though, depending on how smart a query optimizer are we dealing with, YMMV). But as you start adding WHERE clauses the possibile alternatives (=indexes to use) quickly vanish.

How to improve count performance without indexing more fields?

There are over 2 millions record in the table.
I want to count how many errors (with checked) in the table and how many has been checked.
I do two queries:
SELECT count(*) as CountError FROM table WHERE checked = 1 AND error != ''
-
SELECT count(*) as Checked FROM table WHERE checked = 1
The performance is really slow, it take about 5 mins to get the result. How to improve this?
I have already have index on status field for the UPDATE performance.
If I index on checked field - then UPDATE performance will be effected which I do not want that.
UPDATE happen more than SELECT.
The table are Innob
You can try if making both counts in the same query is faster:
select
count(*) as CountError,
sum(case when error != '' then 1 else 0 end) as Checked
from table
where checked = 1
However, the difference will probably not be much to talk about. If you really want a difference then you need to add an index. Consider what the impact really would mean, and make an actual test to get a feel for what the impact could really be. If the update gets 10% slower and the select gets 100000% faster, then it might still be worth it.
Your problem here is simply that your checked field is either 1 or 0 which means that MySQL needs to do a table scan even though you have a key as it's unable to efficiently determine where the split between 0 and 1 is, especially on large amounts of rows.
The main advisory I would offer is the one which you don't want, which is to index checked as then SELECT SUM(checked) AS Checked FROM table WHERE checked=1 would be able to use the index without hitting the table.
Ultimately though, that's not a trivial query. You may wish to look at some way of archiving counts. If you have a date or timestamp then you could set up a task daily which would could store the count(*)'s for the previous day. That in turn would leave you fewer rows to parse on-the-fly.
Without further information as to the exact purpose of this table, the reason why you won't allow an index on that column etc. it is hard to suggest anything more helpful than the above + throwing hardware at it.

Which performs better in a MySQL where clause: YEAR() vs BETWEEN?

I need to find all records created in a given year from a MySQL database. Is there any way that one of the following would be slower than the other?
WHERE create_date BETWEEN '2009-01-01 00:00:00' AND '2009-12-31 23:59:59'
or
WHERE YEAR(create_date) = '2009'
This:
WHERE create_date BETWEEN '2009-01-01 00:00:00' AND '2009-12-31 23:59:59'
...works better because it doesn't alter the data in the create_date column. That means that if there is an index on the create_date, the index can be used--because the index is on the actual value as it exists in the column.
An index can't be used on YEAR(create_date), because it's only using a portion of the value (that requires extraction).
Whenever you use a function against a column, it must perform the function on every row in order to see if it matches the constant. This prevents the use of an index.
The basic rule of thumb, then, is to avoid using functions on the left side of the comparison.
Sargable means that the DBMS can use an index. Use a column on the left side and a constant on the right side to allow the DBMS to utilize an index.
Even if you don't have an index on the create_date column, there is still overhead on the DBMS to run the YEAR() function for each row. So, no matter what, the first method is most likely faster.
I would expect the former to be quicker as it is sargable.
Ideas:
Examine the explain plans; if they are identical, query performance will probably be nearly the same.
Test the performance on a large corpus of test data (which has most of its rows in years other than 2009) on a production-grade machine (ensure that the conditions are the same, e.g. cold / warm caches)
But I'd expect BETWEEN to win. Unless the optimiser is clever enough to do the optimisation for YEAR(), in which case would be the same.
ANOTHER IDEA:
I don't think you care.
If you have only a few records per year, then the query would be fast even if it did a full table scan, because even with (say) 100 years' data, there are so few records.
If you have a very large number of records per year (say 10^8) then the query would be very slow in any case, because returning that many records takes a long time.
You didn't say how many years' data you keep. I guess if it's an archaeological database, you might have a few thousand, in which case you might care if you have a massive load of data.
I find it extremely unlikely that your application will actually notice the difference between a "good" explain plan (using an index range scan) and a "bad" explain plan (full table scan) in this case.