How to improve count performance without indexing more fields? - mysql

There are over 2 millions record in the table.
I want to count how many errors (with checked) in the table and how many has been checked.
I do two queries:
SELECT count(*) as CountError FROM table WHERE checked = 1 AND error != ''
-
SELECT count(*) as Checked FROM table WHERE checked = 1
The performance is really slow, it take about 5 mins to get the result. How to improve this?
I have already have index on status field for the UPDATE performance.
If I index on checked field - then UPDATE performance will be effected which I do not want that.
UPDATE happen more than SELECT.
The table are Innob

You can try if making both counts in the same query is faster:
select
count(*) as CountError,
sum(case when error != '' then 1 else 0 end) as Checked
from table
where checked = 1
However, the difference will probably not be much to talk about. If you really want a difference then you need to add an index. Consider what the impact really would mean, and make an actual test to get a feel for what the impact could really be. If the update gets 10% slower and the select gets 100000% faster, then it might still be worth it.

Your problem here is simply that your checked field is either 1 or 0 which means that MySQL needs to do a table scan even though you have a key as it's unable to efficiently determine where the split between 0 and 1 is, especially on large amounts of rows.
The main advisory I would offer is the one which you don't want, which is to index checked as then SELECT SUM(checked) AS Checked FROM table WHERE checked=1 would be able to use the index without hitting the table.
Ultimately though, that's not a trivial query. You may wish to look at some way of archiving counts. If you have a date or timestamp then you could set up a task daily which would could store the count(*)'s for the previous day. That in turn would leave you fewer rows to parse on-the-fly.
Without further information as to the exact purpose of this table, the reason why you won't allow an index on that column etc. it is hard to suggest anything more helpful than the above + throwing hardware at it.

Related

Faster counts with mysql by sampling table

I'm looking for a way I can get a count for records meeting a condition but my problem is the table is billions of records long and a basic count(*) is not possible as it times out.
I thought that maybe it would be possible to sample the table by doing something like selecting 1/4th of the records. I believe that older records will be more likely to match so I'd need a method which accounts for this (perhaps random sorting).
Is it possible or reasonable to query a certain percent of rows in mysql? And is this the smartest way to go about solving this problem?
The query I currently have which doesn't work is pretty simple:
SELECT count(*) FROM table_name WHERE deleted_at IS NOT NULL
SHOW TABLE STATUS will 'instantly' give an approximate Row count. (There is an equivalent SELECT ... FROM information_schema.tables.) However, this may be significantly far off.
A count(*) on an index on any column in the PRIMARY KEY will be faster because it will be smaller. But this still may not be fast enough.
There is no way to "sample". Or at least no way that is reliably better than SHOW TABLE STATUS. EXPLAIN SELECT ... with some simple query will do an estimate; again, not necessarily any better.
Please describe what kind of data you have; there may be some other tricks we can use.
See also Random . There may be a technique that will help you "sample". Be aware that all techniques are subject to various factors of how the data was generated and whether there has been "churn" on the table.
Can you periodically run the full COUNT(*) and save it somewhere? And then maintain the count after that?
I assume you don't have this case. (Else the solution is trivial.)
AUTO_INCREMENT id
Never DELETEd or REPLACEd or INSERT IGNOREd or ROLLBACKd any rows
ADD an index key with deleted_at column, to improve time execution
and try to count id if id is set.

Should I avoid ORDER BY in queries for large tables?

In our application, we have a page that displays user a set of data, a part of it actually. It also allows user to order it by a custom field. So in the end it all comes down to query like this:
SELECT name, info, description FROM mytable
WHERE active = 1 -- Some filtering by indexed column
ORDER BY name LIMIT 0,50; -- Just a part of it
And this worked just fine, as long as the size of table is relatively small (used only locally in our department). But now we have to scale this application. And let's assume, the table has about a million of records (we expect that to happen soon). What will happen with ordering? Do I understand correctly, that in order to do this query, MySQL will have to sort a million records each time and give a part of it? This seems like a very resource-heavy operation.
My idea is simply to turn off that feature and don't let users select their custom ordering (maybe just filtering), so that the order would be a natural one (by id in descending order, I believe the indexing can handle that).
Or is there a way to make this query work much faster with ordering?
UPDATE:
Here is what I read from the official MySQL developer page.
In some cases, MySQL cannot use indexes to resolve the ORDER BY,
although it still uses indexes to find the rows that match the WHERE
clause. These cases include the following:
....
The key used to
fetch the rows is not the same as the one used in the ORDER BY:
SELECT * FROM t1 WHERE key2=constant ORDER BY key1;
So yes, it does seem like mysql will have a problem with such a query? So, what do I do - don't use an order part at all?
The 'problem' here seems to be that you have 2 requirements (in the example)
active = 1
order by name LIMIT 0, 50
The former you can easily solve by adding an index on the active field
The latter you can improve by adding an index on name
Since you do both in the same query, you'll need to combine this into an index that lets you resolve the active value quickly and then from there on fetches the first 50 names.
As such, I'd guess that something like this will help you out:
CREATE INDEX idx_test ON myTable (active, name)
(in theory, as always, try before you buy!)
Keep in mind though that there is no such a thing as a free lunch; you'll need to consider that adding an index also comes with downsides:
the index will make your INSERT/UPDATE/DELETE statements (slightly) slower, usually the effect is negligible but only testing will show
the index will require extra space in de database, think of it as an additional (hidden) special table sitting next to your actual data. The index will only hold the fields required + the PK of the originating table, which usually is a lot less data then the entire table, but for 'millions of rows' it can add up.
if your query selects one or more fields that are not part of the index, then the system will have to fetch the matching PK fields from the index first and then go look for the other fields in the actual table by means of the PK. This probably is still (a lot) faster than when not having the index, but keep this in mind when doing something like SELECT * FROM ... : do you really need all the fields?
In the example you use active and name but from the text I get that these might be 'dynamic' in which case you'd have to foresee all kinds of combinations. From a practical point this might not be feasible as each index will come with the downsides of above and each time you add an index you'll add supra to that list again (cumulative).
PS: I use PK for simplicity but in MSSQL it's actually the fields of the clustered index, which USUALLY is the same thing. I'm guessing MySQL works similarly.
Explain your query, and check, whether it goes for filesort,
If Order By doesnt get any index or if MYSQL optimizer prefers to avoid the existing index(es) for sorting, it goes with filesort.
Now, If you're getting filesort, then you should preferably either avoid ORDER BY or you should create appropriate index(es).
if the data is small enough, it does operations in Memory else it goes on the disk.
so you may try and change the variable < sort_buffer_size > as well.
there are always tradeoffs, one way to improve the preformance of order query is to set the buffersize and then the run the order by query which improvises the performance of the query
set sort_buffer_size=100000;
<>
If this size is further increased then the performance will start decreasing

What is the best way to save the status of a record in a MySQL table beyond simple 0 or 1?

If I have a table full of records, they could be payments, or bookings or a multitide of other entities, is there a best practice for saving the status of each record beyond a simple 0 for not active and 1 for active?
For example, a payment might have the status 'pending', 'completed' or 'failed'. The way I have previously done it, is to have another table with a series of definitions in value/text pairs ie. 0 = 'failed', 1 = 'pending' and 2 = 'completed'. I would then store 0, 1 or 2 in the payments table and use an inner join to read the text from the definitions table if needed.
This method sometime seems overly complicated and unnecessary, and I have been thinking of changing my method to simply saving the word 'completed' directly in the status field of the payments table for example.
Is this considered bad practice, and if so, what is the best practice?
These seem to be transaction records, so potentially there are many of them and query performance will be an issue. So, it's probably smart to organize your status column or columns in such a way that compound index access to the records you need will be straightforward.
It's hard to give you crisp "do this, don't do that" advice without knowing your query patterns, so here are a couple of scenarios.
Suppose you need to get all the active bookings this month. You'll want a query of the form
SELECT whatever
FROM xactions
WHERE active = 1 and type = 2 /*bookings*/
AND xaction_date >= CURDATE() - INTERVAL DAY(CURDATE()) DAY
This will perform great with a compound BTREE index on (active,type,xaction_date) . The query can be satisfied by random accessing the index to the first eligible record and then scanning it sequentially.
But if you have type=2 meaning active bookings and type=12 meaning inactive bookings, and you want all bookings both active and inactive this month, your query will look like this:
SELECT whatever
FROM xactions
WHERE type IN (2,12)
AND xaction_date >= CURDATE() - INTERVAL DAY(CURDATE()) DAY
This won't be able to scan a compound index quite so easily due to the IN(2,12) clause needing disjoint ranges of values.
tl;dr In MySQL it's easier to index separate columns for various items of status to get better query performance. But it's hard to know without understanding query patterns.
For the specific case you mention, MySQL supports ENUM datatypes.
In your example, an ENUM seems appropriate - it limits the range of valid options, it's translated back to human-readable text in results, and it creates legible code. It has some performance advantages at query time.
However, see this answer for possible drawbacks.
If the status is more than an on/off bool type, then I always have a lookup table as you describe. Apart from being (I believe) a better normalised design, it makes objects based on the data entities easier to code and use.

Best way to check for updated rows in MySQL

I am trying to see if there were any rows updated since the last time it was checked.
I'd like to know if there are any better alternatives to
"SELECT id FROM xxx WHERE changed > some_timestamp;"
However, as there are 200,000+ rows it can get heavy pretty fast... would a count be any better?
"SELECT count(*) FROM xxx WHERE changed > some_timestamp;"
I have thought of creating a unit test but I am not the best at this yet /:
Thanks for the help!
EDIT: Because in many cases there would not be any rows that changed, would it be better to always test with a MAX(xx) first, and if its greater than the old update timestamp given, then do a query?
If you just want to know if any rows have changed, the following query is probably faster than either of yours:
SELECT id FROM xxx WHERE changed > some_timestamp LIMIT 1
Just for the sake of completeness: Make sure you have an index on changed.
Edit: A tiny performance improvement
Now that I think about it, you should probably do a SELECT change instead of selecing the id, because that eliminates accessing the table at all. This query will tell you pretty quickly if any change was performed.
SELECT changed FROM xxx WHERE changed > some_timestamp LIMIT 1
It should be a tiny bit faster than my first query - not by a lot, though, since accessing a single table row is going to be very fast.
Should I select MAX(changed) instead?
Selecting MAX(changed), as suggested by Federico should pretty much result in the same index access pattern. Finding the highest element in an index is a very cheap operation. Finding any element that is greater than some constant is potentially cheaper, so both should have approximately the same performance. In either case, both queries are extremely fast even on very large tables if - and only if - there is an index.
Should I first check if any rows were changed, and then retrieve the rows in a separate step
No. If there is no row that has changed, SELECT id FROM xxx WHERE changed > some_timestamp will be as fast as any such check making it pointless to perform it separately. It only turns into a slower operation when there are results. Unless you add expensive operations (such as ORDER BY), the performance should be (almost) linear to the number of rows retrieved.
Make an index on some_timestamp and run:
SELECT MAX(some_timestamp) FROM xxx;
If the table is MyISAM, the query will be immediate.

Optimize performance of complex GROUP BY query

Is there a way to optimize the following query. It takes about 11s:
SELECT
concat(UNIX_TIMESTAMP(date), '000') as datetime,
TRUNCATE(SUM(royalty_price*conversion_to_usd*
(CASE WHEN sales_or_return = 'R' THEN -1 ELSE 1 END)*
(CASE WHEN royalty_currency = 'JPY' THEN .80
WHEN royalty_currency in ('AUD', 'NZD') THEN .95 ELSE 1 END) )
,2) as total_in_usd
FROM
sales_raw
GROUP BY
date
ORDER BY
date ASC
Doing an explain I get:
1 SIMPLE sales_raw index NULL date 5 NULL 735855 NULL
This is an answer to the question in the comment. It formats better here:
An example of filtering on a set of indexed dates means to do something like this:
where date >= AStartDateVariable
and date < TheDayAfterAnEndDateVariable
If there is no index on the date field, create one.
You may be able to speed this up. You seem to have an index on date. What is happening is that the rows are read in the index, then each row is looked up. If the data is not ordered by the date field, then this might not be optimal, because the reads will be on, essentially, random pages. In the case where the original table does not fit into memory, this results in a condition called "page thrashing". A record is needed, the page is read from memory (displacing another page in the memory cache), and the next read probably also results in a cache miss.
To see if this is occuring, I would suggest one of two things. (1) Try removing the index on date or switching the group by criterion to concat(UNIX_TIMESTAMP(date), '000'). Either of these should remove the index as a factor.
From your additional comment, this is not occuring, although the benefit of the index appears to be on the small side.
(2) You can also expand the index to include all the tables used in the query. Besides date, the index would need to contain royalty_price, conversion_to_usd, sales_or_return, and royalty_currency. This would allow the index to fully satisfy the query, without looking up additional infromation in the pages.
You can also check with your DBA to be aure that you have a large enough page cache that matches your hardware capabilities.
This is a simple group by query which does not even involve joins. I would expect the problem to lie on the functions you are using.
Please start with a simple query just retrieving date and the sum of conversion_to_usd. Check performance and build up the query step by step always checking performance. It should not take long to spot the culprit.
Concats are usually slow operations but I wonder if truncate after sum might be confusing the optimiser. The 2nd case could be replaced by relying on a join with a table of currency codes and respective percentages, but it's not obvious that it makes a big difference. First spot the culprit operation.
You could also store the values with the correct amount but that introduces a denormalisation.