MySQL: Optimize query with DISTINCT - mysql

In my Java application I have found a small performance issue, which is caused by such simple query:
SELECT DISTINCT a
FROM table
WHERE checked = 0
LIMIT 10000
I have index on the checked column.
In the beginning, the query is very fast (i.e. where almost all rows have checked = 0). But as I mark more and more rows as checked, the query becomes greatly inefficient (up to several minutes).
How can I improve the performance of this query ? Should I add a complex index
a, checked
or rather
checked, a?
My table has a lot of millions of rows, that is why I do not want to test it manually and hope to have lucky guess.

I would add an index on checked, a. This means that the value you're returning has already been found in the index and there's no need to re-access the table to find it. Secondly if you're doing lot's of individual updates of the table there's a good chance both the table and the index have become fragmented on the disc. Rebuilding (compacting) a table and index can significantly increase performance.
You can also use the query rewritten as (just in case the optimizer does not understand that it's equivalent):
SELECT a
FROM table
WHERE checked = 0
GROUP BY a
LIMIT 10000

Add a compound index on the DISTINCT column (a in this case). MySQL is able to use this index for the DISTINCT.
MySQL may also take profit of a compound index on (a, checked) (the order matters, the DISTINCT column has to be at the start of the index). Try both and compare the results with your data and your queries.
(After adding this index you should see Using index for group-by in the EXPLAIN output.)
See GROUP BY optimization on the manual. (A DISTINCT is very similar to a GROUP BY.)
The most efficient way to process GROUP BY is when an index is used to directly retrieve the grouping columns. With this access method, MySQL uses the property of some index types that the keys are ordered (for example, BTREE). This property enables use of lookup groups in an index without having to consider all keys in the index that satisfy all WHERE conditions.>

My table has a lot of millions of rows <...> where almost all rows have
checked=0
In this case it seems that the best index would be a simple (a).
UPDATE:
It was not clear how many rows get checked. From your comment bellow the question:
At the beginning 0 is in 100% rows, but at the end of the day it will
become 0%
This changes everything. So #Ben has the correct answer.

I have found a completely different solution which would do the trick. I will simple create a new table with all possible unique "a" values. This will allow me to avoid DISTINCT

You don't state it, but are you updating the index regularly? As changes occur to the underlying data, the index becomes less and less accurate and processing gets worse and worse. If you have an index on checked, and checked is being updated over time, you need to make sure your index is updated accordingly on a regular basis.

Related

Should I avoid ORDER BY in queries for large tables?

In our application, we have a page that displays user a set of data, a part of it actually. It also allows user to order it by a custom field. So in the end it all comes down to query like this:
SELECT name, info, description FROM mytable
WHERE active = 1 -- Some filtering by indexed column
ORDER BY name LIMIT 0,50; -- Just a part of it
And this worked just fine, as long as the size of table is relatively small (used only locally in our department). But now we have to scale this application. And let's assume, the table has about a million of records (we expect that to happen soon). What will happen with ordering? Do I understand correctly, that in order to do this query, MySQL will have to sort a million records each time and give a part of it? This seems like a very resource-heavy operation.
My idea is simply to turn off that feature and don't let users select their custom ordering (maybe just filtering), so that the order would be a natural one (by id in descending order, I believe the indexing can handle that).
Or is there a way to make this query work much faster with ordering?
UPDATE:
Here is what I read from the official MySQL developer page.
In some cases, MySQL cannot use indexes to resolve the ORDER BY,
although it still uses indexes to find the rows that match the WHERE
clause. These cases include the following:
....
The key used to
fetch the rows is not the same as the one used in the ORDER BY:
SELECT * FROM t1 WHERE key2=constant ORDER BY key1;
So yes, it does seem like mysql will have a problem with such a query? So, what do I do - don't use an order part at all?
The 'problem' here seems to be that you have 2 requirements (in the example)
active = 1
order by name LIMIT 0, 50
The former you can easily solve by adding an index on the active field
The latter you can improve by adding an index on name
Since you do both in the same query, you'll need to combine this into an index that lets you resolve the active value quickly and then from there on fetches the first 50 names.
As such, I'd guess that something like this will help you out:
CREATE INDEX idx_test ON myTable (active, name)
(in theory, as always, try before you buy!)
Keep in mind though that there is no such a thing as a free lunch; you'll need to consider that adding an index also comes with downsides:
the index will make your INSERT/UPDATE/DELETE statements (slightly) slower, usually the effect is negligible but only testing will show
the index will require extra space in de database, think of it as an additional (hidden) special table sitting next to your actual data. The index will only hold the fields required + the PK of the originating table, which usually is a lot less data then the entire table, but for 'millions of rows' it can add up.
if your query selects one or more fields that are not part of the index, then the system will have to fetch the matching PK fields from the index first and then go look for the other fields in the actual table by means of the PK. This probably is still (a lot) faster than when not having the index, but keep this in mind when doing something like SELECT * FROM ... : do you really need all the fields?
In the example you use active and name but from the text I get that these might be 'dynamic' in which case you'd have to foresee all kinds of combinations. From a practical point this might not be feasible as each index will come with the downsides of above and each time you add an index you'll add supra to that list again (cumulative).
PS: I use PK for simplicity but in MSSQL it's actually the fields of the clustered index, which USUALLY is the same thing. I'm guessing MySQL works similarly.
Explain your query, and check, whether it goes for filesort,
If Order By doesnt get any index or if MYSQL optimizer prefers to avoid the existing index(es) for sorting, it goes with filesort.
Now, If you're getting filesort, then you should preferably either avoid ORDER BY or you should create appropriate index(es).
if the data is small enough, it does operations in Memory else it goes on the disk.
so you may try and change the variable < sort_buffer_size > as well.
there are always tradeoffs, one way to improve the preformance of order query is to set the buffersize and then the run the order by query which improvises the performance of the query
set sort_buffer_size=100000;
<>
If this size is further increased then the performance will start decreasing

mysql IN clause not using possible keys

I have a fairly simple mysql query which contains a few inner join and then a where clause. I have created indexes for all the columns that are used in the joins as well as the primary keys. I also have a where clause which contains an IN operator. When only 5 or less ids are passed into the IN clause the query optimizer uses one of my indexes to run the query in a reasonable amount of time. When I use explain I see that the type is range and key is PRIMARY. My issue is that if I use more than 5 ids in the IN clause, the optimizer ignores all the available indexes and query runs extremely slow. When I use explain I see that the type is ALL and the key is NULL.
Could someone please shed some light on what is happening here and how I could fix this.
Thanks
Regardless of the "primary key" indexes on the tables to optimize the JOINs, you should also have an index based on common criteria you are applying a WHERE against. More info needed on columns of your query, but you should have an index on your WHERE criteria TOO.
You could also try using Mysql Index Hints. It lets you specify which index should be used during the query execution.
Examples:
SELECT * FROM table1 USE INDEX (col1_index,col2_index)
WHERE col1=1 AND col2=2 AND col3=3;
-
SELECT * FROM table1 IGNORE INDEX (col3_index)
WHERE col1=1 AND col2=2 AND col3=3;
More Information here:
Mysql Index Hints
Found this while checking up on a similar problem I am having. Thought my findings might help anyone else with a similar issue in future.
I have a MyISAM table with about 30 rows (contains common typos of similar words for a search where both the possible original typo and the alternative word may be valid spellings, the table will slowly build up in size). However the cutoff for me is that if there are 4 items in the IN clause the index is used but when 5 are in the IN clause the index is ignored (note I haven't tried alternative words so the actual individual items in the IN clause might be a factor). So similar to the OP, but with a different number of words.
Use index would not work and the index would still be ignored. Force index does work, although I would prefer to avoid specifying indexes (just in case someone deletes the index).
For some testing I padded out the table with an extra 1000 random unique rows the query would use the relevant index even with 80 items in the IN clause.
So seems MySQL decides whether to use the index based on the number of items in the IN clause compared to the number of rows in the table (probably some other factors at play though).

Index optimised for specific lookup

I have a table (which could potentially grow large, ~ millions of rows) on which I regularly perform the query SELECT * from table WHERE somefield = 20, and I would like this query to run fast. At any time, I expect this query to return at most 10 rows out of possiby millions, for this specific value 20 (no guarantees for any other values). What would be the proper way to index this? Is it sufficient to just place an index on somefield, and make sure statistics are roughly up to date? Or are then any other tricks I could try to optimise this?
The ideal index for this query in isolation would be an index with key column somefield and included columns of all other columns in the table (either by making the index clustered or an NCI with the INCLUDE option).
This would allow the values to be seeked into directly and avoid the need for bookmark lookups.
But the maintenance overhead of an NCI with all those included columns would affect data modification operations and you might prefer a CI defined on different key columns to benefit other queries or to avoid fragmentation anyway
So for that reason you may well prefer to define an NCI on somefield alone and live with the 10 bookmark lookups. It is a balancing act.
Edit. Actually if you are only interested in optimising the query where somefield = 20 then you could just create a filtered index on that value. I likely would then include all columns in that index definition.
Is it sufficient to just place an index on somefield, and make sure statistics are roughly up to date?
Yes, quite simply. Make sure somefield is the right type (ie. int). If somefield needs to contain text there's more you can do, but otherwise a normal index will be fine.
You can get small increases (and I do mean small) if you don't need every field returned by not using SELECT * (you won't need somefield, presumably, as you already know what it is).
Yes, You would want to add an index on somefield.
If you are not doing other queries, then you may want to make it a clustered index, but without context, it's hard to say conclusively.

mysql order by performance

Can I prevent using filesort in mysql when field on which condition in one table and field on which order in another. Can use index in this situation? Both tables are large - more than 1 million records
You are dealing with 1 million records, definitely you need to add indexing to gain some speed otherwise it will be overkill anyone visiting you site.
You need to closely examine which fields you will be add indexing to. Thanks
You need to carefully look at the indexes using explain.
If there are where clauses, either add the column you are ordering by to the index being used for that table as well and see if it gets rid of the filesort, or if its not currently using one for that table (which I doubt is the case with that many records) then just create a new one.
Also worth flagging, which someone else may be able to offer more info on, is that mysql often (or always?) can't use an index to sort DESC - I've been in a situation before where its been performant to have a computed field that will index in reverse order, and order by that ASC.
EG: If you have an integer field that you want to order by DESC, add a field where you store the integer value subtracted from 1000000000 (or some other large number), index it, and order by that field ascending.
As I say, I can't remember specifics, it may affect older mysql only or something, but have a feeling its a current limitation.

MySQL: low cardinality/selectivity columns = how to index?

I need to add indexes to my table (columns) and stumbled across this post:
How many database indexes is too many?
Quote:
“Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.”
Is an Index really pointless if there are only two distinct values? Given a table as follows (MySQL Database, InnoDB)
Id (BIGINT)
fullname (VARCHAR)
address (VARCHAR)
status (VARCHAR)
Further conditions:
The Database contains 300 Million records
Status can only be “enabled” and “disabled”
150 Million records have status= enabled and 150 Million records have
stauts= disabled
My understanding is, without having an index on status, a select with where status=’enabled’ would result in a full tablescan with 300 Million Records to process?
How efficient is the lookup when I use a BTREE index on status?
Should I index this column or not?
What alternatives (maybe any other indexes) does MySQL InnoDB provide to efficiently look records up by the "where status="enabled" clause in the given example with a very low cardinality/selectivity of the values?
The index that you describe is pretty much pointless. An index is best used when you need to select a small number of rows in comparison to the total rows.
The reason for this is related to how a database accesses a table. Tables can be assessed either by a full table scan, where each block is read and processed in turn. Or by a rowid or key lookup, where the database has a key/rowid and reads the exact row it requires.
In the case where you use a where clause based on the primary key or another unique index, eg. where id = 1, the database can use the index to get an exact reference to where the row's data is stored. This is clearly more efficient than doing a full table scan and processing every block.
Now back to your example, you have a where clause of where status = 'enabled', the index will return 150m rows and the database will have to read each row in turn using separate small reads. Whereas accessing the table with a full table scan allows the database to make use of more efficient larger reads.
There is a point at which it is better to just do a full table scan rather than use the index. With mysql you can use FORCE INDEX (idx_name) as part of your query to allow comparisons between each table access method.
Reference:
http://dev.mysql.com/doc/refman/5.5/en/how-to-avoid-table-scan.html
I'm sorry to say that I do not agree with Mike. Adding an index is meant to limit the amount of full records searches for MySQL, thereby limiting IO which usually is the bottleneck.
This indexing is not free; you pay for it on inserts/updates when the index has to be updated and in the search itself, as it now needs to load the index file (full text index for 300M records is probably not in memory). So it might well be that you get extra IO in stead of limitting it.
I do agree with the statement that a binary variable is best stored as one, a bool or tinyint, as that decreases the length of a row and can thereby limit disk IO, also comparisons on numbers are faster.
If you need speed and you seldom use the disabled records, you may wish to have 2 tables, one for enabled and one for disabled records and move the records when the status changes. As it increases complexity and risk this would be my very last choice of course. Definitely do the move in 1 transaction if you happen to go for it.
It just popped into my head that you can check wether an index is actually used by using the explain statement. That should show you how MySQL is optimizing the query. I don't really know hoe MySQL optimizes queries, but from postgresql I do know that you should explain a query on a database approximately the same (in size and data) as the real database. So if you have a copy on the database, create an index on the table and see wether it's actually used. As I said, I doubt it, but I most definitely don't know everything:)
If the data is distributed like 50:50 then query like where status="enabled" will avoid half scanning of the table.
Having index on such tables is completely depends on distribution of data, i,e : if entries having status enabled is 90% and other is 10%. and for query where status="disabled" it scans only 10% of the table.
so having index on such columns depends on distribution of data.
#a'r answer is correct, however it needs to be pointed out that the usefulness of an index is given not only by its cardinality but also by the distribution of data and the queries run on the database.
In OP's case, with 150M records having status='enabled' and 150M having status='disabled', the index is unnecessary and a waste of resource.
In case of 299M records having status='enabled' and 1M having status='disabled', the index is useful (and will be used) in queries of type SELECT ... where status='disabled'.
Queries of type SELECT ... where status='enabled' will still run with a full table scan.
You will hardly need all 150 mln records at once, so I guess "status" will always be used in conjunction with other columns. Perhaps it'd make more sense to use a compound index like (status, fullname)
Jan, you should definitely index that column. I'm not sure of the context of the quote, but everything you said above is correct. Without an index on that column, you are most certainly doing a table scan on 300M rows, which is about the worst you can do for that data.
Jan, as asked, where your query involves simply "where status=enabled" without some other limiting factor, an index on that column apparently won't help (glad to SO community showed me what's up). If however, there is a limiting factor, such as "limit 10" an index may help. Also, remember that indexes are also used in group by and order by optimizations. If you are doing "select count(*),status from table group by status", an index would be helpful.
You should also consider converting status to a tinyint where 0 would represent disabled and 1 would be enabled. You're wasting tons of space storing that string vs. a tinyint which only requires 1 byte per row!
I have a similar column in my MySQL database. Approximately 4 million rows, with the distribution of 90% 1 and 10% 0.
I've just discovered today that my queries (where column = 1) actually run significantly faster WITHOUT the index.
Foolishly I deleted the index. I say foolishly, because I now suspect the queries (where column = 0) may have still benefited from it. So, instead I should explicitly tell MySQL to ignore the index when I'm searching for 1, and to use it when I'm searching for 0. Maybe.