Optimizing a "distinct where equals" query and indices - mysql

I'm trying to optimize a query that looks something like
SELECT DISTINCT(some_attribute)
FROM some_table
WHERE soft_deleted=0
I already have indices on some_attribute and soft_deleted individually.
The table from which I am pulling from is relatively large(>100GB), so this query can take tens of minutes. Would a multi-column index on some_attribute and soft_deleted make a significant impact or are there some other optimizations that I can make?

We are going to assume this table is using InnoDB storage engine, and assume that soft_deleted column is integer-ish datatype, and that some_attribute column is a smallish datatype column.
For the exact query text shown in the question, optimal execution plan will likely make use of an index with soft_deleted and some_attribute as the leading columns in that order, i.e.
... ON some_table (soft_deleted, some_attribute, ...)
The index will also contain the columns from the cluster index (even if they aren't listed), so we could also include the names of those columns in the index following the two leading columns. MySQL will also be able to make use of an index that includes additional columns, again, following the two leading columns.
Use EXPLAIN to see the execution plan.
I expect the optimal execution plan will include "Using index for GROUP BY" in the Extra column, and avoid a "Using filesort" operation.
With the index suggested above, compare the execution plan for this query:
SELECT t.some_attribute
FROM some_table t
WHERE t.soft_deleted = 0
GROUP
BY t.soft_deleted
, t.some_attribute
ORDER
BY NULL
If we already have an index defined with some_attribute as the leading column, and also including the soft_deleted column, e.g.
... ON some_table (some_attribute, soft_deleted, ... )
(an index on just the some_attribute column would be redundant, and could be dropped)
we might re-write the SQL and check the EXPLAIN output for a query like this:
SELECT t.some_attribute
FROM some_table t
GROUP
BY t.some_attribute
, IF(t.soft_deleted = 0,1,0)
HAVING t.soft_deleted = 0
ORDER
BY NULL
If we have a guarantee that soft_deleted only has two distinct values, then we could simplify to just
SELECT t.some_attribute
FROM some_table t
GROUP
BY t.some_attribute
, t.soft_deleted
HAVING t.soft_deleted = 0
ORDER
BY NULL
Optimal performance of a query against this table, to return the specified resultset, is likely going to be found in an execution plan that avoids a "Using filesort" operation and using an index to satisfy the DISTINCT/GROUP BY operation.
Note that DISTINCT is a keyword not a function. The parens around some_attribute have no effect, and can be omitted. (Including the spurious parens almost makes it look like we think DISTINCT is a function.)

Related

MySQL query index

I am using MySQL 5.6 and try to optimize next query:
SELECT t1.field1,
...
t1.field30,
t2.field1
FROM Table1 t1
JOIN Table2 t2 ON t1.fk_int = t2.pk_int
WHERE t1.int_field = ?
AND t1.enum_filed != 'value'
ORDER BY t1.created_datetime desc;
A response can contain millions of records and every row consists of 31 columns.
Now EXPLAIN says in Extra that planner uses 'Using where'.
I tried to add next index:
create index test_idx ON Table1 (int_field, enum_filed, created_datetime, fk_int);
After that EXPLAIN says in Extra that planner uses "Using index condition; Using filesort"
"rows" value from EXPLAIN with index is less than without it. But in practice time of execution is longer.
So, the questions are next:
What is the best index for this query?
Why EXPLAIN says that 'key_len' of query with index is 5. Shouldn't it be 4+1+8+4=17?
Should the fields from ORDER BY be in index?
Should the fields from JOIN be in index?
try refactor your index this way
avoid (o move to the right after fk_int) the created_datetime column.. and move fk_int before the enum_filed column .. the in this wahy the 3 more colums used for filter shold be use better )
create index test_idx ON Table1 (int_field, fk_int, enum_filed);
be sure you have also an specific index on table2 column pk_int. if you have not add
create index test_idx ON Table2 (int_field, fk_int, enum_filed);
What is the best index for this query?
Maybe (int_field, created_datetime) (See next Q&A for reason.)
Why EXPLAIN says that 'key_len' of query with index is 5. Shouldn't it be 4+1+8+4=17?
enum_filed != defeats the optimizer. If there is only one other value for that enum (and it is NOT NULL), then use = and the other value. And try INDEX(int_field, enum_field, created_datetime) The Optimizer is much happier with = than with any inequality.
"5" could be indicating 2 columns, or it could be indicating one INT that is Nullable. If int_field can be NULL, consider changing it to NOT NULL; then the "5" would drop to "4".
Should the fields from ORDER BY be in index?
Only if the index can completely handle the WHERE. This usually occurs only if all the WHERE tests are =. (Hence, my previous answer.)
Another case for including those columns is "covering"; see next Q&A.
Should the fields from JOIN be in index?
It depends. One thing that gives some performance benefit is to include all columns mentioned anywhere in the SELECT. This is called a "covering" index and is indicated in EXPLAIN by Using index (not Using index condition). There are too many columns in t1 to add a "covering" index. I think the practical limit is about 5 columns.
My guess for your question № 1:
create index my_idx on Table1(int_field, created_datetime desc, fk_int)
or one of these (but neither will probably be worthwhile):
create index my_idx on Table1(int_field, created_datetime desc, enum_filed, fk_int)
create index my_idx on Table1(int_field, created_datetime desc, fk_int, enum_filed)
I'm supposing 3 things:
Table2.pk_int is already a primary key, judging by the name
The where condition on Table1.int_field is only satisfied by a small subset of Table1
The inequality on Table1.enum_filed (I would fix the typo, if I were you) only excludes a small subset of Table1
Question № 2: the key_len refers to the keys used. Don't forget that there is one extra byte for nullable keys. In your case, if int_field is nullable, it means that this is the only key used, otherwise both int_field and enum_filed are used.
As for questions 3 and 4: If, as I suppose, it's more efficient to start the query plan from the where condition on Table1.int_field, the composite index, in this case also with the correct sort order (desc), enables a scan of the index to get the output rows in the correct order, without an extra sort step. Furthermore, adding also fk_int to the index makes the retrieval of any record of Table1 unnecessary unless a corresponding record is present in Table2. For a similar reason you could also add enum_filed to the index, but, if this doesn't considerably reduce the output record count, the increase in index size will make things worse instead of better. In the end, you will have to try it out (with realistic data!).
Note that if you put another column between int_field and created_datetime in the index, the index won't provide the created_datetime (for a given int_field) in the desired output order.
The issue was fixed by adding more filters (to where clause) to the query.
Regarding to indexes 2 proposed indexes were helpful:
From #WalterTross with next index for initial query:
(int_field, created_datetime desc, enum_filed, fk_int)
With my short comment: desc indexes is not supported at MySQL 5.6 - this key word just reserved.
From #RickJames with next index for modified query:
(int_field, created_datetime)
Thanks everyone who tried to help. I really appreciate it.

SQL gets slow on a simple query with ORDER BY

I have problem with MySQL ORDER BY, it slows down query and I really don't know why, my query was a little more complex so I simplified it to a light query with no joins, but it stills works really slow.
Query:
SELECT
W.`oid`
FROM
`z_web_dok` AS W
WHERE
W.`sent_eRacun` = 1 AND W.`status` IN(8, 9) AND W.`Drzava` = 'BiH'
ORDER BY W.`oid` ASC
LIMIT 0, 10
The table has 946,566 rows, with memory taking 500 MB, those fields I selecting are all indexed as follow:
oid - INT PRIMARY KEY AUTOINCREMENT
status - INT INDEXED
sent_eRacun - TINYINT INDEXED
Drzava - VARCHAR(3) INDEXED
I am posting screenshoots of explain query first:
The next is the query executed to database:
And this is speed after I remove ORDER BY.
I have also tried sorting with DATETIME field which is also indexed, but I get same slow query as with ordering with primary key, this started from today, usually it was fast and light always.
What can cause something like this?
The kind of query you use here calls for a composite covering index. This one should handle your query very well.
CREATE INDEX someName ON z_web_dok (Drzava, sent_eRacun, status, oid);
Why does this work? You're looking for equality matches on the first three columns, and sorting on the fourth column. The query planner will use this index to satisfy the entire query. It can random-access the index to find the first row matching your query, then scan through the index in order to get the rows it needs.
Pro tip: Indexes on single columns are generally harmful to performance unless they happen to match the requirements of particular queries in your application, or are used for primary or foreign keys. You generally choose your indexes to match your most active, or your slowest, queries. Edit You asked whether it's better to create specific indexes for each query in your application. The answer is yes.
There may be an even faster way. (Or it may not be any faster.)
The IN(8, 9) gets in the way of easily handling the WHERE..ORDER BY..LIMIT completely efficiently. The possible solution is to treat that as OR, then convert to UNION and do some tricks with the LIMIT, especially if you might also be using OFFSET.
( SELECT ... WHERE .. = 8 AND ... ORDER BY oid LIMIT 10 )
UNION ALL
( SELECT ... WHERE .. = 9 AND ... ORDER BY oid LIMIT 10 )
ORDER BY oid LIMIT 10
This will allow the covering index described by OJones to be fully used in each of the subqueries. Furthermore, each will provide up to 10 rows without any temp table or filesort. Then the outer part will sort up to 20 rows and deliver the 'correct' 10.
For OFFSET, see http://mysql.rjweb.org/doc.php/index_cookbook_mysql#or

Can a query only use one index per table?

I have a query like this:
( SELECT * FROM mytable WHERE author_id = ? AND seen IS NULL )
UNION
( SELECT * FROM mytable WHERE author_id = ? AND date_time > ? )
Also I have these two indexes:
(author_id, seen)
(author_id, date_time)
I read somewhere:
A query can generally only use one index per table when process the WHERE clause
As you see in my query, there is two separated WHERE clause. So I want to know, "only one index per table" means my query can use just one of those two indexes or it can use one of those indexes for each subquery and both indexes are useful?
In other word, is this sentence true?
"always one of those index will be used, and the other one is useless"
That statement about only using one index is no longer true about MySQL. For instance, it implements the index merge optimization which can take advantage of two indexes for some where clauses that have or. Here is a description in the documentation.
You should try this form of your query and see if it uses index mer:
SELECT *
FROM mytable
WHERE author_id = ? AND (seen IS NULL OR date_time > ? );
This should be more efficient than the union version, because it does not incur the overhead of removing duplicates.
Also, depending on the distribution of your data, the above query with an index on mytable(author_id, date_time, seen) might work as well or better than your version.
UNION combines results of subqueries. Each subquery will be executed independent of others and then results will be merged. So, in this case WHERE limits are applied to each subquery and not to all united result.
In answer to your question: yes, each subquery can use some index.
There are cases when the database engine can use more indexes for one select statement, however when filtering one set of rows really it not possible. If you want to use indexing on two columns then build one index on both columns instead of two indexes.
Every single subquery or part of composite query is itself a query can be evaluated as single query for performance and index access .. you can also force the use of different index for eahc query .. In your case you are using union and these are two separated query .. united in a resulting query
. you can have a brief guide how mysql ue index .. acccessing at this guide
http://dev.mysql.com/doc/refman/5.7/en/mysql-indexes.html

How will I improve mysql query that retrieves 11.2m data?

select tblfarmerdetails.ncode,
tblfarmerdetails.region,tblfarmerdetails.province, tblfarmerdetails.municipality,
concat(tblfarmerdetails.farmerfname, ' ', tblfarmerdetails.farmerlname) as nameoffarmer,
concat(tblfarmerdetails.spousefname, ' ',tblfarmerdetails.spouselname) as nameofspouse, tblstatus.statusoffarmer from tblfarmerdetails
INNER Join
tblstatus on tblstatus.ncode = tblfarmerdetails.ncode where tblstatus.ncode = tblfarmerdetails.ncode order by tblfarmerdetails.region
It takes too long to retrieve 11.2m data. How will I improve this query?
Firstly, format the query so it is readable, or at least decipherable, by a human.
SELECT f.ncode
, f.region
, f.province
, f.municipality
, CONCAT(f.farmerfname,' ',f.farmerlname) AS nameoffarmer
, CONCAT(f.spousefname,' ',f.spouselname) AS nameofspouse
, s.statusoffarmer
FROM tblfarmerdetails
JOIN tblstatus s
ON s.ncode = f.ncode
ORDER BY f.region
It's likely that a lot of time is spent to do a "Using filesort" operation, to sort all the rows in the order specified in the ORDER BY clause. For sure a sort operation is going to occur if there's not an index with a leading column of region.
Having a suitable index available, for examaple
... ON tblfarmerdetails (region, ... )
Means that MySQL may be able to return the rows "in order", using the index, without having to do a sort operation.
If MySQL has a "covering index" available, i.e. an index that contains all of the columns of the table reference in the query, MySQL can make use of that index to satisfy the query without needing to visit pages in the underlying table.
But given the number of columns, and the potential that some of these columns may be goodly sized VARCHAR, this may not be possible or workable:
... ON tblfarmerdetails (region, ncode, province, municipality, farmerfname, farmerlname, spousefname, spouselname)
(MySQL does have some limitations on indexex. The goal of the "covering index" is to avoid lookups to pages in the table.)
And make sure that MySQL knows that ncode is UNIQUE in tblstatus. Either that's the PRIMARY KEY or there's a UNIQUE index.
We suspect tblstatus table contains a small number of rows, so the join operation is probably not that expensive. But an appropriate covering index, with ncode as the leading column, wouldn't hurt:
... ON tblstatus (ncode, statusoffarmer)
If MySQL has to performa a "Using filesort" operation to get the rows ordered (to satisfy the ORDER BY clause), on a large set, that operation can spill to disk, and that can add (sometimes significantly) to the elapsed time.
The resultset produced by the query has to be transferred to the client. And that can also take some clock time.
And the client has to do something with the rows that are returned.
Are you sure you really need to return 11.2M rows? Or, are you only needing the first couple of thousand rows?
Consider adding a LIMIT clause to the query.
And how long are those lname and fname columns? Do you need MySQL to do the concatenation for you, or could that be done on the client as the rows are proceesed.
It's possible that MySQL is having to do a "Using temporary" to hold the rows with the concatenated results. And MySQL is likely allocating enough space for that return column to hold the maximum possible length from lname + maximum posible length from fname. And if that's a multibyte character characterset, that will double or triple the storage over a single byte characterset.
To really see what's going on, you'd need to take a look at the query execution plan. You get that by preceding your SELECT statement with the keyword EXPLAIN
EXPLAIN SELECT ...
The output from that will show the operations that MySQL is going to do, what indexes it's going to use. And armed with knowledge about the operations the MySQL optimizer can perform, we can use that to make some pretty good guesses as to how to get the biggest gains.

Does the ORDER BY optimization takes effect in the following SELECT statement?

I have a SELECT statement which I would like to optimize. The mysql - order by optimization says that in some cases the index cannot be used to optimize the ORDER BY. Specifically the point:
You use ORDER BY on nonconsecutive parts of a key
SELECT * FROM t1 WHERE key2=constant ORDER BY key_part2;
makes me thinking, that this could be the case. I'm using following indexes:
UNIQUE KEY `met_value_index1` (`RTU_NB`,`DATETIME`,`MP_NB`),
KEY `met_value_index` (`DATETIME`,`RTU_NB`)
With following SQL-statement:
SELECT * FROM met_value
WHERE rtu_nb=constant
AND mp_nb=constant
AND datetime BETWEEN constant AND constant
ORDER BY mp_nb, datetime
Would it be enough delete the index met_value_index1 and create it with the new ordering RTU_NB, MP_NB, DATETIME?
Do I have to include RTU_NB into the ORDER BY clause?
Outcome: I have tried what #meriton suggested and added the index met_value_index2. The SELECT completed after 1.2 seconds, previously it completed after 5.06 seconds. The following doesn't belong to the question but as a side note: After some other tries I switched the engine from MyISAM to InnoDB – with rtu_nb, mp_nb, datetime as primary key – and the statement completed after 0.13 seconds!
I don't get your query. If a row must match mp_np = constant to be returned, all rows returned will have the same mp_nb, so including mp_nb in the order by clause has no effect. I recommend you use the semantically equivalent statement:
SELECT * FROM met_value
WHERE rtu_nb=constant
AND mp_nb=constant
AND datetime BETWEEN constant AND constant
ORDER BY datetime
to avoid needlessly confusing the query optimizer.
Now, to your question: A database can implement an order by clause without sorting if it knows that the underlying access will return the rows in proper order. In the case of indexes, that means that an index can assist with sorting if the rows matched by the where clause appear in the index in the order requested by the order by clause.
That is the case here, so the database could actually do an index range scan over met_value_index1 for the rows where rtu_nb=constant AND datetime BETWEEN constant AND constant, and then check whether mp_nb=constant for each of these rows, but that would amount to checking far more rows than necessary if mp_nb=constant has high selectivity. Put differently, an index is most useful if the matching rows are contiguous in the index, because that means the index range scan will only touch rows that actually need to be returned.
The following index will therefore be more helpful for this query:
UNIQUE KEY `met_value_index2` (`RTU_NB`,`MP_NB`, `DATETIME`),
as all matching rows will be right next to each other in the index and the rows appear in the index in the order the order by clause requests. I can not say whether the query optimizer is smart enough to get that, so you should check the execution plan.
I do not think it will use any index for the ORDER BY. But you should look at the execution plan. Or here.
The order of the fields as they appear in the WHERE clause must match the order in the index. So with your current query you need one index with the fields in order of rtu_nb, mp_nb, datetime.