Mysql query tuning (large data set) and explain plan - mysql

I am using mysql5.1, i have table which has about 15 lakh (1.5 million) records.This table has records for different entities i.e child records for all master entities.
There are 8 columns in this table , out of which 6 columns are clubbed to make a primary key.
These columns could be individual foreign keys but due to performance we have made this change.
Even a simple select statement with two conditions is taking 6-8 seconds.Below is the explain plan for the same.
Query
explain extended
select distinct location_code, Max(trial_number) as replication
from status_trait t
where t.status_id='N02'
and t.trial_data='orange'
group by location_code
The results of EXPLAIN EXTENDED
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t index FK_HYBRID_EXP_TRAIT_DTL_2 5 1481572 100.00 Using where; Using index
I have these questions:
How to handle tables with large data
Is indexing fine for this table

Two things might help you here.
First, SELECT DISTINCT is pointless in an aggregating query. Just use SELECT.
Second, you didn't disclose the indexes you have created. However, to satisfy this query efficiently, the following compound covering index will probably help a great deal.
(status_id, trial_data, location_code, trial_number)
Why is this the right index? Because MySQL indexes are organized as BTREE. This organization allows the server to random-access the index to find particular values. In your case you want particular values of status_id and trial_data. Once the server has random-accessed the index, it can then scan sequentially. In this case you hope to scan for various values of location_code. The server knows it will find those different values already in order. Finally, the server needs to pluck out values of trial_number to use in your MAX() function. Lo and behold, there they are in the index ready for the plucking.
(If you're doing a lot of aggregation and querying of large tables, it makes sense for you to learn how compound and covering indexes work.)
There's a cost to adding an index: when you INSERT or UPDATE rows, you have to update your index as well. But this kind of index will greatly accelerate your retrieval.

Related

MySQL covering index optimization? [duplicate]

I've just heard the term covered index in some database discussion - what does it mean?
A covering index is an index that contains all of, and possibly more, the columns you need for your query.
For instance, this:
SELECT *
FROM tablename
WHERE criteria
will typically use indexes to speed up the resolution of which rows to retrieve using criteria, but then it will go to the full table to retrieve the rows.
However, if the index contained the columns column1, column2 and column3, then this sql:
SELECT column1, column2
FROM tablename
WHERE criteria
and, provided that particular index could be used to speed up the resolution of which rows to retrieve, the index already contains the values of the columns you're interested in, so it won't have to go to the table to retrieve the rows, but can produce the results directly from the index.
This can also be used if you see that a typical query uses 1-2 columns to resolve which rows, and then typically adds another 1-2 columns, it could be beneficial to append those extra columns (if they're the same all over) to the index, so that the query processor can get everything from the index itself.
Here's an article: Index Covering Boosts SQL Server Query Performance on the subject.
Covering index is just an ordinary index. It's called "covering" if it can satisfy query without necessity to analyze data.
example:
CREATE TABLE MyTable
(
ID INT IDENTITY PRIMARY KEY,
Foo INT
)
CREATE NONCLUSTERED INDEX index1 ON MyTable(ID, Foo)
SELECT ID, Foo FROM MyTable -- All requested data are covered by index
This is one of the fastest methods to retrieve data from SQL server.
Covering indexes are indexes which "cover" all columns needed from a specific table, removing the need to access the physical table at all for a given query/ operation.
Since the index contains the desired columns (or a superset of them), table access can be replaced with an index lookup or scan -- which is generally much faster.
Columns to cover:
parameterized or static conditions; columns restricted by a parameterized or constant condition.
join columns; columns dynamically used for joining
selected columns; to answer selected values.
While covering indexes can often provide good benefit for retrieval, they do add somewhat to insert/ update overhead; due to the need to write extra or larger index rows on every update.
Covering indexes for Joined Queries
Covering indexes are probably most valuable as a performance technique for joined queries. This is because joined queries are more costly & more likely then single-table retrievals to suffer high cost performance problems.
in a joined query, covering indexes should be considered per-table.
each 'covering index' removes a physical table access from the plan & replaces it with index-only access.
investigate the plan costs & experiment with which tables are most worthwhile to replace by a covering index.
by this means, the multiplicative cost of large join plans can be significantly reduced.
For example:
select oi.title, c.name, c.address
from porderitem poi
join porder po on po.id = poi.fk_order
join customer c on c.id = po.fk_customer
where po.orderdate > ? and po.status = 'SHIPPING';
create index porder_custitem on porder (orderdate, id, status, fk_customer);
See:
http://literatejava.com/sql/covering-indexes-query-optimization/
Lets say you have a simple table with the below columns, you have only indexed Id here:
Id (Int), Telephone_Number (Int), Name (VARCHAR), Address (VARCHAR)
Imagine you have to run the below query and check whether its using index, and whether performing efficiently without I/O calls or not. Remember, you have only created an index on Id.
SELECT Id FROM mytable WHERE Telephone_Number = '55442233';
When you check for performance on this query you will be dissappointed, since Telephone_Number is not indexed this needs to fetch rows from table using I/O calls. So, this is not a covering indexed since there is some column in query which is not indexed, which leads to frequent I/O calls.
To make it a covered index you need to create a composite index on (Id, Telephone_Number).
For more details, please refer to this blog:
https://www.percona.com/blog/2006/11/23/covering-index-and-prefix-indexes/

Index on mysql partitioned tables

I have a table with two partitions. Partitions are pactive = 1 and pinactive = 0. I understand that two partitions does not make so much of a gain, but I have used it to truncate and load in one partition and plain inserts in another partition.
The problem comes when I create indexes.
Query goes this way
select partitionflag,companyid,activityname
from customformattributes
where companyid=47
and activityname = 'Activity 1'
and partitionflag=0
Created index -
create index idx_try on customformattributes(partitionflag,companyid,activityname,completiondate,attributename,isclosed)
there are around 200000 records that will be retreived from the above query. But the query along with the mentioned index takes 30+ seconds. What is the reason for such a long time? Also, if remove the partitionflag from the mentioned index, the index is not even used.
And is the understanding that,
Even with the partitions available, the optimizer needs to have the required partition mentioned in the index definition, so that it only hits the required partition ---- Correct?
Any ideas on understanding this would be very helpful
You can optimize your index by reordering the columns in it. Usually the columns in the index are ordered by its cardinality (starting from the highest and go down to the lowest). Cardinality is the uniqueness of data in the given column. So in your case I suppose there are many variations of companyid in customformattributes table while partitionflag will have cardinality of 2 (if all the options for this column are 1 and 0).
Your query will first filter all the rows with partitionflag=0, then it will filter by company id and so on.
When you remove partitionflag from the index the query did not used the index because may be the optimizer decides that it will be faster to make full table scan instead of using the index (in most of the cases the optimizer is right)
For the given query:
select partitionflag,companyid,activityname
from customformattributes
where companyid=47
and activityname = 'Activity 1'
and partitionflag=0
the following index may be would be better (but of course :
create index idx_try on customformattributes(companyid,activityname, completiondate,attributename, partitionflag, isclosed)
For the query to use index the following rule must be met - the left most column in the index should be present in the where clause ... and depending on the mysql version you are using additional query requirements may be needed. For example if you are using old version of mysql - you may need to order the columns in the where clause in the same order they are listed in the index. In the last versions of mysql the query optimizer is responsible for ordering the columns in the where clause in the correct order.
Your SELECT query took 30+ seconds because it returns 200k rows and because the index might not be the optimal for the given query.
For the second question about the partitioning: the common rule is that the column you are partitioning by must be part of all the UNIQUE keys in a table (Primary key is also unique key by definition so the column should be added to the PK also). If table structure and logic allows you to add the partitioning column to all the UNIQUE indexes in the table then you add it and partition the table.
When the partitioning is made correctly you can take the advantage of partitioning pruning - this is when SELECT query searches the data only in the partitions where given data is stored (otherwise it looks in all partitions)
You can read more about partitioning here:
https://dev.mysql.com/doc/refman/5.6/en/partitioning-overview.html
The query is slow simply because disks are slow.
Cardinality is not important when designing an index.
The optimal index for that query is
INDEX(companyid, activityname, partitionflag) -- in any order
It is "covering" since it includes all the columns mentioned anywhere in the SELECT. This is indicated by "Using index" in the EXPLAIN.
Leaving off the other 3 columns makes the query faster because it will have to read less off the disk.
If you make any changes to the query (add columns, change from '=' to '>', add ORDER BY, etc), then the index may no longer be optimal.
"Also, if remove the partitionflag from the mentioned index, the index is not even used." -- That is because it was no longer "covering".
Keep in mind that there are two ways an index may be used -- "covering" versus being a way to look up the data. When you don't have a "covering" index, the optimizer chooses between using the index and bouncing between the index and the data versus simply ignoring the index and scanning the table.

How to use indexes in mysql and on what parameters

Making my concept more clear about indexes in Mysql. What i know is indexes are used to make your query faster. except that i have couple of questions to know about.
Let's say i'm having a query:
SELECT books.name, books.name2, books.id, books.image, books.faith, books.topic, books.downloaded, books.viewed, books.language, books.size, books.author as author_id, authors.name as author_name, authors.aid from books LEFT JOIN authors ON books.author = authors.aid WHERE books.id = '".$id."' AND status = 1
Is any index applicable for this select query while it has JOIN?
After making an index on a column query for that will be optimized
or changed ?
How to make index for this query and against which column?
What's the other benefits or disadvantages of using indexes?
On which case indexes should be avoided and where should use indexes
for more?
Do indexes applicable on random queries ?
Are indexes more efficient on IDS ?
Please apprise, thank you in advanced !
You can check details on below links.
https://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
http://mydbsolutions.in/query-optimization-2/
Your queries are here-
Is any index applicable for this select query while it has JOIN?
It will depend on various factors like if your table have very less data or approx. 70% data is same in index column then mysql will prefer to scan table instead of index. In simple all your join columns should be indexed (will be indexed if you use foreign key concept other wise you should indexed them). Also on which column your query is filtering most data that field should be indexed. In your case you are filtering data on books.id which should be primary key so already indexed.
After making an index on a column query for that will be optimized or changed?
It will be automatically start to use index but in some cases may be you need to change your query. Suppose you are using a filter condition as "date(order_date)='2015-10-15'", even after creating index on order_date it will not be used so you have to change your query as "order_date>='2015-10-15 00:00:00' and order_date<='2015-10-15 23:59:59'" if you order_date column data type is datetime or timestamp.
How to make index for this query and against which column?
Here I am not seeing any need of making index as your condition is on books table primary key and it will be already indexed.
What's the other benefits or disadvantages of using indexes?
If you create index blindly then at the time of record insertion/updation etc each time index will be updated and will slow the process. Even heavy index will perform slow. Also will consume more disk space.
On which case indexes should be avoided and where should use indexes for more?
If more than 70% data is same for any column then no need to create index on them like status or is_deleted type columns as mostly data will be active.
Do indexes applicable on random queries ?
Yes index work on random queries, for repeatable queries you can use query cache which will be more efficient.
Are indexes more efficient on IDS ?
Yes.

Distinct (or group by) using filesort and temp table

I know there are similar questions on this but I've got a specific query / question around why this query
EXPLAIN SELECT DISTINCT RSubdomain FROM R_Subdomains WHERE EmploymentState IN (0,1) AND RPhone='7853932120'
gives me this output explain
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE RSubdomains index NULL RSubdomain 767 NULL 3278 Using where
with and index on RSubdomains
but if I add in a composite index on EmploymentState/RPhone
I get this output from explain
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE RSubdomains range EmploymentState EmploymentState 67 NULL 2 Using where; Using temporary
if I take away the distinct on RSubdomains it drops the Using temp from the explain output... but what I don't get is why, when I add in the composite key (and keeping the key on RSubdomain) does the distinct end up using a temp table and which index schema is better here? I see that the amount of rows scanned on the combined key is far less, but the query is of type range and it's also slower.
Q: why ... does the distinct end up using a temp table?
MySQL is doing a range scan on the index (i.e. reading index blocks) to locate the rows that satisfy the predicates (WHERE clause). Then MySQL has to lookup the value of the RSubdomain column from the underlying table (it's not available in the index.) To eliminate duplicates, MySQL needs to scan the values of RSubdomain that were retrieved. The "Using temp" indicates the MySQL is materializing a resultset, which is processed in a subsequent step. (Likely, that's the set of RSubdomain values that was retrieved; given the DISTINCT, it's likely that MySQL is actually creating a temporary table with RSubdomain as a primary or unique key, and only inserting non-duplicate values.
In the first case, it looks like the rows are being retreived in order by RSubdomain (likely, that's the first column in the cluster key). That means that MySQL needn't compare the values of all the RSubdomain values; it only needs to check if the last retrieved value matches the currently retrieved value to determine whether the value can be "skipped."
Q: which index schema is better here?
The optimum index for your query is likely a covering index:
... ON R_Subdomains (RPhone, EmploymentState, RSubdomain)
But with only 3278 rows, you aren't likely to see any performance difference.
FOLLOWUP
Unfortunately, MySQL does not provide the type of instrumentation provided in other RDBMS (like the Oracle event 10046 sql trace, which gives actual timings for resources and waits.)
Since MySQL is choosing to use the index when it is available, that is probably the most efficient plan. For the best efficiency, I'd perform an OPTIMIZE TABLE operation (for InnoDB tables and MyISAM tables with dynamic format, if there have been a significant number of DML changes, especially DELETEs and UPDATEs that modify the length of the row...) At the very least, it would ensure that the index statistics are up to date.
You might want to compare the plan of an equivalent statement that does a GROUP BY instead of a DISTINCT, i.e.
SELECT r.RSubdomain
FROM R_Subdomains r
WHERE r.EmploymentState IN (0,1)
AND r.RPhone='7853932120'
GROUP
BY r.Subdomain
For optimum performance, I'd go with a covering index with RPhone as the leading column; that's based on an assumption about the cardinality of the RPhone column (close to unique values), opposed to only a few different values in the EmploymentState column. That covering index will give the best performance... i.e. the quickest elimination of rows that need to be examined.
But again, with only a couple thousand rows, it's going to be hard to see any performance difference. If the query was examining millions of rows, that's when you'd likely see a difference, and the key to good performance will be limiting the number of rows that need to be inspected.

The explain tells that the query is awful (it doesn't use a single key) but I'm using LIMIT 1. Is this a problem?

The explain command with the query:
explain SELECT * FROM leituras
WHERE categorias_id=75 AND
textos_id=190304 AND
cookie='3f203349ce5ad3c67770ebc882927646' AND
endereco_ip='127.0.0.1'
LIMIT 1
The result:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE leituras ALL (null) (null) (null) (null) 1022597 Using where
Will it make any difference adding some keys on the table? Even that the query will always return only one row.
In answer to your question, yes. You should add indexes where necessary (thanks #Col.Shrapnel) on the columns that appear in your WHERE clause - in this case, categorias_id, textos_id, cookie, and endereco_ip.
If you always perform a query using the same 3 columns in the WHERE clause, it may be beneficial to add an index which comprises the 3 columns in one go, rather than adding individual indexes.
It still has to do a linear search over the table until it finds that one row. So adding indexes could noticeably improve performance.
Yes, indexes are even more important when you want to return only one row.
If you are returning half of the rows and your database system has to scan the entire table, you're still at 50% efficiency.
However, if you want to return just one row, and your database system has to scan 1022597 rows to find your row, your efficiency is minuscule.
LIMIT 1 does offer some efficiency in that it stops as soon as it finds the first matching row, but it obviously has to scan an enormous number of records to find that first row.
Adding an index for each of the columns in your WHERE clause allows your database system to avoid scanning rows that don't match your criteria. With adequate indexes, you'll see that the rows column in the explain will get closer to the actual number of returned rows.
Using a compound index that covers all four of the columns in your WHERE clause allows even better performance and less scanning, as the index will provide full coverage. Compound indexes do use a lot of memory and negatively affect insert performance, so you might only want to add a compound index if a large percentage of your queries repeatedly do a look up on the same columns, or if you rarely insert records, or it's just that important to you for that particular query to be fast.
Another way to improve performance is to return only the columns that you need rather than using SELECT *. If you had a compound index on those four columns, and you returned only those four columns, your database system wouldn't need to hit your records at all. The database system could get everything it needed right from the indexes.