query with multiple left joins leading to query lock - mysql

I am trying to optimize this query as good as possible,but still i am getting query locks due to this query.Can any one provide some suggestions in improving it.The query fetches the last one day entries from the table.
The QUERY:
SELECT CR.id,
CR.servicecode,
CR.leadtime,
CR.redirecturl,
CRE.custemail,
CRE.custlname,
CRE.custfname,
CRE.duration,
CR.userid,
AA.lpintrotimearr,
AA.lpintrotimedep,
AA.landdatetimearr,
AA.landdatetimedep,
CR.newcustid,
cre.CRE.custmobilephone,
CRE.brandname
FROM response CR
LEFT JOIN agreement AA
ON CR.id = AA.id
LEFT JOIN request CRE
ON CRE.id = CR.id
WHERE CR.id > '20120617145243'
AND CR.approved = 1
AND CR.chlapproved != 0
AND CR.chlapproved IS NOT NULL
AND AA.id IS NOT NULL
AND ( AA.stdsign != 'on'
OR AA.stdsign IS NULL )
AND ( AA.ivaflag = 0
OR AA.ivaflag IS NULL )
AND ( AA.opt IS NULL
OR AA.opt = 0 );
The EXPLAIN:
One way is to index all 3(AA.stdsign,AA.ivaflag and AA.opts) columns but all the three flags (AA.stdsign,AA.ivaflag and AA.opts) can have only 3 different values.Will indexing these reduce query run time?
All the ids are of varchar(60) data type.

There isn't much to be improved on the query itself.
On the other hand, setting an index on AA.stdsign, AA.ivaflag and AA.opts should help a lot.
As your EXPLAIN indicates, no suitable key is found for your AA table and all 534956 rows must be scanned to satisfy the WHERE clause.
[edit]
One last tip: using large column types (such as VARCHAR(60)) for your primary keys is probably sub-optimal.
First reason: every time you need to reference a row (e.g. in a foreign key), you need another VARCHAR(60).
Second reason: comparisons on strings are slower than on integers (hence it may render a JOIN slower than necessary)
You may want to add an INT column to your tables, and use it as primary key.

Related

Alternative to COUNT for innodb to prevent table scan?

I've managed to put together a query that works for my needs, albeit more complicated than I was hoping. But, for the size of tables the query is slower than it should be (0.17s). The reason, based on the EXPLAIN provided below, is because there is a table scan on the meta_relationships table due to it having the COUNT in the WHERE clause on an innodb engine.
Query:
SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
RIGHT JOIN meta_relationships ON (posts.post_id = meta_relationships.object_id)
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE meta.meta_name = computers AND meta_relationships.object_id
NOT IN (SELECT meta_relationships.object_id FROM meta_relationships
GROUP BY meta_relationships.object_id HAVING count(*) > 1)
GROUP BY meta_relationships.object_id
This particular query, selects posts which have ONLY the computers category. The purpose of count > 1 is to exclude posts that contain computers/hardware, computers/software, etc. The more categories that are selected, the higher the count would be.
Ideally, I'd like to get it functioning like this:
WHERE meta.meta_name IN ('computers') AND meta_relationships.meta_order IN (0)
or
WHERE meta.meta_name IN ('computers','software')
AND meta_relationships.meta_order IN (0,1)
etc..
But unfortunately this doesn't work, because it doesn't take into consideration that there may be a meta_relationships.meta_order = 2.
I've tried...
WHERE meta.meta_name IN ('computers')
GROUP BY meta_relationships.meta_order
HAVING meta_relationships.meta_order IN (0) AND meta_relationships.meta_order NOT IN (1)
but it doesn't return the correct amount of rows.
EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY meta ref PRIMARY,idx_meta_name idx_meta_name 602 const 1 Using where; Using index; Using temporary; Using filesort
1 PRIMARY meta_data ref PRIMARY,idx_meta_id idx_meta_id 8 database.meta.meta_id 1
1 PRIMARY meta_relationships ref idx_meta_data_id idx_meta_data_id 8 database.meta_data.meta_data_id 11 Using where
1 PRIMARY posts eq_ref PRIMARY PRIMARY 4 database.meta_relationships.object_id 1
2 MATERIALIZED meta_relationships index NULL idx_object_id 4 NULL 14679 Using index
Tables/Indexes:
meta
This table contains the category and tag names.
indexes:
PRIMARY KEY (meta_id), KEY idx_meta_name (meta_name)
meta_data
This table contains additional data about the categories and tags such as type (category or tag), description, parent, count.
indexes:
PRIMARY KEY (meta_data_id), KEY idx_meta_id (meta_id)
meta_relationships
This is a junction/lookup table. It contains a foreign key to the posts_id, a foreign key to the meta_data_id, and also contains the order of the categories.
indexes:
PRIMARY KEY (relationship_id), KEY idx_object_id (object_id), KEY idx_meta_data_id (meta_data_id)
The count allows me to only select the posts with that correct level of category. For example, the category computers has posts with only the computers category but it also has posts with computers/hardware. The count filters out posts that contain those extra categories. I hope that makes sense.
I believe the key to optimizing the query is to get away completely from doing the COUNT.
An alternative to the COUNT would possibly be using meta_relationships.meta_order or meta_data.parent instead.
The meta_relationships table will grow quickly and with the current size (~15K rows) I'm hoping to achieve an execution time in the 100th of seconds rather than the 10ths of seconds.
Since there needs to be multiple conditions in the WHERE clause for each category/tag, any answer optimized for a dynamic query is preferred.
I have created an IDE with sample data.
How can I optimize this query?
EDIT :
I was never able to find an optimal solution to this problem. It was really a combination of smcjones recommendation of improving the indexes for which I would recommend doing an EXPLAIN and looking at EXPLAIN Output Format then change the indexes to whatever gives you the best performance.
Also, hpf's recommendation to add another column with the total count helped tremendously. In the end, after changing the indexes, I ended up going with this query.
SELECT posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
JOIN meta_relationships ON meta_relationships.object_id = posts.post_id
JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE posts.meta_count = 2
GROUP BY posts.post_id
HAVING category = 'category,subcategory'
After getting rid of the COUNT, the big performance killer was the GROUP BY and ORDER BY, but the indexes are your best friend. I learned that when doing a GROUP BY, the WHERE clause is very important, the more specific you can get the better.
With a combination of optimized queries AND optimizing your tables, you will have fast queries. However, you cannot have fast queries without an optimized table.
I cannot stress this enough: If your tables are structured correctly with the correct amount of indexes, you should not be experiencing any full table reads on a query like GROUP BY... HAVING unless you do so by design.
Based on your example, I have created this SQLFiddle.
Compare that to SQLFiddle #2, in which I added indexes and added a UNIQUE index against meta.meta_naame.
From my testing, Fiddle #2 is faster.
Optimizing Your Query
This query was driving me nuts, even after I made the argument that indexes would be the best way to optimize this. Even though I still hold that the table is your biggest opportunity to increase performance, it did seem that there had to be a better way to run this query in MySQL. I had a revelation after sleeping on this problem, and used the following query (seen in SQLFiddle #3):
SELECT posts.post_id,posts.post_name,posts.post_title,posts.post_description,posts.date,meta.meta_name
FROM posts
LEFT JOIN meta_relationships ON meta_relationships.object_id = posts.post_id
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE meta.meta_name = 'animals'
GROUP BY meta_relationships.object_id
HAVING sum(meta_relationships.object_id) = min(meta_relationships.object_id);
HAVING sum() = min() on a GROUP BY should check to see if there is more than one record of each type. Obviously, each time the record shows up, it will add more to the sum. (Edit: On subsequent tests it seems like this has the same impact as count(meta_relationships.object_id) = 1. Oh well, the point is I believe you can remove subquery and have the same result).
I want to be clear that you won't notice much if any optimization on the query I provided you unless the section, WHERE meta.meta_name = 'animals' is querying against an index (preferably a unique index because I doubt you'll need more than one of these and it will prevent accidental duplication of data).
So, instead of a table that looks like this:
CREATE TABLE meta_data (
meta_data_id BIGINT,
meta_id BIGINT,
type VARCHAR(50),
description VARCHAR(200),
parent BIGINT,
count BIGINT);
You should make sure you add primary keys and indexes like this:
CREATE TABLE meta_data (
meta_data_id BIGINT,
meta_id BIGINT,
type VARCHAR(50),
description VARCHAR(200),
parent BIGINT,
count BIGINT,
PRIMARY KEY (meta_data_id,meta_id),
INDEX ix_meta_id (meta_id)
);
Don't overdo it, but every table should have a primary key, and any time you are aggregating or querying against a specific value, there should be indexes.
When indexes are not used, the MySQL will walk through each row of the table until it finds what you want. In such a limited example as yours this doesn't take too long (even though it's still noticeably slower), but when you add thousands or more records, this will become extraordinarily painful.
In the future, when reviewing your queries, try to identify where your full table scans are occurring and see if there is an index on that column. A good place to start is wherever you are aggregating or using the WHERE syntax.
A note on the count column
I have not found putting count columns into the table to be helpful. It can lead to some pretty serious integrity issues. If a table is properly optimized, It should be very easy to use count() and get the current count. If you want to have it in a table, you can use a VIEW, although that will not be the most efficient way to make the pull.
The problem with putting count columns into a table is that you need to update that count, using either a TRIGGER or, worse, application logic. As your program scales out that logic can either get lost or buried. Adding that column is a deviation from normalization and when something like this is to occur, there should be a VERY good reason.
Some debate exists as to whether there is ever a good reason to do this, but I think I'd be wise to stay out of that debate because there are great arguments on both sides. Instead, I will pick a much smaller battle and say that I see this causing you more headaches than benefits in this use case, so it is probably worth A/B testing.
Since the HAVING seems to be the issue, can you instead create a flag field in the posts table and use that instead? If I understand the query correctly, you're trying to find posts with only one meta_relationship link. If you created a field in your posts table that was either a count of the meta_relationships for that post, or a boolean flag for whether there was only one, and indexed it of course, that would probably be much faster. It would involve updating the field if the post was edited.
So, consider this:
Add a new field to the posts table called "num_meta_rel". It can be an unsigned tinyint as long as you'll never have more than 255 tags to any one post.
Update the field like this:
UPDATE posts
SET num_meta_rel=(SELECT COUNT(object_id) from meta_relationships WHERE object_id=posts.post_id);
This query will take some time to run, but once done you have all the counts precalculated. Note this can be done better with a join, but SQLite (Ideone) only allows subqueries.
Now, you rewrite your query like this:
SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
RIGHT JOIN meta_relationships ON (posts.post_id = meta_relationships.object_id)
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE meta.meta_name = computers AND posts.num_meta_rel=1
GROUP BY meta_relationships.object_id
If I've done this correctly, the runnable code is here: http://ideone.com/ZZiKgx
Note that this solution requires that you update the num_meta_rel (choose a better name, that one is terrible...) if the post has a new tag associated with it. But that should be much faster than scanning your entire table over and over.
See if this gives you the right answer, possibly faster:
SELECT p.post_id, p.post_name,
GROUP_CONCAT(IF(md.type = 'category', meta.meta_name, null)) AS category,
GROUP_CONCAT(IF(md.type = 'tag', meta.meta_name, null)) AS tag
FROM
( SELECT object_id
FROM meta_relation
GROUP BY object_id
HAVING count(*) = 1
) AS x
JOIN meta_relation AS mr ON mr.object_id = x.object_id
JOIN posts AS p ON p.post_id = mr.object_id
JOIN meta_data AS md ON mr.meta_data_id = md.meta_data_id
JOIN meta ON md.meta_id = meta.meta_id
WHERE meta.meta_name = ?
GROUP BY mr.object_id
Unfortunately I have no possibility to test performance,
But try my query using your real data:
http://sqlfiddle.com/#!9/81b29/13
SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
INNER JOIN (
SELECT meta_relationships.object_id
FROM meta_relationships
GROUP BY meta_relationships.object_id
HAVING count(*) < 3
) mr ON mr.object_id = posts.post_id
LEFT JOIN meta_relationships ON mr.object_id = meta_relationships.object_id
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
INNER JOIN (
SELECT *
FROM meta
WHERE meta.meta_name = 'health'
) meta ON meta_data.meta_id = meta.meta_id
GROUP BY posts.post_id
Use
sum(1)
instead of
count(*)

Query optimization with multiple JOINs

I have a query on a fact table "foo_success" in a star schema, which has about 6 million rows. This table holds (integer) references to dimension tables and nothing else. We use MyISAM as storage engine.
The query:
SELECT
hierarchy.level0name,
hierarchy.level1name,
hierarchy.level0,
hierarchy.level1,
date.date,
address.city,
user.emailAddress,
foo_object.name,
foo_object.type,
user_group.groupId,
COUNT(user.id) AS count_user_id,
SUM(foo_object_statistic.passes) AS sum_foo_object_statistic_passes,
SUM(foo_object_statistic.starts) AS sum_foo_object_statistic_starts,
SUM(foo_object_statistic.calls) AS sum_foo_object_statistic_calls
FROM
foo_success,
user,
user_group,
address,
hierarchy,
foo_object,
foo_object_statistic,
date
WHERE (foo_success.userDimensionId = user.id)
AND (foo_success.userGroupDimensionId = user_group.id)
AND (foo_success.addressDimensionId = address.id)
AND (foo_success.hierarchyDimensionId = hierarchy.id)
AND (foo_success.fooObjectDimensionId = foo_object.id)
AND (foo_success.fooObjectStatisticDimensionId = foo_object_statistic.id)
AND (foo_success.dateDimensionId=date.id)
AND hierarchy.level0 = 'XYZ'
AND hierarchy.level1 IS NOT NULL
AND hierarchy.level2 IS NOT NULL
AND hierarchy.level3 IS NOT NULL
AND hierarchy.level4 IS NOT NULL
AND hierarchy.level5 IS NOT NULL
AND hierarchy.level6 IS NULL
AND hierarchy.level7 IS NULL
GROUP BY hierarchy.level0, foo_object.fooObjectId
LIMIT 0, 25;
What I've tried so far:
This is the simple join version, which equals the INNER JOIN alternative in speed.
There are indices on all fields which are joined or which are part of a condition.
I did use EXPLAIN on this query and found that the query cost (# of processed rows) is 128596 for the table user and 77 for the table foo_success.
I tried to remove the dependency on the user table, which leads to a # of processed rows of over 6 million in the fact table foo_success.
It takes about 1,5 minutes to finish this query, which is far off my expectations for a data warehouse star schema optimized on read speed. Is there any way I can optimize this monster?
The inefficiency of the query mostly comes from transfering a lot of data you do not actually use: the fields hierarchy.level1name, hierarchy.level0name, hierarchy.level1, date.date, address.city, user.emailAddress, foo_object.name, foo_object.type, user_group.groupId are not included in GROUP BY clause, which means that the information is retrieved for each row, loaded in memory and then just discarded.
What I would recommend is to concentrate retrieving of all sufficient ids and aggregation results in a subquery and then join to the rest of the tables, so that each join would produce not more than a single row (you can even move the LIMIT clause in the subquery to minimize the required subsequent JOIN operations). After that, you may discover, that you do not have some useful indexes.

Dependant SubQuery v Left Join

This query displays the correct result but when doing an EXPLAIN, it lists it as a "Dependant SubQuery" which I'm led to believe is bad?
SELECT Competition.CompetitionID, Competition.CompetitionName, Competition.CompetitionStartDate
FROM Competition
WHERE CompetitionID NOT
IN (
SELECT CompetitionID
FROM PicksPoints
WHERE UserID =1
)
I tried changing the query to this:
SELECT Competition.CompetitionID, Competition.CompetitionName, Competition.CompetitionStartDate
FROM Competition
LEFT JOIN PicksPoints ON Competition.CompetitionID = PicksPoints.CompetitionID
WHERE UserID =1
and PicksPoints.PicksPointsID is null
but it displays 0 rows. What is wrong with the above compared to the first query that actually does work?
The seconds query cannot produce rows: it claims:
WHERE UserID =1
and PicksPoints.PicksPointsID is null
But to clarify, I rewrite as follows:
WHERE PicksPoints.UserID =1
and PicksPoints.PicksPointsID is null
So, on one hand, you are asking for rows on PicksPoints where UserId = 1, but then again you expect the row to not exist in the first place. Can you see the fail?
External joins are so tricky at that! Usually you filter using columns from the "outer" table, for example Competition. But you do not wish to do so; you wish to filter on the left-joined table. Try and rewrite as follows:
SELECT Competition.CompetitionID, Competition.CompetitionName, Competition.CompetitionStartDate
FROM Competition
LEFT JOIN PicksPoints ON (Competition.CompetitionID = PicksPoints.CompetitionID AND UserID = 1)
WHERE
PicksPoints.PicksPointsID is null
For more on this, read this nice post.
But, as an additional note, performance-wise you're in some trouble, using either subquery or the left join.
With subquery you're in trouble because up to 5.6 (where some good work has been done), MySQL is very bad with optimizing inner queries, and your subquery is expected to execute multiple times.
With the LEFT JOIN you are in trouble since a LEFT JOIN dictates the order of join from left to right. Yet your filtering is on the right table, which means you will not be able to use an index for filtering the USerID = 1 condition (or you would, and lose the index for the join).
These are two different queries. The first query looks for competitions associated with user id 1 (via the PicksPoints table), which the second joins with those rows that are associated with user id 1 that in addition have a null PicksPointsID.
The second query is coming out empty because you are joining against a table called PicksPoints and you are looking for rows in the join result that have PicksPointsID as null. This can only happen if
The second table had a row with a null PickPointsID and a competition id that matched a competition id in the first table, or
All the columns in the second table's contribution to the join are null because there is a competition id in the first table that did not appear in the second.
Since PicksPointsID really sounds like a primary key, it's case 2 that is showing up. So all the columns from PickPointsID are null, your where clause (UserID=1 and PicksPoints.PicksPointsID is null) will always be false and your result will be empty.
A plain left join should work for you
select c.CompetitionID, c.CompetitionName, c.CompetitionStartDate
from Competition c
left join PicksPoints p
on (c.CompetitionID = p.CompetitionID)
where p.UserID <> 1
Replacing the final where with an and (making a complex join clause) might also work. I'll leave it to you to analyze the plans for each query. :)
I'm not personally convinced of the need for the is null test. The article linked to by Shlomi Noach is excellent and you may find some tips in there to help you with this.

MySQL: simple schema, joining in a view and sorting on unrelated attribute causes unbearable performance hit

I'm creating a database model for use by a diverse amount of applications and different kinds of database servers (though I'm mostly testing on MySQL and SQLite now). It's a really simple model that basically consists of one central matches table and many attribute tables that have the match_id as their primary key and one other field (the attribute value itself). Said in other words, every match has exactly one of every type of attribute and every attribute is stored in a seperate table. After experiencing some rather bad performance whilst sorting and filtering on these attributes (FROM matches LEFT JOIN attributes_i_want on primary index) I decided to try to improve it. To this end I added an index on every attribute value column. Sorting and filtering performance increased a lot for easy queries.
This simple schema is basically a requirement for the application, so it is able to auto-discover and use attributes. Thus, to create more complex attributes that are actually based on other results, I decided to use VIEWs that turn one or more other tables that don't necessarily match up to the attribute-like schema into an attribute-schema. I call these meta-attributes (they aren't directly editable either). However, to the application this is all transparant, and so it happily joins in the VIEW as well when it wants to. The problem: it kills performance. When the VIEW is joined in without sorting on any attribute, performance is still acceptable, but combining a retrieval of the VIEW with sorting is unacceptably slow (on the order of 1s). Even after reading quite a bit of tutorials on indexing and some questions here on stack overflow, I can't seem to help it.
_Prerequisites for a solution: in one way or another, num_duplicates must exist as a table or view with the columns match_id and num_duplicates to look like an attribute. I can't change the way attributes are discovered and used. So if I want to see num_duplicates appear in the application it'll have to be as some kind of view or materialized table that makes a num_duplicates table._
Relevant parts of the schema
Main table:
CREATE TABLE `matches` (
`match_id` int(11) NOT NULL,
`source_name` text,
`target_name` text,
`transformation` text,
PRIMARY KEY (`match_id`)
) ENGINE=InnoDB;
Example of a normal attribute (indexed):
CREATE TABLE `error` (
`match_id` int(11) NOT NULL,
`error` double DEFAULT NULL,
PRIMARY KEY (`match_id`),
KEY `error_index` (`error`)
) ENGINE=InnoDB;
(all normal attributes, like error, are basically the same)
Meta-attribute / VIEW:
CREATE VIEW num_duplicates
AS SELECT duplicate AS match_id, COUNT(duplicate) AS num_duplicates
FROM duplicate
GROUP BY duplicate
(this is the only meta-attribute I'm using right now)
Simple query with indexing on the attribute value columns (the part improved by indexes)
SELECT matches.match_id, source_name, target_name, transformation FROM matches
INNER JOIN error ON matches.match_id = error.match_id
ORDER BY error.error
(the performance on this query increased a lot because of the index on error)
(the runtime of this query is on the order of 0.0001 sec)
Slightly more complex queries and their runtimes including the meta-attribute (the still bad part)
SELECT
matches.match_id, source_name, target_name, transformation, STATUS , volume, error, COMMENT , num_duplicates
FROM matches
INNER JOIN STATUS ON matches.match_id = status.match_id
INNER JOIN error ON matches.match_id = error.match_id
LEFT JOIN num_duplicates ON matches.match_id = num_duplicates.match_id
INNER JOIN volume ON matches.match_id = volume.match_id
INNER JOIN COMMENT ON matches.match_id = comment.match_id
(runtime: 0.0263sec) <--- still acceptable
SELECT matches.match_id, source_name, target_name, transformation, STATUS , volume, error, COMMENT , num_duplicates
FROM matches
INNER JOIN STATUS ON matches.match_id = status.match_id
INNER JOIN error ON matches.match_id = error.match_id
LEFT JOIN num_duplicates ON matches.match_id = num_duplicates.match_id
INNER JOIN volume ON matches.match_id = volume.match_id
INNER JOIN COMMENT ON matches.match_id = comment.match_id
ORDER BY error.error
LIMIT 20, 20
(runtime: 0.8866 sec) <--- not acceptable (the query speed is exactly the same with the LIMIT as without the LIMIT, note: if I could get the version with the LIMIT to be fast that would already be a big win. I presume it has to scan the entire table and so the limit doesn't matter too much)
EXPLAIN of the last query
Of course I tried to solve it myself before coming here, but I must admit I'm not that good at these things and haven't found a way to remove the offending performance killer yet. I know it's most likely the using filesort but I don't know how to get rid of it.
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY error index PRIMARY,match_id error_index 9 NULL 53909 Using index; Using temporary; Using filesort
1 PRIMARY COMMENT eq_ref PRIMARY PRIMARY 4 tangbig4.error.match_id 1
1 PRIMARY STATUS eq_ref PRIMARY PRIMARY 4 tangbig4.COMMENT.match_id 1 Using where
1 PRIMARY matches eq_ref PRIMARY PRIMARY 4 tangbig4.COMMENT.match_id 1 Using where
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 2
1 PRIMARY volume eq_ref PRIMARY PRIMARY 4 tangbig4.matches.match_id 1 Using where
2 DERIVED duplicate index NULL duplicate_index 5 NULL 49222 Using index
By the way, the query without the sort, which still runs acceptably, is EXPLAIN'ed like this:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY COMMENT ALL PRIMARY NULL NULL NULL 49610
1 PRIMARY error eq_ref PRIMARY,match_id PRIMARY 4 tangbig4.COMMENT.match_id 1
1 PRIMARY matches eq_ref PRIMARY PRIMARY 4 tangbig4.COMMENT.match_id 1
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 2
1 PRIMARY STATUS eq_ref PRIMARY PRIMARY 4 tangbig4.COMMENT.match_id 1
1 PRIMARY volume eq_ref PRIMARY PRIMARY 4 tangbig4.matches.match_id 1 Using where
2 DERIVED duplicate index NULL duplicate_index 5 NULL 49222 Using index
Question
So, my question is if someone who know more about databases/MySQL is able to find me a way that I can use/research to increase the performance of my last query.
I've been thinking quite a lot about materialized views but they are not natively supported in MySQL and since I'm going for as wide a range of SQL servers as possible this might not be idea. I'm hoping maybe a change to the queries or views might help or possible an extra index.
EDIT: Some random thoughts I've been having about the query:
VERY FAST: joining all tables, excluding the VIEW, sorting
ACCEPTABLE: joining all tables, including the VIEW, no sorting
DOG SLOW: joining all tables, including the VIEW, sorting
But: the VIEW has no influence at all on the sorting, none of it's attributes or even the attributes in its constituent tables are used to sort. Why does includingg the sort impact performance that much then? Is there any way I can convince the database to sort first and then just join up the VIEW? Or can I convince it that the VIEW is not important for sorting?
EDIT2: Following the suggestion by #ace for creating a VIEW and then joining at first didn't seem to help:
DROP VIEW IF EXISTS `matches_joined`;
CREATE VIEW `matches_joined` AS (
SELECT matches.match_id, source_name, target_name, transformation, STATUS , volume, error, COMMENT
FROM matches
INNER JOIN STATUS ON matches.match_id = status.match_id
INNER JOIN error ON matches.match_id = error.match_id
INNER JOIN volume ON matches.match_id = volume.match_id
INNER JOIN COMMENT ON matches.match_id = comment.match_id
ORDER BY error.error
);
followed by:
SELECT matches_joined.*, num_duplicates
FROM matches_joined
LEFT JOIN num_duplicates ON matches_joined.match_id = num_duplicates.match_id
However, using LIMIT on the view did make a difference:
DROP VIEW IF EXISTS `matches_joined`;
CREATE VIEW `matches_joined` AS (
SELECT matches.match_id, source_name, target_name, transformation, STATUS , volume, error, COMMENT
FROM matches
INNER JOIN STATUS ON matches.match_id = status.match_id
INNER JOIN error ON matches.match_id = error.match_id
INNER JOIN volume ON matches.match_id = volume.match_id
INNER JOIN COMMENT ON matches.match_id = comment.match_id
ORDER BY error.error
LIMIT 0, 20
);
Afterwards, the query ran at an acceptable speed. This is already a nice result. However, I feel that I'm jumping through hoops to force the database to do what I want and the reduction in time is probably only caused by the fact that it now only has to sort 20 rows. What if I have more rows? Is there any other way to force the database to see that joining in the num_duplicates VIEW doesn't influence the sorting in the least? Could I perhaps change the query that makes the VIEW a bit?
Some things that can be tested if you haven't tried them yet.
Create a view for all joins with sorting.
DROP VIEW IF EXISTS `matches_joined`;
CREATE VIEW `matches_joined` AS (
SELECT matches.match_id, source_name, target_name, transformation, STATUS , volume, error, COMMENT
FROM matches
INNER JOIN STATUS ON matches.match_id = status.match_id
INNER JOIN error ON matches.match_id = error.match_id
INNER JOIN volume ON matches.match_id = volume.match_id
INNER JOIN COMMENT ON matches.match_id = comment.match_id
ORDER BY error.error
);
Then join them with num_duplicates
SELECT matches_joined.*, num_duplicates
FROM matches_joined
LEFT JOIN num_duplicates ON matches_joined.match_id = num_duplicates.match_id
I'm assuming that as pointed out in here, this query will utilize the order by clause in the view matches_joined.
Some information that may help on optimization.
MySQL :: MySQL 5.0 Reference Manual :: 7.3.1.11 ORDER BY Optimization
The problem was more or less solved by the "VIEW" suggestion that #ace made, but several other types of queries still had performance issues (notably large OFFSET's). In the end a large improvement on all queries of this form was had by simply forcing late-row lookup. Note that it is commonly claimed that this is only necessary for MySQL because MySQL always performs early-row lookup and that other databases like PostgreSQL don't suffer from this problem. However, extensive benchmarks of my application have pointed out that PostgreSQL benefits greatly from this approach as well.

MySQL Query optimisation

I am currently working on a website which needs some optimisations ... since the front page takes about 15-20 seconds to be loaded I thought that some optimisation would be nice.
Here is one query that appeared on the MySQL slow query log:
SELECT a.user,a.id
FROM `profil_perso` pp
INNER JOIN `acces` a ON pp.parrain = a.id
INNER JOIN `acces` ap ON ap.id = pp.id
WHERE pp.parrain_visibilite = '1'
AND a.actif = 1
GROUP BY a.id
ORDER BY ap.depuis DESC LIMIT 15;
On profil_perso (~207K lines -- contains emails and profiles) there is perso_id that is the primary key, there is also id (foreign key) + parrain(referer) + parrain_visibilite(referer is showed) that are indexes.
On acces there is id that is the primary key, there is also depuis (registration date) that is indexed
The benchmark shows this actually:
First time : 1.94532990456
Last time : 1.94532990456
Average time : 0.0389438009262
I tried to put it this way :
SELECT DISTINCT a.id, a.user
FROM `profil_perso` pp
LEFT JOIN `acces` a ON pp.parrain = a.id
WHERE pp.parrain_visibilite = 1
AND a.actif = 1
AND pp.id != 0
ORDER BY pp.id DESC LIMIT 15;
Still the benchmark show this:
First time: 1.96376991272
Last time: 1.96376991272
Average time: 0.0393264245987
Any hint to lower the query time ?
Here's the full indexes:
acces :
id (primary)
derniere_visite -- last visit
pays_id -- country_id
depuis -- registration time
perso_id -- foreign key to profil_perso primary key
actif -- account status
compte_premium -- if account is premium
profil_perso :
perso_id (primary)
id -- foreign key to acces primary key
genre -- gender
parrain_visibilite -- visibility of referer
parrain -- referer
parrain_contexte
telephone
orientation
naissance -- birthdate
photo -- if it has a picture
Run EXPLAIN SELECT DISTINCT a.id .....;
This will help show you where you might be missing indexes etc.
The proper answerr(s) depends as much on the distribution of data (record counts, cardinality of fields and field combinations, etc.) amd the schema, as on the query expression. Even given that information, we could only provide suggestions for testing, which would only lead to more suggestions for testing.
But we could start with a first cut with a schema of the tables involved, plus the results of current EXPLAIN (run twise, second and first results).
Generally speaking, you need to ensure that your indecies are set up correctly - not just for the primary keys, but for the foreign keys used in the table joins.
In addition, it is usually preferable to have defined indecies for any field you filter / order on - so, again, make sure these are set appropriately.
However, I think the big performance hit here could be the fact you're sorting 207k records to retrieve the last 15 inserted - can you achieve the same in a different way?
Why do you have two JOIN's here?
Create a composite index on acces (actif, depuis):
CREATE INDEX ix_acces_actif_depuis ON acces (actif, depuis)
, create a composite index on profil_perso (parrain, parrain_visibilite):
CREATE INDEX ix_profilperso_parrain_parrainvisibilite ON profil_perso (parrain, parrain_visibilite)
and try this:
SELECT a.user, a.id
FROM acces a
JOIN profil_perso p
ON pp.parrain = a.id
AND pp.parrain_visibilite = 1
WHERE a.actif = 1
ORDER BY
a.actif DESC, a.depuis DESC
LIMIT 15
This query will use the index on actif to avoid sorting, and the index on profil_perso to find and filter out the non-visible parrain's.
Since you have a LIMIT 15 here, this query should be instant.
It would also help knowing how selective is your actif field is.
To figure this out, please run:
SELECT COUNT(DISTINCT actif) / COUNT(*)
FROM acces