mySQL Query optimisation LEFT JOIN - mysql

I need to acquire some data from a questions table and then LEFT JOIN product answers. I need the list of all questions in a particular category (a total of 16 categories out of approximately 200 categories) and then list the product answers next to the questions for a certain product id.
SELECT `questions`.`id`, `questions`.`text`, `questions`.`catalogue_id`, `productanswers`.`answer`
FROM `questions`
LEFT JOIN `productanswers` ON `productanswers`.`question_id` = `questions`.`id`
AND product_id = '2001682'
WHERE `catalogue_id` IN (1234912,1234913,1234914)
ORDER BY `catalogue_id`
which returns as I would expect approximately 17 results. Questions without an answer for this product are filled with Null, great!
The problem is that the query takes approximately 23 seconds to execute :-o making a full query with all catalogue questions impossible.
How can I optimise the query, or do you have any other ideas?
Thanks,
Taff

Add a composite (multiple column) index on the question_id and product_id together:
ALTER TABLE productanswers ADD KEY(question_id, product_id)
Note: You might want to switch the column order in the index, depending on the selectivity of question_id vs product_id
For more on composite indexes see the Docs

In addition you should have indexes on all the your id columns but I guess you have this. Another source for information is the explain statement: http://dev.mysql.com/doc/refman/5.0/en/using-explain.html

Related

Optimizing Inner Join Queries

I have this query and i want to know if i can optimize it in some way because currently it takes a long time to execute (like 4/5 seconds)
SELECT *
FROM `posts` ml INNER JOIN
posts_tag_one gt
ON gt.post_id = ml.id AND gt.tag_id = 15 INNER JOIN
posts_tag_two gg
ON gg.post_id = ml.id AND gg.tag_id = 5
WHERE active = '1' AND NOT ml.id = '639474'
ORDER BY ml.id DESC
LIMIT 5
I want to say the database it has like 600k+ posts, the posts_tag_one 5 milions records, the posts_tag_two 475k+ records.
That example i gave it's only with 2 joins but in some cases i have up to 4 joins so the other tables has like 300k-400k records.
I am using foregin keys and indexes for posts_tag_one, posts_tag_two tables but the query it's still slow.
Any advice would help. Thanks!
By means of Transitive property (if a=b and b=c, then a=c), your ML.ID = GT.Post_ID = GG.Post_ID. Since you are trying to pre-qualify specific tags, I would rewrite and try to see if cardinality of data may help by moving to a front position and using better indexes to optimize the query. Also, MySQL has a nice keyword "STRAIGHT_JOIN" that tells the engine query the data in the order I tell you, dont think for me. I have used many times and have seen significant improvement.
SELECT STRAIGHT_JOIN
*
FROM
posts_tag_two gg
INNER JOIN posts_tag_one gt
ON gg.post_id = gt.post_id
AND gt.tag_id = 15
INNER JOIN posts ml
ON gt.post_id = ml.id
AND ml.active = 1
WHERE
gg.tag_id = 5
AND NOT gg.post_id = 639474
ORDER BY
gg.post_id DESC
LIMIT 5
I would ensure the following table / multi-field indexes
table index
Posts_Tag_One ( tag_id, post_id )
Posts_Tag_Two ( tag_id, post_id )
posts ( id, active )
By starting with the Posts_Tag_Two table which you are pre-filtering for tag_id = 5, you are already cutting the list down to those pre-qualified FIRST. Not by starting with ALL posts and seeing which qualify with the tag.
Second level join is to the POSTS_TAG_ONE table on same ID, but that level filtered by its Tag_ID = 15.
Only then does it even care to get to the POSTS table for active.
Since the order is based on the ID descending, and the Posts_tag_two table "post_id" is the same value as Posts.id, the index from the posts_tag_two table should return the record already pre-sorted.
HTH, and would be interested to know final performance difference. Again, I have used STRAIGHT_JOIN many times with significant improvement in performance. I also typically do NOT do "Select *" for all tables / all columns. Get what you need.
FEEDBACK
#eshirvana, in MANY cases, yes, the optimizers do by default. But sometimes, the designer knows a better the makeup of the data. Lets take the scenario of POSTS in the lead-position. You have a room of boxes for posts. Each box contains say 10k records. You have to go through all 10k records, then to the next box until you get through 400k records... again, just for example. Once you find those, then it goes to the join on the filtered criteria for a specific tag. Those too are ordered by ID so you have to do a one-to-one- correlation. So which table stays in a primary position.
Now, by the index by tag, and one of the posts_tag tables (smaller by choice is #2).
Now, you have a room of boxes, but each box only has one tag within it. If you have 300 tag IDs available, you have already cut out x-amount of records giving you just the small sample you pre-qualify to.
So now, the second posts table similarly is a room of boxes. Their boxes are also broken down by tags. So now you only have to grab box for tag #15.
So now you have two very finite sets of records that the JOIN can match on the ID that exists in both cases. only once that is done do you ever need to go to the posts table, which by ID is going to be quick and direct. But having the active status in the index, the engine never needs to go to any actual data pages to retrieve the data until all conditions are met. Only then does it pull the record from the 3 respective tables being returned.
Sounds like posts_tags is a many-to-many mapping table? It need two indexes: (post_id, tag_id) and (tag_id, post_id). One of those should probably be the PRIMARY KEY (Having an auto_increment id is wasteful and slows things down). The other should be INDEX (not UNIQUE). More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
But, why have both posts_tag_two and posts_tag_one?
In addition to those 'composite' keys, do not also have the single-column (post_id) or (tag_id).
If tag is simply a short string, don't bother normalizing it; simply have it in the table.
For further discussion, please provide SHOW CREATE TABLE for each table. And EXPLAIN SELECT ....

Refinement to this MySQL query?

I've got a query which is taking a long time and I was wondering if there was a better way to do it? Perhaps with joins?
It's currently taking ~2.5 seconds which is way too long.
To explain the structure a little: I have products, "themes" and "categories". A product can be assigned any number of themes or categories. The themeitems and categoryitems tables are linking tables to link a category/theme ID to a product ID.
I want to get a list of all products with at least one theme and category. The query I've got at the moment is below:
SELECT *
FROM themes t, themeitems ti, products p, catitems ci, categories c
WHERE t.ID = ti.THEMEID
AND ti.PRODID = p.ID
AND p.ID = ci.PRODID
AND ci.CATID = c.ID
I'm only actually selecting the rows I need when performing the query but I've removed that to abstract a little.
Any help in the right direction would be great!
Edit: EXPLAIN below
Utilise correct JOINs and ensure there are indexes on the fields used in the JOIN is the standard response for this issue.
SELECT *
FROM themes t
INNER JOIN themeitems ti ON t.ID = ti.THEMEID
INNER JOIN products p ON ti.PRODID = p.ID
INNER JOIN catitems ci ON p.ID = ci.PRODID
INNER JOIN categories c ON ci.CATID = c.ID
The specification of the JOINs assists the query engine in working out what it needs to do, and the indexes on the columns used in the join, will enable more rapid joining.
Your query is slow because you don't have any indexes on your tables.
Try:
create unique index pk on themes (ID)
create index fk on themeitems(themeid, prodid)
create unique index pk on products (id)
create index fk catitems(prodid, catid)
create unique index pk on categories (id)
As #symcbean writes in the comments, the catitems and themeitems indices should probably be unique indices too - if there isn't another column to add to that index (e.g. "validityDate"), please add that to the create statement.
Your query is very simple. I do not think that your cost decreases with implementing joins. You can try putting indexes to appropriate columns
Simply selecting less data is the glaringly obvious solution here.
Why do you need to know every column and every row every time you run the query? Addressing any one of these 3 factors will improve performance.
I want to get a list of all products with at least one theme and category
That rather implies you don't care which theme and category, in which case.....
SELECT p.*
FROM themeitems ti, products p, catitems ci
WHERE p.ID = ti.PRODID
AND p.ID = ci.PRODID
It may be possible to make the query run significantly faster - but you've not provided details of the table structure, the indexes, the volume of data, the engine type, the query cache configuration, the frequency of data updates, the frequency with which the query is run.....
update
Now that you've provided the explain plan then it's obvious you've got very small amounts of data AND NO RELEVENT INDEXES!!!!!
As a minimum you should add indexes on the product foreign key in the themeitems and catitems tables. Indeed, the primary keys for these tables should be the product id and category id / theme id, and since it's likely that you will have more products than categories or themes then the fields should be in that order in the indexes. (i.e. PRODID,CATID rather than CATID, PRODID)
update2
Given the requirement "to get a list of all products with at least one theme and category", it might be faster still (but the big wins are reducing the number of joins and adding the right indexes) to....
SELECT p.*
FROM product p
INNER JOIN (
SELECT DISTINCT ti.PRODID
FROM themeitems ti, catitems ci
WHERE ti.PRODID=ci.PRODID
) i ON p.id=i.PRODID
Ive made an answer off this because i could not place it as an comment
Basic thumb off action if you want to remove FULL table scans with JOINS.
You should index first.
Note that this not always works with ORDER BY/GROUP BY in combination with JOINS, because often an Using temporary; using filesort is needed.
Extra because this is out off the scope off the question and how to fix slow query with ORDER BY/GROUP BY in combination with JOIN
Because the MySQL optimizer thinks it needs to access the smallest table first to get the best execution what will cause MySQL cant always use indexes to sort the result and needs to use an temporary table and the filesort the fix the wrong sort ordering
(read more about this here MySQL slow query using filesort this is how i fix this problem because using temporary really can kill performance when MySQL needs an disk based temporary table)

Database design to enable Multiple tags like Stackoverflow?

I have the following tables.
Articles table
a_id INT primary unique
name VARCHAR
Description VARCHAR
c_id INT
Category table
id INT
cat_name VARCHAR
For now I simply use
SELECT a_id,name,Description,cat_name FROM Articles LEFT JOIN Category ON Articles.a_id=Category.id WHERE c_id={$id}
This gives me all articles which belong to a certain category along with category name.
Each article is having only one category.
AND I use a sub category in a similar way(I have another table named sub_cat).But every article doesn't necessary have a sub category.It may belong to multiple categories instead.
I now think of tagging an article with more than one category just like the questions at stackoverflow are tagged(eg: with multiple tags like PHP,MYSQL,SQL etc).AND later I have to display(filter) all article with certain tags(eg: tagged with php,php +MySQL) and I also have to display the tags along with the article name,Description.
Can anyone help me redesign the database?(I am using php + MySQL at back-end)
Create a new table:
CREATE TABLE ArticleCategories(
A_ID INT,
C_ID INT,
Constraint PK_ArticleCategories Primary Key (Article_ID, Category_ID)
)
(this is the SQL server syntax, may be slightly different for MySQL)
This is called a "Junction Table" or a "Mapping Table" and it is how you express Many-to-Many relationships in SQL. So, whenever you want to add a Category to an Article, just INSERT a row into this table with the IDs of the Article and the Category.
For instance, you can initialize it like this:
INSERT Into ArticleCategories(A_ID,C_ID)
SELECT A_ID,C_ID From Articles
Now you can remove c_id from your Articles table.
To get back all of the Categories for a single Article, you would do use a query like this:
SELECT a_id,name,Description,cat_name
FROM Articles
LEFT JOIN ArticleCategories ON Articles.a_id=ArticleCategories.a_id
INNER JOIN Category ON ArticleCategories.c_id=Category.id
WHERE Articles.a_id={$a_id}
Alternatively, to return all articles that have a category LIKE a certain string:
SELECT a_id,name,Description
FROM Articles
WHERE EXISTS( Select *
From ArticleCategories
INNER JOIN Category ON ArticleCategories.c_id=Category.id
WHERE Articles.a_id=ArticleCategories.a_id
AND Category.cat_name LIKE '%'+{$match}+'%'
)
(You may have to adjust the last line, as I am not sure how string parameters are passed MySQL+PHP.)
Ok RBarryYoung you asked me about an reference/analyse you get one
This reference / analyse is based off the documention / source code analyse off the MySQL server
INSERT Into ArticleCategories(A_ID,C_ID)
SELECT A_ID,C_ID From Articles
On an large Articles table with many rows this copy will push one core off the CPU to 100% load and will create a disk based temporary table what will slow down the complete MySQL performance because the disk will be stress out with that copy.
If this is a one time process this is not that bad but do the math if you run this every time..
SELECT a_id,name,Description
FROM Articles
WHERE EXISTS( Select *
From ArticleCategories
INNER JOIN Category ON ArticleCategories.c_id=Category.id
WHERE Articles.a_id=ArticleCategories.a_id
AND Category.cat_name LIKE '%'+{$match}+'%'
)
Note dont take the Execution Times on sqlfriddle for real its an busy server and the times vary alot to make a good statement but look to what View Execution Plan has to say
see http://sqlfiddle.com/#!2/48817/21 for demo
Both querys always trigger an complete table scan on table Articles and two DEPENDENT SUBQUERYS thats not good if you have an large Articles table with many records.
This means the performance depends on the number of Articles rows even when you want only the articles that are in the category.
Select *
From ArticleCategories
INNER JOIN Category ON ArticleCategories.c_id=Category.id
WHERE Articles.a_id=ArticleCategories.a_id
AND Category.cat_name LIKE '%'+{$match}+'%'
This query is the inner subquery but when you try to run it, MySQL cant run because it depends on a value of the Articles table so this is correlated subquery. a subquery type that will be evaluated once for each row processed by the outer query. not good indeed
There are more ways off rewriting RBarryYoung query i will show one.
The INNER JOIN way is much more efficent even with the LIKE operator
Note ive made an habbit out off it that i start with the table with the lowest number off records and work my way up if you start with the table Articles the executing will be the same if the MySQL optimizer chooses the right plan..
SELECT
Articles.a_id
, Articles.name
, Articles.description
FROM
Category
INNER JOIN
ArticleCategories
ON
Category.id = ArticleCategories.c_id
INNER JOIN
Articles
ON
ArticleCategories.a_id = Articles.a_id
WHERE
cat_name LIKE '%php%';
;
see http://sqlfiddle.com/#!2/43451/23 for demo Note that this look worse because it looks like more rows needs to be checkt
Note if the Article table has low number off records RBarryYoung EXIST way and INNER JOIN way will perform more or less the same based on executing times and more proof the INNER JOIN way scales better when the record count become larger
http://sqlfiddle.com/#!2/c11f3/1 EXISTS oeps more Articles records needs to be checked now (even when they are not linked with the ArticleCategories table) so the query is less efficient now
http://sqlfiddle.com/#!2/7aa74/8 INNER JOIN same explain plan as the first demo
Extra notes about scaling it becomes even more worse when you also want to ORDER BY or GROUP BY the NOT EXIST way has an bigger chance it will create an disk based temporary table that will kill MySQL performance
Lets also analyse the LIKE '%php%' vs = 'php' for the EXIST way and INNER JOIN way
the EXIST way
http://sqlfiddle.com/#!2/48817/21 / http://sqlfiddle.com/#!2/c11f3/1 (more Articles) the explain tells me both patterns are more or less the same but 'php' should be little faster because off the const type vs ref in the TYPE column but LIKE %php% will use more CPU because an string compare algoritme needs to run.
the INNER JOIN way
http://sqlfiddle.com/#!2/43451/23 / http://sqlfiddle.com/#!2/7aa74/8 (more Articles) the explain tell me the LIKE '%php%' should be slower because 3 more rows need to be analysed but not shocking slower in this case (you can see the index is not really used on the best way).
RBarryYoung way works but doenst keep performance atleast not on a MySQL server
see http://sqlfiddle.com/#!2/b2bd9/1 or http://sqlfiddle.com/#!2/34ea7/1
for examples that will scale on large tables with lots of records this is what the topic starter needs

Could this MYSQL be optimized?

Hello I'm having a comments table on which I run a fulltext search.
c1 and c2 are aliases on the same table used
via criteria: c1.parent_id=0 I get the questions only(not the answers attached to them)
and via c2.parent_id<>0 I filter the questions that already have answers
SELECT DISTINCT c1.comment, c1.comment_id, MATCH(c1.comment) AGAINST ('keyword1 keyword2 keyword3') AS score
FROM comments AS c1
JOIN comments AS c2
ON c1.comment_id = c2.parent_id
WHERE c1.parent_id=0
and c2.parent_id <> 0
ORDER BY score DESC LIMIT 9
The problem is that when I run EXPLAIN SELECT... the search looks up through each and every row of the table - so the bigger it gets the slower this operation will be, instead of searching just the rows with parent_id=0.
I would like to ask: is it possible to optimize this kind of query any further?
add an index to all the id columns
alter table your_table add index(parent_id)
same with comment_id

MYSQL: SELECT users from one user group and exclude users from another group - optimisation

I have an optimisation question here.
Background
I have a 12000 users in a user table, on record per user. Each user can be in zero or more groups. I have a groups table with 45 groups and a groups_link table with 75000 records (to facilitate the many to many relationship).
I am making a querying screen which allows a user to filter users from the user list.
Aim
Specifically, I need help with: Selecting all users that are in one group but are not in another group.
DB Structure
Query
My current query which runs too slowly...
SELECT U.user_id,U.user_email
FROM (sc_module_users AS U)
JOIN sc_module_users_groups_links AS group_join ON group_join.user_id = U.user_id
LEFT JOIN sc_module_users_groups_links AS excluded_group_join ON group_join.user_id = U.user_id
WHERE group_join.group_id IN (27) AND excluded_group_join.group_id NOT IN (19) OR excluded_group_join.group_id IS NULL AND U.user_subscribed=1 AND U.user_active=1
GROUP BY U.user_id,U.user_id
This query takes 9 minutes to complete, it returns 11,000 records (out of 12,000).
Explain
Here's the explain on that query:
Click here for a closer look
Can anyone help me optimise this to below the 1 minute mark...?
After 3 revisions, I changed it to this
SELECT U.user_id,U.user_email FROM (sc_module_users AS U) WHERE ( user_country LIKE '%australia%' ) AND
EXISTS (SELECT group_id FROM sc_module_users_groups_links l WHERE group_id in (31) AND l.user_id=U.user_id) AND
NOT EXISTS (SELECT group_id FROM sc_module_users_groups_links l WHERE group_id in (27) AND l.user_id=U.user_id)
AND U.user_subscribed=1 AND U.user_active=1 GROUP BY U.user_id
'
mucccch faster
EDIT: removed my query suggestion but the index stuff should still apply:
The indexes on the sc_module_users_groups_links could be improved by creating a composite index just on user_id and group_id. The order of the columns in the index can have an impact - i believe having user_id first should perform better.
You could also try removing the link_id and just using a composite primary key since the link_id doesn't seem to serve any other purpose.
I believe the very first thing you need to do is to place parentheses:
// should be
.. AND ( excluded_group_join.group_id NOT IN (19)
OR excluded_group_join.group_id IS NULL) AND ....