I have this query and i want to know if i can optimize it in some way because currently it takes a long time to execute (like 4/5 seconds)
SELECT *
FROM `posts` ml INNER JOIN
posts_tag_one gt
ON gt.post_id = ml.id AND gt.tag_id = 15 INNER JOIN
posts_tag_two gg
ON gg.post_id = ml.id AND gg.tag_id = 5
WHERE active = '1' AND NOT ml.id = '639474'
ORDER BY ml.id DESC
LIMIT 5
I want to say the database it has like 600k+ posts, the posts_tag_one 5 milions records, the posts_tag_two 475k+ records.
That example i gave it's only with 2 joins but in some cases i have up to 4 joins so the other tables has like 300k-400k records.
I am using foregin keys and indexes for posts_tag_one, posts_tag_two tables but the query it's still slow.
Any advice would help. Thanks!
By means of Transitive property (if a=b and b=c, then a=c), your ML.ID = GT.Post_ID = GG.Post_ID. Since you are trying to pre-qualify specific tags, I would rewrite and try to see if cardinality of data may help by moving to a front position and using better indexes to optimize the query. Also, MySQL has a nice keyword "STRAIGHT_JOIN" that tells the engine query the data in the order I tell you, dont think for me. I have used many times and have seen significant improvement.
SELECT STRAIGHT_JOIN
*
FROM
posts_tag_two gg
INNER JOIN posts_tag_one gt
ON gg.post_id = gt.post_id
AND gt.tag_id = 15
INNER JOIN posts ml
ON gt.post_id = ml.id
AND ml.active = 1
WHERE
gg.tag_id = 5
AND NOT gg.post_id = 639474
ORDER BY
gg.post_id DESC
LIMIT 5
I would ensure the following table / multi-field indexes
table index
Posts_Tag_One ( tag_id, post_id )
Posts_Tag_Two ( tag_id, post_id )
posts ( id, active )
By starting with the Posts_Tag_Two table which you are pre-filtering for tag_id = 5, you are already cutting the list down to those pre-qualified FIRST. Not by starting with ALL posts and seeing which qualify with the tag.
Second level join is to the POSTS_TAG_ONE table on same ID, but that level filtered by its Tag_ID = 15.
Only then does it even care to get to the POSTS table for active.
Since the order is based on the ID descending, and the Posts_tag_two table "post_id" is the same value as Posts.id, the index from the posts_tag_two table should return the record already pre-sorted.
HTH, and would be interested to know final performance difference. Again, I have used STRAIGHT_JOIN many times with significant improvement in performance. I also typically do NOT do "Select *" for all tables / all columns. Get what you need.
FEEDBACK
#eshirvana, in MANY cases, yes, the optimizers do by default. But sometimes, the designer knows a better the makeup of the data. Lets take the scenario of POSTS in the lead-position. You have a room of boxes for posts. Each box contains say 10k records. You have to go through all 10k records, then to the next box until you get through 400k records... again, just for example. Once you find those, then it goes to the join on the filtered criteria for a specific tag. Those too are ordered by ID so you have to do a one-to-one- correlation. So which table stays in a primary position.
Now, by the index by tag, and one of the posts_tag tables (smaller by choice is #2).
Now, you have a room of boxes, but each box only has one tag within it. If you have 300 tag IDs available, you have already cut out x-amount of records giving you just the small sample you pre-qualify to.
So now, the second posts table similarly is a room of boxes. Their boxes are also broken down by tags. So now you only have to grab box for tag #15.
So now you have two very finite sets of records that the JOIN can match on the ID that exists in both cases. only once that is done do you ever need to go to the posts table, which by ID is going to be quick and direct. But having the active status in the index, the engine never needs to go to any actual data pages to retrieve the data until all conditions are met. Only then does it pull the record from the 3 respective tables being returned.
Sounds like posts_tags is a many-to-many mapping table? It need two indexes: (post_id, tag_id) and (tag_id, post_id). One of those should probably be the PRIMARY KEY (Having an auto_increment id is wasteful and slows things down). The other should be INDEX (not UNIQUE). More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
But, why have both posts_tag_two and posts_tag_one?
In addition to those 'composite' keys, do not also have the single-column (post_id) or (tag_id).
If tag is simply a short string, don't bother normalizing it; simply have it in the table.
For further discussion, please provide SHOW CREATE TABLE for each table. And EXPLAIN SELECT ....
Related
I've managed to put together a query that works for my needs, albeit more complicated than I was hoping. But, for the size of tables the query is slower than it should be (0.17s). The reason, based on the EXPLAIN provided below, is because there is a table scan on the meta_relationships table due to it having the COUNT in the WHERE clause on an innodb engine.
Query:
SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
RIGHT JOIN meta_relationships ON (posts.post_id = meta_relationships.object_id)
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE meta.meta_name = computers AND meta_relationships.object_id
NOT IN (SELECT meta_relationships.object_id FROM meta_relationships
GROUP BY meta_relationships.object_id HAVING count(*) > 1)
GROUP BY meta_relationships.object_id
This particular query, selects posts which have ONLY the computers category. The purpose of count > 1 is to exclude posts that contain computers/hardware, computers/software, etc. The more categories that are selected, the higher the count would be.
Ideally, I'd like to get it functioning like this:
WHERE meta.meta_name IN ('computers') AND meta_relationships.meta_order IN (0)
or
WHERE meta.meta_name IN ('computers','software')
AND meta_relationships.meta_order IN (0,1)
etc..
But unfortunately this doesn't work, because it doesn't take into consideration that there may be a meta_relationships.meta_order = 2.
I've tried...
WHERE meta.meta_name IN ('computers')
GROUP BY meta_relationships.meta_order
HAVING meta_relationships.meta_order IN (0) AND meta_relationships.meta_order NOT IN (1)
but it doesn't return the correct amount of rows.
EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY meta ref PRIMARY,idx_meta_name idx_meta_name 602 const 1 Using where; Using index; Using temporary; Using filesort
1 PRIMARY meta_data ref PRIMARY,idx_meta_id idx_meta_id 8 database.meta.meta_id 1
1 PRIMARY meta_relationships ref idx_meta_data_id idx_meta_data_id 8 database.meta_data.meta_data_id 11 Using where
1 PRIMARY posts eq_ref PRIMARY PRIMARY 4 database.meta_relationships.object_id 1
2 MATERIALIZED meta_relationships index NULL idx_object_id 4 NULL 14679 Using index
Tables/Indexes:
meta
This table contains the category and tag names.
indexes:
PRIMARY KEY (meta_id), KEY idx_meta_name (meta_name)
meta_data
This table contains additional data about the categories and tags such as type (category or tag), description, parent, count.
indexes:
PRIMARY KEY (meta_data_id), KEY idx_meta_id (meta_id)
meta_relationships
This is a junction/lookup table. It contains a foreign key to the posts_id, a foreign key to the meta_data_id, and also contains the order of the categories.
indexes:
PRIMARY KEY (relationship_id), KEY idx_object_id (object_id), KEY idx_meta_data_id (meta_data_id)
The count allows me to only select the posts with that correct level of category. For example, the category computers has posts with only the computers category but it also has posts with computers/hardware. The count filters out posts that contain those extra categories. I hope that makes sense.
I believe the key to optimizing the query is to get away completely from doing the COUNT.
An alternative to the COUNT would possibly be using meta_relationships.meta_order or meta_data.parent instead.
The meta_relationships table will grow quickly and with the current size (~15K rows) I'm hoping to achieve an execution time in the 100th of seconds rather than the 10ths of seconds.
Since there needs to be multiple conditions in the WHERE clause for each category/tag, any answer optimized for a dynamic query is preferred.
I have created an IDE with sample data.
How can I optimize this query?
EDIT :
I was never able to find an optimal solution to this problem. It was really a combination of smcjones recommendation of improving the indexes for which I would recommend doing an EXPLAIN and looking at EXPLAIN Output Format then change the indexes to whatever gives you the best performance.
Also, hpf's recommendation to add another column with the total count helped tremendously. In the end, after changing the indexes, I ended up going with this query.
SELECT posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
JOIN meta_relationships ON meta_relationships.object_id = posts.post_id
JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE posts.meta_count = 2
GROUP BY posts.post_id
HAVING category = 'category,subcategory'
After getting rid of the COUNT, the big performance killer was the GROUP BY and ORDER BY, but the indexes are your best friend. I learned that when doing a GROUP BY, the WHERE clause is very important, the more specific you can get the better.
With a combination of optimized queries AND optimizing your tables, you will have fast queries. However, you cannot have fast queries without an optimized table.
I cannot stress this enough: If your tables are structured correctly with the correct amount of indexes, you should not be experiencing any full table reads on a query like GROUP BY... HAVING unless you do so by design.
Based on your example, I have created this SQLFiddle.
Compare that to SQLFiddle #2, in which I added indexes and added a UNIQUE index against meta.meta_naame.
From my testing, Fiddle #2 is faster.
Optimizing Your Query
This query was driving me nuts, even after I made the argument that indexes would be the best way to optimize this. Even though I still hold that the table is your biggest opportunity to increase performance, it did seem that there had to be a better way to run this query in MySQL. I had a revelation after sleeping on this problem, and used the following query (seen in SQLFiddle #3):
SELECT posts.post_id,posts.post_name,posts.post_title,posts.post_description,posts.date,meta.meta_name
FROM posts
LEFT JOIN meta_relationships ON meta_relationships.object_id = posts.post_id
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE meta.meta_name = 'animals'
GROUP BY meta_relationships.object_id
HAVING sum(meta_relationships.object_id) = min(meta_relationships.object_id);
HAVING sum() = min() on a GROUP BY should check to see if there is more than one record of each type. Obviously, each time the record shows up, it will add more to the sum. (Edit: On subsequent tests it seems like this has the same impact as count(meta_relationships.object_id) = 1. Oh well, the point is I believe you can remove subquery and have the same result).
I want to be clear that you won't notice much if any optimization on the query I provided you unless the section, WHERE meta.meta_name = 'animals' is querying against an index (preferably a unique index because I doubt you'll need more than one of these and it will prevent accidental duplication of data).
So, instead of a table that looks like this:
CREATE TABLE meta_data (
meta_data_id BIGINT,
meta_id BIGINT,
type VARCHAR(50),
description VARCHAR(200),
parent BIGINT,
count BIGINT);
You should make sure you add primary keys and indexes like this:
CREATE TABLE meta_data (
meta_data_id BIGINT,
meta_id BIGINT,
type VARCHAR(50),
description VARCHAR(200),
parent BIGINT,
count BIGINT,
PRIMARY KEY (meta_data_id,meta_id),
INDEX ix_meta_id (meta_id)
);
Don't overdo it, but every table should have a primary key, and any time you are aggregating or querying against a specific value, there should be indexes.
When indexes are not used, the MySQL will walk through each row of the table until it finds what you want. In such a limited example as yours this doesn't take too long (even though it's still noticeably slower), but when you add thousands or more records, this will become extraordinarily painful.
In the future, when reviewing your queries, try to identify where your full table scans are occurring and see if there is an index on that column. A good place to start is wherever you are aggregating or using the WHERE syntax.
A note on the count column
I have not found putting count columns into the table to be helpful. It can lead to some pretty serious integrity issues. If a table is properly optimized, It should be very easy to use count() and get the current count. If you want to have it in a table, you can use a VIEW, although that will not be the most efficient way to make the pull.
The problem with putting count columns into a table is that you need to update that count, using either a TRIGGER or, worse, application logic. As your program scales out that logic can either get lost or buried. Adding that column is a deviation from normalization and when something like this is to occur, there should be a VERY good reason.
Some debate exists as to whether there is ever a good reason to do this, but I think I'd be wise to stay out of that debate because there are great arguments on both sides. Instead, I will pick a much smaller battle and say that I see this causing you more headaches than benefits in this use case, so it is probably worth A/B testing.
Since the HAVING seems to be the issue, can you instead create a flag field in the posts table and use that instead? If I understand the query correctly, you're trying to find posts with only one meta_relationship link. If you created a field in your posts table that was either a count of the meta_relationships for that post, or a boolean flag for whether there was only one, and indexed it of course, that would probably be much faster. It would involve updating the field if the post was edited.
So, consider this:
Add a new field to the posts table called "num_meta_rel". It can be an unsigned tinyint as long as you'll never have more than 255 tags to any one post.
Update the field like this:
UPDATE posts
SET num_meta_rel=(SELECT COUNT(object_id) from meta_relationships WHERE object_id=posts.post_id);
This query will take some time to run, but once done you have all the counts precalculated. Note this can be done better with a join, but SQLite (Ideone) only allows subqueries.
Now, you rewrite your query like this:
SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
RIGHT JOIN meta_relationships ON (posts.post_id = meta_relationships.object_id)
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE meta.meta_name = computers AND posts.num_meta_rel=1
GROUP BY meta_relationships.object_id
If I've done this correctly, the runnable code is here: http://ideone.com/ZZiKgx
Note that this solution requires that you update the num_meta_rel (choose a better name, that one is terrible...) if the post has a new tag associated with it. But that should be much faster than scanning your entire table over and over.
See if this gives you the right answer, possibly faster:
SELECT p.post_id, p.post_name,
GROUP_CONCAT(IF(md.type = 'category', meta.meta_name, null)) AS category,
GROUP_CONCAT(IF(md.type = 'tag', meta.meta_name, null)) AS tag
FROM
( SELECT object_id
FROM meta_relation
GROUP BY object_id
HAVING count(*) = 1
) AS x
JOIN meta_relation AS mr ON mr.object_id = x.object_id
JOIN posts AS p ON p.post_id = mr.object_id
JOIN meta_data AS md ON mr.meta_data_id = md.meta_data_id
JOIN meta ON md.meta_id = meta.meta_id
WHERE meta.meta_name = ?
GROUP BY mr.object_id
Unfortunately I have no possibility to test performance,
But try my query using your real data:
http://sqlfiddle.com/#!9/81b29/13
SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
INNER JOIN (
SELECT meta_relationships.object_id
FROM meta_relationships
GROUP BY meta_relationships.object_id
HAVING count(*) < 3
) mr ON mr.object_id = posts.post_id
LEFT JOIN meta_relationships ON mr.object_id = meta_relationships.object_id
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
INNER JOIN (
SELECT *
FROM meta
WHERE meta.meta_name = 'health'
) meta ON meta_data.meta_id = meta.meta_id
GROUP BY posts.post_id
Use
sum(1)
instead of
count(*)
I'm building a system that has items and tags, with a many-to-many relationship (via an intermediate table), in MySQL. As I've scaled it up, one query has become unacceptably slow, but I'm struggling to make it more efficient.
The query in question amounts to "select all tags that have an item of type x associated with them". Here's a very slightly simplified version:
SELECT DISTINCT(t.id)
FROM tags t
INNER JOIN items_tags it ON it.tag_id = t.id
INNER JOIN items i ON it.item_id = i.id
WHERE i.type = 10
I have unique primary indexes on t.id, item.id and "it.tag_id, it.item_id". The problem I'm having is that the items_tags table is at a size (~1,400,000 rows) where the query takes too long (one thing that puzzles me here is that phpMyAdmin seems to think the query is fast - it times it as a few ms, but in practice it seems to take 6 or 7 seconds).
It feels to me as if there might be a way of joining the items_tags table to itself to reduce the size of the result set (and perhaps remove the need for that DISTINCT clause), but I can't figure out how... Alternatively, it occurs to me that there might be a better way of indexing things. Any help or suggestions would be much appreciated!
Well, for the record, here's what worked for me (though I'd still be interested if anyone has any other suggestions).
It was pointed out (in the comments above - thanks #Turophile!) that since tag id is available in the items_tags table, I could leave the tags table out. I actually did need other fields (eg. name) from the tags table (I simplified the query a little for the question), but I found that removing the tags table from the above query and joining the tags table onto its results was significantly faster (EXPLAIN showed that it allowed fewer rows to be scanned). That made the query look more like this:
SELECT
tags.id,
tags.name
FROM tags
INNER JOIN (
SELECT DISTINCT(it.tag_id) AS tag_id
FROM items_tags it
JOIN items i ON it.item_id = i.id
WHERE i.type = 10
) it ON tags.id = it.tag_id
This was about 10x faster than the previous version of the query (reduced the average time from about 27s to ~2.5s).
On top of that, adding an index to items.type improved things further (reduced the average time from ~2.5s to ~1.2s).
I've got a query which is taking a long time and I was wondering if there was a better way to do it? Perhaps with joins?
It's currently taking ~2.5 seconds which is way too long.
To explain the structure a little: I have products, "themes" and "categories". A product can be assigned any number of themes or categories. The themeitems and categoryitems tables are linking tables to link a category/theme ID to a product ID.
I want to get a list of all products with at least one theme and category. The query I've got at the moment is below:
SELECT *
FROM themes t, themeitems ti, products p, catitems ci, categories c
WHERE t.ID = ti.THEMEID
AND ti.PRODID = p.ID
AND p.ID = ci.PRODID
AND ci.CATID = c.ID
I'm only actually selecting the rows I need when performing the query but I've removed that to abstract a little.
Any help in the right direction would be great!
Edit: EXPLAIN below
Utilise correct JOINs and ensure there are indexes on the fields used in the JOIN is the standard response for this issue.
SELECT *
FROM themes t
INNER JOIN themeitems ti ON t.ID = ti.THEMEID
INNER JOIN products p ON ti.PRODID = p.ID
INNER JOIN catitems ci ON p.ID = ci.PRODID
INNER JOIN categories c ON ci.CATID = c.ID
The specification of the JOINs assists the query engine in working out what it needs to do, and the indexes on the columns used in the join, will enable more rapid joining.
Your query is slow because you don't have any indexes on your tables.
Try:
create unique index pk on themes (ID)
create index fk on themeitems(themeid, prodid)
create unique index pk on products (id)
create index fk catitems(prodid, catid)
create unique index pk on categories (id)
As #symcbean writes in the comments, the catitems and themeitems indices should probably be unique indices too - if there isn't another column to add to that index (e.g. "validityDate"), please add that to the create statement.
Your query is very simple. I do not think that your cost decreases with implementing joins. You can try putting indexes to appropriate columns
Simply selecting less data is the glaringly obvious solution here.
Why do you need to know every column and every row every time you run the query? Addressing any one of these 3 factors will improve performance.
I want to get a list of all products with at least one theme and category
That rather implies you don't care which theme and category, in which case.....
SELECT p.*
FROM themeitems ti, products p, catitems ci
WHERE p.ID = ti.PRODID
AND p.ID = ci.PRODID
It may be possible to make the query run significantly faster - but you've not provided details of the table structure, the indexes, the volume of data, the engine type, the query cache configuration, the frequency of data updates, the frequency with which the query is run.....
update
Now that you've provided the explain plan then it's obvious you've got very small amounts of data AND NO RELEVENT INDEXES!!!!!
As a minimum you should add indexes on the product foreign key in the themeitems and catitems tables. Indeed, the primary keys for these tables should be the product id and category id / theme id, and since it's likely that you will have more products than categories or themes then the fields should be in that order in the indexes. (i.e. PRODID,CATID rather than CATID, PRODID)
update2
Given the requirement "to get a list of all products with at least one theme and category", it might be faster still (but the big wins are reducing the number of joins and adding the right indexes) to....
SELECT p.*
FROM product p
INNER JOIN (
SELECT DISTINCT ti.PRODID
FROM themeitems ti, catitems ci
WHERE ti.PRODID=ci.PRODID
) i ON p.id=i.PRODID
Ive made an answer off this because i could not place it as an comment
Basic thumb off action if you want to remove FULL table scans with JOINS.
You should index first.
Note that this not always works with ORDER BY/GROUP BY in combination with JOINS, because often an Using temporary; using filesort is needed.
Extra because this is out off the scope off the question and how to fix slow query with ORDER BY/GROUP BY in combination with JOIN
Because the MySQL optimizer thinks it needs to access the smallest table first to get the best execution what will cause MySQL cant always use indexes to sort the result and needs to use an temporary table and the filesort the fix the wrong sort ordering
(read more about this here MySQL slow query using filesort this is how i fix this problem because using temporary really can kill performance when MySQL needs an disk based temporary table)
I have the following tables.
Articles table
a_id INT primary unique
name VARCHAR
Description VARCHAR
c_id INT
Category table
id INT
cat_name VARCHAR
For now I simply use
SELECT a_id,name,Description,cat_name FROM Articles LEFT JOIN Category ON Articles.a_id=Category.id WHERE c_id={$id}
This gives me all articles which belong to a certain category along with category name.
Each article is having only one category.
AND I use a sub category in a similar way(I have another table named sub_cat).But every article doesn't necessary have a sub category.It may belong to multiple categories instead.
I now think of tagging an article with more than one category just like the questions at stackoverflow are tagged(eg: with multiple tags like PHP,MYSQL,SQL etc).AND later I have to display(filter) all article with certain tags(eg: tagged with php,php +MySQL) and I also have to display the tags along with the article name,Description.
Can anyone help me redesign the database?(I am using php + MySQL at back-end)
Create a new table:
CREATE TABLE ArticleCategories(
A_ID INT,
C_ID INT,
Constraint PK_ArticleCategories Primary Key (Article_ID, Category_ID)
)
(this is the SQL server syntax, may be slightly different for MySQL)
This is called a "Junction Table" or a "Mapping Table" and it is how you express Many-to-Many relationships in SQL. So, whenever you want to add a Category to an Article, just INSERT a row into this table with the IDs of the Article and the Category.
For instance, you can initialize it like this:
INSERT Into ArticleCategories(A_ID,C_ID)
SELECT A_ID,C_ID From Articles
Now you can remove c_id from your Articles table.
To get back all of the Categories for a single Article, you would do use a query like this:
SELECT a_id,name,Description,cat_name
FROM Articles
LEFT JOIN ArticleCategories ON Articles.a_id=ArticleCategories.a_id
INNER JOIN Category ON ArticleCategories.c_id=Category.id
WHERE Articles.a_id={$a_id}
Alternatively, to return all articles that have a category LIKE a certain string:
SELECT a_id,name,Description
FROM Articles
WHERE EXISTS( Select *
From ArticleCategories
INNER JOIN Category ON ArticleCategories.c_id=Category.id
WHERE Articles.a_id=ArticleCategories.a_id
AND Category.cat_name LIKE '%'+{$match}+'%'
)
(You may have to adjust the last line, as I am not sure how string parameters are passed MySQL+PHP.)
Ok RBarryYoung you asked me about an reference/analyse you get one
This reference / analyse is based off the documention / source code analyse off the MySQL server
INSERT Into ArticleCategories(A_ID,C_ID)
SELECT A_ID,C_ID From Articles
On an large Articles table with many rows this copy will push one core off the CPU to 100% load and will create a disk based temporary table what will slow down the complete MySQL performance because the disk will be stress out with that copy.
If this is a one time process this is not that bad but do the math if you run this every time..
SELECT a_id,name,Description
FROM Articles
WHERE EXISTS( Select *
From ArticleCategories
INNER JOIN Category ON ArticleCategories.c_id=Category.id
WHERE Articles.a_id=ArticleCategories.a_id
AND Category.cat_name LIKE '%'+{$match}+'%'
)
Note dont take the Execution Times on sqlfriddle for real its an busy server and the times vary alot to make a good statement but look to what View Execution Plan has to say
see http://sqlfiddle.com/#!2/48817/21 for demo
Both querys always trigger an complete table scan on table Articles and two DEPENDENT SUBQUERYS thats not good if you have an large Articles table with many records.
This means the performance depends on the number of Articles rows even when you want only the articles that are in the category.
Select *
From ArticleCategories
INNER JOIN Category ON ArticleCategories.c_id=Category.id
WHERE Articles.a_id=ArticleCategories.a_id
AND Category.cat_name LIKE '%'+{$match}+'%'
This query is the inner subquery but when you try to run it, MySQL cant run because it depends on a value of the Articles table so this is correlated subquery. a subquery type that will be evaluated once for each row processed by the outer query. not good indeed
There are more ways off rewriting RBarryYoung query i will show one.
The INNER JOIN way is much more efficent even with the LIKE operator
Note ive made an habbit out off it that i start with the table with the lowest number off records and work my way up if you start with the table Articles the executing will be the same if the MySQL optimizer chooses the right plan..
SELECT
Articles.a_id
, Articles.name
, Articles.description
FROM
Category
INNER JOIN
ArticleCategories
ON
Category.id = ArticleCategories.c_id
INNER JOIN
Articles
ON
ArticleCategories.a_id = Articles.a_id
WHERE
cat_name LIKE '%php%';
;
see http://sqlfiddle.com/#!2/43451/23 for demo Note that this look worse because it looks like more rows needs to be checkt
Note if the Article table has low number off records RBarryYoung EXIST way and INNER JOIN way will perform more or less the same based on executing times and more proof the INNER JOIN way scales better when the record count become larger
http://sqlfiddle.com/#!2/c11f3/1 EXISTS oeps more Articles records needs to be checked now (even when they are not linked with the ArticleCategories table) so the query is less efficient now
http://sqlfiddle.com/#!2/7aa74/8 INNER JOIN same explain plan as the first demo
Extra notes about scaling it becomes even more worse when you also want to ORDER BY or GROUP BY the NOT EXIST way has an bigger chance it will create an disk based temporary table that will kill MySQL performance
Lets also analyse the LIKE '%php%' vs = 'php' for the EXIST way and INNER JOIN way
the EXIST way
http://sqlfiddle.com/#!2/48817/21 / http://sqlfiddle.com/#!2/c11f3/1 (more Articles) the explain tells me both patterns are more or less the same but 'php' should be little faster because off the const type vs ref in the TYPE column but LIKE %php% will use more CPU because an string compare algoritme needs to run.
the INNER JOIN way
http://sqlfiddle.com/#!2/43451/23 / http://sqlfiddle.com/#!2/7aa74/8 (more Articles) the explain tell me the LIKE '%php%' should be slower because 3 more rows need to be analysed but not shocking slower in this case (you can see the index is not really used on the best way).
RBarryYoung way works but doenst keep performance atleast not on a MySQL server
see http://sqlfiddle.com/#!2/b2bd9/1 or http://sqlfiddle.com/#!2/34ea7/1
for examples that will scale on large tables with lots of records this is what the topic starter needs
I am hoping some of you who are experts in mysql can help me to optimize my mysql search query...
First, some background:
I am working on a small exercise mysql application that has a search feature.
Each exercise in the database can belong to an arbitrary number of nested categories, and each exercise can also have an arbitrary number of searchtags associated with it.
Here is my data structure (simplified for readability)
TABLE exercises
ID
title
TABLE searchtags
ID
title
TABLE exerciseSearchtags
exerciseID -> exercises.ID
searchtagID -> searchtags.ID
TABLE categories
ID
parentID -> ID
title
TABLE exerciseCategories
exerciseID -> exercises.ID
categoryID -> categories.ID
All tables are InnoDB (no full-text searching).
The ID columns for exercises, searchtags and categories have been indexed.
"exerciseSearchtags" and "exerciseCategories" are many to many join tables expressing the relationship between exercises and searchtags, and exercises and categories, respectively. Both the exerciseID & searchtagID columns have been indexed in exerciseSearchtags, and both the exerciseID and categoryID columns have indexed in exerciseCategories.
Here are some examples of what exercise title, category title and searchtag title data might look like. All three types can have multiple words in the title.
Exercises
(ID - title)
1 - Concentric Shoulder Internal Rotation in Prone
2 - Straight Leg Raise Dural Mobility (Sural)
3 - Push-Ups
Categories
(ID - title)
1 - Flexion
2 - Muscles of Mastication
3 - Lumbar Plexus
Searchtags
(ID - title)
1 - Active Range of Motion
2 - Overhead Press
3 - Impingement
Now, on to the search query:
The search engine accepts an arbitrary number of user inputted keywords.
I would like to rank search results based on the number of keyword/category title matches, keyword/searchtag title matches, and keyword/exercise title matches.
To accomplish this, I am using the following dynamically generated SQL:
SELECT
exercises.ID AS ID,
exercises.title AS title,
(
// for each keyword, the following
// 3 subqueries are generated
(
SELECT COUNT(1)
FROM categories
LEFT JOIN exerciseCategories
ON exerciseCategories.categoryID = categories.ID
WHERE categories.title RLIKE CONCAT('[[:<:]]',?)
AND exerciseCategories.exerciseID = exercises.ID
) +
(
SELECT COUNT(1)
FROM searchtags
LEFT JOIN exerciseSearchtags
ON exerciseSearchtags.searchtagID = searchtags.ID
WHERE searchtags.title RLIKE CONCAT('[[:<:]]',?)
AND exerciseSearchtags.exerciseID = exercises.ID
) +
(
SELECT COUNT(1)
FROM exercises AS exercises2
WHERE exercises2.title RLIKE CONCAT('[[:<:]]',?)
AND exercises2.ID = exercises.ID
)
// end subqueries
) AS relevance
FROM
exercises
LEFT JOIN exerciseCategories
ON exerciseCategories.exerciseID = exercises.ID
LEFT JOIN categories
ON categories.ID = exerciseCategories.categoryID
LEFT JOIN exerciseSearchtags
ON exerciseSearchtags.exerciseID = exercises.ID
LEFT JOIN searchtags
ON searchtags.ID = exerciseSearchtags.searchtagID
WHERE
// for each keyword, the following
// 3 conditions are generated
categories.title RLIKE CONCAT('[[:<:]]',?) OR
exercises.title RLIKE CONCAT('[[:<:]]',?) OR
searchtags.title RLIKE CONCAT('[[:<:]]',?)
// end conditions
GROUP BY
exercises.ID
ORDER BY
relevance DESC
LIMIT
$start, $results
All of this works just fine. It returns relevant search results based on user input.
However, I am worried that my solution may not scale well. For example, if a user enters a seven keywords search string, that will result in a query with 21 subqueries in the relevance calculation, which might start to slow things down, if the tables get big.
Does anyone have any suggestions as to how I can optimize the above? Is there a better way to accomplish what I want? Am I making any glaring errors in the above?
Thanks in advance for your help.
I might me be able to provide a better answer if you also provided some data, particular some example keywords and example titles from each of your tables so we can get a sense of what you're trying to actually match on. But I will try to answer with what you have provided.
First let me put in English what I think your query will do and then I'll break down the reasons why and ways to fix it.
Perform a full table scan of all instances of `exercises`
For each row in `exercises`
Find all categories attached via exerciseCategories
For each combination of exercise and category
Perform a full table scan of all instances of exerciseCategories
Look up corresponding category
Perform RLIKE match on title
Perform a full table scan of all instances of exerciseSearchtags
Look up corresponding searchtag
Perform RLIKE match on title
Join back to exercises table to re-lookup self
Perform RLIKE match on title
Assuming that you have at least a few sane indexes, this will work out to be E x C x (C + S + 1) where E is the number of exercises, C is the average number of categories for a given exercise, and S is the average number of search tags for a given. If you don't have indexes on at least the IDs you listed, then it will perform far worse. So part of the question depends particularly on the relative sizes of C and S which I can currently only guess at. If E is 1000 and C and S are each about 2-3 then you'll be scanning 8-21000 rows. If E is 1 million and C is 2-3 and S is 10-15, you'll be scanning 26-57 million rows. If E is 1 million and C or S is about 1000, then you'll be scanning well over 1 trillion rows. So no, this won't scale well at all.
1) The LEFT JOINs inside of your subqueries are ignored because the WERE clauses on those same queries forces them to be normal JOINs. This doesn't affect performance much but it does obfuscate your intent.
2) RLIKE (and its alias REGEXP) do not ever utilize indexes AFAIK so they will not ever scale. I can only guess without sample data but I would say that if your searches require matching on word boundaries that you are in need of normalizing your data. Even if your titles seem like natural strings to store, searching through part of them means you're really treating them as a collection of words. So you should either make use of mysql's full text search capabilities or else you should break you titles out into separate tables that store one word per row. The one row per word will obviously increase your storage but would make your queries almost trivial since you appear to only be doing whole word matches (as opposed to similar words, word roots, etc).
3) The final left joins you have are what cause the E x C part of my formula, you will being doing the same work C times for every exercise. Now, admittedly, under most query plans the subqueries will be cached for each category and so its not in practice quite as bad as I'm suggesting but that will not be true in every case so I'm giving you the worst case scenario. Even if you could verify that you have the proper indexes in place and the query optimizer has avoided all those extra table scans, you will still be returning lots of redundant data because your results will look something like this:
Exercise 1 info
Exercise 1 info
Exercise 1 info
Exercise 2 info
Exercise 2 info
Exercise 2 info
etc
because each exercise row is duplicated for each exercisecategory entry even though you're not returning anything from exercisecategory or categories (and the categories.ID in your first subquery is actually referencing the categories joined in that subquery NOT the one from the outer query).
4) Since most search engines return results using paging, I would guess you only really need the first X results. Adding a LIMIT X to your query, or better yet LIMIT Y, X where Y is the current page and X is the number of results returned per page will greatly help optimize your query if the search keywords return lots of results.
If you can provide us with a little more information on your data, I can update my answer to reflect that.
UPDATE
Based on your responses, here is my suggested query. Unfortunately, without full text search or indexed words, there are still going to be scaling problems if either your category table or your search tag table is very large.
SELECT exercises.ID AS ID,
exercises.title AS title,
IF(exercises.title RLIKE CONCAT('[[:<:]]',?), 1, 0)
+
(SELECT COUNT(*)
FROM categories
JOIN exerciseCategories ON exerciseCategories.categoryID = categories.ID
WHERE exerciseCategories.exerciseID = exercises.ID
AND categories.title RLIKE CONCAT('[[:<:]]',?))
+
(SELECT COUNT(*)
FROM searchtags
JOIN exerciseSearchtags ON exerciseSearchtags.searchtagID = searchtags.ID
WHERE exerciseSearchtags.exerciseID = exercises.ID
AND searchtags.title RLIKE CONCAT('[[:<:]]',?))
FROM exercises
ORDER BY relevance DESC
HAVING relevance > 0
LIMIT $start, $results
I wouldn't normally recommend a HAVING clause but its not gonna be any worse than your RLIKE ... OR RLIKE ..., etc.
This addresses my issues #1, #3, #4 but leaves #2 still remaining. Given your example data, I would imagine that each table only has at most a few dozen entries. In that case, the inefficiency of RLIKE might not be painful enough to be worth the optimizations of one word per row but you did ask about scaling. Only an exact equality (title = ?) query or a starts with query (title LIKE 'foo%' ) can use indexes which are an absolute necessity if you are going to scale up the rows in any table. RLIKE and REGEXP don't fit those criteria, no matter the regular expression used (and yours is a 'contains' like query which is the worst case). (It's important to note that title LIKE CONCAT(?, '%') is NOT good enough because mysql sees that it has to calculate something and ignores its index. You need to add the '%' in your application.)
Try running explain plan for the query and look at the rows that currently do not use an index. Add indexes strategically for those rows.
Also, if possible, reduce the number of RLIKE calls in the query, as those will be expensive.
Consider caching results to reduce database load using something like memcached in front of the database.