Is there anything I can do to optimize this mysql query? - mysql

I am hoping some of you who are experts in mysql can help me to optimize my mysql search query...
First, some background:
I am working on a small exercise mysql application that has a search feature.
Each exercise in the database can belong to an arbitrary number of nested categories, and each exercise can also have an arbitrary number of searchtags associated with it.
Here is my data structure (simplified for readability)
TABLE exercises
ID
title
TABLE searchtags
ID
title
TABLE exerciseSearchtags
exerciseID -> exercises.ID
searchtagID -> searchtags.ID
TABLE categories
ID
parentID -> ID
title
TABLE exerciseCategories
exerciseID -> exercises.ID
categoryID -> categories.ID
All tables are InnoDB (no full-text searching).
The ID columns for exercises, searchtags and categories have been indexed.
"exerciseSearchtags" and "exerciseCategories" are many to many join tables expressing the relationship between exercises and searchtags, and exercises and categories, respectively. Both the exerciseID & searchtagID columns have been indexed in exerciseSearchtags, and both the exerciseID and categoryID columns have indexed in exerciseCategories.
Here are some examples of what exercise title, category title and searchtag title data might look like. All three types can have multiple words in the title.
Exercises
(ID - title)
1 - Concentric Shoulder Internal Rotation in Prone
2 - Straight Leg Raise Dural Mobility (Sural)
3 - Push-Ups
Categories
(ID - title)
1 - Flexion
2 - Muscles of Mastication
3 - Lumbar Plexus
Searchtags
(ID - title)
1 - Active Range of Motion
2 - Overhead Press
3 - Impingement
Now, on to the search query:
The search engine accepts an arbitrary number of user inputted keywords.
I would like to rank search results based on the number of keyword/category title matches, keyword/searchtag title matches, and keyword/exercise title matches.
To accomplish this, I am using the following dynamically generated SQL:
SELECT
exercises.ID AS ID,
exercises.title AS title,
(
// for each keyword, the following
// 3 subqueries are generated
(
SELECT COUNT(1)
FROM categories
LEFT JOIN exerciseCategories
ON exerciseCategories.categoryID = categories.ID
WHERE categories.title RLIKE CONCAT('[[:<:]]',?)
AND exerciseCategories.exerciseID = exercises.ID
) +
(
SELECT COUNT(1)
FROM searchtags
LEFT JOIN exerciseSearchtags
ON exerciseSearchtags.searchtagID = searchtags.ID
WHERE searchtags.title RLIKE CONCAT('[[:<:]]',?)
AND exerciseSearchtags.exerciseID = exercises.ID
) +
(
SELECT COUNT(1)
FROM exercises AS exercises2
WHERE exercises2.title RLIKE CONCAT('[[:<:]]',?)
AND exercises2.ID = exercises.ID
)
// end subqueries
) AS relevance
FROM
exercises
LEFT JOIN exerciseCategories
ON exerciseCategories.exerciseID = exercises.ID
LEFT JOIN categories
ON categories.ID = exerciseCategories.categoryID
LEFT JOIN exerciseSearchtags
ON exerciseSearchtags.exerciseID = exercises.ID
LEFT JOIN searchtags
ON searchtags.ID = exerciseSearchtags.searchtagID
WHERE
// for each keyword, the following
// 3 conditions are generated
categories.title RLIKE CONCAT('[[:<:]]',?) OR
exercises.title RLIKE CONCAT('[[:<:]]',?) OR
searchtags.title RLIKE CONCAT('[[:<:]]',?)
// end conditions
GROUP BY
exercises.ID
ORDER BY
relevance DESC
LIMIT
$start, $results
All of this works just fine. It returns relevant search results based on user input.
However, I am worried that my solution may not scale well. For example, if a user enters a seven keywords search string, that will result in a query with 21 subqueries in the relevance calculation, which might start to slow things down, if the tables get big.
Does anyone have any suggestions as to how I can optimize the above? Is there a better way to accomplish what I want? Am I making any glaring errors in the above?
Thanks in advance for your help.

I might me be able to provide a better answer if you also provided some data, particular some example keywords and example titles from each of your tables so we can get a sense of what you're trying to actually match on. But I will try to answer with what you have provided.
First let me put in English what I think your query will do and then I'll break down the reasons why and ways to fix it.
Perform a full table scan of all instances of `exercises`
For each row in `exercises`
Find all categories attached via exerciseCategories
For each combination of exercise and category
Perform a full table scan of all instances of exerciseCategories
Look up corresponding category
Perform RLIKE match on title
Perform a full table scan of all instances of exerciseSearchtags
Look up corresponding searchtag
Perform RLIKE match on title
Join back to exercises table to re-lookup self
Perform RLIKE match on title
Assuming that you have at least a few sane indexes, this will work out to be E x C x (C + S + 1) where E is the number of exercises, C is the average number of categories for a given exercise, and S is the average number of search tags for a given. If you don't have indexes on at least the IDs you listed, then it will perform far worse. So part of the question depends particularly on the relative sizes of C and S which I can currently only guess at. If E is 1000 and C and S are each about 2-3 then you'll be scanning 8-21000 rows. If E is 1 million and C is 2-3 and S is 10-15, you'll be scanning 26-57 million rows. If E is 1 million and C or S is about 1000, then you'll be scanning well over 1 trillion rows. So no, this won't scale well at all.
1) The LEFT JOINs inside of your subqueries are ignored because the WERE clauses on those same queries forces them to be normal JOINs. This doesn't affect performance much but it does obfuscate your intent.
2) RLIKE (and its alias REGEXP) do not ever utilize indexes AFAIK so they will not ever scale. I can only guess without sample data but I would say that if your searches require matching on word boundaries that you are in need of normalizing your data. Even if your titles seem like natural strings to store, searching through part of them means you're really treating them as a collection of words. So you should either make use of mysql's full text search capabilities or else you should break you titles out into separate tables that store one word per row. The one row per word will obviously increase your storage but would make your queries almost trivial since you appear to only be doing whole word matches (as opposed to similar words, word roots, etc).
3) The final left joins you have are what cause the E x C part of my formula, you will being doing the same work C times for every exercise. Now, admittedly, under most query plans the subqueries will be cached for each category and so its not in practice quite as bad as I'm suggesting but that will not be true in every case so I'm giving you the worst case scenario. Even if you could verify that you have the proper indexes in place and the query optimizer has avoided all those extra table scans, you will still be returning lots of redundant data because your results will look something like this:
Exercise 1 info
Exercise 1 info
Exercise 1 info
Exercise 2 info
Exercise 2 info
Exercise 2 info
etc
because each exercise row is duplicated for each exercisecategory entry even though you're not returning anything from exercisecategory or categories (and the categories.ID in your first subquery is actually referencing the categories joined in that subquery NOT the one from the outer query).
4) Since most search engines return results using paging, I would guess you only really need the first X results. Adding a LIMIT X to your query, or better yet LIMIT Y, X where Y is the current page and X is the number of results returned per page will greatly help optimize your query if the search keywords return lots of results.
If you can provide us with a little more information on your data, I can update my answer to reflect that.
UPDATE
Based on your responses, here is my suggested query. Unfortunately, without full text search or indexed words, there are still going to be scaling problems if either your category table or your search tag table is very large.
SELECT exercises.ID AS ID,
exercises.title AS title,
IF(exercises.title RLIKE CONCAT('[[:<:]]',?), 1, 0)
+
(SELECT COUNT(*)
FROM categories
JOIN exerciseCategories ON exerciseCategories.categoryID = categories.ID
WHERE exerciseCategories.exerciseID = exercises.ID
AND categories.title RLIKE CONCAT('[[:<:]]',?))
+
(SELECT COUNT(*)
FROM searchtags
JOIN exerciseSearchtags ON exerciseSearchtags.searchtagID = searchtags.ID
WHERE exerciseSearchtags.exerciseID = exercises.ID
AND searchtags.title RLIKE CONCAT('[[:<:]]',?))
FROM exercises
ORDER BY relevance DESC
HAVING relevance > 0
LIMIT $start, $results
I wouldn't normally recommend a HAVING clause but its not gonna be any worse than your RLIKE ... OR RLIKE ..., etc.
This addresses my issues #1, #3, #4 but leaves #2 still remaining. Given your example data, I would imagine that each table only has at most a few dozen entries. In that case, the inefficiency of RLIKE might not be painful enough to be worth the optimizations of one word per row but you did ask about scaling. Only an exact equality (title = ?) query or a starts with query (title LIKE 'foo%' ) can use indexes which are an absolute necessity if you are going to scale up the rows in any table. RLIKE and REGEXP don't fit those criteria, no matter the regular expression used (and yours is a 'contains' like query which is the worst case). (It's important to note that title LIKE CONCAT(?, '%') is NOT good enough because mysql sees that it has to calculate something and ignores its index. You need to add the '%' in your application.)

Try running explain plan for the query and look at the rows that currently do not use an index. Add indexes strategically for those rows.
Also, if possible, reduce the number of RLIKE calls in the query, as those will be expensive.
Consider caching results to reduce database load using something like memcached in front of the database.

Related

Optimizing Inner Join Queries

I have this query and i want to know if i can optimize it in some way because currently it takes a long time to execute (like 4/5 seconds)
SELECT *
FROM `posts` ml INNER JOIN
posts_tag_one gt
ON gt.post_id = ml.id AND gt.tag_id = 15 INNER JOIN
posts_tag_two gg
ON gg.post_id = ml.id AND gg.tag_id = 5
WHERE active = '1' AND NOT ml.id = '639474'
ORDER BY ml.id DESC
LIMIT 5
I want to say the database it has like 600k+ posts, the posts_tag_one 5 milions records, the posts_tag_two 475k+ records.
That example i gave it's only with 2 joins but in some cases i have up to 4 joins so the other tables has like 300k-400k records.
I am using foregin keys and indexes for posts_tag_one, posts_tag_two tables but the query it's still slow.
Any advice would help. Thanks!
By means of Transitive property (if a=b and b=c, then a=c), your ML.ID = GT.Post_ID = GG.Post_ID. Since you are trying to pre-qualify specific tags, I would rewrite and try to see if cardinality of data may help by moving to a front position and using better indexes to optimize the query. Also, MySQL has a nice keyword "STRAIGHT_JOIN" that tells the engine query the data in the order I tell you, dont think for me. I have used many times and have seen significant improvement.
SELECT STRAIGHT_JOIN
*
FROM
posts_tag_two gg
INNER JOIN posts_tag_one gt
ON gg.post_id = gt.post_id
AND gt.tag_id = 15
INNER JOIN posts ml
ON gt.post_id = ml.id
AND ml.active = 1
WHERE
gg.tag_id = 5
AND NOT gg.post_id = 639474
ORDER BY
gg.post_id DESC
LIMIT 5
I would ensure the following table / multi-field indexes
table index
Posts_Tag_One ( tag_id, post_id )
Posts_Tag_Two ( tag_id, post_id )
posts ( id, active )
By starting with the Posts_Tag_Two table which you are pre-filtering for tag_id = 5, you are already cutting the list down to those pre-qualified FIRST. Not by starting with ALL posts and seeing which qualify with the tag.
Second level join is to the POSTS_TAG_ONE table on same ID, but that level filtered by its Tag_ID = 15.
Only then does it even care to get to the POSTS table for active.
Since the order is based on the ID descending, and the Posts_tag_two table "post_id" is the same value as Posts.id, the index from the posts_tag_two table should return the record already pre-sorted.
HTH, and would be interested to know final performance difference. Again, I have used STRAIGHT_JOIN many times with significant improvement in performance. I also typically do NOT do "Select *" for all tables / all columns. Get what you need.
FEEDBACK
#eshirvana, in MANY cases, yes, the optimizers do by default. But sometimes, the designer knows a better the makeup of the data. Lets take the scenario of POSTS in the lead-position. You have a room of boxes for posts. Each box contains say 10k records. You have to go through all 10k records, then to the next box until you get through 400k records... again, just for example. Once you find those, then it goes to the join on the filtered criteria for a specific tag. Those too are ordered by ID so you have to do a one-to-one- correlation. So which table stays in a primary position.
Now, by the index by tag, and one of the posts_tag tables (smaller by choice is #2).
Now, you have a room of boxes, but each box only has one tag within it. If you have 300 tag IDs available, you have already cut out x-amount of records giving you just the small sample you pre-qualify to.
So now, the second posts table similarly is a room of boxes. Their boxes are also broken down by tags. So now you only have to grab box for tag #15.
So now you have two very finite sets of records that the JOIN can match on the ID that exists in both cases. only once that is done do you ever need to go to the posts table, which by ID is going to be quick and direct. But having the active status in the index, the engine never needs to go to any actual data pages to retrieve the data until all conditions are met. Only then does it pull the record from the 3 respective tables being returned.
Sounds like posts_tags is a many-to-many mapping table? It need two indexes: (post_id, tag_id) and (tag_id, post_id). One of those should probably be the PRIMARY KEY (Having an auto_increment id is wasteful and slows things down). The other should be INDEX (not UNIQUE). More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
But, why have both posts_tag_two and posts_tag_one?
In addition to those 'composite' keys, do not also have the single-column (post_id) or (tag_id).
If tag is simply a short string, don't bother normalizing it; simply have it in the table.
For further discussion, please provide SHOW CREATE TABLE for each table. And EXPLAIN SELECT ....

Database design to enable Multiple tags like Stackoverflow?

I have the following tables.
Articles table
a_id INT primary unique
name VARCHAR
Description VARCHAR
c_id INT
Category table
id INT
cat_name VARCHAR
For now I simply use
SELECT a_id,name,Description,cat_name FROM Articles LEFT JOIN Category ON Articles.a_id=Category.id WHERE c_id={$id}
This gives me all articles which belong to a certain category along with category name.
Each article is having only one category.
AND I use a sub category in a similar way(I have another table named sub_cat).But every article doesn't necessary have a sub category.It may belong to multiple categories instead.
I now think of tagging an article with more than one category just like the questions at stackoverflow are tagged(eg: with multiple tags like PHP,MYSQL,SQL etc).AND later I have to display(filter) all article with certain tags(eg: tagged with php,php +MySQL) and I also have to display the tags along with the article name,Description.
Can anyone help me redesign the database?(I am using php + MySQL at back-end)
Create a new table:
CREATE TABLE ArticleCategories(
A_ID INT,
C_ID INT,
Constraint PK_ArticleCategories Primary Key (Article_ID, Category_ID)
)
(this is the SQL server syntax, may be slightly different for MySQL)
This is called a "Junction Table" or a "Mapping Table" and it is how you express Many-to-Many relationships in SQL. So, whenever you want to add a Category to an Article, just INSERT a row into this table with the IDs of the Article and the Category.
For instance, you can initialize it like this:
INSERT Into ArticleCategories(A_ID,C_ID)
SELECT A_ID,C_ID From Articles
Now you can remove c_id from your Articles table.
To get back all of the Categories for a single Article, you would do use a query like this:
SELECT a_id,name,Description,cat_name
FROM Articles
LEFT JOIN ArticleCategories ON Articles.a_id=ArticleCategories.a_id
INNER JOIN Category ON ArticleCategories.c_id=Category.id
WHERE Articles.a_id={$a_id}
Alternatively, to return all articles that have a category LIKE a certain string:
SELECT a_id,name,Description
FROM Articles
WHERE EXISTS( Select *
From ArticleCategories
INNER JOIN Category ON ArticleCategories.c_id=Category.id
WHERE Articles.a_id=ArticleCategories.a_id
AND Category.cat_name LIKE '%'+{$match}+'%'
)
(You may have to adjust the last line, as I am not sure how string parameters are passed MySQL+PHP.)
Ok RBarryYoung you asked me about an reference/analyse you get one
This reference / analyse is based off the documention / source code analyse off the MySQL server
INSERT Into ArticleCategories(A_ID,C_ID)
SELECT A_ID,C_ID From Articles
On an large Articles table with many rows this copy will push one core off the CPU to 100% load and will create a disk based temporary table what will slow down the complete MySQL performance because the disk will be stress out with that copy.
If this is a one time process this is not that bad but do the math if you run this every time..
SELECT a_id,name,Description
FROM Articles
WHERE EXISTS( Select *
From ArticleCategories
INNER JOIN Category ON ArticleCategories.c_id=Category.id
WHERE Articles.a_id=ArticleCategories.a_id
AND Category.cat_name LIKE '%'+{$match}+'%'
)
Note dont take the Execution Times on sqlfriddle for real its an busy server and the times vary alot to make a good statement but look to what View Execution Plan has to say
see http://sqlfiddle.com/#!2/48817/21 for demo
Both querys always trigger an complete table scan on table Articles and two DEPENDENT SUBQUERYS thats not good if you have an large Articles table with many records.
This means the performance depends on the number of Articles rows even when you want only the articles that are in the category.
Select *
From ArticleCategories
INNER JOIN Category ON ArticleCategories.c_id=Category.id
WHERE Articles.a_id=ArticleCategories.a_id
AND Category.cat_name LIKE '%'+{$match}+'%'
This query is the inner subquery but when you try to run it, MySQL cant run because it depends on a value of the Articles table so this is correlated subquery. a subquery type that will be evaluated once for each row processed by the outer query. not good indeed
There are more ways off rewriting RBarryYoung query i will show one.
The INNER JOIN way is much more efficent even with the LIKE operator
Note ive made an habbit out off it that i start with the table with the lowest number off records and work my way up if you start with the table Articles the executing will be the same if the MySQL optimizer chooses the right plan..
SELECT
Articles.a_id
, Articles.name
, Articles.description
FROM
Category
INNER JOIN
ArticleCategories
ON
Category.id = ArticleCategories.c_id
INNER JOIN
Articles
ON
ArticleCategories.a_id = Articles.a_id
WHERE
cat_name LIKE '%php%';
;
see http://sqlfiddle.com/#!2/43451/23 for demo Note that this look worse because it looks like more rows needs to be checkt
Note if the Article table has low number off records RBarryYoung EXIST way and INNER JOIN way will perform more or less the same based on executing times and more proof the INNER JOIN way scales better when the record count become larger
http://sqlfiddle.com/#!2/c11f3/1 EXISTS oeps more Articles records needs to be checked now (even when they are not linked with the ArticleCategories table) so the query is less efficient now
http://sqlfiddle.com/#!2/7aa74/8 INNER JOIN same explain plan as the first demo
Extra notes about scaling it becomes even more worse when you also want to ORDER BY or GROUP BY the NOT EXIST way has an bigger chance it will create an disk based temporary table that will kill MySQL performance
Lets also analyse the LIKE '%php%' vs = 'php' for the EXIST way and INNER JOIN way
the EXIST way
http://sqlfiddle.com/#!2/48817/21 / http://sqlfiddle.com/#!2/c11f3/1 (more Articles) the explain tells me both patterns are more or less the same but 'php' should be little faster because off the const type vs ref in the TYPE column but LIKE %php% will use more CPU because an string compare algoritme needs to run.
the INNER JOIN way
http://sqlfiddle.com/#!2/43451/23 / http://sqlfiddle.com/#!2/7aa74/8 (more Articles) the explain tell me the LIKE '%php%' should be slower because 3 more rows need to be analysed but not shocking slower in this case (you can see the index is not really used on the best way).
RBarryYoung way works but doenst keep performance atleast not on a MySQL server
see http://sqlfiddle.com/#!2/b2bd9/1 or http://sqlfiddle.com/#!2/34ea7/1
for examples that will scale on large tables with lots of records this is what the topic starter needs

Efficiently selecting from many-to-many relation in H2

I'm using H2, and I have a database of books (table Entries) and authors (table Persons), connected through a many-to-many relationship, itself stored in a table Authorship.
The database is fairly large (900'000+ persons and 2.5M+ books).
I'm trying to efficiently select the list of all books authored by at least one author whose name matches a pattern (LIKE '%pattern%'). The trick here is that the pattern should severly restrict the number of matching authors, and each author has a reasonably small number of associated books.
I tried two queries:
SELECT p.*, e.title FROM (SELECT * FROM Persons WHERE name LIKE '%pattern%') AS p
INNER JOIN Authorship AS au ON au.authorId = p.id
INNER JOIN Entries AS e ON e.id = au.entryId;
and:
SELECT p.*, e.title FROM Persons AS p
INNER JOIN Authorship AS au ON au.authorId = p.id
INNER JOIN Entries AS e ON e.id = au.entryId
WHERE p.name like '%pattern%';
I expected the first one to be much faster, as I'm joining a much smaller (sub)table of authors, however they both take as long. So long in fact that I can manually decompose the query into three selects and find the result I want faster.
When I try to EXPLAIN the queries, I observe that indeed they are very similar (a full join on the tables and only then a WHERE clause), so my question is: how can I achieve a fast select, that relies on the fact that the filter on authors should result in a much smaller join with the other two tables?
Note that I tried the same queries with MySQL and got results in line with what I expected (selecting first is much faster).
Thank you.
OK, here is something that finally worked for me.
Instead of running the query:
SELECT p.*, e.title FROM (SELECT * FROM Persons WHERE name LIKE '%pattern%') AS p
INNER JOIN Authorship AS au ON au.authorId = p.id
INNER JOIN Entries AS e ON e.id = au.entryId;
...I ran:
SELECT title FROM Entries e WHERE id IN (
SELECT entryId FROM Authorship WHERE authorId IN (
SELECT id FROM Persons WHERE name LIKE '%pattern%'
)
)
It's not exactly the same query, because now I don't get the author id as a column in the result, but that does what I wanted: take advantage of the fact that the pattern restricts the number of authors to a very small value to search only through a small number of entries.
What is interesting is that this worked great with H2 (much, much faster than the join), but with MySQL it is terribly slow. (This has nothing to do with the LIKE '%pattern%' part, see comments in other answers.) I suppose queries are optimized differently.
SELECT * FROM Persons WHERE name LIKE '%pattern%' will always take LONG on a 900,000+ row table no matter what you do because when your pattern '%pattern%' starts with a % MySql can't use any indexes and should do a full table scan. You should look into full-text indexes and function.
Well, since the like condition starts with a wildcard it will result in a full table scan which is always slow, no internal caching can take place.
If you want to do full text searches, mysql is not the best bet you have. Look into other software (solr for instance) to solve this kind of problems.

MySQL Full text boolean search with tags

I've never done searching from MYSQL before, but I need to implement a search. I have three tables: articles, articles_tags, and tags.
The table articles holds the first thing I would like to search on, the title field.
The table articles_tags is a pivot table which relates articles and tags together. articles_tags has two fields, that are: articles_id and tag_id.
And, the table tags holds the second thing I would like to search on, the name field.
My problem is, I need a way to search the title field, and each of the tags that relate to that article (tags.name) and return a relevancy (or sort by relevancy) for the specific article.
What would be a good way to implement this? I'm pretty sure it can't be done from just one query so two queries, and then mixing the relevancies together, would be ok.
Thanks.
Edit: Forgot to say, if I could give more weighting to matching a tag than matching a word in the title, that would be awesome. I'm not really asking for anyone to write the thing, but give me some direction. I'm a bit of a newbie in both PHP and MySQL.
Starting from the answer given by #james.c.funk but making some changes.
SELECT a.id, a.title,
MATCH (a.title) AGAINST (?) AS relevance
FROM articles AS a
LEFT OUTER JOIN (articles_tags AS at
JOIN tags AS t ON (t.id = at.tag_id AND t.name = ?))
ON (a.id = at.article_id)
WHERE MATCH (a.title) AGAINST (? IN BOOLEAN MODE)
ORDER BY IF(t.name IS NOT NULL, 1.0, relevance) DESC;
I assume you want tag matches to match against the full string, instead of using a fulltext search.
Also using one left outer join instead of two, because if a join to articles_tags is satisfied, then surely there is a tag. Put the tag name comparison inside the join condition instead of in the WHERE clause.
The boolean mode makes MATCH() returns 1.0 on a match, which makes it useless as a measure of relevance. So do an extra comparison in the select-list to calculate the relevance. This value is between 0.0 and 1.0. Now we can make a tag match sort higher by treating it as having relevance of 1.0.
Is it worth at this time, recommending that you look at offloading the job of search to something that is actually written just for that purpose?
In our products, we use MySQL to store data, but index all of our data with Lucene (via Solr - but that's irrelevant).
It's worth giving it a look into, because it's relatively simple to set up, it's very powerful and it's a lot easier than trying to manipulate the database into doing what you want.
Sorry this isn't a direct answer to the question, I just feel that this kind of thing is always worth mentioning in this scenario :)
Here is how I have done this in the past. It looks slow, but I think you will find it is not.
I added a little complexity to show what else can easily be done. In this example an article will get 1 point for a partial title match, 2 points for a partial tag match, 3 points for an exact tag match, and 4 points for an exact title match. It then adds those up and sorts by the score.
SELECT
a.*,
SUM(
CASE WHEN a.title LIKE '%keyword%' THEN 1 ELSE 0 END
+
CASE WHEN t.name LIKE '%keyword%' THEN 2 ELSE 0 END
+
CASE WHEN t.name = 'keyword' THEN 3 ELSE 0 END
+
CASE WHEN a.title = 'keyword' THEN 4 ELSE END
) AS score
FROM article a, articles_tags at, tags t
WHERE a.id = at.article_id
AND at.tag_id=t.id
AND (a.title LIKE '%keyword%' OR t.name LIKE '%keyword%')
GROUP BY a.id
ORDER BY score;
NOTES: This will not return articles without tags. I used simple joins to reduce the noise in the query and highlight just what is doing the scoring. To include articles without tags just make the joins left joins.
You might want to look into sphinx, http://www.sphinxsearch.com/
This quick demo query is far from optimized but should be a good starting point
SELECT * FROM
(SELECT a.id, a.title,
MATCH (a.title) AGAINST ('$s_search_term') AS title_score,
SUM(MATCH (t.name) AGAINST ('$s_search_term')
) AS tag_score
FROM articles AS a
LEFT JOIN articles_tags AS at
ON a.id = at.article_id
LEFT JOIN tags AS t
ON t.id = at.tag_id
WHERE MATCH (a.title) AGAINST ('$s_search_term')
OR MATCH (t.name) AGAINST ('$s_search_term')
GROUP BY a.id) AS table1
ORDER BY 2*tag_score + title_score DESC
You may want to normalize tag_score by dividing it by COUNT(t.id). Sorry but it's easier to give the query than to explain how to make it.
Funny it is the 3rd question about pretty much the same problem I see in 2 days, check out these two posts: 1, 2

MySQL -- joining then joining then joining again

MySQL setup: step by step.
programs -> linked to --> speakers (by program_id)
At this point, it's easy for me to query all the data:
SELECT *
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
Nice and easy.
The trick for me is this. My speakers table is also linked to a third table, "books." So in the "speakers" table, I have "book_id" and in the "books" table, the book_id is linked to a name.
I've tried this (including a WHERE you'll notice):
SELECT *
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
JOIN books on speakers.book_id = books.book_id
WHERE programs.category_id = 1
LIMIT 5
No results.
My questions:
What am I doing wrong?
What's the most efficient way to make this query?
Basically, I want to get back all the programs data and the books data, but instead of the book_id, I need it to come back as the book name (from the 3rd table).
Thanks in advance for your help.
UPDATE:
(rather than opening a brand new question)
The left join worked for me. However, I have a new problem. Multiple books can be assigned to a single speaker.
Using the left join, returns two rows!! What do I need to add to return only a single row, but separate the two books.
is there any chance that the books table doesn't have any matching columns for speakers.book_id?
Try using a left join which will still return the program/speaker combinations, even if there are no matches in books.
SELECT *
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
LEFT JOIN books on speakers.book_id = books.book_id
WHERE programs.category_id = 1
LIMIT 5
Btw, could you post the table schemas for all tables involved, and exactly what output (or reasonable representation) you'd expect to get?
Edit: Response to op author comment
you can use group by and group_concat to put all the books on one row.
e.g.
SELECT speakers.speaker_id,
speakers.speaker_name,
programs.program_id,
programs.program_name,
group_concat(books.book_name)
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
LEFT JOIN books on speakers.book_id = books.book_id
WHERE programs.category_id = 1
GROUP BY speakers.id
LIMIT 5
Note: since I don't know the exact column names, these may be off
That's typically efficient. There is some kind of assumption you are making that isn't true. Do your speakers have books assigned? If they don't that last JOIN should be a LEFT JOIN.
This kind of query is typically pretty efficient, since you almost certainly have primary keys as indexes. The main issue would be whether your indexes are covering (which is more likely to occur if you don't use SELECT *, but instead select only the columns you need).