MySQL FULLTEXT Search Across >1 Table - mysql

As a more general case of this question because I think it may be of interest to more people...What's the best way to perform a fulltext search on two tables? Assume there are three tables, one for programs (with submitter_id) and one each for tags and descriptions with object_id: foreign keys referring to records in programs. We want the submitter_id of programs with certain text in their tags OR descriptions. We have to use MATCH AGAINST for reasons that I won't go into here. Don't get hung up on that aspect.
programs
id
submitter_id
tags_programs
object_id
text
descriptions_programs
object_id
text
The following works and executes in a 20ms or so:
SELECT p.submitter_id
FROM programs p
WHERE p.id IN
(SELECT t.object_id
FROM titles_programs t
WHERE MATCH (t.text) AGAINST ('china')
UNION ALL
SELECT d.object_id
FROM descriptions_programs d
WHERE MATCH (d.text) AGAINST ('china'))
but I tried to rewrite this as a JOIN as follows and it runs for a very long time. I have to kill it after 60 seconds.
SELECT p.id
FROM descriptions_programs d, tags_programs t, programs p
WHERE (d.object_id=p.id AND MATCH (d.text) AGAINST ('china'))
OR (t.object_id=p.id AND MATCH (t.text) AGAINST ('china'))
Just out of curiosity I replaced the OR with AND. That also runs in s few milliseconds, but it's not what I need. What's wrong with the above second query? I can live with the UNION and subselects, but I'd like to understand.

Join after the filters (e.g. join the results), don't try to join and then filter.
The reason is that you lose use of your fulltext index.
Clarification in response to the comment: I'm using the word join generically here, not as JOIN but as a synonym for merge or combine.
I'm essentially saying you should use the first (faster) query, or something like it. The reason it's faster is that each of the subqueries is sufficiently uncluttered that the db can use that table's full text index to do the select very quickly. Joining the two (presumably much smaller) result sets (with UNION) is also fast. This means the whole thing is fast.
The slow version winds up walking through lots of data testing it to see if it's what you want, rather than quickly winnowing the data down and only searching through rows you are likely to actually want.

Just in case you don't know: MySQL has a built in statement called EXPLAIN that can be used to see what's going on under the surface. There's a lot of articles about this, so I won't be going into any detail, but for each table it provides an estimate for the number of rows it will need to process. If you look at the "rows" column in the EXPLAIN result for the second query you'll probably see that the number of rows is quite large, and certainly a lot larger than from the first one.
The net is full of warnings about using subqueries in MySQL, but it turns out that many times the developer is smarter than the MySQL optimizer. Filtering results in some manner before joining can cause major performance boosts in many cases.

If you join both tables you end up having lots of records to inspect. Just as an example, if both tables have 100,000 records, fully joining them give you with 10,000,000,000 records (10 billion!).
If you change the OR by AND, then you allow the engine to filter out all records from table descriptions_programs which doesn't match 'china', and only then joining with titles_programs.
Anyway, that's not what you need, so I'd recommend sticking to the UNION way.

The union is the proper way to go. The join will pull in both full text indexes at once and can multiple the number of checks actually preformed.

Related

Clarification of join order for creation of temporary tables

I have a large query in mysql that involves joining multiple tables together. It's too slow, so I've done "explain" and see that it's creating a temporary table, which I suspect of taking most of the execution time. I found some related information:
The mysql docs describe conditions when a temporary table might be created. ("The server creates temporary tables under conditions such as these..." [Emphasis added])
This related SO question Using index, using temporary, using filesort - how to fix this?, which provides a link to the doc and applies it in a specific case.
This related SO question Order of join conditions important? that talks about the order in which joins are evaluated.
My query does not appear to meet any of the conditions listed in the docs #1, in the order that the joins were written by me. By experimentation, however, I find that if I remove my order by clause, the temporary table is not created. That makes me look at this rule from the doc:
Evaluation of statements that contain an ORDER BY clause and a different GROUP BY clause, or for which the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue.
This is the same rule that played in the example at #2 above, but in #2, the OP explicitly had columns from multiple tables in the order by clause, so that's at least superficially different.
Moreover, when I look at the output from explain, it appears that the table that I listed first is not used first by the optimizer. Putting down a pseudo-query for example:
select * from A
join B on A.c1=B.c1
join C on A.c2=C.c2
where A.c3='value'
order by A.c4
I would say that my order by clause does use only columns from the "first table in the join queue" based on the order that I've written the query. On the other hand, the output from explain suggests that it first considers table B then A.
So here are the questions:
Does the quoted rule above for use of temporary tables refer to the order that I write the tables or the order that the software chooses to evaluate them?
If it's the order that I write them, does this mean that the order of the joins does impact performance? (Seems to contradict the claims at #3 above.)
If it's the order that the software chooses to evaluate them, is there any way to coerce or trick it into selecting and order that doesn't use the table?
It refers to the order in which the optimiser evaluates them (join queue). The optimiser may not even be aware of the order of the tables in your sql statement.
No, it does not contradict what's been written in #3, since the answer explicitly writes (emphasis is mine):
has no effect on the result
The result and performance are two different things. Actually, there is an upvoted comment to the answer saying that
But it might affect the query plan (=> performance)
You can tell the optimiser which table to process first by using straight_join:
STRAIGHT_JOIN is similar to JOIN, except that the left table is always read before the right table. This can be used for those (few) cases for which the join optimizer puts the tables in the wrong order.
However, you need to be careful with that because you tie the optimiser's hand. See this SO topic on discussing advantages and disadvantages of straight_join.
Number of records, where criteria, indexes - they all play their part in the optimiser's decision of the processing order of tables. There is no magic bullet, you need to play around a bit and probably you can trick the optimiser to change the order of the tables.
select * from A
join B on A.c1=B.c1
join C on A.c2=C.c2
where A.c3='value'
order by A.c4
The optimizer will use various heuristics to decide which order to look at the tables. In this case, it will start with A because of the filters (WHERE...).
This "composite" index on A should eliminate the tmp&filesort for ORDER BY: INDEX(c3, c4). No that is not the same as INDEX(c3), INDEX(c4).
After getting the rows from A, either B or C could be reached into ("Nested Loop Join"). These indexes are important: B: (c1) and C: (c2).
STRAIGHT_JOIN and FORCE INDEX are usually a bad idea, and should be used only as a last resort. It may help today's query, but hurt tomorrow.
EXPLAIN FORMAT=JSON SELECT ... gives more information, sometimes even pointing out that two tmp tables are needed.
More tips: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

SQL Optimization: how to JOIN a table with itself

I'm trying to optimize a SQL query and I am not sure if further optimization is possible.
Here's my query:
SELECT someColumns
FROM (((smaller_table))) AS a
INNER JOIN (((smaller_table))) AS b
ON a.someColumnProperty = b.someColumnProperty
...the problem with this approach is that my table has half a trillion records in it. In my query, you'll notice (((smaller_table))). I wrote that as an abbreviation for a SELECT statement being run on MY_VERY_LARGE_TABLE to reduce it's size.
(((smaller_table))) appears twice, and the code within is exactly the same both times. There's no reason for me to run the same sub-query twice. This table is several TB and I shouldn't scan through it twice just to get the same results.
Do you have any suggestions on how I can NOT run the exact same reduction twice? I tried replacing the INNER JOIN line with INNER JOIN a AS b but got an "unrecognized table a" warning. Is there any way to store the value of a so I can reuse it?
Thoughts:
Make sure there is an index on userid and dayid.
I would ask you to define better what it is you are trying to find out.
Examples:
What is the busiest time of the week?
Who are the top 25 people who come to the gym the most often?
Who are the top 25 people who utilize the gem the most? (This is different than the one above because maybe I have a user that comes 5 times a month, but stays 5 hours per session vs a user that comes 30 times a month and stays .5 hour per session.)
Maybe doing all days in a horizontal method (day1, day2, day3) would be better visually to try to find out what you are looking for. You could easily put this into excel or libreoffice and color the days that are populated to get a visual "picture" of people who come consecutively.
It might be interesting to run this for multiple months to see if what the seasonality looks like.
Alas CTE is not available in MySQL. The ~equivalent is
CREATE TABLE tmp (
INDEX(someColumnProperty)
)
SELECT ...;
But...
You can't use CREATE TEMPORARY TABLE because such can't be used twice in the same query. (No, I don't know why.)
Adding the INDEX (or PK or ...) during the CREATE (or afterwards) provides the very necessary key for doing the self join.
You still need to worry about DROPping the table (or otherwise dealing with it).
The choice of ENGINE for tmp depends on a number of factors. If you are sure it will be "small" and has no TEXT/BLOB, then MEMORY may be optimal.
In a Replication topology, there are additional considerations.

query optimizer operator choice - nested loops vs hash match (or merge)

One of my stored procedures was taking too long execute. Taking a look at query execution plan I was able to locate the operation taking too long. It was a nested loop physical operator that had outer table (65991 rows) and inner table (19223 rows). On the nested loop it showed estimated rows = 1,268,544,993 (multiplying 65991 by 19223) as below:
I read a few articles on physical operators used for joins and got a bit confused whether nested loop or hash match would have been better for this case. From what i could gather:
Hash Match - is used by optimizer when no useful indexes are available, one table is substantially smaller than the other, tables are not sorted on the join columns. Also hash match might indicate more efficient join method (nested loops or merge join) could be used.
Question: Would hash match be better than nested loops in this scenario?
Thanks
ABSOLUTELY. A hash match would be a huge improvement. Creating the hash on the smaller 19,223 row table then probing into it with the larger 65,991 row table is a much smaller operation than the nested loop requiring 1,268,544,993 row comparisons.
The only reason the server would choose the nested loops is that it badly underestimated the number of rows involved. Do your tables have statistics on them, and if so, are they being updated regularly? Statistics are what enable the server to choose good execution plans.
If you've properly addressed statistics and are still having a problem you could force it to use a HASH join like so:
SELECT *
FROM
TableA A -- The smaller table
LEFT HASH JOIN TableB B -- the larger table
Please note that the moment you do this it will also force the join order. This means you have to arrange all your tables correctly so that their join order makes sense. Generally you would examine the execution plan the server already has and alter the order of your tables in the query to match. If you're not familiar with how to do this, the basics are that each "left" input comes first, and in graphical execution plans, the left input is the lower one. A complex join involving many tables may have to group joins together inside parentheses, or use RIGHT JOIN in order to get the execution plan to be optimal (swap left and right inputs, but introduce the table at the correct point in the join order).
It is generally best to avoid using join hints and forcing join order, so do whatever else you can first! You could look into the indexes on the tables, fragmentation, reducing column sizes (such as using varchar instead of nvarchar where Unicode is not required), or splitting the query into parts (insert to a temp table first, then join to that).
I would not recommend trying to "fix" the plan by forcing the hints in one direction or another. Instead, you need to look to your indexes, statistics and the TSQL code to understand why you have a Table spool loading up 1.2billion rows from 19000.

How does JOIN work in MySQL?

Although the question title is duplicate of many discussions, I did not find a answer to this question:
Consider a simple join for normalized tables of tags as
SELECT tags.tag
FROM tags
INNER JOIN tag_map
ON tags.tag_id=tag_map.tag_id
WHERE article_id=xx
Does JOIN work with the entire tables of tags and tag_map then filter the created (JOINed) table to find rows with WHERE clause for the article id
OR JOIN will only join rows of tag_map table in which article_id=xx ?
The latter method should be quite faster!
It will do the former, to my knowledge WHERE's are explicitly performed on the resulting JOINed table. (Disclaimer: MySQL may optimize this in some cases, I don't know).
To force the latter behaviour and execute the WHERE first, you can add an extra filter to your JOIN ON statement:
SELECT tags.tag
FROM tags
INNER JOIN tag_map
ON tags.article_id=xx
AND tags.tag_id=tag_map.tag_id
WHERE article_id=xx
The Joins work on ONLY those records qualified from the WHERE clause of the first table returning records.. That said, you are doing a join to tag_map, but your where clause does not specify which alias the "Article_ID" is associated with. Its typically better to always qualify your fields with either the table name or alias the are coming from.
So, if article_id is coming from TAGS, then it will first look at that list as the primary set of records, and optimized with index if one so exists and return a small set. From that, the join is applied to the tag_map and will grab all records that match the join "ON" condition.
Just to clarify something. If the JOIN was applied FIRST, before the WHERE clause optimization, queries would take forever. The join basically PREPARES the relationship before the record selection actually occurs. Hence, the execution plan that shows the indexes that would be used.
It depends on the engine. Earlier version of many database engines would generate the join results first and then it would filter. Newer versions of engines generate a execution plan that achieves the fastest results. Test would have to be done with the db engine reviewing execution plans for your version/database to find "what is best"
Assuming it is simple or inner join:
The answer is: in relational model, first answer is correct, it creates a table that contains every row from first crossed with every row from the second table, so if you have N rows in first and M in second, it will create a table with NxM and then eliminate those where conditions do not match.
Now, that is mathematical model, but in implementation, depending on the engine, it will use some smarter way, typically choosing one table that seems faster and jioing from there using hopefully indexed join field. But this depends between engines: there is a lot of documetnation on that(google it) and some people, poster of this answer included, are paid to optimize join queries...
In case of MYSQL (just noticed the tag) you can use following syntax:
EXPLAIN [EXTENDED] SELECT select_options
as explained here and MYSQL will tell you how it would execute such query. It is faster then reading the docuemtnation.
You can always check Execution plan to see how your query gets executed step by step. In MySQL I don't know whether it can be presented graphically using any third party tools (as you can on MS SQL out of the box with Management Studio) but you can still check it using explain language constructs. Check documentation.
Not knowing your table schema
If article_id is of table tags then tag_map table isn't scanned at all unless join column in FK table is nullable.
If article_id is indexed (ie. primary key) then index is being scanned...
etc...
What I'd like to say is that we'd need your table schema definition to tell you some details. We can't know how your schema works.

Sorting items filtered by tag

I want to implement a very common feature - filtering some items by tag. There are many tutorials on the internet with examples of how to do it. The query is quite simple and fast (assuming proper indexes exist).
But usually the filtered items need to be sorted by some field. For example, when you filter questions by tag on SO, you get your results sorted.
To accomplish this task (assuming we need to sort by rating), one could write:
SELECT item.id FROM item
INNER JOIN taggeditem ON taggeditem.item_id = item.id
WHERE
taggeditem.tag_id = 1234
ORDER BY item.rating DESC
We have indexes (taggeditem.tag_id), (item.id), (item.rating)
The problem with this query is that mysql can't use index on item.rating, because the key used to fetch the rows is not the same as the one used in the ORDER BY (MySQL: ORDER BY Optimization). This leads to using a temporary table and filesort, which in turn leads to slow execution time.
The solution I came up with is to denormalize sort field to the taggeditem table, so that I could create index (tag_id, item_rating) on taggeditem.
I've searched for similar questions at SO, and found only this one: Mysql slow query: INNER JOIN + ORDER BY causes filesort. The solution was the same.
So, I want to ask, is this a common solution to this problem? Is it a good practice to denormalize a bunch of sort fields to taggeditem, such as created, rating? At SO you can sort using 4 different parameters (newest, hot, votes, active) - does it mean that they denormalized fields which are used to sort results?
Are there any alternatives to this solution?
There is a standard alternative - change server system variables.
For example, you can experiment with sort_buffer_size value (default 2MB).
More about it.
As soon as you're using a JOIN, and filter out on the joined table, you're stuck with bad performance.
As you said, the only way to avoid this is to create a denormalized table.
For SO's sorts, I think they have no such issue: they just have to sort answers by a column of the answers' table (something like SELECT * FROM answers WHERE question_id = 1234 SORT BY answer_date, with an index on question_id, answer_date)
I'm also looking for such solutions, with multi-valued columns, and that's really difficult (denormalized data would be huge, as it needs to cross all values in the multi-valued columns)