MySQL Match Against w/ additional condition: Use subquery? - mysql

Suppose we have a table such as
CREATE TABLE test{
title VARCHAR(32),
city VARCHAR(32),
description TEXT
...
And in a query say we have
SELECT * FROM test WHERE MATCH(title, description) AGAINST('xyz' IN NATURAL LANGUAGE MODE) AND city = 'ABC';
Will MySQL know to use the "city" condition first, or should we be more explicit and use a subquery?

From looking at the code, the MATCH expression will be evaluated already in the query optimization phase. This means that all the rows containing 'xyz' will be identified before the condition on city is considered. (At least, this is how I understand it to for work for InnoDB. I do not know the details of how this is implemented in MyISAM.) During query execution, when the WHERE clause is evaluated, expressions are evaluated from left to right. (This is the current implementation, and may change in future versions.) Since MATCH scores have already been computed, at this point one just evaluates whether they are non-zero.
If your city column is indexed, the query optimizer may choose to use this index to scan only rows from the given city, and compare the MATCH scores for just these rows. However, all rows containing 'xyz' are still first identified. The EXPLAIN output for the query will show if the index is used.
I doubt that using a subquery will help anything. If the subquery is correlated, you may even risk that the full-text search is performed multiple times.

Related

Index when using OR in query

What is the best way to create index when I have a query like this?
... WHERE (user_1 = '$user_id' OR user_2 = '$user_id') ...
I know that only one index can be used in a query so I can't create two indexes, one for user_1 and one for user_2.
Also could solution for this type of query be used for this query?
WHERE ((user_1 = '$user_id' AND user_2 = '$friend_id') OR (user_1 = '$friend_id' AND user_2 = '$user_id'))
MySQL has a hard time with OR conditions. In theory, there's an index merge optimization that #duskwuff mentions, but in practice, it doesn't kick in when you think it should. Besides, it doesn't give as performance as a single index when it does.
The solution most people use to work around this is to split up the query:
SELECT ... WHERE user_1 = ?
UNION
SELECT ... WHERE user_2 = ?
That way each query will be able to use its own choice for index, without relying on the unreliable index merge feature.
Your second query is optimizable more simply. It's just a tuple comparison. It can be written this way:
WHERE (user_1, user_2) IN (('$user_id', '$friend_id'), ('$friend_id', '$user_id'))
In old versions of MySQL, tuple comparisons would not use an index, but since 5.7.3, it will (see https://dev.mysql.com/doc/refman/5.7/en/row-constructor-optimization.html).
P.S.: Don't interpolate application code variables directly into your SQL expressions. Use query parameters instead.
I know that only one index can be used in a query…
This is incorrect. Under the right circumstances, MySQL will routinely use multiple indexes in a query. (For example, a query JOINing multiple tables will almost always use at least one index on each table involved.)
In the case of your first query, MySQL will use an index merge union optimization. If both columns are indexed, the EXPLAIN output will give an explanation along the lines of:
Using union(index_on_user_1,index_on_user_2); Using where
The query shown in your second example is covered by an index on (user_1, user_2). Create that index if you plan on running those queries routinely.
The two cases are different.
At the first case both columns needs to be searched for the same value. If you have a two column index (u1,u2) then it may be used at the column u1 as it cannot be used at column u2. If you have two indexes separate for u1 and u2 probably both of them will be used. The choice comes from statistics based on how many rows are expected to be returned. If returned rows expected few an index seek will be selected, if the appropriate index is available. If the number is high a scan is preferable, either table or index.
At the second case again both columns need to be checked again, but within each search there are two sub-searches where the second sub-search will be upon the results of the first one, due to the AND condition. Here it matters more and two indexes u1 and u2 will help as any field chosen to be searched first will have an index. The choice to use an index is like i describe above.
In either case however every OR will force 1 more search or set of searches. So the proposed solution of breaking using union does not hinder more as the table will be searched x times no matter 1 select with OR(s) or x selects with union and no matter index selection and type of search (seek or scan). As a result, since each select at the union get its own execution plan part, it is more likely that (single column) indexes will be used and finally get all row result sets from all parts around the OR(s). If you do not want to copy a large select statement to many unions you may get the primary key values and then select those or use a view to be sure the majority of the statement is in one place.
Finally, if you exclude the union option, there is a way to trick the optimizer to use a single index. Create a double index u1,u2 (or u2,u1 - whatever column has higher cardinality goes first) and modify your statement so all OR parts use all columns:
... WHERE (user_1 = '$user_id' OR user_2 = '$user_id') ...
will be converted to:
... WHERE ((user_1 = '$user_id' and user_2=user_2) OR (user_1=user_1 and user_2 = '$user_id')) ...
This way a double index (u1,u2) will be used at all times. Please not that this will work if columns are nullable and bypassing this with isnull or coalesce may cause index not to be selected. It will work with ansi nulls off however.

Is grouped case/when priority always first to last case?

If you have a parameter that fulfills more than one WHEN in a CASE statement, regardless of query order, is the first THEN which fulfills the condition always delivered and the rest of the comparison disregarded? In other words, is each further WHEN clause in the structure equivalent to an ELSEIF ternary? Are there ever exceptions to this depending on random result order?
CREATE TABLE test (non_unique_id INT(11) UNSIGNED, a BOOL, b BOOL);
INSERT INTO test VALUES(1,0,1),(1,1,0),(2,0,1);
SELECT non_unique_id,
CASE
WHEN MAX(a=1) THEN 'a'
WHEN MAX(b=1) THEN 'b'
END AS ttype
FROM test GROUP BY non_unique_id;
Returns
non_unique_id ttype
1 a
2 b
That's the expected result. My question is whether CASE/WHEN can effectively be used as a way to order a subquery prior to grouping, reliably, under those conditions.
It is not entirely clear what you are asking, but I sense that you are concerned that perhaps the MAX function would encounter e.g. a=1, and then report 'a', before checking the rest of the column to see if perhaps there is an a=2 somewhere.
As far as I know, the aggregate functions in SQL must examine all records for each group (or for the entire table, if GROUP BY is not used). The exception to this might be if the database can use an index, in such a way that not all records needs to be examined. But even in the case of an index, the database would only avoid examining all records if it could be certain that doing so were not necessary.

Check if MySQL Table is empty: COUNT(*) is zero vs. LIMIT(0,1) has a result?

This is a simple question about efficiency specifically related to the MySQL implementation. I want to just check if a table is empty (and if it is empty, populate it with the default data). Would it be best to use a statement like SELECT COUNT(*) FROM `table` and then compare to 0, or would it be better to do a statement like SELECT `id` FROM `table` LIMIT 0,1 then check if any results were returned (the result set has next)?
Although I need this for a project I am working on, I am also interested in how MySQL works with those two statements and whether the reason people seem to suggest using COUNT(*) is because the result is cached or whether it actually goes through every row and adds to a count as it would intuitively seem to me.
You should definitely go with the second query rather than the first.
When using COUNT(*), MySQL is scanning at least an index and counting the records. Even if you would wrap the call in a LEAST() (SELECT LEAST(COUNT(*), 1) FROM table;) or an IF(), MySQL will fully evaluate COUNT() before evaluating further. I don't believe MySQL caches the COUNT(*) result when InnoDB is being used.
Your second query results in only one row being read, furthermore an index is used (assuming id is part of one). Look at the documentation of your driver to find out how to check whether any rows have been returned.
By the way, the id field may be omitted from the query (MySQL will use an arbitrary index):
SELECT 1 FROM table LIMIT 1;
However, I think the simplest and most performant solution is the following (as indicated in Gordon's answer):
SELECT EXISTS (SELECT 1 FROM table);
EXISTS returns 1 if the subquery returns any rows, otherwise 0. Because of this semantic MySQL can optimize the execution properly.
Any fields listed in the subquery are ignored, thus 1 or * is commonly written.
See the MySQL Manual for more info on the EXISTS keyword and its use.
It is better to do the second method or just exists. Specifically, something like:
if exists (select id from table)
should be the fastest way to do what you want. You don't need the limit; the SQL engine takes care of that for you.
By the way, never put identifiers (table and column names) in single quotes.

MySQL, Selecting based on subquery using MIN() and GROUP BY

First, to describe my data set. I am using SNOMED CT codes and trying to make a usable list out of them. The relevant columns are rowId, conceptID, and Description. rowId is unique, the other two are not. I want to select a very specific subset of those codes:
SELECT *
FROM SnomedCode
WHERE LENGTH(Description)=MIN(LENGTH(Description))
GROUP BY conceptID
The result should be a list of 400,000 unique conceptIDs (out of 1.4 million) and the shortest applicable description for each code. The query above is obviously malformed (and would only return rows where LENGTH(description)=1 because the shortest description in the table is 1 character long.) What am I missing?
SELECT conceptID, MAX(Description)
FROM SnomedCode A
WHERE LENGTH(Description)=(SELECT MIN(LENGTH(B.Description))
FROM SnomedCode B
WHERE B.conceptID = A.conceptID)
GROUP BY conceptID
The "GROUP BY" and "MAX(Description)" are not really necessary, but were added as a tiebreaker for different descriptions with same length for a conceptID, as the requirements include unique conceptIDs.
MAX was chosen to penalize possible leading spaces. Otherwise MIN(Description) works as well.
BTW, this query takes quite some time if you have over million records. Test it with "AND conceptID in (list-of-conceptIDs-to-test)" added in the WHERE clause.
The table SnomedCode must have an index on conceptID. If not, the query will take forever.

MySQL: SELECT(x) WHERE vs COUNT WHERE?

This is going to be one of those questions but I need to ask it.
I have a large table which may or may not have one unique row. I therefore need a MySQL query that will just tell me TRUE or FALSE.
With my current knowledge, I see two options (pseudo code):
[id = primary key]
OPTION 1:
SELECT id FROM table WHERE x=1 LIMIT 1
... and then determine in PHP whether a result was returned.
OPTION 2:
SELECT COUNT(id) FROM table WHERE x=1
... and then just use the count.
Is either of these preferable for any reason, or is there perhaps an even better solution?
Thanks.
If the selection criterion is truly unique (i.e. yields at most one result), you are going to see massive performance improvement by having an index on the column (or columns) involved in that criterion.
create index my_unique_index on table(x)
If you want to enforce the uniqueness, that is not even an option, you must have
create unique index my_unique_index on table(x)
Having this index, querying on the unique criterion will perform very well, regardless of minor SQL tweaks like count(*), count(id), count(x), limit 1 and so on.
For clarity, I would write
select count(*) from table where x = ?
I would avoid LIMIT 1 for two other reasons:
It is non-standard SQL. I am not religious about that, use the MySQL-specific stuff where necessary (i.e. for paging data), but it is not necessary here.
If for some reason, you have more than one row of data, that is probably a serious bug in your application. With LIMIT 1, you are never going to see the problem. This is like counting dinosaurs in Jurassic Park with the assumption that the number can only possibly go down.
AFAIK, if you have an index on your ID column both queries will be more or less equal performance. The second query will need 1 less line of code in your program but that's not going to make any performance impact either.
Personally I typically do the first one of selecting the id from the row and limiting to 1 row. I like this better from a coding perspective. Instead of having to actually retrieve the data, I just check the number of rows returned.
If I were to compare speeds, I would say not doing a count in MySQL would be faster. I don't have any proof, but my guess would be that MySQL has to get all of the rows and then count how many there are. Altough...on second thought, it would have to do that in the first option as well so the code will know how many rows there are as well. But since you have COUNT(id) vs COUNT(*), I would say it might be slightly slower.
Intuitively, the first one could be faster since it can abort the table(or index) scan when finds the first value. But you should retrieve x not id, since if the engine it's using an index on x, it doesn't need to go to the block where the row actually is.
Another option could be:
select exists(select 1 from mytable where x = ?) from dual
Which already returns a boolean.
Typically, you use group by having clause do determine if there are duplicate rows in a table. If you have a table with id and a name. (Assuming id is the primary key, and you want to know if name is unique or repeated). You would use
select name, count(*) as total from mytable group by name having total > 1;
The above will return the number of names which are repeated and the number of times.
If you just want one query to get your answer as true or false, you can use a nested query, e.g.
select if(count(*) >= 1, True, False) from (select name, count(*) as total from mytable group by name having total > 1) a;
The above should return true, if your table has duplicate rows, otherwise false.