Is grouped case/when priority always first to last case? - mysql

If you have a parameter that fulfills more than one WHEN in a CASE statement, regardless of query order, is the first THEN which fulfills the condition always delivered and the rest of the comparison disregarded? In other words, is each further WHEN clause in the structure equivalent to an ELSEIF ternary? Are there ever exceptions to this depending on random result order?
CREATE TABLE test (non_unique_id INT(11) UNSIGNED, a BOOL, b BOOL);
INSERT INTO test VALUES(1,0,1),(1,1,0),(2,0,1);
SELECT non_unique_id,
CASE
WHEN MAX(a=1) THEN 'a'
WHEN MAX(b=1) THEN 'b'
END AS ttype
FROM test GROUP BY non_unique_id;
Returns
non_unique_id ttype
1 a
2 b
That's the expected result. My question is whether CASE/WHEN can effectively be used as a way to order a subquery prior to grouping, reliably, under those conditions.

It is not entirely clear what you are asking, but I sense that you are concerned that perhaps the MAX function would encounter e.g. a=1, and then report 'a', before checking the rest of the column to see if perhaps there is an a=2 somewhere.
As far as I know, the aggregate functions in SQL must examine all records for each group (or for the entire table, if GROUP BY is not used). The exception to this might be if the database can use an index, in such a way that not all records needs to be examined. But even in the case of an index, the database would only avoid examining all records if it could be certain that doing so were not necessary.

Related

MySQL Match Against w/ additional condition: Use subquery?

Suppose we have a table such as
CREATE TABLE test{
title VARCHAR(32),
city VARCHAR(32),
description TEXT
...
And in a query say we have
SELECT * FROM test WHERE MATCH(title, description) AGAINST('xyz' IN NATURAL LANGUAGE MODE) AND city = 'ABC';
Will MySQL know to use the "city" condition first, or should we be more explicit and use a subquery?
From looking at the code, the MATCH expression will be evaluated already in the query optimization phase. This means that all the rows containing 'xyz' will be identified before the condition on city is considered. (At least, this is how I understand it to for work for InnoDB. I do not know the details of how this is implemented in MyISAM.) During query execution, when the WHERE clause is evaluated, expressions are evaluated from left to right. (This is the current implementation, and may change in future versions.) Since MATCH scores have already been computed, at this point one just evaluates whether they are non-zero.
If your city column is indexed, the query optimizer may choose to use this index to scan only rows from the given city, and compare the MATCH scores for just these rows. However, all rows containing 'xyz' are still first identified. The EXPLAIN output for the query will show if the index is used.
I doubt that using a subquery will help anything. If the subquery is correlated, you may even risk that the full-text search is performed multiple times.

MySQL - Poor performance in a select from a simple table

I have a very simple table with three columns:
- A BigINT,
- Another BigINT,
- A string.
The first two columns are defined as INDEX and there are no repetitions. Moreover, both columns have values in a growing order.
The table has nearly 400K records.
I need to select the string when a value is within those of column 1 and two, in order words:
SELECT MyString
FROM MyTable
WHERE Col_1 <= Test_Value
AND Test_Value <= Col_2 ;
The result may be either a NOT FOUND or a single value.
The query takes nearly a whole second while, intuitively (imagining a binary search throughout an array), it should take just a small fraction of a second.
I checked the index type and it is BTREE for both columns (1 and 2).
Any idea how to improve performance?
Thanks in advance.
EDIT:
The explain reads:
Select type: Simple,
Type: Range,
Possible Keys: PRIMARY
Key: Primary,
Key Length: 8,
Rows: 441,
Filtered: 33.33,
Extra: Using where.
If I understand your obfuscation correctly, you have a start and end value such as a datetime or an ip address in a pair of columns? And you want to see if your given datetime/ip is in the given range?
Well, there is no way to generically optimize such a query on such a table. The optimizer does not know whether a given value could be in multiple ranges. Or, put another way, whether the ranges are disjoint.
So, the optimizer will, at best, use an index starting with either start or end and scan half the table. Not efficient.
Are the ranges non-overlapping? IP Addresses
What can you say about the result? Perhaps a kludge like this will work: SELECT ... WHERE Col_1 <= Test_Value ORDER BY Col_1 DESC LIMIT 1.
Your query, rewritten with shorter identifiers, is this
SELECT s FROM t WHERE t.low <= v AND v <= t.high
To satisfy this query using indexes would go like this: First we must search a table or index for all rows matching the first of these criteria
t.low <= v
We can think of that as a half-scan of a BTREE index. It starts at the beginning and stops when it gets to v.
It requires another half-scan in another index to satisfy v <= t.high. It then requires a merge of the two resultsets to identify the rows matching both criteria. The problem is, the two resultsets to merge are large, and they're almost entirely non-overlapping.
So, the query planner probably should just choose a full table scan instead to satisfy your criteria. That's especially true in the case of MySQL, where the query planner isn't very good at using more than one index.
You may, or may not, be able to speed up this exact query with a compound index on (low, high, s) -- with your original column names (Col_1, Col_2, MyString). This is called a covering index and allows MySQL to satisfy the query completely from the index. It sometimes helps performance. (It would be easier to guess whether this will help if the exact definition of your table were available; the efficiency of covering indexes depends on stuff like other indexes, primary keys, column size, and so forth. But you've chosen minimal disclosure for that information.)
What will really help here? Rethinking your algorithm could do you a lot of good. It seems you're trying to retrieve rows where a test point v lies in the range [t.low, t.high]. Does your application offer an a-priori limit on the width of the range? That is, is there a known maximum value of t.high - t.low? If so, let's call that value maxrange. Then you can rewrite your query like this:
SELECT s
FROM t
WHERE t.low BETWEEN v-maxrange AND v
AND t.low <= v AND v <= t.high
When maxrange is available we can add the col BETWEEN const1 AND const2 clause. That turns into an efficient range scan on an index on low. In that case, the covering index I mentioned above will certainly accelerate this query.
Read this. http://use-the-index-luke.com/
Well... I found a suitable solution for me (not sure your guys will like it but, as stated, it works for me).
I simply partitioned my 400K records into a number of tables and created a simple table that serves as a selector:
The selector table holds the minimal value of the first column for each partition along with a simple index (i.e. 1, 2, ,...).
I then user the following to get the index of the table that is supposed to contain the searched for range like:
SELECT Table_Index
FROM tbl_selector
WHERE start_range <= Test_Val
ORDER BY start_range DESC LIMIT 1 ;
This will give me the Index of the table I wish to select from.
I then have a CASE on the retrieved Index to select the correct partition table from perform the actual search.
(I guess that more elegant would be to use Dynamic SQL, but will take care of that later; for now just wanted to test the approach).
The result is that I get the response well below a second (~0.08) and it is uniform regardless of the number being used for test. This, by the way, was not the case with the previous approach: There, if the number was "close" to the beginning of the table, the result was produced quite fast; if, on the other hand, the record was near the end of the table, it would take several seconds to complete).
[By the way, I assume you understand what I mean by beginning and end of the table]
Again, I'm sure people might dislike this, but it does the job for me.
Thank you all for the effort to assist!!

Where clause with one column and multiple criteria returning one row instead of13

I have a simple query with a few rows and multiple criteria in the where clause but it is only returning one row instead of 13. No joins and the syntax was triple checked and appears to be free of errors.
Query:
select column1, column2, column3
from mydb
where onecolumn in (number1, number2....number13)
Results:
returns one row of data associated with a random number in the where clause
spent a big part of the day trying to figure this one out and am now out of ideas. Please help...
Absent a more detailed test case, and the actual SQL statement that is actually running, this question cannot be answered. Here are some "ideas"...
Our first guess is that the rows you think are going to satisfy the predicates aren't actually satisfying all of the conditions.
Our second guess is that you've got an aggregate expression (COUNT(), MAX(), SUM()) in the SELECT list that's causing an implicit GROUP BY. This is a common "gotcha"... the non-standard MySQL extension to GROUP BY which allows non-aggregates to appear in the SELECT list, which are not also included as expressions in the GROUP BY clause. This same gotcha appears when the GROUP BY clause is omitted entirely, and an aggregate is included in the SELECT list.
But the question doesn't make any mention of an aggregate expression in the SELECT list.
Our third guess is another issue that beginners frequently overlook: the order of precedence of operations, especially AND and OR. For example, consider the expressions:
a AND b OR c
a AND ( b OR c )
( a AND b ) OR c
consider those while we sing-along, Sesame Street style,...: "One of these things is not like the others, one of these things just doesn't belong..."
A fourth guess... if it wasn't for the row being returned having a value of onecolumn as a random number in the IN list... if it was instead the first number in the IN list, we'd be very suspicious that the IN list actually contains a single string value that looks like a list a values, but is actually not.
The two expression in the SELECT list look very similar, but they are very different:
SELECT t.n IN (2,3,5,7) AS n_in_list
, t.n IN ('2,3,5,7') AS n_in_string
FROM ( SELECT 2 AS n
UNION ALL SELECT 3
UNION ALL SELECT 5
) t
The first expression is comparing n to each value in a list of four values.
The second expression is equivalent to t.n IN (2).
This is a frequent trip up when neophytes are dynamically creating SQL text, thinking that they can pass in a string value and that MySQL will see the commas within the string as part of the SQL statement.
(But this doesn't explain how a some the random one in the list.)
Those are all just guesses. Those are some of the most frequent trip ups we see, but we're just guessing. It could be something else entirely. In it's current form, there is no definitive "answer" to the question.

MySQL: SELECT(x) WHERE vs COUNT WHERE?

This is going to be one of those questions but I need to ask it.
I have a large table which may or may not have one unique row. I therefore need a MySQL query that will just tell me TRUE or FALSE.
With my current knowledge, I see two options (pseudo code):
[id = primary key]
OPTION 1:
SELECT id FROM table WHERE x=1 LIMIT 1
... and then determine in PHP whether a result was returned.
OPTION 2:
SELECT COUNT(id) FROM table WHERE x=1
... and then just use the count.
Is either of these preferable for any reason, or is there perhaps an even better solution?
Thanks.
If the selection criterion is truly unique (i.e. yields at most one result), you are going to see massive performance improvement by having an index on the column (or columns) involved in that criterion.
create index my_unique_index on table(x)
If you want to enforce the uniqueness, that is not even an option, you must have
create unique index my_unique_index on table(x)
Having this index, querying on the unique criterion will perform very well, regardless of minor SQL tweaks like count(*), count(id), count(x), limit 1 and so on.
For clarity, I would write
select count(*) from table where x = ?
I would avoid LIMIT 1 for two other reasons:
It is non-standard SQL. I am not religious about that, use the MySQL-specific stuff where necessary (i.e. for paging data), but it is not necessary here.
If for some reason, you have more than one row of data, that is probably a serious bug in your application. With LIMIT 1, you are never going to see the problem. This is like counting dinosaurs in Jurassic Park with the assumption that the number can only possibly go down.
AFAIK, if you have an index on your ID column both queries will be more or less equal performance. The second query will need 1 less line of code in your program but that's not going to make any performance impact either.
Personally I typically do the first one of selecting the id from the row and limiting to 1 row. I like this better from a coding perspective. Instead of having to actually retrieve the data, I just check the number of rows returned.
If I were to compare speeds, I would say not doing a count in MySQL would be faster. I don't have any proof, but my guess would be that MySQL has to get all of the rows and then count how many there are. Altough...on second thought, it would have to do that in the first option as well so the code will know how many rows there are as well. But since you have COUNT(id) vs COUNT(*), I would say it might be slightly slower.
Intuitively, the first one could be faster since it can abort the table(or index) scan when finds the first value. But you should retrieve x not id, since if the engine it's using an index on x, it doesn't need to go to the block where the row actually is.
Another option could be:
select exists(select 1 from mytable where x = ?) from dual
Which already returns a boolean.
Typically, you use group by having clause do determine if there are duplicate rows in a table. If you have a table with id and a name. (Assuming id is the primary key, and you want to know if name is unique or repeated). You would use
select name, count(*) as total from mytable group by name having total > 1;
The above will return the number of names which are repeated and the number of times.
If you just want one query to get your answer as true or false, you can use a nested query, e.g.
select if(count(*) >= 1, True, False) from (select name, count(*) as total from mytable group by name having total > 1) a;
The above should return true, if your table has duplicate rows, otherwise false.

Calculated Column Based on Two Calculated Columns

I'm trying to do a rather complicated SELECT computation that I will generalize:
Main query is a wildcard select for a table
One subquery does a COUNT() of all items based on a condition (this works fine)
Another subquery does a SUM() of numbers in a column based on another condition. This also works correctly, except when no records meet the conditions, it returns NULL.
I initially wanted to add up the two subqueries, something like (subquery1)+(subquery2) AS total which works fine unless subquery2 is null, in which case total becomes null, regardless of what the result of subquery1 is. My second thought was to try to create a third column that was to be a calculation of the two subqueries (ie, (subquery1) AS count1, (subquery2) AS count2, count1+count2 AS total) but I don't think it's possible to calculate two calculated columns, and even if it were, I feel like the same problem applies.
Does anyone have an elegant solution to this problem outside of just getting the two subquery values and totalling them in my program?
Thanks!
Two issues going on here:
You can't use one column alias in another expression in the same SELECT list.
However, you can establish aliases in a derived table subquery and use them in an outer query.
You can't do arithmetic with NULL, because NULL is not zero.
However, you can "default" NULL to a non-NULL value using the COALESCE() function. This function returns its first non-NULL argument.
Here's an example:
SELECT *, count1+count2 AS total
FROM (SELECT *, COALESCE((subquery1), 0) AS count1,
COALESCE((subquery2), 0) AS count2
FROM ... ) t;
(remember that a derived table must be given a table alias, "t" in this example)
First off, the COALESCE function should help you take care of any null problems.
Could you use a union to merge those two queries into a single result set, then treat it as a subquery for further analysis?
Or maybe I did not completely understand your question?
I would try (for the second query) something like: SELECT SUM(ISNULL(myColumn, 0)) //Please verify syntax on that before you use it, though...
This should return 0 instead of null for any instance of that column being zero.
It might be unnecessary to say, but since you're using it inside a program, You'd rather use program logic to sum the two results (NULL and a number), due to portability issues.
Who knows when COALESCE function is deprecated or if another DBMS supports it or not.