Using multiple ='s for IN? - mysql

What would be the difference between doing:
SELECT person FROM population WHERE id = 1 or id = 2 or id = 3
and -
SELECT person FROM population WHERE id IN (1,2,3)
Are they executed the exact same way? What difference is there? Would there ever be a reason where one would you IN rather than multiple ='s?

No, they perform the same thing. The IN minimizes the query string. That's all. Such statements help in query optimization.
One difference in these two comparison operators would be that IN uses a SET of values to compare, unlike the "=" or "<>" which takes a single value.
According to the manual:
if expr is equal to any of the values in the IN list, else returns 0.
If all values are constants, they are evaluated according to the type
of expr and sorted. The search for the item then is done using a
binary search. This means IN is very quick if the IN value list
consists entirely of constants.

Related

Index not used against MySQL SET column

I have a large data table containing details by date and across 3 independent criteria with around 12 discreet values for each criteria. That is, each criteria field in the table is defined as a 12 value ENUM. Users pull summary data by date and any filtering across the three criteria, including none at all. To make a single criteria lookup efficient, 3 separate indexes are required (date,CriteriaA), (date,CriteriaB), (date,CriteriaC). 4 indexes if you want to lookup against any of the 3 (date,A,B,C),(date,A,C),(date,B,C),(date,C).
In an attempt to be more efficient in the lookup, I built a SET column containing all 36 values from the 3 criteria. All values across the criteria are unique and none are a subset of any other. I added an index to this set (date, set_col). Queries against this table using a set lookup fails to take advantage of the index, however. Neither FIND_IN_SET('Value',set_col), set_col LIKE '%Value%', nor set_col & [pos. in set] triggers the index (according to explain and overall resultset return speed).
Is there a trick to indexing SET columns?
I tried queries like
Select Date, count(*)
FROM tbl
where DATE between [Start] and [End]
and FIND_IN_SET('Value',set_col)
group by Date
I would expect it to run nearly as fast as a lookup against the individual criteria column that has an index against it. But instead it runs as fast when just an index against DATE exists. Same number of rows processed according to Explain.
It's not possible to index SET columns for arbitrary queries.
A SET type is basically a bitfield, with one bit set for each of the values defined for your set. You could search for a specific bit pattern in such a bitfield, or you could search for a range of specific bit patterns, or an inequality, etc. But searching for rows where one specific bit is set in the bitfield is not going to be indexable.
FIND_IN_SET() is really searching for a specific bit set in the bitfield. It will not use an index for this predicate. The best you can hope to do for optimization is to have an index that narrows down the examined rows based on the other search term on date. Then among the rows matching the date range, the FIND_IN_SET() will be applied row-by-row.
It's the same problem as searching for substrings. The following predicates will not use an index on the column:
SELECT ... WHERE SUBSTRING(mytext, 5, 8) = 'word'
SELECT ... WHERE LOCATE(mytext, 'word') > 0
SELECT ... WHERE mytext LIKE '%word%'
A conventional index on the data would be alphabetized from the start of the string, not from some arbitrary point in the middle of the string. This is why fulltext indexing was created as an alternative to a simple B-tree index on the whole string value. But there's no special index type for bitfields.
I don't think the SET data type is helping in your case.
You should use your multi-column indexes with permutations of the columns.
Go back to 3 ENUMs. Then have
INDEX(A, date),
INDEX(B, date),
INDEX(C, date)
Those should significantly help with queries like
WHERE A = 'foo' AND date BETWEEN...
and somewhat help for
WHERE A = 'foo' AND date BETWEEN...
AND B = 'bar'
If you will also have queries without A/B/C, then add
INDEX(date)
Note: INDEX(date, A) is no better than INDEX(date) when using a "range". That is, I recommend against the indexes you mentioned.
FIND_IN_SET(), like virtually all other function calls, is not sargable . However enum=const is sargable since it is implemented as a simple integer.
You did not mention
WHERE A IN ('x', 'y') AND ...
That is virtually un-indexable. However, my suggestions are better than nothing.

Do i need to cast integer param to string for varchar index to be used?

When i query a varchar column (indexed) against an integer number, it runs extremely slow. I thought mysql was able to infer this and cast the parameter to string, but when i use an integer for filtering it avoids the index.
Is this ok? Is it a collation problem? Should i always cast manually integers to string for varchar index to work?
Running mysql 5.7
the varchar column is an external id, we do not control whether it's integer or alphanumeric. Sometimes the user wants to find an object by our internal integer id, sometimes by their id, so we use: WHERE id = ? or external_id = ?
It says here that:
In all other cases, the arguments are compared as floating-point
(double-precision) numbers. For example, a comparison of string and
numeric operands takes place as a comparison of floating-point
numbers.
Since you're comparing string column with an integer constant, MySQL will have to convert each value in the column to float for comparison and index might not be used (things might be different if it were the other way round i.e. integer column compared to string constant).
But more importantly, such comparison will produce unexpected results:
select '123abc' = 123 -- true
That being said, it is not very difficult to change this:
select '123abc' = 123 -- true
to this:
select '123abc' = '123' -- false
The problem is not the data type or the collation, it's the fact that you're using OR to search two different columns.
Consider the analogy: suppose I ask you to find people named Gammel in the telephone book. You ask, "is that the last name or the first name?" I answer, "please find all cases, whether it's the first name or the last name."
Now you have a problem.
SELECT ... FROM telephone_book
WHERE last_name = 'Gammel' OR first_name = 'Gammel';
The book is sorted by last name, so it should be quick to find the entries that match the last name. But I also asked for all those that match the first name. Those may be distributed randomly all through the book, for people with any last name. You will now have to search the book the hard way, one page at a time.
A common solution to the OR optimization problem is to use UNION with two separate queries that search on respective indexes.
SELECT ... FROM telephone_book
WHERE last_name = 'Gammel'
UNION
SELECT ... FROM telephone_book
WHERE first_name = 'Gammel';
Supposing there is a different index on first_name, the latter query in this union will use it to find the entries matching by first name in an optimized way. We already know it can do that for last name.
Then once the (hopefully small) subset of rows matching either condition are found, these sets are unioned together into one result.
There are 4 combinations of comparing a column (VARCHAR or INT) to a value (string or numeric). One case is inefficient because no index is useful:
WHERE varchar_col = 123
It is because it cannot use an index and will convert each string to a number to do the test.
The other cases can use an index on the column. (For other reasons, such an index may or may not actually be used.)
This is not slow:
WHERE int_col = "123"
because the "123" is converted to a simple number as it is parsed.
And, of course, these work "as expected". That is a string comparison or a numeric comparison is used.
WHERE varchar_col = "123"
WHERE int_col = 123
(FLOAT is another topic.)

MySQL using IN()

IN() is usually applied like this:
SELECT eid FROM comments WHERE id IN (1,2,3,4,5,6)
Would this generate an error or is it just syntactically bad?
SELECT eid FROM comments WHERE id IN (6)
It will work as expected. Most probably under the hood it will be optimised as WHERE id = 6 anyway.
No, it won't generate error, it will work correctly,
because the point of IN clause - is to check whether value exists in defined list.
In your case this list contains only one value (6).
No it will not error, the MySQL optimizer is smart in that because it knows id IN (6) is equal to id = 6 and will handle it like that.
SELECT eid FROM comments WHERE id IN (6)
Will be rewritten/handled after optimizing as
/* select#1 */ select test.comments.eid AS eid from test.comments where (
test.comments.id = 6)
see demo
MySQL IN() function finds a match in the given arguments.
Syntax:
expr IN (value,...)
The function returns 1 if expr is equal to any of the values in the IN
list, otherwise, returns 0. If all values are constants, they are
evaluated according to the type of expr and sorted. The search for the
item then is done using a binary search. This means IN is very quick
if the IN value list consists entirely of constants. Otherwise, type
conversion takes place according to the rules.
For your case, If you are conscious about the performance of IN() with one element vs =, actually there is no significant difference between the MySQL statements, and the MySQL optimizer will transform the IN to the = when IN is just one element.
Something like-
SELECT eid FROM comments WHERE id IN (6)
to
SELECT eid FROM comments WHERE id = 6
It will arise performance issue if it contains multiple elements inside the IN(). You can try with EXPLAIN to see the difference. See HERE

Select distinct column and then count 2 columns that relate to that column in MySQL

So I have an error log that I need to analyze.
In that error log there is are fields called
EVENT_ATTRIBUTE that displays the name of the device that collected that information.
EVENT_SEVERITY that displays a number from 1 to 5. In this column I need to find the amount 4's and 5's.
The problem is I need to get the distinct EVENT_ATTRIBUTES and then count all the 4's and 5's related to that specific EVENT_ATTRIBUTE and output the count.
Basically the sensors(event_attribute) detect different errors. I need to analyze how many 4's and 5's each of the sensors picks up so that I can analyze them.
I am having problems taking the distinct sensors and linking them to the specific sensor. I have tried this so far but it only returns me same number for 4 and 5 so I don't think I am doing it correctly.
SELECT DISTINCT LEFT(EVENT_ATTRIBUTE, locate('(', EVENT_ATTRIBUTE, 1)-1) AS
SensorName,
COUNT(CASE WHEN 'EVENT_SEVERITY' <>5 THEN 1 END) AS ERROR5,
COUNT(CASE WHEN 'EVENT_SEVERITY' <>4 THEN 1 END) AS ERROR4
FROM nodeapp.disc_event
WHERE EVENT_SEVERITY IN (5,4)
Group BY SensorName;
Here is the table that I am looking at.
Event Error Table
Im truncating the event attribute because the IP address doesn't matter. Basically I want to make the unique event_attribute act as a primary key and count the amount of 4's and 5's connected to that primary key.
With the code above I get this output: Event Result Table
Thank you for all your help!
You're very close.
DISTINCT is unnecessary when you're grouping.
You want SUM(). COUNT() simply counts everything that's not null. You can exploit the hack that a boolean expression evaluates to either 1 or 0.
SELECT LEFT(EVENT_ATTRIBUTE, LOCATE('(', EVENT_ATTRIBUTE, 1)-1) AS SensorName,
SUM(EVENT_SEVERITY = 5) ERROR_5,
SUM(EVENT_SEVERITY = 4) ERROR_4,
COUNT(*) ALL_ERRORS
FROM nodeapp.disc_event
GROUP BY LEFT(EVENT_ATTRIBUTE, LOCATE('(', EVENT_ATTRIBUTE, 1)-1);
Even if EVENT_SEVERITY values are stored as strings in your DBMS, expressions like EVENT_SEVERITY = 4 implicitly coerce them to integers.
It's generally good practice to include batch totals like COUNT(*) especially when you're debugging; they form a good sanity check that you're handling your data correctly.
The query is interpreting 'EVENT_SEVERITY' as string, try using ` or double quotes to delimit the field instead. ...and while it is "standard", I tend to shy away from double-quotes because they look like they should be for strings (and in some configurations of MySQL are).
Edit (for clarity): I mean it is literally interpreting 'EVENT_SEVERITY' as the string "EVENT_SEVERITY", not the underlying value of the field as a string.

MySQL Select Results Excluding Outliers Using AVG and STD Conditions

I'm trying to write a query that excludes values beyond 6 standard deviations from the mean of the result set. I expect this can be done elegantly with a subquery, but I'm getting nowhere and in every similar case I've read the aim seems to be just a little different. My result set seems to get limited to a single row, I'm guessing due to calling the aggregate functions. Conceptually, this is what I'm after:
SELECT t.Result FROM
(SELECT Result, AVG(Result) avgr, STD(Result) stdr
FROM myTable WHERE myField=myCondition limit=75) as t
WHERE t.Result BETWEEN (t.avgr-6*t.stdr) AND (t.avgr+6*t.stdr)
I can get it to work by replacing each use of the STD or AVG value (ie. t.avgr) with it's own select statement as:
(SELECT AVG(Result) FROM myTable WHERE myField=myCondition limit=75)
However this seems waay more messy than I expect it needs to be (I've a few conditions). At first I thought specifying a HAVING clause was necessary, but as I learn more it doesn't seem to be quite what I'm after. Am I close? Is there some snazzy way to access the value of aggregate functions for use in conditions (without needing to return the aggregate values)?
Yes, your subquery is an aggregate query with no GROUP BY clause, therefore its result is a single row. When you select from that, you cannot get more than one row. Moreover, it is a MySQL extension that you can include the Result field in the subquery's selection list at all, as it is neither a grouping column nor an aggregate function of the groups (so what does it even mean in that context unless, possibly, all the relevant column values are the same?).
You should be able to do something like this to compute the average and standard deviation once, together, instead of per-result:
SELECT t.Result FROM
myTable AS t
CROSS JOIN (
SELECT AVG(Result) avgr, STD(Result) stdr
FROM myTable
WHERE myField = myCondition
) AS stats
WHERE
t.myField = myCondition
AND t.Result BETWEEN (stats.avgr-6*stats.stdr) AND (stats.avgr+6*stats.stdr)
LIMIT 75
Note that you will want to be careful that the statistics are computed over the same set of rows that you are selecting from, hence the duplication of the myField = myCondition predicate, but also the removal of the LIMIT clause to the outer query only.
You can add more statistics to the aggregate subquery, provided that they are all computed over the same set of rows, or you can join additional statistics computed over different rows via a separate subquery. Do ensure that all your statistics subqueries return exactly one row each, else you will get duplicate (or no) results.
I created a UDF that doesn't calculate exactly the way you asked (it discards a percent of the results from the top and bottom, instead of using std), but it might be useful for you
(or someone else) anyway, matching the Excel function referenced here https://support.office.com/en-us/article/trimmean-function-d90c9878-a119-4746-88fa-63d988f511d3
https://github.com/StirlingMarketingGroup/mysql-trimmean
Usage
`trimmean` ( `NumberColumn`, double `Percent` [, integer `Decimals` = 4 ] )
`NumberColumn`
The column of values to trim and average.
`Percent`
The fractional number of data points to exclude from the calculation. For example, if percent = 0.2, 4 points are trimmed from a data set of 20 points (20 x 0.2): 2 from the top and 2 from the bottom of the set.
`Decimals`
Optionally, the number of decimal places to output. Default is 4.