why do mathematical operators interact with strings in sql - mysql

Say I use the sql query:
select * from table t where t.name = "adam" and t.age > 10;
and in the column "age", there is more than just integer values. There are also values "old", "young", etc...
The result of this query would be all adam's with an age older than 10, but also all adam's with an age equal to one of the string values "old", "young", etc...
Why is this?

The reason for this behavior is MySQLs Type Conversion
which occurs implicit when you apply operators to data/columns of different types.
To the example you posted: It's generally considered bad practice to save data of different types/meanings in the same field.
I'd set the type of your 'age' column to 'int', and make it nullable. Then add another column named about_age which could be an ENUM with the values old and young. Whenever the 'age' column is NULL, your application can check the the about_age column. Only with this way, it's completely clear what your data means.

Related

Mixing quoted and unquoted values in IN() condition - MySQL quirk or general issue?

The MySQL manual contains the following interesting note about mixing quoted and unquoted values in an IN condition:
You should never mix quoted and unquoted values in an IN() list because the comparison rules for quoted values (such as strings) and unquoted values (such as numbers) differ. Mixing types may therefore lead to inconsistent results.
However, it doesn't really explain why this is a problem. It has examples, but it doesn't show either the data being queried or the results, so they only serve as illustrations without giving any explanation about the issue.
I have two questions:
Why does this cause problems in MySQL? Ideally, provide an example where the results are wrong/inconsistent/unintuitive, to demonstrate.
Is this a MySQL-specific quirk or does this apply to other database systems? In particular, I am interested in whether this issue affects SQL Server, but would ideally like the question answered in the general case.
It depends what you consider "non-intuitive". This returns false:
'00' in ('0', '01')
However, this returns true:
'00' in (0, '01')
I think the next few lines give an unintuitive example without mixing :
mysql> SELECT 'a' IN (0), 0 IN ('b');
-> 1, 1
That you can extend :
SELECT 'a' IN (0, 1, '2'), 'a' IN ('0', '1', '2');
-> 1, 0
SELECT 0 IN (0.0, 'b'), 0 IN ('0.0', 'b');
-> 1, 1
Also there is this other question :
In MySQL, why does the following query return '----', '0', '000', 'AK3462', 'AL11111', 'C131521', 'TEST', etc.?
select varCharColumn from myTable where varCharColumn in (-1, '');
I get none of these results when I do:
select varCharColumn from myTable where varCharColumn in (-1);
select varCharColumn from myTable where varCharColumn in ('');
Everything is cast into float, most likely, according to this link :
[...] In all other cases, the arguments are compared as floating-point (real) numbers. For example, a comparison of string and numeric operands takes places as a comparison of floating-point numbers.
And string are cast as 0.0, unless they start by digits. Also from the same link, there could be problems with floating point accuracy, and queries not using index because the type is not right (it must cast everything to float, so no index usage, I guess).
I think you might get something similar but not the same with every DBMS because you have to cast things to compare them. It might not be the exact same issue in SQL Server, because the data type precedence is not the same, but you should compare data of the same data type.
According to this link that gives data type precedence for SQL Server :
user-defined data types (highest)
sql_variant
xml
datetimeoffset
datetime2
datetime
smalldatetime
date
time
float
real
decimal
money
smallmoney
bigint
int
smallint
tinyint
bit
ntext
text
image
timestamp
uniqueidentifier
nvarchar (including nvarchar(max) )
nchar
varchar (including varchar(max) )
char
varbinary (including varbinary(max) )
binary (lowest)
int and string would be cast to int (not float) for a SQL server DBMS.
Running some simple tests seems that the control between data types is done correctly, despite what is written in the MySQL manual.
SELECT 0 IN ('0','00',0,00); -> TRUE
SELECT 0 IN ('0','01',1,01); -> TRUE
SELECT 0 IN ('1','00',1,10); -> TRUE
SELECT 0 IN ('11','10',0,10); -> TRUE
SELECT 0 IN ('1','01',1,00); -> TRUE
SELECT '0' IN ('1','01',1,00); -> TRUE
SELECT '0' IN ('0','00',0,00); -> TRUE
SELECT '0' IN ('0','01',1,01); -> TRUE
SELECT '0' IN ('1','00',1,10); -> FALSE
SELECT '0' IN ('11','10',0,10); -> TRUE
SELECT '1' IN ('11','10',1,10); -> TRUE
SELECT '15.32' IN ('11','10',1,15.32); -> TRUE
SELECT 13.12 IN ('11','10',1,13.12); -> TRUE
SELECT 00 IN ('11','00',1,13.12); -> TRUE
SELECT '00' IN ('11',00,1,13.12); -> TRUE
SELECT '00.0' IN ('11',00.0,1,13.12); -> TRUE
SELECT '00.00' IN ('11',0,1,13.12); -> TRUE
SELECT '00.01' IN ('11',0.01,1,13.12); -> TRUE
The above results can be seen in this SQLFiddle
But the above tests are not even close to testing all the different data types of MySQL.
In addition we should simply just think in what cases we would use the IN () operator.
MySQL writes that mixed data types offer surprises on results sometimes, but then again is it actually needed to have different data types inside IN ()?
In short no. What will be checked against the values inside the parenthesis will be a table column having specific data type.
For example doesn't comparing a column of TEXT against IN ('Hello','World',13) seems odd? I know that one could oppose the fact that in the column having data type TEXT you may have numerical values. Good, then just write the above like this IN ('Hello','World','13') since we were speaking about a TEXT column.
In case that we did not know the data type or if somehow the data type is dynamic and could some times change, then we should convert that field to the data type that we expect the majority of results would be.
1. Why does this cause problems in MySQL?
The example below should be able to show you the inconsistency about using IN across quoted (x='1a') and unquoted types (x=1). Note for the same value of x = 1, the same IN expression yields 0 in Query 1, but yields 1 in Query 2.
SELECT
x, x IN ('1b','a1')
FROM
(
select '1a' as x
union all select 1
) q1;
SELECT
x, x IN ('1b','a1')
FROM
(
select 1 as x
) q1;
Results:
Query 1:
'1a': 0
1: 0
Query 2:
1: 1
For far I cannot observe inconsistency if I only alter the list inside IN. But I observed that pattern is like:
expr IN (...array of values)
For expr with string, against string values: compare as string
For expr without string, against string values: compare as number
For expr with string, against numeric values: compare as number
For expr without string, against numeric values: compare as number
2. Is this a MySQL-specific quirk or does this apply to other database systems?
Case by case. For MSSQL I tell you no because when comparing string with number, they give you an error message like:
Conversion failed when converting the varchar value '1a' to data type int.
1. Why does this cause problems in MySQL?
Engine needs to know how it will make comparisons.
If you compare column with integers, the column integer value will be compared with the IN list. If IN list items are strings, comparison will differ.
https://dev.mysql.com/doc/refman/8.0/en/type-conversion.html
2. Is this a MySQL-specific quirk or does this apply to other database systems?
It is not MYSQL specific. For performance reasons (indexing) it is always better not to make casting.
Why does this cause problems in MySQL?
It's not a bug, it's a feature. 😬
Basically it's about how the database handles the field comparison. In particular, MySQL automatically converts the string value to a numeric value when comparing the numeric with string values. Since MySQL is written in C++ , somewhere in the code base, they should cast the string value to double prior to field comparison.
There is nothing special about the IN clause, I think. In the MySQL source code, I saw comments similar to this one:
`WHERE a IN (b, c)` can also be rewritten as `WHERE a = b OR a = c`
Which makes sense and IN is (probably) treated the same way in code base. So based on this, if we have let's say something like this:
... WHERE '04.2' IN ('0', 4.2);
Which means '04.2' = '0' OR '04.2' = 4.2, and will return true, because, in C/C++:
"04.2" = "0" // string value comparison -> false
cast_as_double("04.2") = 4.2 // double value comparison -> true
The same applies for other cases, which resolve as true, e.g. 42 IN ('0042', 0), '3.00' IN (3, '1'), 0 IN (3, '0.00') etc.
Is this a MySQL-specific quirk or does this apply to other database systems?
This seems to be the case with other databases as well. If you like, you can test them online
MySQL: https://www.db-fiddle.com
PostgreSQL: https://www.db-fiddle.com
MS SQL Server 2017: http://sqlfiddle.com/#!18/ff6b8/12807
Whilst there have been a lot of lot of answers and comments that provide examples of 'unintuitive' behaviour, most of these examples seem to be explained by the standard casting rules. In other words, the results were entirely consistent with what would be returned from SELECT A = B; for the given A and B.
"Because casting" doesn't seem like a particularly satisfying explanation for the paragraph I quoted in the question. That paragraph comes after a number of paragraphs explaining how type conversion affects the IN() statement, so it seems somewhat repetitive and redundant if that is all it's referring to.
My interpretation of the quoted paragraph is that it is an explicit statement that a IN(b, c) may give different results to a = b OR a = c in situations where b and c are quoted differently.
I was therefore looking to find an example where the result couldn't be explained by the usual casting rules.
I think the reason that we haven't seen a good example yet is because most answers focussed on comparing numbers, in string and non-string representations. However, by basing the test around string values instead, I have managed to construct a non-intuitive example that is not explained by simple type conversion rules and which is not equivalent to the individual comparisons ORed together; the comparison between 'test' and 23 gives different results depending on what other values are in the IN() list:
SELECT 'test' IN('fish'); --> 0
SELECT 'test' IN(23); --> 0
SELECT 'test' IN('fish', 23); --> 1 !!!
I have yet to come up with a good explanation about what is happening here - is there some rule being followed, or is it just a MySQL quirk? I also haven't got an answer to the second question, as that somewhat depends on the reason for the behaviour (e.g. if it is defined by the standard or is an artefact of an obvious optimisation, vs. just being a MySQL-specific quirk) but I guess this could be figured out by running the above test on other RDBMSs.
Any comments to help flesh this out (or answers that cover the missing elements) will be appreciated - I will update this answer with any further details that I manage to deduce and don't plan on accepting any answer (including my own) until I understand what's going on a little bit better.

Should I go with ENUM or TINYINT

My table contains a field 'priority'. Now I have following priorities to consider 'low', 'medium', 'high'.
What I am confused with is that:
Should I create a ENUM type field for priority values ?
Should I create a TINYINT type field and store values as 1, 2, 3 ?
Please note I would be required search and sort data based on this field.
Also, there will be indexing on this field.
You should use ENUM in case if you are sure none of the priority added in future because in that case you have to alter the table... BUt enum give surity of consistent data no other values gets inserted..
You should go with ENUM, as you said you have to search / sort, stroing them in tiny int would make you to additional processing like convert 1,2,3 back to 'low', 'medium', 'high'. while displaying.
Enum is ideal for such situvations
ENUM is a non-standard MySql extension. You should avoid it, especially if you can achieve the same results in a standard way. So its better to go with tinyint.

enum or char(1) in MySQL

Sometimes I am not sure whether using enum or char(1) in MysQL. For instance, I store statuses of posts. Normally, I only need Active or Passive values in status field. I have two options:
// CHAR
status char(1);
// ENUM (but too limited)
status enum('A', 'P');
What about if I want to add one more status type (ie. Hidden) in the future? If I have small data, it won't be an issue. But if i have too large data, so editing ENUM type will be problem, i think.
So what's your advice if we also think about MySQL performance? Which way I would go?
Neither. You'd typically use tinyint with a lookup table
char(1) will be slightly slower because comparing uses collation
confusion: As you extend to more than A and P
using a letter limits you as you add more types. See last point.
every system I've seen has more then one client eg reporting. A and P have to resolved to Active and Passive for in each client code
extendibility: add one more type ("S" for "Suspended") you can one row to a lookup table or change a lot of code and constraints. And your client code too
maintenance: logic is in 3 places: database constraint, database code and client code. With a lookup and foreign key, it can be in one place
Enum is not portable
On the plus side of using a single letter or Enum
Note: there is a related DBA.SE MySQL question about Enums. The recommendation is to use a lookup table there too.
You can use
status enum('Active', 'Passive');
It will not save a string in the row, it will only save a number that is reference to enum member in the table structure, so the size is the same but its more readable than char(1) or your enum.
Editing enum is not a problem no matter how big your data is
I would use a binary SET field for this, but without labelling the options specifically within the database. All the "labelling" would be done within your code, but it does provide some very flexible options.
For example, you could create a SET containing eight "options" such as;
`column_name` SET('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h') NOT NULL DEFAULT ''
Within your application, you can then define the 'a' as denoting "Active" or "Passive", the 'b' can denote "Hidden", and the rest can be left undefined until you need them.
You can then use all sorts of useful binary operations on the field for instance you could extract all those which are "Hidden" by running;
WHERE `column_name` & 'b'
And all those which are "Active" AND "Hidden" by running;
WHERE `column_name` & 'a' AND `column_name` & 'b'
You can even use the LIKE and FIND_IN_SET operators to do even more useful queries.
Read the MySQL documentation for further information;
http://dev.mysql.com/doc/refman/5.1/en/set.html
Hope it helps!
Dave
Hard to tell without knowing the semantics of your statuses, but to me "hidden" doesn't seem like an alternative to "active" or "passive", i.e. you might want to have both "active hidden" and "passive hidden"; this would degenerate with each new non-exclusive "status", it would be better to implement your schema with boolean flags: one for the active/passive distinction, and one for the hidden/visible distinction. Queries become more readable when your condition is "WHERE NOT hidden" or "WHERE active", instead of "WHERE status = 'A'".

Which is the best way to define the value of data being added to my database?

For example, I have a column > shopping_cart.status; and this column status should for each record contain one of three values > "incomplete" "complete" or "shipped". My question is, should it be my application that makes sure that these are the only values used, or do i need to build this into the domain of this attribute on the database side?
Use enums thats exactly what they are meant for.
An ENUM is a string object with a value chosen from a list of
permitted values that are enumerated explicitly in the column
specification at table creation time.
An enumeration value must be a quoted string literal; it may not be an
expression, even one that evaluates to a string value. For example,
you can create a table with an ENUM column like this:
CREATE TABLE shoppingcards (
shoppingcardstatus ENUM('incomplete', 'complete', 'shipped')
);
see: http://dev.mysql.com/doc/refman/5.0/en/enum.html

MySQL Tri-state field

I need to create a good/neutral/bad field. which one would be the more understandable/correct way.
A binary field with null (1=good, null=neutral, 0=bad)
An int (1=good, 2=neutral, 3=bad)
An enum (good, neutral, bad)
Any other
It's only and informative field and I will not need to search by this.
NULL values should be reserved for either:
unknown values; or
not-applicable values;
neither of which is the case here.
I would simply store a CHAR value myself, one of the set {'G','N','B'}. That's probably the easiest solution and takes up little space while still providing mnemonic value (easily converting 'G' to 'Good' for example).
If you're less concerned about space, then you could even store them as varchar(7) or equivalent and store the actual values {'Good','Neutral','Bad'} so that no translation at all would be needed in your select statements (assuming those are the actual values you will be printing).
In Mysql you ought to be using an enum type. You can pick any names you like without worrying about space, because Mysql stores the data as a short integer. See 10.4.4. The ENUM Type in the documentation.