Why do people on SO prefer CASE WHEN to other alternatives? - mysql

I have noticed that on SO a lot of people seem to prefer CASE ... WHEN to other alternatives.
For example all of the answers in this question use CASE ... WHEN whereas I would have used a simple IF. IF is quite a bit less to type and is prevalent in all programming languages so it seems kind of weird to me that not a single answer uses it. (I would also expect that IF is a bit faster though I did not measure it).
Even more interesting are the answers to this question. 2 out of 3 answers (among them the accepted answer) suggest using CASE ... WHEN when from my point of view COALESCE is the better solution (after all COALESCE was created for exactly the problem the OP has). (Also, in this case I am almost certain that COALESCE would be faster.)
So, my question is, is there any benefit to CASE ... WHEN (that offsets the additional typing) that I am missing or is it a case of "To a man with a hammer, everything looks like a nail"?

One reason, a good one actually, is that a CASE WHEN expression is ANSI compliant while IF is not. Were someone to face porting a MySQL query to another database the IF calls in MySQL would probably all have to be rewritten.
MySQL, like most databases, extended ANSI by introducing the IF() function. Perhaps IF, or something similar to it, will become part of the standard some day.

CASE WHEN is in the SQL standard. IF is not. As SQL databases do have vastly different dialects, it is not the worst idea to stick to code that will work on most databases for the following reasons:
If you build the habit of using code that is specific to one database, you will have troubles when working on another.
If you use code that is specific to one database, you cannot test your query with other databases by simply copy pasting them. You can also not migrate your application to other databases without changing your SQL queries.

CASE WHEN is the ANSI standard expression for conditional expressions. IF() is a function specific to MySQL.
In general, I prefer ANSI standard functionality when available -- although there are occasional exceptions.
Specifically about IF() as a function. It is easily confused with IF as a statement in MySQL. Using it as a function seems like unnecessary confusion (admittedly, there are other databases where CASE can be confused with a CASE statement in the scripting language, but that is not an issue in MySQL).
In addition, IF() is pretty close to control flow, which makes it different from most other functions anyway.

Related

Should SQL functions be uppercase or lowercase?

I am new in SQL and I was wondering what is the right way to write the functions. I know the norm for statements like SELECT is uppercase, but what is the norm for functions? I've seen people write them with lowercase and others in uppercase.
Thanks for the help.
There's no norm on that, there are standards, but those can change from company to company.
SQL code is not case sensitive.
Usually you should write SQL code (and SQL reserved code) in UPPERCASE and fields and other things in lowercase. But it is not necessary.
It depends on the function a bit, but there is no requirement. There is really no norm but some vendors have specific standards / grammar, which in their eyes aids with readability (/useability) of the code.
Usually 'built-in' functions (non vendor specific) are shown in uppercase.
/* On MySQL */
SELECT CURRENT_TIMESTAMP;
-> '2001-12-15 23:50:26'
Vendor specific functions are usually shown in lowercase.
A good read, on functions in general, can be found here.
SQL keywords are routinely uppercase, but there is no, one "correct" way. Lowercase works just as well. (N.B. Many think that using uppercase keywords in SQL improves readability, although not to say that using lowercase is difficult to read.) For the most part, however, no one will be averted to something like:
SELECT * FROM `users` WHERE id="3"
Although, if you prefer, this will work as well:
select * from `users` where id='3'
Here's a list of them. Note that they are written in uppercase, but it is not required:
http://developer.mimer.com/validator/sql-reserved-words.tml
Here's another good resource that I keep in my "interesting articles to read over periodically." It elaborates a bit on some somewhat interesting instances when case should be taken into consideration:
http://dev.mysql.com/doc/refman/5.0/en/identifier-case-sensitivity.html

SQL parameterization: How does this work behind the scenes?

SQL parameterization is a hot topic nowadays, and for a good reason, but does it really do anything besides escaping decently?
I could imagine a parameterization engine simply making sure the data is decently escaped before inserting it into the query string, but is that really all it does? It would make more sense to do something differently in the connection, e.g. like this:
> Sent data. Formatting: length + space + payload
< Received data
-----
> 69 SELECT * FROM `users` WHERE `username` LIKE ? AND `creation_date` > ?
< Ok. Send parameter 1.
> 4 joe%
< Ok. Send parameter 2.
> 1 0
< Ok. Query result: [...]
This way would simply eliminate the issue of SQL injections, so you wouldn't have to avoid them through escaping. The only other way I can think of how parameterization might work, is by escaping the parameters:
// $params would usually be an argument, not in the code like this
$params = ['joe%', 0];
// Escape the values
foreach ($params as $key=>$value)
$params[$key] = mysql_real_escape_string($value);
// Foreach questionmark in the $query_string (another argument of the function),
// replace it with the escaped value.
$n = 0;
while ($pos = strpos($query_string, "?") !== false && $n < count($params)) {
// If it's numeric, don't use quotes around it.
$param = is_numeric($params[$n]) ? $params[$n] : "'" . $params[$n] . "'";
// Update the query string with the replaced question mark
$query_string = substr($query_string, 0, $pos) //or $pos-1? It's pseudocode...
. $param
. substr($query_string, $pos + 1);
$n++;
If the latter is the case, I'm not going to switch my sites to parameterization just yet. It has no advantage that I can see, it's just another strong vs weak variable typing discussion. Strong typing may catch more errors in compiletime, but it doesn't really make anything possible that would be hard to do otherwise - same with this parameterization. (Please correct me if I'm wrong!)
Update:
I knew this would depend on the SQL server (and also on the client, but I assume the client uses the best possible techniques), but mostly I had MySQL in mind. Answers concerning other databases are (and were) also welcome though.
As far as I understand the answers, parameterization does indeed do more than simply escaping the data. It is really sent to the server in a parameterized way, so with variables separated and not as a single query string.
This also enables the server to store and reuse the query with different parameters, which provides better performance.
Did I get everything? One thing I'm still curious about is whether MySQL has these features, and if query reusage is automatically done (or if not, how this can be done).
Also, please comment when anyone reads this update. I'm not sure if it bumps the question or something...
Thanks!
I'm sure that the way that your command and parameters are handled will vary depending on the particular database engine and client library.
However, speaking from experience with SQL Server, I can tell you that parameters are preserved when sending commands using ADO.NET. They are not folded into the statement. For example, if you use SQL Profiler, you'll see a remote procedure call like:
exec sp_executesql N'INSERT INTO Test (Col1) VALUES (#p0)',N'#p0 nvarchar(4000)',#p0=N'p1'
Keep in mind that there are other benefits to parameterization besides preventing SQL injection. For example, the query engine has a better chance of reusing query plans for parameterized queries because the statement is always the same (just the parameter values change).
In response to update:
Query parameterization is so common I would expect MySQL (and really any database engine) to handle it similarly.
Based on the MySQL protocol documentation, it looks like prepared statements are handled using COM_PREPARE and COM_EXECUTE packets, which do support separate parameters in binary format. It's not clear if all parameterized statements will be prepared, but it does look like unprepared statements are handled by COM_QUERY which has no mention of parameter support.
When in doubt: test. If you really want to know what's sent over the wire, use a network protocol analyzer like Wireshark and look at the packets.
Regardless of how it's handled internally and any optimizations it may or may not currently provide for a given engine, there's very little (nothing?) to gain from not using parameters.
Parameterized query are passed to SQL implementation as parameterized query, the parameters are never concatenated to the query itself unless an implementation decided to fallback to concatenating themselves. Parameterized query avoids the need for escaping, and improves performance since the query is generic and it is more likely that a compiled form of the query is already cached by the database server.
The straight answer is "it's implemented whatever way it's implemented in the particular implementation in question". There's dozens of databases, dozens of access layers and in some cases more than one way for the same access layer to deal with the same code.
So, there isn't a single correct answer here.
One example would be that if you use Npgsql with a query that isn't a prepared statement, then it pretty much just escapes things correctly (though escaping in Postgresql has some edge cases that people who know about escaping miss, and Npgsql catches them all, so still a gain). With a prepared statement, it sends parameters as prepared-statment parameters. So one case allows for greater query-plan reuse than another.
The SQLServer driver for the same framework (ADO.NET) passes queries through as calls to sp_executesql, which allows for query-plan re-use.
As well as that, the matter of escaping is still worth considering for a few reasons:
It's the same code each time. If you're escaping yourself, then either you're doing so through the same piece of code each time (so it's not like there's any downside to using someone else's same piece of code), or you're risking a slip-up each time.
They're also better at not escaping. There's no point going through every character in the string representation of a number looking for ' characters, for example. But does not escaping count as a needless risk, or a reasonable micro-optimisation.
Well, "reasonable micro-optimisation" in itself means one of two things. Either it requires no mental effort to write or to read for correctness afterwards (in which case you might as well), or it's hit frequently enough that tiny savings will add up, and it's easily done.
(Relatedly, it also makes more sense to write a highly optimised escaper - the sort of string replacement involved is the sort of case where the most common approach of replacing isn't as fast as some other approaches in some languages at least, but the optimisation only makes sense if the method will be called a very large number of times).
If you've a library that includes type checking the parameter (either in basing the format used on the type, or by validation, both of which are common with such code), then it's easy to do and since these libraries aim at mass use, it's a reasonable micro-opt.
If you're thinking each time about whether parameter number 7 of an 8-parameter call could possibly contain a ' character, then it's not.
They're also easier to translate to other systems if you want. To again look at the two examples I gave above, apart from the classes created, you can use pretty much identical code with System.Data.SqlClient as with Npgsql, though SQL-Server and Postgresql have different escaping rules. They also have an entirely different format for binary strings, date-times and a few other datatypes they have in common.
Also, I can't really agree with calling this a "hot topic". It's had a well-established consensus for well over a decade at the very least.

Why are MySQL queries almost always written in Capital

I've seen most coders use Capital letters while writing MySQL queries, something like
"SELECT * FROM `table` WHERE `id` = 1 ORDER BY `id` DESC"
I've tried writing the queries in small caps and it still works.
So is there any particular reason for not using small caps or is it just a matter of choice?
It's just a matter of readability. Keeping keywords in upper case and table/column names lower case means it's easier to separate the two when scan reading the statement .: better readability.
Most SQL implementations are case-insensitive, so you could write your statement in late 90s LeEt CoDeR StYLe, if you felt so inclined, and it would still work.
The case make no difference of the SQL engine. It is just a convention followed, just like coding conventions use in any of the programming languages
You have to have a system - there are already a few questions on the site that deal with conventions and approaches. Try:
SQL formatting standards
This is for readability sake.
Sometimes we have to write queries like that because our project (official) recommends.
But everything for readability and uniformity in a project.

Why would a language NOT use Short-circuit evaluation?

Why would a language NOT use Short-circuit evaluation? Are there any benefits of not using it?
I see that it could lead to some performances issues... is that true? Why?
Related question : Benefits of using short-circuit evaluation
Reasons NOT to use short-circuit evaluation:
Because it will behave differently and produce different results if your functions, property Gets or operator methods have side-effects. And this may conflict with: A) Language Standards, B) previous versions of your language, or C) the default assumptions of your languages typical users. These are the reasons that VB has for not short-circuiting.
Because you may want the compiler to have the freedom to reorder and prune expressions, operators and sub-expressions as it sees fit, rather than in the order that the user typed them in. These are the reasons that SQL has for not short-circuiting (or at least not in the way that most developers coming to SQL think it would). Thus SQL (and some other languages) may short-circuit, but only if it decides to and not necessarily in the order that you implicitly specified.
I am assuming here that you are asking about "automatic, implicit order-specific short-circuiting", which is what most developers expect from C,C++,C#,Java, etc. Both VB and SQL have ways to explicitly force order-specific short-circuiting. However, usually when people ask this question it's a "Do What I Meant" question; that is, they mean "why doesn't it Do What I Want?", as in, automatically short-circuit in the order that I wrote it.
One benefit I can think of is that some operations might have side-effects that you might expect to happen.
Example:
if (true || someBooleanFunctionWithSideEffect()) {
...
}
But that's typically frowned upon.
Ada does not do it by default. In order to force short-circuit evaluation, you have to use and then or or else instead of and or or.
The issue is that there are some circumstances where it actually slows things down. If the second condition is quick to calculate and the first condition is almost always true for "and" or false for "or", then the extra check-branch instruction is kind of a waste. However, I understand that with modern processors with branch predictors, this isn't so much the case. Another issue is that the compiler may happen to know that the second half is cheaper or likely to fail, and may want to reorder the check accordingly (which it couldn't do if short-circuit behavior is defined).
I've heard objections that it can lead to unexpected behavior of the code in the case where the second test has side effects. IMHO it is only "unexpected" if you don't know your language very well, but some will argue this.
In case you are interested in what actual language designers have to say about this issue, here's an excerpt from the Ada 83 (original language) Rationale:
The operands of a boolean expression
such as A and B can be evaluated in
any order. Depending on the complexity
of the term B, it may be more
efficient (on some but not all
machines) to evaluate B only when the
term A has the value TRUE. This
however is an optimization decision
taken by the compiler and it would be
incorrect to assume that this
optimization is always done. In other
situations we may want to express a
conjunction of conditions where each
condition should be evaluated (has
meaning) only if the previous
condition is satisfied. Both of these
things may be done with short-circuit
control forms ...
In Algol 60 one can achieve the effect
of short-circuit evaluation only by
use of conditional expressions, since
complete evaluation is performed
otherwise. This often leads to
constructs that are tedious to follow...
Several languages do not define how
boolean conditions are to be
evaluated. As a consequence programs
based on short-circuit evaluation will
not be portable. This clearly
illustrates the need to separate
boolean operators from short-circuit
control forms.
Look at my example at On SQL Server boolean operator short-circuit which shows why a certain access path in SQL is more efficient if boolean short circuit is not used. My blog example it shows how actually relying on boolean short-circuit can break your code if you assume short-circuit in SQL, but if you read the reasoning why is SQL evaluating the right hand side first, you'll see that is correct and this result in a much improved access path.
Bill has alluded to a valid reason not to use short-circuiting but to spell it in more detail: highly parallel architectures sometimes have problem with branching control paths.
Take NVIDIA’s CUDA architecture for example. The graphics chips use an SIMT architecture which means that the same code is executed on many parallel threads. However, this only works if all threads take the same conditional branch every time. If different threads take different code paths, evaluation is serialized – which means that the advantage of parallelization is lost, because some of the threads have to wait while others execute the alternative code branch.
Short-circuiting actually involves branching the code so short-circuit operations may be harmful on SIMT architectures like CUDA.
– But like Bill said, that’s a hardware consideration. As far as languages go, I’d answer your question with a resounding no: preventing short-circuiting does not make sense.
I'd say 99 times out of 100 I would prefer the short-circuiting operators for performance.
But there are two big reasons I've found where I won't use them.
(By the way, my examples are in C where && and || are short-circuiting and & and | are not.)
1.) When you want to call two or more functions in an if statement regardless of the value returned by the first.
if (isABC() || isXYZ()) // short-circuiting logical operator
//do stuff;
In that case isXYZ() is only called if isABC() returns false. But you may want isXYZ() to be called no matter what.
So instead you do this:
if (isABC() | isXYZ()) // non-short-circuiting bitwise operator
//do stuff;
2.) When you're performing boolean math with integers.
myNumber = i && 8; // short-circuiting logical operator
is not necessarily the same as:
myNumber = i & 8; // non-short-circuiting bitwise operator
In this situation you can actually get different results because the short-circuiting operator won't necessarily evaluate the entire expression. And that makes it basically useless for boolean math. So in this case I'd use the non-short-circuiting (bitwise) operators instead.
Like I was hinting at, these two scenarios really are rare for me. But you can see there are real programming reasons for both types of operators. And luckily most of the popular languages today have both. Even VB.NET has the AndAlso and OrElse short-circuiting operators. If a language today doesn't have both I'd say it's behind the times and really limits the programmer.
If you wanted the right hand side to be evaluated:
if( x < 13 | ++y > 10 )
printf("do something\n");
Perhaps you wanted y to be incremented whether or not x < 13. A good argument against doing this, however, is that creating conditions without side effects is usually better programming practice.
As a stretch:
If you wanted a language to be super secure (at the cost of awesomeness), you would remove short circuit eval. When something 'secure' takes a variable amount of time to happen, a Timing Attack could be used to mess with it. Short circuit eval results in things taking different times to execute, hence poking the hole for the attack. In this case, not even allowing short circuit eval would hopefully help write more secure algorithms (wrt timing attacks anyway).
The Ada programming language supported both boolean operators that did not short circuit (AND, OR), to allow a compiler to optimize and possibly parallelize the constructs, and operators with explicit request for short circuit (AND THEN, OR ELSE) when that's what the programmer desires. The downside to such a dual-pronged approach is to make the language a bit more complex (1000 design decisions taken in the same "let's do both!" vein will make a programming language a LOT more complex overall;-).
Not that I think this is what's going on in any language now, but it would be rather interesting to feed both sides of an operation to different threads. Most operands could be pre-determined not to interfere with each other, so they would be good candidates for handing off to different CPUs.
This kins of thing matters on highly parallel CPUs that tend to evaluate multiple branches and choose one.
Hey, it's a bit of a stretch but you asked "Why would a language"... not "Why does a language".
The language Lustre does not use short-circuit evaluation. In if-then-elses, both then and else branches are evaluated at each tick, and one is considered the result of the conditional depending on the evaluation of the condition.
The reason is that this language, and other synchronous dataflow languages, have a concise syntax to speak of the past. Each branch needs to be computed so that the past of each is available if it becomes necessary in future cycles. The language is supposed to be functional, so that wouldn't matter, but you may call C functions from it (and perhaps notice they are called more often than you thought).
In Lustre, writing the equivalent of
if (y <> 0) then 100/y else 100
is a typical beginner mistake. The division by zero is not avoided, because the expression 100/y is evaluated even on cycles when y=0.
Because short-circuiting can change the behavior of an application IE:
if(!SomeMethodThatChangesState() || !SomeOtherMethodThatChangesState())
I'd say it's valid for readability issues; if someone takes advantage of short circuit evaluation in a not fully obvious way, it can be hard for a maintainer to look at the same code and understand the logic.
If memory serves, erlang provides two constructs, standard and/or, then andalso/orelse . This clarifies intend that 'yes, I know this is short circuiting, and you should too', where as at other points the intent needs to be derived from code.
As an example, say a maintainer comes across these lines:
if(user.inDatabase() || user.insertInDatabase())
user.DoCoolStuff();
It takes a few seconds to recognize that the intent is "if the user isn't in the Database, insert him/her/it; if that works do cool stuff".
As others have pointed out, this is really only relevant when doing things with side effects.
I don't know about any performance issues, but one possible argumentation to avoid it (or at least excessive use of it) is that it may confuse other developers.
There are already great responses about the side-effect issue, but I didn't see anything about the performance aspect of the question.
If you do not allow short-circuit evaluation, the performance issue is that both sides must be evaluated even though it will not change the outcome. This is usually a non-issue, but may become relevant under one of these two circumstances:
The code is in an inner loop that is called very frequently
There is a high cost associated with evaluating the expressions (perhaps IO or an expensive computation)
The short-circuit evaluation automatically provides conditional evaluation of a part of the expression.
The main advantage is that it simplifies the expression.
The performance could be improved but you could also observe a penalty for very simple expressions.
Another consequence is that side effects of the evaluation of the expression could be affected.
In general, relying on side-effect is not a good practice, but in some specific context, it could be the preferred solution.
VB6 doesn't use short-circuit evaluation, I don't know if newer versions do, but I doubt it. I believe this is just because older versions didn't either, and because most of the people who used VB6 wouldn't expect that to happen, and it would lead to confusion.
This is just one of the things that made it extremely hard for me to get out of being a noob VB programmer who wrote spaghetti code, and get on with my journey to be a real programmer.
Many answers have talked about side-effects. Here's a Python example without side-effects in which (in my opinion) short-circuiting improves readability.
for i in range(len(myarray)):
if myarray[i]>5 or (i>0 and myarray[i-1]>5):
print "At index",i,"either arr[i] or arr[i-1] is big"
The short-circuit ensures we don't try to access myarray[-1], which would raise an exception since Python arrays start at 0. The code could of course be written without short-circuits, e.g.
for i in range(len(myarray)):
if myarray[i]<=5: continue
if i==0: continue
if myarray[i-1]<=5: continue
print "At index",i,...
but I think the short-circuit version is more readable.

How is this MySQL query vulnerable to SQL injection?

In a comment on a previous question, someone said that the following sql statement opens me up to sql injection:
select
ss.*,
se.name as engine,
ss.last_run_at + interval ss.refresh_frequency day as next_run_at,
se.logo_name
from
searches ss join search_engines se on ss.engine_id = se.id
where
ss.user_id='.$user_id.'
group by ss.id
order by ss.project_id, ss.domain, ss.keywords
Assuming that the $userid variable is properly escaped, how does this make me vulnerable, and what can I do to fix it?
Every SQL interface library worth using has some kind of support for binding parameters. Don't try to be clever, just use it.
You may really, really think/hope you've escaped stuff properly, but it's just not worth the time you don't.
Also, several databases support prepared statement caching, so doing it right can also bring you efficiency gains.
Easier, safer, faster.
Assuming it is properly escaped, it doesn't make you vulnerable. The thing is that escaping properly is harder than it looks at first sight, and you condemn yourself to escape properly every time you do a query like that. If possible, avoid all that trouble and use prepared statements (or binded parameters or parameterized queries). The idea is to allow the data access library to escape values properly.
For example, in PHP, using mysqli:
$db_connection = new mysqli("localhost", "user", "pass", "db");
$statement = $db_connection->prepare("SELECT thing FROM stuff WHERE id = ?");
$statement->bind_param("i", $user_id); //$user_id is an integer which goes
//in place of ?
$statement->execute();
If $user_id is escaped, then you should not be vulnerable to SQL Injection.
In this case, I would also ensure that the $user_id is numeric or an integer (depending on the exact type required). You should always limit the data to the most restrictive type you can.
If it is properly escaped and validated, then you don't have a problem.
The problem arises when it is not properly escaped or validated. This could occur by sloppy coding or an oversight.
The problem is not with particular instances, but with the pattern. This pattern makes SQL injection possible, while the other pattern makes it impossible.
All answers are good and right, but I feel I need to add that the prepare/execute paradigm is not the only solution, either. You should have a database abstraction layer, rather than using the library functions directly and such a layer is a good place to explicitly escape string parameters, whether you let prepare do it, or you do it yourself.
I think 'Properly Escaped' here is the keyword. In your last question, I'm making the assumption that your code is copy pasted from your production code, and since you asked question about three tables join, I also make the assumption that you didn't do proper escaping, hence my remark on SQL Injection attack.
To answer your question, as so many people here has described, IF the variable has been 'Properly Escaped', then you have no problem. But why trouble yourself by doing that? As some people have pointed out, sometimes Properly Escaping is not a straightforward thing to do. There are patterns and library in PHP that makes SQL Injection impossible, why don't we just use that? (I also deliberately make assumption that your code is in fact PHP). Vinko Vrsalovic answer may give you ideas on how to approach this problem.
That statement as such isn't really a problem, its "safe", however I don't know how you are doing this (one level up on the API stack). If $user_id is getting inserted into the statement using string operations (like as if you are letting Php automatically fill out the statement) then its dangerous.
If its getting filled in using a binding API, then your ready to go.