I have constructed a query and I'm wondering if it would work on any database besides MySQL. I have never actually used another database so I'm not great with the differences.
UPDATE `locks` AS `l1`
CROSS JOIN (SELECT SUM(`value`) AS `sum` FROM `locks`
WHERE `key` IN ("key3","key2")) AS `l2`
SET `l1`.`value` = `l1`.`value` + 1
WHERE `l1`.`key` = "key1" AND (`l2`.`sum` < 1);
Here are the specific features I'm relying on (as I can think of them):
Update queries.
Joins in update queries.
Aggregate functions in non-explicitly-grouped queries.
WHERE...IN condition.
I'm sure people will be curious exactly what this does, and this may also include database features that might not be ubiquitous. This is an implementation of mutual exclusion using a database, intended for a web application. In my case I needed it because certain user actions cause tables to be dropped and recreated with different columns, and I want to avoid errors if other parts of the application try to insert data. The implementation, therefore, is specialized to solve the readers-writers problem.
This query assumes there exists a table locks with two fields: key (varchar) and value (int). It further assumes that the table contains a row such that key="key1". Then it tries to increment the value for "key1". It only does so if for every key in the list ("key2","key3"), the associated value is 0 (the WHERE condition for l2 is an approximation that assumes value is never negative). Therefore this query only "obtains a lock" if certain conditions are met, presumably in an atomic fashion. Then, the application checks if it received a lock by the return value of the query which presumably states how many rows were affected. If and only if no rows were affected, the application did not receive a lock.
So, here are the additional conditions not discernable from the query itself:
Assumes that in a multi-threaded environment, a copy of this query will never be interleaved with another copy.
Processing the query must return whether any values were affected.
As a secondary request, I would appreciate any resources on "standard SQL." I've heard about it but never been able to find any kind of definition, and I feel like I'm missing a lot of things when the MySQL documentation says "this feature is an extension of standard SQL."
Based on the responses, this query should work better across all systems:
UPDATE locks AS l1
CROSS JOIN (SELECT SUM(val) AS others FROM locks
WHERE keyname IN ('key3','key2')) AS l2
SET l1.val = l1.val + 1
WHERE l1.keyname = 'key1' AND (l2.others < 1);
Upvotes for everyone because of the good answers. The marked answer seeks to directly answer my question, even if just for one other DBMS, and even though there may be better solutions to my particular problem (or even the problem of cross-platform SQL in general).
This exact syntax would only work in MySQL.
It's an ugly workaround for this construct:
UPDATE locks
SET value = 1
WHERE key = 'key1'
AND NOT EXISTS
(
SELECT NULL
FROM locks li
WHERE li.key IN ('key2', 'key3')
AND li.value > 0
)
which works in all systems except MySQL, because the latter does not allow subqueries on the target table in UPDATE or DELETE statements.
For PostgreSQL
1) Update queries.
Can't imagine a RDBMS that has no UPDATE. (?)
2) Joins in update queries.
In PostgreSQL you would include additional tables with FROM from_list.
3) Aggregate functions in non-grouped queries.
Not possible in PostgreSQL. Use subqueries, CTE or Window functions for that.
But your query is grouped. The GROUP BY clause is just not spelled out. That works in PostgreSQL, too.
The presence of HAVING turns a query into a grouped query even if
there is no GROUP BY clause. This is the same as what happens when the
query contains aggregate functions but no GROUP BY clause.
(Quote from the manual).
4) WHERE...IN condition
Works in any RDBMS I know of.
"Additional conditions": Assumes that in a multi-threaded environment, a copy of this query will never be interleaved with another copy.
PostgreSQL's multiversion model MVCC (Multiversion Concurrency Control) is superior to MySQL for handling concurrency. Then again, most RDBMS are superior to MySQL in this respect.
Processing the query must return whether any values were affected.
Postgres does that, most every RDBMS does.
Furthermore, this query wouldn't run in PostgreSQL because:
no identifiers with backticks (that's MySQL slang).
values need to be single-quoted, not double-quoted.
See the list of reserved words in Postgres and SQL standards.
A combined list for various RDBMS.
This will only work in mysql, just because you use "`" delimiter, which is mysql-specific only.
What if you replace delimiter with more "standard" one: then probably it will work in all modern DBMS (postgres, sql server, oracle), but I would never write a general query for all - I'd better written a specific query for each used (or potentially used) DBMS to use its specific language dialects to get the best performance and query readability.
What about "As a secondary request, I would appreciate any resources on "standard SQL."" --- get a look at http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
Related
Here is the tutorial of GROUP_CONCAT() in GeeksForGeeks.
In "Queries 2", the Output is ascending. But there is no ORDER BY clause.
here is the picture of "Queries 2"
Could anyone can tell me why?
Any help would be really appreciated!
This is one of those oddballs where there is likely an implicit sort happening behind the scenes to optimize the DISTINCT execution by mysql.
You can test this yourself pretty easily:
CREATE TABLE t1 (c1 VARCHAR(50));
INSERT INTO t1 VALUES ('zebra'),('giraffe'),('cattle'),('fox'),('octopus'),('yak');
SELECT GROUP_CONCAT(c1) FROM t1;
SELECT GROUP_CONCAT(DISTINCT c1) FROM t1;
GROUP_CONCAT(c1)
zebra,giraffe,cattle,fox,octopus,yak
GROUP_CONCAT(DISTINCT c1)
cattle,fox,giraffe,octopus,yak,zebra
It's not uncommon to find sorted results where no ORDER BY was specified. Window Functions output are a good example of this.
You can imagine if you were tasked, as a human, to only pick distinct items from a list. You would likely first sort the list and then pick out duplicates, right? And when you hand the list back to the person that requested this from you, you wouldn't scramble the data back up to be unsorted, I would assume. Why do the extra work? What you are seeing here is a byproduct of the optimized execution path chosen by the mysql server.
The key takeaway is "byproduct". If I specifically wanted the output of GROUP_CONCAT to be sorted, I would specify exactly what I want and I would not rely on this implicit sorting behavior. We can't guess what the execution path will be. There are a lot of decisions an RDBMS makes when SQL is submitted to optimize the execution and depending on data size and other steps it needs to take in the sql, this behavior may work on one sql statement and not another. Likewise, it may work one day, and not another.
TL;DR Never omit an ORDER BY clause from a query if you rely on the order for something.
Does DISTINCT will automatically sort the result in MySQL?
No. NO! Be careful!
SQL is all about sets of rows. Without ORDER BY clauses, SQL queries return the rows of their result sets in an "unpredictable" order. "Unpredictable" is like random, but worse. If the order is truly random, you have a chance to catch any ordering problem when you're testing. Unpredictable means the server returns rows in any convenient order. This means everything works as you expect until some day in the future when it doesn't, without warning. (MySQL might start using some kind of parallel algorithm in the future.)
Now it is true that DISTINCT result sets from modestly sized tables are often generated using a sorting / deduplicating algorithm in the server. But that is an implementation detail. MySql and other table servers are complex enough that relying on implementation details is not wise. The good news: If you include an ORDER BY clause showing the same order as that methodology generates, usually performance is not changed.
SQL is declarative, not procedural. We specify what we want, not how to get it. It's probably the only declarative language most of us ever see, so it's easy to make the mistake of thinking it is procedural.
When writting SQL queries with various where clauses (I only work with MySQL and sqlite) , I usually have the doubt of reordering the query clauses to put the "best ones" first (those which will remove a bigger amount of rows), and other "cosmetic" clauses later (which will barely change the output). In other words, I'm in doubt about if I really will help the optimizer to run faster by reordering clauses (specially when there are indexes in play), or if it could be another case of premature optimization. Optimizers are usually smarter than me.
For example:
select address.* from address inner join
user on address.user = user.id
where address.zip is not null and address.country == user.country
If we know that usually address.zip is not null, that check will be 90% true, and if the query order is respected, there will be a lot of dummy checks which can be avoided by placing the country check before.
Should I take care of that? In other words, is it important the order of where clauses or not?
The mysql optimizer seems well documented and you can find a many interesting considerations in offcial documents ..http://dev.mysql.com/doc/refman/5.7/en/where-optimizations.html
be taken into account especially of a very simple fact .... sql is not a procedural language but rather is a declarative language .. this mean it is not important the order in which the parts are written but it is important only the fact of what elements are been declared. This is evident in the documentation on optimization of mysql where where the focus is only on the components of a query and how they are transformed by optmizer in internal components
The order is mostly irrelevant.
In MySQL, with WHERE ... AND ...,
The Optimizer will first look for which part can use an index. If one can and one can't, the optimizer will use the index; the order becomes irrelevant
If both sides of the AND can use an index, MySQL will usually pick the 'better' one. (Sometimes it goofs.) Again, the order is ignored.
If neither side can use an index, it it evaluated left to right. But... Fetching rows is the bulk of effort in performing the query, so if one side of the AND is a little slower than the other, you probably won't notice. (Sure, if one side does SLEEP(3) you will notice.)
There's another issue in your example query (aside from the syntax error): The Optimizer will make a conscious decision of which table to start with.
If it decides to start with user, address needs INDEX(user, country) in either order.
If it decides to start with address, user needs (id, country) in either order.
It is unclear whether the Optimizer will bother with the NOT NULL test, even if that column is indexed.
Bottom line: Spend your time focusing on optimal indexes .
The answer is definitly maybe.
Mysterious are the ways of the optimizer.
Here is a demonstration based on exception caused due to division by zero.
create table t (i int);
insert into t (i) values (0);
The following query succeeds for Oracle, SQL Server, Postgres and Teradata (we'll skip the version information for now):
select 1 from t where i < 1 or 1/i < 1;
The following query fails for SQL Server and Postgres but succeeds for Oracle and Teradata
select 1 from t where 1/i < 1 or i < 1;
However, the following query does fail for Oracle and Teradata:
select 1 from t where 1/i < 1 or i/1 < 1;
What do we learn?
That some optimizers seem to respect the order of the predicates (or at least in some manner) and some seem to reorder the predicates by their estimated cost (e.g 1/i < 1 is more costly than i < 1 but not i/1 < 1).
For those who respect the order of the predicates we probably can improve performance by putting the light wait predicates first for OR operators and the frequently false predicates for AND operators.
Being that said, since databases do not guarantee to preserve the order of the predicates even if some of them currently seem to do that, you definitly can't count on it.
MySQL 5.7.11
This query returns immediately:
select 1 from t where i < 1 or sleep(3);
This query returns after 3 seconds:
select 1 from t where sleep(3) or i < 1
I've visited one interesting job interview recently. There I was asked a question about optimizing a query with a WHERE..IN clause containing long list of scalars (thousands of values, that is). This question is NOT about subqueries in the IN clause, but about simple list of scalars.
I answered right away, that this can be optimized using an INNER JOIN with another table (possibly temporary one), which will contain only those scalars. My answer was accepted and there was a note from the reviewer, that "no database engine currently can optimize long WHERE..IN conditions to be performant enough". I nodded.
But when I walked out, I started to have some doubts. The condition seemed rather trivial and widely used for modern RDBMS not to be able to optimize it. So, I started some digging.
PostgreSQL:
It seems, that PostgreSQL parse scalar IN() constructions into ScalarArrayOpExpr structure, which is sorted. This structure is later used during index scan to locate matching rows. EXPLAIN ANALYZE for such queries shows only one loop. No joins are done. So, I expect such query to be even faster, than INNER JOIN. I tried some queries on my existing database and my tests proved that position. But I didn't care about test purity and that Postgres was under Vagrant so I might be wrong.
MSSQL Server:
MSSQL Server builds a hash structure from the list of constant expressions and then does a hash join with the source table. Even though no sorting seems to be done, that is a performance match, I think. I didn't do any tests since I don't have any experience with this RDBMS.
MySQL Server:
The 13th of these slides says, that before 5.0 this problem indeed took place in MySQL with some cases. But other than that, I didn't find any other problem related to bad IN () treatment. I didn't find any proofs of the inverse, unfortunately. If you did, please kick me.
SQLite:
Documentation page hints some problems, but I tend to believe things described there are really at conceptual level. No other information was found.
So, I'm starting to think I misunderstood my interviewer or misused Google ;) Or, may be, it's because we didn't set any conditions and our talk became a little vague (we didn't specify any concrete RDBMS or other conditions. That was just abstract talk).
It looks like the days, where databases rewrote IN() as a set of OR statements (which can cause problems sometimes with NULL values in lists, btw) are long ago. Or not?
Of course, in cases where a list of scalars is longer than allowed database protocol packet, INNER JOIN might be the only solution available.
I think in some cases query parsing time (if it was not prepared) alone can kill performance.
Also databases could be unable to prepare IN(?) query which will lead to reparsing it again and again (which may kill performance). Actually, I never tried, but I think that even in such cases query parsing and planning is not huge comparing to query execution.
But other than that I do not see other problems. Well, other than the problem of just HAVING this problem. If you have queries, that contain thousands of IDs inside, something is wrong with your architecture.
Do you?
Your answer is only correct if you build an index (preferably a primary key index) on the list, unless the list is really small.
Any description of optimization is definitely database specific. However, MySQL is quite specific about how it optimizes in:
Returns 1 if expr is equal to any of the values in the IN list, else
returns 0. If all values are constants, they are evaluated according
to the type of expr and sorted. The search for the item then is done
using a binary search. This means IN is very quick if the IN value
list consists entirely of constants.
This would definitely be a case where using IN would be faster than using another table -- and probably faster than another table using a primary key index.
I think that SQL Server replaces the IN with a list of ORs. These would then be implemented as sequential comparisons. Note that sequential comparisons can be faster than a binary search, if some elements are much more common than others and those appear first in the list.
I think it is bad application design. Those values using IN operator are most probably not hardcoded but dynamic. In such case we should always use prepared statements the only reliable mechanism to prevent SQL injection.
In each case it will result in dynamically formatting the prepared statement (as number of placeholders is dynamic too) and it will also result in having excessive hard parsing (as many unique queries as we have number of IN values - IN (?), IN (?,?), ...).
I would either load these values into table as use join as you mentioned (unless loading is too overhead) or use Oracle pipelined function IN foo(params) where params argument can be complex structure (array) coming from memory (PLSQL/Java etc).
If the number of values is larger I would consider using EXISTS (select from mytable m where m.key=x.key) or EXISTS (select x from foo(params) instead of IN. In such case EXISTS provides better performance than IN.
I have only used SQL rarely until recently when I began using it daily. I notice that if no "order by" clause is used:
When selecting part of a table the rows returned appear to be in the same order as they appear if I select the whole table
The order of rows returned by a selecting from a join seemes to be determined by the left most member of a join.
Is this behaviour a standard thing one can count on in the most common databases (MySql, Oracle, PostgreSQL, Sqlite, Sql Server)? (I don't really even know whether one can truly count on it in sqlite). How strictly is it honored if so (e.g. if one uses "group by" would the individual groups each have that ordering)?
If no ORDER BY clause is included in the query, the returned order of rows is undefined.
Whilst some RDBMSes will return rows in specific orders in some situations even when an ORDER BY clause is omitted, such behaviour should never be relied upon.
Section 20.2 <direct select statement: multiple rows>, subsection "General Rules" of
the SQL-92 specification:
4) If an <order by clause> is not specified, then the ordering of
the rows of Q is implementation-dependent.
If you want order, include an ORDER BY. If you don't include an ORDER BY, you're telling SQL Server:
I don't care what order you return the rows, just return the rows
Since you don't care, SQL Server is going to decide how to return the rows what it deems will be the most efficient manner possible right now (or according to the last time the plan for this specific query was cached). Therefore you should not rely on the behavior you observe. It can change from one run of the query to the next, with data changes, statistics changes, index changes, service packs, cumulative updates, upgrades, etc. etc. etc.
For PostgreSQL, if you omit the ORDER BY clause you could run the exact same query 100 times while the database is not being modified, and get one run in the middle in a different order than the others. In fact, each run could be in a different order.
One reason this could happen is that if the plan chosen involves a sequential scan of a table's heap, and there is already a seqscan of that table's heap in process, your query will start it's scan at whatever point the other scan is already at, to reduce the need for disk access.
As other answers have pointed out, if you want the data in a certain order, specify that order. PostgreSQL will take the requested order into consideration in choosing a plan, and may use an index that provides data in that order, if that works out to be cheaper than getting the rows some other way and then sorting them.
GROUP BY provides no guarantee of order; PostgreSQL might sort the data to do the grouping, or it might use a hash table and return the rows in order of the number generated by the hashing algorithm (i.e., pretty random). And that might change from one run to the next.
It never ceased to amaze me when I was a DBA that this feature of SQL was so often thought of as quirky. Consider a simple program that runs against a text file and produces some output. If the program never changes, and the data never changes, you'd expect the output to never change.
As for this:
If no ORDER BY clause is included in the query, the returned order of rows is undefined.
Not strictly true - on every RDBMS I've ever worked on (Oracle, Informix, SQL Server, DB2 to name a few) a DISTINCT clause also has the same effect as an ORDER BY as finding unique values involves a sort by definition.
EDIT (6/2/14):
Create a simple table
For DISTINCT and ORDER BY, both the plan and the cost is the same since it is ostensibly the same operation to be performed
And not surprisingly, the effect is thus the same
A friend of mine and I recently were having a debate about this, so I was wondering if anyone here might actually know the answer.
Generally, to emulate the LIMIT start_index,result_count syntax of MySQL and the LIMIT result_count OFFSET start_index functionality of MySQL and PostgreSQL in Oracle, we use:
SELECT P.* FROM
( SELECT COL1...COLN, ROW_NUMBER() OVER ID_COLUMN AS RN FROM MY_TABLE )
WHERE P.RN BETWEEN START_INDEX AND END_INDEX;
Instead of an explicit limit function, this alternate means needs to be used.
(If there is a better way, please let me know)
One of us argued that this means that Oracle actually fetches END_INDEX records and then only returns those records which have a rn over START_INDEX. This means that when looking for records 123,432-123,442 Oracle would retrieve 123,431 unnecessary records. It was then argued that it followed that the two open source DB's mentioned (MySQL & PgSQL) by implication have a means of shortcutting this.
The counter argument is that DBMS's are optimized to handle sub-queries, meaning that the syntax does not necessarily imply the behavior. Further, the LIMIT syntaxes are likely merely syntactic sugar which are really wrappers around what has to be stated explicitly in Oracle.
Is there any who can determine which of these is the correct interpretation? Perhaps both are correct in some way?
It is correct that Oracle processes END_INDEX rows and discards rows 1 to START_INDEX-1, but it only fetches rows START_INDEX to END_INDEX in the cursor.
I don't know how LIMIT is implemented in the other DBMSs, but I can't imagine how they could do otherwise: how would they know they are fetching the 123,432nd row of the result set without first finding and discarding the previous 123,431 rows?
In practice, if you find yourself applying a LIMIT clause (or Oracle equivalent) with a START_INDEX of more than a few hundreds, you really need to rethink your requirements.