I would like to know what is the best practice for better performance when you have a select query that can have any combination of a number (20+) of parameters in the where clause that is passed to a stored procedure.
let's say I have a query that should return the list of people and their addresses (maybe more than 1 address per person). The user wants to search by any possible combination of fields from the person/address tables. The search could be on one field or all 20 or anything in between.
The way I use to handle this is by creating one cursor like this
(for simplicity I am listing 2 variables only a varchar and an int)
create procedure dynasp for (
in in_name varchar(40),
in in_age int
..... rest of parameters here...
declare cursor cs for
select .... from person join address....
where
(in_age = 0 or in_age = person_age) and
(in_name is null or rtrim(in_name)='' or in_name = person.name)
and...
I believe since the value of an input variable is constant, the query should not evaluate it on each row or does it?
The other option that I use is using dynamic cursor built from string in the sp. this way it will contain only the fields that are not empty in the where clause, but I believe this means that the sql needs to be constructed and recompiled on every call to the SP.
My question is for best practices which method above is more recommended, and is there any other better way than the 2 methods mentioned above?
Thank you
The question of performance basically hinges on one simple thing: does your table have any indexes that you intend to use to improve the performance of the query?
If indexes aren't an issue, then your approach is fine. Well, let me add: assuming a cursor is necessary for the additional processing that you are doing. If you can just return the result set and do set-based processing, that is superior to using cursors.
If indexes are an issue, then a long complex where statement with a bunch of constant expressions might confuse the MySQL compiler. The documentation on using indexes for where clauses is here. MySQL definitely removes constant expressions. However, in a very complex expression, I'm not sure how well this interacts with choosing the right index. (I am assuming you are using MySQL based on the syntax).
For this latter case, a dynamic cursor would be beneficial, because it would encourage MySQL to choose an execution plan that uses indexes.
So, if you are not using indexing (or partitioning), then your current approach is fine. If you are, look at the execution plan for your queries. If they use the appropriate indexes, then your current approach is fine. If they are not, consider dynamic cursors.
It depends on the size of the tables you're searching. If you have 20+ criteria in the where clause, all containing or operators, the query optimizer will not be able to choose a good index to use and will likely just scan the entire table(s). For small tables, this won't matter but for very large tables, it will be slow.
The other alternative, constructing a dynamic query, will occur some overhead in parsing and choosing a query plan, but when executed, the query will likely be more efficient. (Make sure you're protecting against SQL injection vulnerabilities).
So the best practice is to benchmark both and see what's best in your situation.
Related
I have a view (say 'v') that is the combination of 10 tables using several Joins and complex calculations. In that view, there are around 10 Thousand rows.
And then I select 1 row based on row as WHERE id = 23456.
Another possible way to use a larger query in which I can cut short the dataset to 1% before the complex calculation starts.
Question: Are SQL views optimized in some form?
MySQL Views are just syntactic sugar. There is not special optimization. Think of views as being textually merged; then optimized. That is, you could get the same optimizations (or not) by manually writing the equivalent SELECT.
If you would like to discuss the particular query further, please provide SHOW CREATE TABLE/VIEW and EXPLAIN SELECT .... It may be that you are missing a useful 'composite' index.
I have a view which queries from 2 tables that don't change often (they are updated once or twice a day) and have a maximum of 2000 and 1000 rows).
Which algorithm should perform better, MERGE or TEMPTABLE?
Wondering, will MySQL cache the query result, making TEMPTABLE the best choice in my case?
Reading https://dev.mysql.com/doc/refman/5.7/en/view-algorithms.html I understood that basically, the MERGE algorithm will inject the view code in the query that is calling it, then run. The TEMPTABLE algorithm will make the view run first, store its result into a temporary table then used. But no mention to cache.
I know I have the option to implement Materialized Views myself (http://www.fromdual.com/mysql-materialized-views). Can MySQL automatically cache the TEMPTABLE result and use it instead?
Generally speaking the MERGE algorithm is preferred as it allows your view to utilize table indexes, and doesn't introduce a delay in creating temporary tables (as TEMPTABLE does).
In fact this is what the MySQL Optimizer does by default - when a view's algorithm UNDEFINED (as it is by default) MySQL will use MERGE if it can, otherwise it'll use TEMPTABLE.
One thing to note (which has caused me a lot of pain) is that MySQL will not use the MERGE algorithm if your view contains any of the following constructs:
Constructs that prevent merging are the same for derived tables and view references:
Aggregate functions (SUM(), MIN(), MAX(), COUNT(), and so forth)
DISTINCT
GROUP BY
HAVING
LIMIT
UNION or UNION ALL
Subqueries in the select list
Assignments to user variables
Refererences only to literal values (in this case, there is no underlying table)
In this case, TEMPTABLE will be used, which can cause performance issues without any clear reason why. In this case it's best to use a stored procedure, or subquery instead of a view
Thank's MySQL ðŸ˜
Which algorithm? It depends on the particular query and schema. Usually the Optimizer picks the better approach, and you should not specify.
But... Sometimes the Optimizer picks really bad approach. At that point, the only real solution is not to use Views. That is, some Views cannot be optimized as well as the equivalent SELECT.
If you want to discuss a particular case, please provide the SHOW CREATE VIEW and SHOW CREATE TABLEs, plus a SELECT calling the view. And construct the equivalent SELECT. Also include EXPLAIN for both SELECTs.
I've visited one interesting job interview recently. There I was asked a question about optimizing a query with a WHERE..IN clause containing long list of scalars (thousands of values, that is). This question is NOT about subqueries in the IN clause, but about simple list of scalars.
I answered right away, that this can be optimized using an INNER JOIN with another table (possibly temporary one), which will contain only those scalars. My answer was accepted and there was a note from the reviewer, that "no database engine currently can optimize long WHERE..IN conditions to be performant enough". I nodded.
But when I walked out, I started to have some doubts. The condition seemed rather trivial and widely used for modern RDBMS not to be able to optimize it. So, I started some digging.
PostgreSQL:
It seems, that PostgreSQL parse scalar IN() constructions into ScalarArrayOpExpr structure, which is sorted. This structure is later used during index scan to locate matching rows. EXPLAIN ANALYZE for such queries shows only one loop. No joins are done. So, I expect such query to be even faster, than INNER JOIN. I tried some queries on my existing database and my tests proved that position. But I didn't care about test purity and that Postgres was under Vagrant so I might be wrong.
MSSQL Server:
MSSQL Server builds a hash structure from the list of constant expressions and then does a hash join with the source table. Even though no sorting seems to be done, that is a performance match, I think. I didn't do any tests since I don't have any experience with this RDBMS.
MySQL Server:
The 13th of these slides says, that before 5.0 this problem indeed took place in MySQL with some cases. But other than that, I didn't find any other problem related to bad IN () treatment. I didn't find any proofs of the inverse, unfortunately. If you did, please kick me.
SQLite:
Documentation page hints some problems, but I tend to believe things described there are really at conceptual level. No other information was found.
So, I'm starting to think I misunderstood my interviewer or misused Google ;) Or, may be, it's because we didn't set any conditions and our talk became a little vague (we didn't specify any concrete RDBMS or other conditions. That was just abstract talk).
It looks like the days, where databases rewrote IN() as a set of OR statements (which can cause problems sometimes with NULL values in lists, btw) are long ago. Or not?
Of course, in cases where a list of scalars is longer than allowed database protocol packet, INNER JOIN might be the only solution available.
I think in some cases query parsing time (if it was not prepared) alone can kill performance.
Also databases could be unable to prepare IN(?) query which will lead to reparsing it again and again (which may kill performance). Actually, I never tried, but I think that even in such cases query parsing and planning is not huge comparing to query execution.
But other than that I do not see other problems. Well, other than the problem of just HAVING this problem. If you have queries, that contain thousands of IDs inside, something is wrong with your architecture.
Do you?
Your answer is only correct if you build an index (preferably a primary key index) on the list, unless the list is really small.
Any description of optimization is definitely database specific. However, MySQL is quite specific about how it optimizes in:
Returns 1 if expr is equal to any of the values in the IN list, else
returns 0. If all values are constants, they are evaluated according
to the type of expr and sorted. The search for the item then is done
using a binary search. This means IN is very quick if the IN value
list consists entirely of constants.
This would definitely be a case where using IN would be faster than using another table -- and probably faster than another table using a primary key index.
I think that SQL Server replaces the IN with a list of ORs. These would then be implemented as sequential comparisons. Note that sequential comparisons can be faster than a binary search, if some elements are much more common than others and those appear first in the list.
I think it is bad application design. Those values using IN operator are most probably not hardcoded but dynamic. In such case we should always use prepared statements the only reliable mechanism to prevent SQL injection.
In each case it will result in dynamically formatting the prepared statement (as number of placeholders is dynamic too) and it will also result in having excessive hard parsing (as many unique queries as we have number of IN values - IN (?), IN (?,?), ...).
I would either load these values into table as use join as you mentioned (unless loading is too overhead) or use Oracle pipelined function IN foo(params) where params argument can be complex structure (array) coming from memory (PLSQL/Java etc).
If the number of values is larger I would consider using EXISTS (select from mytable m where m.key=x.key) or EXISTS (select x from foo(params) instead of IN. In such case EXISTS provides better performance than IN.
I'd like to use prepared statements with MySQL on my Go server, but I'm not sure how to make it work with an unknown number of parameters. One endpoint allows users to send an array of id's, and Go will SELECT the objects from the database matching the given id's. This array could contain anywhere from 1 to 20 id's, so how would I construct a prepared statement to handle that? All the examples I've seen require you to know exactly the number of query parameters.
The only (very unlikely) option I can think is to prepare 20 different SELECT statements, and use the one that matches the number of id's the user submits - but this seems like a terrible hack. Would I even see the performance benefits of prepared statements at that point?
I'm pretty stuck here, so any help would be appreciated!
No RDBMS I'm aware of is able to bind an unknown number of parameters. It is never possible to match an array with an unknown number of parameter placeholders. It means there is not smart way to bind an array to a query such as:
SELECT xxx FROM xxx WHERE xxx in (?,...,?)
This is not a limitation of the client driver, this is simply not supported by database servers.
There are various workarounds.
You can create the query with 20 ?, bind the values you have, and complete the binding by NULL values. It works fine, because of the particular semantic of comparison operations involving NULL values. A condition like "field = ?" evaluates always to false when the parameter is bound to a NULL value, even if some rows would match. Supposing you have 5 values in your array, the database server will have to deal with 5 provided values, plus 15 NULL values. It is usually smart enough to just ignore the NULL values
An alternative solution is to prepare all the queries (each one with a different number of parameters). It is only interesting if the maximum number of parameters is limited. It works well on database for which prepared statements really matters (such as Oracle).
As far as MySQL is concerned, the gain of using a prepared statement is quite limited. Keep in mind that prepared statements are only maintained per session, they are not shared across sessions. If you have a lot of sessions, they take memory. On the other hand, parsing statements with MySQL does not involve much overhead (contrary to some other database systems). Generally, generating plenty of prepared statements to cover a single query is not worth it.
Note that some MySQL drivers offer a prepared statement interface, while they do not use internally the prepared statement capability of the MySQL protocol (again, because often, it is not worth it).
There are also some other solutions (like relying on a temporary table), but they are only interesting if the number of parameters is significant.
I have constructed a query and I'm wondering if it would work on any database besides MySQL. I have never actually used another database so I'm not great with the differences.
UPDATE `locks` AS `l1`
CROSS JOIN (SELECT SUM(`value`) AS `sum` FROM `locks`
WHERE `key` IN ("key3","key2")) AS `l2`
SET `l1`.`value` = `l1`.`value` + 1
WHERE `l1`.`key` = "key1" AND (`l2`.`sum` < 1);
Here are the specific features I'm relying on (as I can think of them):
Update queries.
Joins in update queries.
Aggregate functions in non-explicitly-grouped queries.
WHERE...IN condition.
I'm sure people will be curious exactly what this does, and this may also include database features that might not be ubiquitous. This is an implementation of mutual exclusion using a database, intended for a web application. In my case I needed it because certain user actions cause tables to be dropped and recreated with different columns, and I want to avoid errors if other parts of the application try to insert data. The implementation, therefore, is specialized to solve the readers-writers problem.
This query assumes there exists a table locks with two fields: key (varchar) and value (int). It further assumes that the table contains a row such that key="key1". Then it tries to increment the value for "key1". It only does so if for every key in the list ("key2","key3"), the associated value is 0 (the WHERE condition for l2 is an approximation that assumes value is never negative). Therefore this query only "obtains a lock" if certain conditions are met, presumably in an atomic fashion. Then, the application checks if it received a lock by the return value of the query which presumably states how many rows were affected. If and only if no rows were affected, the application did not receive a lock.
So, here are the additional conditions not discernable from the query itself:
Assumes that in a multi-threaded environment, a copy of this query will never be interleaved with another copy.
Processing the query must return whether any values were affected.
As a secondary request, I would appreciate any resources on "standard SQL." I've heard about it but never been able to find any kind of definition, and I feel like I'm missing a lot of things when the MySQL documentation says "this feature is an extension of standard SQL."
Based on the responses, this query should work better across all systems:
UPDATE locks AS l1
CROSS JOIN (SELECT SUM(val) AS others FROM locks
WHERE keyname IN ('key3','key2')) AS l2
SET l1.val = l1.val + 1
WHERE l1.keyname = 'key1' AND (l2.others < 1);
Upvotes for everyone because of the good answers. The marked answer seeks to directly answer my question, even if just for one other DBMS, and even though there may be better solutions to my particular problem (or even the problem of cross-platform SQL in general).
This exact syntax would only work in MySQL.
It's an ugly workaround for this construct:
UPDATE locks
SET value = 1
WHERE key = 'key1'
AND NOT EXISTS
(
SELECT NULL
FROM locks li
WHERE li.key IN ('key2', 'key3')
AND li.value > 0
)
which works in all systems except MySQL, because the latter does not allow subqueries on the target table in UPDATE or DELETE statements.
For PostgreSQL
1) Update queries.
Can't imagine a RDBMS that has no UPDATE. (?)
2) Joins in update queries.
In PostgreSQL you would include additional tables with FROM from_list.
3) Aggregate functions in non-grouped queries.
Not possible in PostgreSQL. Use subqueries, CTE or Window functions for that.
But your query is grouped. The GROUP BY clause is just not spelled out. That works in PostgreSQL, too.
The presence of HAVING turns a query into a grouped query even if
there is no GROUP BY clause. This is the same as what happens when the
query contains aggregate functions but no GROUP BY clause.
(Quote from the manual).
4) WHERE...IN condition
Works in any RDBMS I know of.
"Additional conditions": Assumes that in a multi-threaded environment, a copy of this query will never be interleaved with another copy.
PostgreSQL's multiversion model MVCC (Multiversion Concurrency Control) is superior to MySQL for handling concurrency. Then again, most RDBMS are superior to MySQL in this respect.
Processing the query must return whether any values were affected.
Postgres does that, most every RDBMS does.
Furthermore, this query wouldn't run in PostgreSQL because:
no identifiers with backticks (that's MySQL slang).
values need to be single-quoted, not double-quoted.
See the list of reserved words in Postgres and SQL standards.
A combined list for various RDBMS.
This will only work in mysql, just because you use "`" delimiter, which is mysql-specific only.
What if you replace delimiter with more "standard" one: then probably it will work in all modern DBMS (postgres, sql server, oracle), but I would never write a general query for all - I'd better written a specific query for each used (or potentially used) DBMS to use its specific language dialects to get the best performance and query readability.
What about "As a secondary request, I would appreciate any resources on "standard SQL."" --- get a look at http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt