I have 2 tables:
author with 3 millions of rows.
book with 20 miles rows.
.
So I have benchmarked this query with a join:
SELECT BENCHMARK(100000000, 'SELECT book.title, author.name
FROM `book` , `author` WHERE book.id = author.book_id ')
And this is the result:
Query took 0.7438 sec
ONLY 0.7438 seconds for 100 millions of query with a join ???
Do I make some mistakes or this is the right result ?
Your result smells wrong, I've just run checked the documentation and run some benchmarks of my own. You're not actually benchmarking anything.
BENCHMARK() is for testing scalar expressions, it's not for testing query runtimes. The query isn't actually being executed. In my own testing of queries, the duration took was not at all related to the complexity of the query, only to the amount of trials to be run.
Take a look at http://dev.mysql.com/doc/refman/5.0/en/information-functions.html#function_benchmark
A few quotes from the doc:
"BENCHMARK() is intended for measuring the runtime performance of scalar expressions,"
"Only scalar expressions can be used. Although the expression can be a subquery, it must return a single column and at most a single row. For example, BENCHMARK(10, (SELECT * FROM t)) will fail if the table t has more than one column or more than one row."
You're not actually measuring anything, outside of at absolute most the query planners time.
If you want to run benchmarks, it's probably worth doing it from application code (and possible with a no cache directive depending on how write heavy your prod environment will be.). Doing it from application code will also figure in the time to hydrate the data, plus the cost of sending the data across the wire etc.
Related
I have this query :
SELECT *,(SELECT count(id) FROM riverLikes
WHERE riverLikes.river_id = River.id) as likeCounts
FROM River
WHERE user_id IN (1,2,3)
LIMIT 10
my question is my sub-query runs only 10 time ( foreach row that are fetched ) or it run for every row in the "River" table ?
my "River" has lots of records and i like to have the best performance to get the rivers .
thanks.
In general, calculated data (either subqueries or functions), is calculated for the rows that matter, being rows that are returned, or rows for which the outcome of the calculation is relevant to further filtering or grouping.
In addition, the query optimizer may do all kinds of magic, and it is unlikely that it will run the subquery many times as such. It can be transformed in such a way that all relevant information is fetched at once.
And even if it didn't do that, it all takes place within the same operation in the database SQL engine, so executing this subselect 10 times is way, way faster than executing that subselect as a separate select 10 times, because the SQL engine only has to parse and prepare it once, and doesn't suffer from roundtrip times.
A simple select like that could easily take 30 milliseconds or so when executed from PHP, so quick math would suggest that it'd take 300ms extra to have this subselect in a 10-row query, but that's not the case, because the lion's share of those 30ms is overhead of communication between PHP and the database.
Because of the reasons mentioned above, this subselect is possibly way faster than a join, and it's a common misconception that a join is (almost) always faster.
So, to get back to your example, the subquery won't be executed for all rows in River, but will only be executed, probably in optimized form, for those 10 records of Rivers 1, 2 and 3.
In most production-ready RDBMS's subquery will be run only for rows which included in result set, i.e. only 10 times in your case. I think it is true for mysql too.
EDIT:
For assurance run
EXPLAIN <your query>
And view execution plan of your query
The subquery in the select statement runs one time per row returned, in your sample 10 times
Suppose i have a long query string for eg.
SELECT id from users where collegeid='1' or collegeid='2' . . . collegeid='1000'
will it affect the speed or output in any way?
SELECT m.id,m.message,m.postby,m.tstamp,m.type,m.category,u.name,u.img
from messages m
join users u on m.postby=u.uid
where m.cid = '1' or m.cid = '1' . . . . . .
or m.cid = '1000'. . . .
I would prefer to use IN in this case as it would be better. However to check the performance you may try to look at the Execution Plan of the query which you are executing. You will get the idea about what performance difference you will get by using the both.
Something like this:
SELECT id from users where collegeid IN ('1','2','3'....,'1000')
According to the MYSQL
If all values are constants, they are evaluated according to the type
of expr and sorted. The search for the item then is done using a
binary search. This means IN is very quick if the IN value list
consists entirely of constants.
The number of values in the IN list is only limited by the
max_allowed_packet value.
You may also check IN vs OR in the SQL WHERE Clause and MYSQL OR vs IN performance
The answer given by Ergec is very useful:
SELECT * FROM item WHERE id = 1 OR id = 2 ... id = 10000
This query took 0.1239 seconds
SELECT * FROM item WHERE id IN (1,2,3,...10000)
This query took 0.0433 seconds
IN is 3 times faster than OR
will it affect the speed or output in any way?
So the answer is Yes the performance will be affected.
Obviously, there is no direct correlation between the length of a query string and its processing time (as some very short query can be tremendeously complex and vice versa). For your specific example: It depends on how the query is processed. This is something you can check by looking at the query execution plan (syntax depends on your DBMS, something like EXPLAIN PLAN). If the DBMS has to perform a full table scan, performance will only be affected slightly, since the DBMS has to visit all pages that make up the table anyhow. If there is an index on collegeid, performance will likely suffer more the more entries you put into your disjunction, since there will be several (though very fast) index lookups. At some point, there will we an full index scan instead of individual lookups, at which point performance will not degrade significantly anymore.
However - details depend ony our DBMS and its execution planner.
I' not sure you are facing what I suffered.
Actually, string length is not problem. How many values are in IN() is more important.
I've tested how many elements can be listed IN().
Result is 10,000 elements can be processed without performance loss.
Values in IN() should be stored in somewhere and searched while query evaluation. But 10k values is getting slower.
So if you have many 100k values, split 10 groups and try 10 times query. Or save in temp table and JOIN.
and long query uses more CPU, So IN() better than column = 1 OR ...
Suppose a Nested query is like :
SELECT *
FROM Table1
WHERE Table1.val in
(
Select Table2.val
from Table2
where Table2.val>3
)
So my question is how do actual query processors evaluate this -
Do they first evaluate inner most query, store the result temporarily and then use it in the upper level query? [ If the result of subquery is large, temporary storage may not be enough ]
Or do they evaluate the outer query for each result of the outer query [ requires too many evaluations of outer query]
Am I missing something, and which way is it actually implemented ?
It depends. You're allowed to reference a value from the outer query inside the inner. If you do this, you have what is referred to as a correlated sub-query or correlated derived table. In this case, it has to recompute the query for every possible row in the parent query. If you don't this, you have what is referred to as an inline view or inline derived table, and most database engines are smart enough to only compute the view once.
This might help you
There is plenty of memory available to run fairly large queries and there is little chance of the query's complexity creating a timeout. Nested queries have the advantage of being run almost exclusively in physical memory for the sake of speed, unless the amount of memory needed to perform the query exceeds the physical memory "footprint" allocated to SQL Server.
Also, because of a nested query's structure, the SQL Server query optimizer will attempt to find the most optimal way to derive the results from the query. (i.e. the optimizer will convert it into micro-ops and reorganize to run it efficiently) [Source]
Due to the re-ordering which query is run first is hard to say. After the optimization most probably however a form of the inner query will be run (you can't really do stuff without that result most of the time).
SQL queries don't run like stack calls the way you seem to think. They're a one line instruction which is comprehended and translated for the machine.
You can use EXPLAIN to get information about the execution plan like this:
EXPLAIN SELECT *
FROM Table1
WHERE Table1.val in
(
Select Table2.val
from Table2
where Table2.val>3
)
For example for your query you'll get
and as the documentation states:
DEPENDENT SUBQUERY evaluation differs from UNCACHEABLE SUBQUERY evaluation. For DEPENDENT SUBQUERY, the subquery is re-evaluated only once for each set of different values of the variables from its outer context. For UNCACHEABLE SUBQUERY, the subquery is re-evaluated for each row of the outer context.
So your first guess was the correct one.
For more information about EXPLAIN see:
http://dev.mysql.com/doc/refman/5.0/en/explain.html
So I have a table that has a little over 5 million rows. When I use SQL_CALC_FOUND_ROWS the query just hangs forever. When I take it out the query executes within a second withe LIMIT ,25. My question is for pagination reasons is there an alternative to getting the number of total rows?
SQL_CALC_FOUND_ROWS forces MySQL to scan for ALL matching rows, even if they'd never get fetched. Internally it amounts to the same query being executed without the LIMIT clause.
If the filtering you're doing via WHERE isn't too crazy, you could calculate and cache various types of filters to save the full-scan load imposed by calc_found_rows. Basically run a "select count(*) from ... where ...." for most possible where clauses.
Otherwise, you could go Google-style and just spit out some page numbers that occasionally have no relation whatsoever with reality (You know, you see "Goooooooooooogle", get to page 3, and suddenly run out of results).
Detailed talk about implementing Google-style pagination using MySQL
You should choose between COUNT(*) AND SQL_CALC_FOUND_ROWS depending on situation. If your query search criteria uses rows that are in index - use COUNT(*). In this case Mysql will "read" from indexes only without touching actual data in the table while SQL_CALC_FOUND_ROWS method will load rows from disk what can be expensive and time consuming on massive tables.
More information on this topic in this article #mysqlperformanceblog.
NOTE: the original question is moot but scan to the bottom for something relevant.
I have a query I want to optimize that looks something like this:
select cols from tbl where col = "some run time value" limit 1;
I want to know what keys are being used but whatever I pass to explain, it is able to optimize the where clause to nothing ("Impossible WHERE noticed...") because I fed it a constant.
Is there a way to tell mysql to not do constant optimizations in explain?
Am I missing something?
Is there a better way to get the info I need?
Edit: EXPLAIN seems to be giving me the query plan that will result from constant values. As the query is part of a stored procedure (and IIRC query plans in spocs are generated before they are called) this does me no good because the value are not constant. What I want is to find out what query plan the optimizer will generate when it doesn't known what the actual value will be.
Am I missing soemthing?
Edit2: Asking around elsewhere, it seems that MySQL always regenerates query plans unless you go out of your way to make it re-use them. Even in stored procedures. From this it would seem that my question is moot.
However that doesn't make what I really wanted to know moot: How do you optimize a query that contains values that are constant within any specific query but where I, the programmer, don't known in advance what value will be used? -- For example say my client side code is generating a query with a number in it's where clause. Some times the number will result in an impossible where clause other times it won't. How can I use explain to examine how well optimized the query is?
The best approach I'm seeing right off the bat would be to run EXPLAIN on it for the full matrix of exist/non-exist cases. Really that isn't a very good solution as it would be both hard and error prone to do by hand.
You are getting "Impossible WHERE noticed" because the value you specified is not in the column, not just because it is a constant. You could either 1) use a value that exists in the column or 2) just say col = col:
explain select cols from tbl where col = col;
For example say my client side code is generating a query with a number in it's where clause.
Some times the number will result in an impossible where clause other times it won't.
How can I use explain to examine how well optimized the query is?
MySQL builds different query plans for different values of bound parameters.
In this article you can read the list of when does the MySQL optimizer does what:
Action When
Query parse PREPARE
Negation elimination PREPARE
Subquery re-writes PREPARE
Nested JOIN simplification First EXECUTE
OUTER->INNER JOIN conversions First EXECUTE
Partition pruning Every EXECUTE
COUNT/MIN/MAX elimination Every EXECUTE
Constant subexpression removal Every EXECUTE
Equality propagation Every EXECUTE
Constant table detection Every EXECUTE
ref access analysis Every EXECUTE
range/index_merge analysis and optimization Every EXECUTE
Join optimization Every EXECUTE
There is one more thing missing in this list.
MySQL can rebuild a query plan on every JOIN iteration: a such called range checking for each record.
If you have a composite index on a table:
CREATE INDEX ix_table2_col1_col2 ON table2 (col1, col2)
and a query like this:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col1 = t1.value1
AND t2.col2 BETWEEN t1.value2_lowerbound AND t2.value2_upperbound
, MySQL will NOT use an index RANGE access from (t1.value1, t1.value2_lowerbound) to (t1.value1, t1.value2_upperbound). Instead, it will use an index REF access on (t1.value) and just filter out the wrong values.
But if you rewrite the query like this:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col1 <= t1.value1
AND t2.col1 >= t2.value1
AND t2.col2 BETWEEN t1.value2_lowerbound AND t2.value2_upperbound
, then MySQL will recheck index RANGE access for each record from table1, and decide whether to use RANGE access on the fly.
You can read about it in these articles in my blog:
Selecting timestamps for a time zone - how to use coarse filtering to filter out timestamps without a timezone
Emulating SKIP SCAN - how to emulate SKIP SCAN access method in MySQL
Analytic functions: optimizing LAG, LEAD, FIRST_VALUE, LAST_VALUE - how to emulate Oracle's analytic functions in MySQL
Advanced row sampling - how to select N records from each group in MySQL
All these things employ RANGE CHECKING FOR EACH RECORD
Returning to your question: there is no way to tell which plan will MySQL use for every given constant, since there is no plan before the constant is given.
Unfortunately, there is no way to force MySQL to use one query plan for every value of a bound parameter.
You can control the JOIN order and INDEX'es being chosen by using STRAIGHT_JOIN and FORCE INDEX clauses, but they will not force a certain access path on an index or forbid the IMPOSSIBLE WHERE.
On the other hand, for all JOIN's, MySQL employs only NESTED LOOPS. That means that if you build right JOIN order or choose right indexes, MySQL will probably benefit from all IMPOSSIBLE WHERE's.
How do you optimize a query with values that are constant only to the query but where I, the programmer, don't known in advance what value will be used?
By using indexes on the specific columns (or even on combination of columns if you always query the given columns together). If you have indexes, the query planner will potentially use them.
Regarding "impossible" values: the query planner can conclude that a given value is not in the table from several sources:
if there is an index on the particular column, it can observe that the particular value is large or smaller than any value in the index (min/max values take constant time to extract from indexes)
if you are passing in the wrong type (if you are asking for a numeric column to be equal with a text)
PS. In general, creation of the query plan is not expensive and it is better to re-create than to re-use them, since the conditions might have changed since the query plan was generated and a better query plan might exists.