If you'd write a query like so:
SELECT * FROM `posts` WHERE `views` > 200 OR `views` > 100
Would MySql analyze that query and realize that it's actually equivalent to this?
SELECT * FROM `posts` WHERE `views` > 100
In other words, would MySql optimize the query such that it skips any unnecessary WHERE checks?
I'm asking because I'm working on a piece of code that, for now, generates queries with redundant WHERE clauses. I'm wondering if I should optimize those queries before I send them to MySql, or if that's unnecessary, because MySql would do it anyway.
Yes. MySQL does optimize queries before running them. In fact, what runs has no obvious relationship to the SQL statement itself -- it is a directed acyclic graph.
In the process, MySQL determines what indexes to use for the query, what join algorithms, sorts lists of constants in in lists, and much more.
The optimizer also does some simplifications of the query. I'm not sure if those simplifications extend to inequalities. However, there is little overhead in making the comparison twice.
EXPLAIN SELECT ... Shows how the query was rewritten -- but it still has the OR.
The "Optimizer trace" says the same thing. However, when it gets into discussing the "cost", it gets smart and merges the two comparisons. (This is the case at least as far back as 5.6.)
In many cases, OR should be avoided like covid.
Related
I have been learning query optimization, increase query performance and all but in general if we create a query how can we know if this is a wise query.
I know we can see the execution time below, But this time will not give a clear indication without a good amount of data. And usually, when we create a new query we don't have much data to check.
I have learned about clauses and commands performance. But is there is anything by which we can check the performance of the query? Performance here is not execution time, it means that whether a query is "ok" or not, without data dependency.
As we cannot create that much data that would be in live database.
General performance of a query can be checked using the EXPLAIN command in MySQL. See https://dev.mysql.com/doc/refman/5.7/en/using-explain.html
It shows you how MySQL engine plans to execute the query and allows you to do some basic sanity checks i.e. if the engine will use keys and indexes to execute the query, see how MySQL will execute the joins (i.e. if foreign keys aren't missing) and many more.
You can find some general tips about how to use EXPLAIN for optimizing queries here (along with some nice samples): http://www.sitepoint.com/using-explain-to-write-better-mysql-queries/
As mentioned above, Right query is always data-dependent. Up to some level you can use the below methods to check the performance
You can use Explain to understand the Query Execution Plan and that may help you to correct some stuffs. For more info :
Refer Documentation Optimizing Queries with EXPLAIN
You can use Query Analyzer. Refer MySQL Query Analyzer
I like to throw my cookbook at Newbies because they often do not understand how important INDEXes are, or don't know some of the subtleties.
When experimenting with multiple choices of query/schema, I like to use
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
That counts low level actions, such as "read next record". It essentially eliminates caching issues, disk speed, etc, and is very reproducible. Often there is a counter in that output (or multiple counters) that match the number of rows in the table (sometimes +/-1) -- that tells me there are table scan(s). This is usually not as good as if some INDEX were being used. If the query has a LIMIT, that value may show up in some Handler.
A really bad query, such as a CROSS JOIN, would show a value of N*M, where N and M are the row counts for the two tables.
I used the Handler technique to 'prove' that virtually all published "get me a random row" techniques require a table scan. Then I could experiment with small tables and Handlers to come up with a list of faster random routines.
Another tip when timing... Turn off the Query_cache (or use SELECT SQL_NO_CACHE).
The two quires below do the same thing. Basically show all the id's of table 1, which are present in table 2. The thing which puzzles me is that the simple select is way way faster than the JOIN, I would have expected that the JOIN is a bit slower, but not by that much...5 seconds vs. 0.2
Can anyone elaborate on this ?
SELECT table1.id FROM
table1,table2 WHERE
table1.id=table2.id
Duration/Fetch 0.295/0.028 (MySql Workbench 5.2.47)
SELECT table1.id
FROM table1
INNER JOIN table2
ON table1.id=table2.id
Duration/Fetch 5.035/0.027 (MySql Workbench 5.2.47)
Q: Can anyone elaborate on this?
A: Before we go the "a bug in MySQL" route that #a_horse_with_no_name seems impatient to race down, we'd really need to ensure that this is repeatable behavior, and isn't just a quirk.
And to do that, we'd really need to see the elapsed time result from more than one run of the query.
If the query cache is enabled on the server, we want to run the queries with the SQL_NO_CACHE hint added (SELECT SQL_NO_CACHE table1.id ...) so we know we aren't retrieving cached results.
I'd repeat the execution of each query at least three times, and throw out the result from the first run, and average the other runs. (The purpose of this is to eliminate the impact of the table data not being in the cache, either InnoDB buffer, or the filesystem cache.)
Also, run an EXPLAIN SELECT ... for each query. And compare the access plans.
If either of these tables is MyISAM storage engine, note that MyISAM tables are subject to locking by DML operations; while an INSERT, UPDATE or DELETE operation is run on the table, the SELECT statements will be blocked from accessing the table. (But five seconds seems a bit much for that, unless these are really large tables, or really inefficient DML statements).
With InnoDB, the SELECT queries won't be blocked by DML operations.
Elapsed time is also going to depend on what else is going on on the system.
But the total elapsed time is going include more than just the time in the MySQL server. Temporarily turning on the MySQL general_log would allow you to capture the statements that are actually being processed by the server.
This looks like something that could be further optimized by the database engine if indeed you are running both queries under the exact same context.
SQL is declarative. By successfully declaring what you want, the engine has free reign to restructure the "How" of your request to bring back the fastest result.
The earliest versions of SQL didn't even have the keyword JOIN. There was only the comma.
There are many coding constructs in SQL that imperatively force a single inferior methodology over another and they should be avoided. JOIN shouldn't be avoided. Something sounds a miss. JOIN is the core element of SQL. It would be a shame to always have to use commas.
There are a zillion factors that go into the performance of a JOIN all based your environment, schema, and data. Chances are that your table1 and table2 represent a fringe case that may have gotten past the optimization algorithms.
The SQL_NO_CACHE worked, the new results are:
Duration/Fetch 5.065 / 0.027 for the select where and
Duration/Fetch 5.050 / 0.027 for the join
I would have thought that the "select where" would be faster, but the join was actually a tad swifter. But the difference is negligible
I would like to thank everyone for their response.
From the mk-archiver help, we can see there is an option to optimize "seek-then-scan". Any idea how do they do this?
What I'm really looking for is, if I do have a table with one PKey, and queries
SELECT col1,col2 FROM tbl LIMIT 1,10;
SELECT col1,col2 FROM tbl LIMIT 11,20; ...
SELECT col1,col2 FROM tbl LIMIT m,n;
Any way to do this in an optimized way, given m and n are very large values and each select query is initiated in parallel from multiple machines? (will address host/network choking later)
How do others tackle the situation if the table doesn't have a PKey?
*Using MySQL
The default ascending-index optimization causes mk-archiver to
optimize repeated SELECT queries so they seek into the index where the
previous query ended, then scan along it, rather than scanning from
the beginning of the table every time. This is enabled by default
because it is generally a good strategy for repeated accesses.
I believe they are playing directly with the index structures, not relying on SQL. Advantage of access to source code of MySQL. It should be possible to have such an option using SQL, per connection, but with multiple users connect through intermediate (web) servers would be more complicated, if at all possible.
I have two tables TABLE A and TABLE B.
TABLE A contain 1 million (1,000,000) records and 4 fields while TABLE 2 contain 60,000 and 3 fields.
I am running a query which joins these two tables and usees WHERE clause to find specific products like WHERE product like '%Bags%' and product like 'Bags%' e.t.c.
When I run the query directly in phpMyAdmin then it returns records in around 1 or 2 seconds. But when they are being used on website, they are sometime taking 9 or 10 seconds according to MySQL 'slow query' log. Actually my website response was very slow at times so upon investigation I found out it is due to MySQL as I came to know about 'slow query log'.
The slow query log consists of all SQL statements that took more than long_query_time seconds to execute and required at least min_examined_row_limit rows to be examined.
So according to that log "query_time" for above query was 13 seconds while in some cases they even had "query_time" exceeding 50 seconds.
Both my tables are using PRIMARY keys as well as INDEXES. So I want to know how can I optimize them more or is there any way I can optimize MySQL settings in general?
This slowness of website doesn't happen all the time but sometimes (may be once in a week) and lasts for around 1 or 2 minutes. It gets decent amount of traffic and there are many other queries too, the above I posted was just one example.
Thanks
For all things MySQL and performance related, check out http://www.mysqlperformanceblog.com/
Check your queries with EXPLAIN, see here and here for info on how to use EXPLAIN as query diagnostic tool.
It's not enough to just have indexes. Are you indexing the fields searched in the WHERE clause? Also do you have indexes for the fields used in the WHERE clause (including the fields you mention in ORDER BY, GROUP BY, and HAVING clauses as well as JOINs)? If you have grouped fields in a single index, that index won't be hit unless you have a query that searches all those fields together. If you group fields in an index make sure they the index will actually be used in your query (EXPLAIN is your friend).
That said, it could be many other things as well: poorly configured MySQL server, poorly tuned server, bad schema. But your queries and your indexes are good place to start your investigation.
Here is a nice summary of performance best practices from Jay Pipes of MySQL.
like '%Bags%' query cannot be optimized using indexes.
The only way to improve performance here is to use fulltext indexes or get sphinx to search.
Its because of some other queries are run at the time when you are going to refresh the page of your website. so if for example your website going to run 8-10 queries at time of page refresh then it will take some more time than you run single query in phpmyadmin. and if its take 1-1.5 min to execute then its may not the query problem but it may have prob with the server speed also.
and you also can use MATCH() AGAINST() statement for optimize this type of search queries.
Otherwise you are already using PRIMARY KEY, INDEXES and JOINS so there is no need to worry about other things.
just check it out.
Thanks.
There are many ways to optimize Databases and queries. My method is the following.
Look at the DB Schema and see if it makes sense
Most often, Databases have bad designs and are not normalized. This can greatly affect the speed of your Database. As a general case, learn the 3 Normal Forms and apply them at all times. The normal forms above 3rd Normal Form are often called de-normalization forms but what this really means is that they break some rules to make the Database faster.
What I suggest is to stick to the 3rd normal form except if you are a DBA (which means you know subsequent forms and know what you're doing). Normalization after the 3rd NF is often done at a later time, not during design.
Only query what you really need
Filter as much as possible
Your Where Clause is the most important part for optimization.
Select only the fields you need
Never use "Select *" -- Specify only the fields you need; it will be faster and will use less bandwidth.
Be careful with joins
Joins are expensive in terms of time. Make sure that you use all the keys that relate the two tables together and don't join to unused tables -- always try to join on indexed fields. The join type is important as well (INNER, OUTER,... ).
Optimize queries and stored procedures (Most Run First)
Queries are very fast. Generally, you can retrieve many records in less than a second, even with joins, sorting and calculations. As a rule of thumb, if your query is longer than a second, you can probably optimize it.
Start with the Queries that are most often used as well as the Queries that take the most time to execute.
Add, remove or modify indexes
If your query does Full Table Scans, indexes and proper filtering can solve what is normally a very time-consuming process. All primary keys need indexes because they makes joins faster. This also means that all tables need a primary key. You can also add indexes on fields you often use for filtering in the Where Clauses.
You especially want to use Indexes on Integers, Booleans, and Numbers. On the other hand, you probably don't want to use indexes on Blobs, VarChars and Long Strings.
Be careful with adding indexes because they need to be maintained by the database. If you do many updates on that field, maintaining indexes might take more time than it saves.
In the Internet world, read-only tables are very common. When a table is read-only, you can add indexes with less negative impact because indexes don't need to be maintained (or only rarely need maintenance).
Move Queries to Stored Procedures (SP)
Stored Procedures are usually better and faster than queries for the following reasons:
Stored Procedures are compiled (SQL Code is not), making them faster than SQL code.
SPs don't use as much bandwidth because you can do many queries in one SP. SPs also stay on the server until the final results are returned.
Stored Procedures are run on the server, which is typically faster.
Calculations in code (VB, Java, C++, ...) are not as fast as SP in most cases.
It keeps your DB access code separate from your presentation layer, which makes it easier to maintain (3 tiers model).
Remove unneeded Views
Views are a special type of Query -- they are not tables. They are logical and not physical so every time you run select * from MyView, you run the query that makes the view and your query on the view.
If you always need the same information, views could be good.
If you have to filter the View, it's like running a query on a query -- it's slower.
Tune DB settings
You can tune the DB in many ways. Update statistics used by the optimizer, run optimization options, make the DB read-only, etc... That takes a broader knowledge of the DB you work with and is mostly done by the DBA.
****> Using Query Analysers****
In many Databases, there is a tool for running and optimizing queries. SQL Server has a tool called the Query Analyser, which is very useful for optimizing. You can write queries, execute them and, more importantly, see the execution plan. You use the execution to understand what SQL Server does with your query.
The other day I found the FOUND_ROWS() (here) function in MySQL and it's corresponding SQL_CALC_FOUND_ROWS option. The later looks especially useful (instead of running a second query to get the row count).
I'm wondering what speed impact there is by adding SQL_CALC_FOUND_ROWS to a query?
I'm guessing it will be much faster than runnning a second query to count the rows, but will it be a lot different. Also, I have found limiting a query to make it much faster (for example when you get the first 10 rows of 1000). Will adding SQL_CALC_FOUND_ROWS to a query with a small limit cause the query to run much slower?
I know I can test this, but I'm wondering about general practices here.
When I was at the MySQL Conference in 2008, part of one session was dedicated to exactly this - benchmarks between SQL_CALC_FOUND_ROWS and doing a separate SELECT.
I believe the result was that there was no benefit to SQL_CALC_FOUND_ROWS - it wasn't faster, in fact it may have been slower. There was also a 3rd way.
Additionally, you don't always need this information, so I would go the extra query route.
I'll try to find the slides...
Edit: Hrm, google tells me that I actually liveblogged from that session: http://beerpla.net/2008/04/16/mysql-conference-liveblogging-mysql-performance-under-a-microscope-the-tobias-and-jay-show-wednesday-200pm/. Google wins when memory fails.
To calculate SQL_CALC_FOUND_ROWS the query will be execute as if no LIMIT was set, but the result set sent to the client will obey the LIMIT.
Update: for COUNT(*) operations which would be using only the index, SQL_CALC_FOUND_ROWS is slower (reference).
I assume it would be slightly faster for queries that you need the number of rows know, but would incur and overhead for queries that you don't need to know.
The best advice I could give is to try it out on your development server and benchmark the difference. Every setup is different.
I would advise to use as few proprietary SQL extensions as possible when developing an application (or actually not using SQL queries at all). Doing a separate query is portable, and actually I don't think MySql could do better at getting the actual information than re-querying. Btw. as the page mentions the command has some drawbacks too when used in replicated environments.