MySQL: where exists VS where id in [performance] - mysql

This question also exist here: Poor whereHas performance in Laravel
... but without answer.
A similar situation happened to me as it happened to the author of that question:
replays table has 4M rows
players table has 40M rows
This query uses where exists and it takes a lot of time (70s) to finish:
select * from `replays`
where exists (
select * from `players`
where `replays`.`id` = `players`.`replay_id`
and `battletag_name` = 'test')
order by `id` asc
limit 100;
but when it's changed to use where id in instead of where exists - it's much faster (0.4s):
select * from `replays`
where id in (
select replay_id from `players`
where `battletag_name` = 'test')
order by `id` asc
limit 100;
MySQL (InnoDB) is being used.
I would like to understand why there is such a big difference in performance between where exists VS where id in - is it because of the way how MySQL works? I expected that the "exists" variant would be faster because MySQL would just check whether relevant rows exist... but I was wrong (I probably don't understand how "exists" works in this case).

You should show the execution plans.
To optimize the exists, you want an index on players(replay_id, battletag_name). An index on replays(id) should also help -- but if id is a primary key there is already an index.

Gordon has a good answer. The fact is that performance depends on a lot of different factors including database design/schema and volume of data.
As a rough guide, the exists sub-query is going to execute once for every row in replays and the in sub-query is going to execute once to get the results of the sub-query and then those results will be searched for every row in replays.
So with the exists, the better the indexing/access path the faster it will run. Without relevant index(es) it will just read through all rows until it finds a match. For every single row in replays. For the rows with no matches it would end up reading the entire players table each time. Even the rows with matches could read through a significant number of players before finding a match.
With the in the smaller the resultset from the sub-query the faster it will run. For those without a match it only needs to quickly check the small sub query rows to reach that answer. That said you don't get the benefit of indexes (if it works this way) so for a large result set from the sub query it has to read every row in the sub select before deciding that when there is no match.
That said, database optimisers are pretty clever, and don't always evaluate queries exactly the way you ask them to, hence why checking execution plans and testing yourself is important to figure out the best approach. Its not unusual to expect a certain execution path only to find that optimiser has chosen a different method of execution based on how it expects the data to look.

Related

Optimize query through the order of columns in index

I had a table that is holding a domain and id
the query is
select distinct domain
from user
where id = '1'
the index is using the order idx_domain_id is faster than idx_id_domain
if the order of the execution is
(FROM clause,WHERE clause,GROUP BY clause,HAVING clause,SELECT
clause,ORDER BY clause)
then the query should be faster if it use the sorted where columns than the select one.
at 15:00 to 17:00 it show the same query i am working on
https://serversforhackers.com/laravel-perf/mysql-indexing-three
the table has a 4.6 million row.
time using idx_domain_id
time after change the order
This is your query:
select distinct first_name
from user
where id = '1';
You are observing that user(first_name, id) is faster than user(id, firstname).
Why might this be the case? First, this could simply be an artifact of how your are doing the timing. If your table is really small (i.e. the data fits on a single data page), then indexes are generally not very useful for improving performance.
Second, if you are only running the queries once, then the first time you run the query, you might have a "cold cache". The second time, the data is already stored in memory, so it runs faster.
Other issues can come up as well. You don't specify what the timings are. Small differences can be due to noise and might be meaningless.
You don't provide enough information to give a more definitive explanation. That would include:
Repeated timings run on cold caches.
Size information on the table and the number of matching rows.
Layout information, particularly the type of id.
Explain plans for the two queries.
select distinct domain
from user
where id = '1'
Since id is the PRIMARY KEY, there is at most one row involved. Hence, the keyword DISTINCT is useless.
And the most useful index is what you already have, PRIMARY KEY(id). It will drill down the BTree to find id='1' and deliver the value of domain that is sitting right there.
On the other hand, consider
select distinct domain
from user
where something_else = '1'
Now, the obvious index is INDEX(something_else, domain). This is optimal for the WHERE clause, and it is "covering" (meaning that all the columns needed by the query exist in the index). Swapping the columns in the index will be slower. Meanwhile, since there could be multiple rows, DISTINCT means something. However, it is not the logical thing to use.
Concerning your title question (order of columns): The = columns in the WHERE clause should come first. (More details in the link below.)
DISTINCT means to gather all the rows, then de-duplicate them. Why go to that much effort when this gives the same answer:
select domain
from user
where something_else = '1'
LIMIT 1
This hits only one row, not all the 1s.
Read my Indexing Cookbook.
(And, yes, Gordon has a lot of good points.)

Optimizing mysql query with the proper index

I have a table of 15.1 million records. I'm running the following query on it to process the records for duplicate checking.
select id, name, state, external_id
from companies
where dup_checked=0
order by name
limit 500;
When I use explain extended on the query it tells me it's using the index_companies_on_name index which is just an index on the company name. I'm assuming this is due to the ordering. I tried creating other indexes based on the name and dup_checked fields hoping it would use this one as it may be faster, but it still uses the index_companies_on_name index.
Initially it was fast enough, but now we're down to 3.3 million records left to check and this query is taking up to 90 seconds to execute. I'm not quite sure what else to do to make this run faster. Is a different index the answer or something else I'm not thinking of? Thanks.
Generally the trick here is to create an index that filters first, reducing the number of rows ("Cardinality"), and has the ordering applied secondarily:
CREATE INDEX `index_companies_on_dup_checked_name`
ON `companies` (`dup_checked`,`name`)
That should give you the scope you need.

Check if MySQL Table is empty: COUNT(*) is zero vs. LIMIT(0,1) has a result?

This is a simple question about efficiency specifically related to the MySQL implementation. I want to just check if a table is empty (and if it is empty, populate it with the default data). Would it be best to use a statement like SELECT COUNT(*) FROM `table` and then compare to 0, or would it be better to do a statement like SELECT `id` FROM `table` LIMIT 0,1 then check if any results were returned (the result set has next)?
Although I need this for a project I am working on, I am also interested in how MySQL works with those two statements and whether the reason people seem to suggest using COUNT(*) is because the result is cached or whether it actually goes through every row and adds to a count as it would intuitively seem to me.
You should definitely go with the second query rather than the first.
When using COUNT(*), MySQL is scanning at least an index and counting the records. Even if you would wrap the call in a LEAST() (SELECT LEAST(COUNT(*), 1) FROM table;) or an IF(), MySQL will fully evaluate COUNT() before evaluating further. I don't believe MySQL caches the COUNT(*) result when InnoDB is being used.
Your second query results in only one row being read, furthermore an index is used (assuming id is part of one). Look at the documentation of your driver to find out how to check whether any rows have been returned.
By the way, the id field may be omitted from the query (MySQL will use an arbitrary index):
SELECT 1 FROM table LIMIT 1;
However, I think the simplest and most performant solution is the following (as indicated in Gordon's answer):
SELECT EXISTS (SELECT 1 FROM table);
EXISTS returns 1 if the subquery returns any rows, otherwise 0. Because of this semantic MySQL can optimize the execution properly.
Any fields listed in the subquery are ignored, thus 1 or * is commonly written.
See the MySQL Manual for more info on the EXISTS keyword and its use.
It is better to do the second method or just exists. Specifically, something like:
if exists (select id from table)
should be the fastest way to do what you want. You don't need the limit; the SQL engine takes care of that for you.
By the way, never put identifiers (table and column names) in single quotes.

Best way to count rows from mysql database

After facing a slow loading time issue with a mysql query, I'm now looking the best way to count rows numbers. I have stupidly used mysql_num_rows() function to do this and now realized its a worst way to do this.
I was actually making a Pagination to make pages in PHP.
I have found several ways to count rows number. But I'm looking the faster way to count it.
The table type is MyISAM
So the question is now
Which is the best and faster to count -
1. `SELECT count(*) FROM 'table_name'`
2. `SELECT TABLE_ROWS
FROM INFORMATION_SCHEMA.TABLES WHERE table_schema = 'database_name'
AND table_name LIKE 'table_name'`
3. `SHOW TABLE STATUS LIKE 'table_name'`
4. `SELECT FOUND_ROWS()`
If there are others better way to do this, please let me know them as well.
If possible please describe along with the answer- why it is best and faster. So I could understand and can use the method based on my requirement.
Thanks.
Quoting the MySQL Reference Manual on COUNT
COUNT(*) is optimized to return very quickly if the SELECT retrieves
from one table, no other columns are retrieved, and there is no WHERE
clause. For example:
mysql> SELECT COUNT(*) FROM student;
This optimization applies only to
MyISAM tables only, because an exact row count is stored for this
storage engine and can be accessed very quickly. For transactional
storage engines such as InnoDB, storing an exact row count is more
problematic because multiple transactions may be occurring, each of
which may affect the count.
Also read this question
MySQL - Complexity of: SELECT COUNT(*) FROM MyTable;
I would start by using SELECT count(*) FROM 'table_name' because it is the most portable, easiset to understand, and because it is likely that the DBMS developers optimise common idiomatic queries of this sort.
Only if that wasn't fast enough would I benchmark the approaches you list to find if any were significantly faster.
It's slightly faster to count a constant:
select count('x') from table;
When the parser hits count(*) it has to go figure out what all the columns of the table are that are represented by the * and get ready to accept them inside the count().
Using a constant bypasses this (albeit slight) column checking overhead.
As an aside, although not faster, one cute option is:
select sum(1) from table;
I've looked around quite a bit for this recently. it seems that there are a few here that I'd never seen before.
Special needs: This database is about 6 million records and is getting crushed by multi-insert queries all the time. Getting a true count is difficult to say the least.
SELECT TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES WHERE table_schema = 'admin_worldDomination' AND table_name LIKE 'master'
Showing rows 0 - 0 ( 1 total, Query took 0.0189 sec)
This is decent, Very fast but inaccurate. Showed results from 4 million to almost 8 million rows
SELECT count( * ) AS counter FROM `master`
No time displayed, took 8 seconds real time. Will get much worse as the table grows. This has been killing my site previous to today.
SHOW TABLE STATUS LIKE 'master'
Seems to be as fast as the first, no time displayed though. Offers lots of other table information, not much of it is worth anything though (avg record length maybe).
SELECT FOUND_ROWS() FROM 'master'
Showing rows 0 - 29 ( 4,824,232 total, Query took 0.0004 sec)
This is good, but an average. Closer spread than others (4-5 million) so I'll probably end up taking a sample from a few of these queries and averaging.
EDIT: This was really slow when doing a query in php, ended up going with the first. Query runs 30 times quickly and I take an average, under 1 second ... it' still ranges between 5.3 & 5.5 million
One idea I had, to throw this out there, is to try to find a way to estimate the row count. Since it's just to give your user an idea of the number of pages, maybe you don't need to be exact and could even say Page 1 of ~4837 or Page 1 of about 4800 or something.
I couldn't quickly find an estimate count function, but you could try getting the table size and dividing by a determined/constant avg row size. I don't know if or why getting the table size from TABLE STATUS would be faster than getting the rows from TABLE STATUS.

Which is a less expensive query count(id) or order by id

I'd like to know which of the followings would execute faster in MySQL database. The table would have 200 - 1000 entries.
SELECT id
from TABLE
order by id desc
limit 1
or
SELECT count(id)
from TABLE
The story is the Table is cached. So this query is to be executed every time before cache retrieval to determine whether the cache data is invalid by comparing the previous value.
So if there exists a even less expensive query, please kindly let me know. Thanks.
If you
start from 1
never have any gaps
use the InnoDB engine
id is not nullable
Then the 2nd could run [ever so marginally] faster due to not having to visit table data at all (count is stored in metadata).
Otherwise,
if the table has NO index on ID (causing a SCAN), the 2nd one is faster
Barring both the above
the first one is faster
And if you actually meant to ask SELECT .. LIMIT 1 vs SELECT MAX(id).. then the answer is actually that they are the same for MySQL and most sane DBMS, whether or not there is an index.
I think, the first query will run faster, as the query is limited to be executed for one row only, 200-1000 may not matter that much in this case.
As already pointed out in the comments, your table is so small it really doesn't what your solution will be. For this reason the select count(id) should be used as it expresses the intent and doesn't need any further processing.
Now select count(id) comes with an alternative select count(*). These two are not synonyms. select count(*) will count the number of rows and use a cached value if possible when select count(id) counts the number of non null values of the column id exists. If the id columns is set as not null then the cached row count may be used.
The selection between count(*) and count(id) depends once again on your intent. In the general case, count(*) describes the intent better.
The there is the possibility of count(1) which is actually a synonym of count(*) when using mysql but the interpretation may vary if end up using a different RDBMS.
The performance of each type of count also varies depending on whether you are using MyISAM or InnoDB. The row counts are cached on the former but not on the latter, if I've understood correctly.
In the end, you should rely on query plans and running tests and measuring their performance rather than these general ramblings.