mysql "in" clause quite slow(seems to hang) - mysql

I'm using mysql and a simple query like this seems to hang forever with cpu 100%.
select login_name, server, recharge
from role_info
where login_name in (select login_name
from role_info
group by login_name
having count(login_name) > 1)
the table role_info is small, with only 33535 rows.
below is the output of explain:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | PRIMARY | role_info | ALL | NULL | NULL | NULL | NULL | 33535 | Using where |
| 2 | DEPENDENT SUBQUERY | role_info | index | NULL | a | 302 | NULL | 1 | Using index |
show processlist reports the query keeps Sending data.
Query | 1135 | Sending data | select login_name, server, recharge from role_info where login_name in (select login_name from ...
I can switch to join instead of "in" and it works fine. but still curious why this simple query should act so abnormally.

I wonder how many people stumble over this bug again and again. It's long lasting MySQL bug#32665, where MySQL 6 is the target version. It means that MySQL will turn you uncorrelated subquery into correlated (executed per row of outer resultset). Just in case, in this blog post are some ideas how to workaround the limitation when you can't rewrite the query into using JOINs.

In some versions of MySQL, in with a subquery is quite inefficient. The subquery gets executed for each row being processed. In other words, the results are not cached. I think this is fixed in the most recent versions.
You know how to fix it, by using a join. In general, I gravitate to joins or exists in MySQL instead of using in with a subquery.

Related

Addition of GROUP BY to simple query makes it 1000 slower

I am using test DB from https://github.com/datacharmer/test_db. It has a moderate size of 160 Mb. To run queries I use MySQL Workbench.
Following code runs in 0.015s
SELECT *
FROM employees INNER JOIN salaries ON employees.emp_no = salaries.emp_no
The similar code with GROUP BY added runs for 15.0s
SELECT AVG(salary), gender
FROM employees INNER JOIN salaries ON employees.emp_no = salaries.emp_no
GROUP BY gender
I checked the execution plan for both queries and found that in both cases query cost is similar and is about 600k. I should add that the employee table has 300K rows and the salary table is about 3M rows.
Can anyone suggest a reason why the difference in the execution time is soo huge? I need this explanation to understand the way SQL works better.
Problem solution: As I found due to comments and answers the problem was related to me not noticing that in the case of the first query my IDE was limiting result to 1000 rows. That's how I got 0.015s. In reality, it takes 10.0s to make a join in my case. If the index for gender is created (indices for employees.emp_no and salaries.emp_no already exist in this DB) it takes 10.0s to make join and group by. Without index for gender second query takes 18.0s.
The EXPLAIN for the first query shows that it does a table-scan (type=ALL) of 300K rows from employees, and for each one, does a partial primary key (type=ref) lookup to 1 row (estimated) in salaries.
mysql> explain SELECT * FROM employees
INNER JOIN salaries ON employees.emp_no = salaries.emp_no;
+----+-------------+-----------+------+---------------+---------+---------+----------------------------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------+---------+---------+----------------------------+--------+-------+
| 1 | SIMPLE | employees | ALL | PRIMARY | NULL | NULL | NULL | 299113 | NULL |
| 1 | SIMPLE | salaries | ref | PRIMARY | PRIMARY | 4 | employees.employees.emp_no | 1 | NULL |
+----+-------------+-----------+------+---------------+---------+---------+----------------------------+--------+-------+
The EXPLAIN for the second query (actually a sensible query to compute AVG() as you mentioned in your comment) shows something additional:
mysql> EXPLAIN SELECT employees.gender, AVG(salary) FROM employees
INNER JOIN salaries ON employees.emp_no = salaries.emp_no
GROUP BY employees.gender;
+----+-------------+-----------+------+---------------+---------+---------+----------------------------+--------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------+---------+---------+----------------------------+--------+---------------------------------+
| 1 | SIMPLE | employees | ALL | PRIMARY | NULL | NULL | NULL | 299113 | Using temporary; Using filesort |
| 1 | SIMPLE | salaries | ref | PRIMARY | PRIMARY | 4 | employees.employees.emp_no | 1 | NULL |
+----+-------------+-----------+------+---------------+---------+---------+----------------------------+--------+---------------------------------+
See the Using temporary; Using filesort in the Extra field? That means that the query has to build a temp table to accumulate the AVG() results per group. It has to use a temp table because MySQL can't know that it will scan all the rows for each gender together, so it must assume it will need to maintain running totals independently as it scans rows. It doesn't seem like that would be a big problem to track two (in this case) gender totals, but suppose it were postal code or something like that?
Creating a temp table is a pretty expensive operation. It means writing data, not only reading it as the first query does.
If we could make an index that orders by gender, then MySQL's optimizer would know it can scan all those rows with the same gender together. So it can calculate the running total of one gender at a time, then once it's done scanning one gender, calculate the AVG(salary) and then be guaranteed no further rows for that gender will be scanned. Therefore it can skip building a temp table.
This index helps:
mysql> alter table employees add index (gender, emp_no);
Now the EXPLAIN of the same query shows that it will do an index scan (type=index) which visits the same number of entries, but it'll scan in a more helpful order for the calculation of the aggregate AVG().
Same query, but no Using temporary note:
mysql> EXPLAIN SELECT employees.gender, AVG(salary) FROM employees
INNER JOIN salaries ON employees.emp_no = salaries.emp_no
GROUP BY employees.gender;
+----+-------------+-----------+-------+----------------+---------+---------+----------------------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+----------------+---------+---------+----------------------------+--------+-------------+
| 1 | SIMPLE | employees | index | PRIMARY,gender | gender | 5 | NULL | 299113 | Using index |
| 1 | SIMPLE | salaries | ref | PRIMARY | PRIMARY | 4 | employees.employees.emp_no | 1 | NULL |
+----+-------------+-----------+-------+----------------+---------+---------+----------------------------+--------+-------------+
And executing this query is a lot faster:
+--------+-------------+
| gender | AVG(salary) |
+--------+-------------+
| M | 63838.1769 |
| F | 63769.6032 |
+--------+-------------+
2 rows in set (1.06 sec)
The addition of the GROUP BY clause could easily explain the big performance drop that you are seeing.
From the documentation :
The most general way to satisfy a GROUP BY clause is to scan the whole table and create a new temporary table where all rows from each group are consecutive, and then use this temporary table to discover groups and apply aggregate functions (if any).
The additional cost incurred by the grouping process can be very expensive. Also, grouping happens even if no aggregate function is used.
If you don’t need an aggregate function, don’t group. If you do, ensure that you have a single index that references all grouped columns, as suggested by the documentation :
In some cases, MySQL is able to do much better than that and avoid creation of temporary tables by using index access.
PS : please note that « SELECT * ... GROUP BY »-like statements are not supported since MySQL 5.7.5 (unless you turn off option ONLY_FULL_GROUP_BY)
There is another reason as well as what GMB points out. Basically, you are probably looking at the timing of the first query until it returns the first row. I doubt it is returning all the rows in 0.015 seconds.
The second query with the GROUP BY needs to process all the data to derive the results.
If you added an ORDER BY (which requires processing all the data) to the first query, then you would see a similar performance drop.

Why is this MySQL query poor performance (DEPENDENT_SUBQUERY)

explain select id, nome from bea_clientes where id in (
select group_concat(distinct(bea_clientes_id)) as list
from bea_agenda
where bea_clientes_id>0
and bea_agente_id in(300006,300007,300008,300009,300010,300011,300012,300013,300014,300018,300019,300020,300021,300022)
)
When I try to do the above (without the explain), MySQL simply goes busy, using DEPENDENT SUBQUERY, which makes this slow as hell. The thing is why the optimizer calculates the subquery for each ids in client. I even put the IN argument in a group_concat believing that would be the same to put that result as a plain "string" to avoid scanning.
I thought this wouldn't be a problem for MySQL server which is 5.5+?
Testing in MariaDb also does the same.
Is this a known bug? I know I can rewrite this as a join, but still this is terrible.
Generated by: phpMyAdmin 4.4.14 / MySQL 5.6.26
Comando SQL: explain select id, nome from bea_clientes where id in ( select group_concat(distinct(bea_clientes_id)) as list from bea_agenda where bea_clientes_id>0 and bea_agente_id in(300006,300007,300008,300009,300010,300011,300012,300013,300014,300018,300019,300020,300021,300022) );
Lines: 2
Current selection does not contain a unique column. Grid edit, checkbox, Edit, Copy and Delete features are not available.
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|----|--------------------|--------------|-------|-------------------------------|---------------|---------|------|-------|------------------------------------|
| 1 | PRIMARY | bea_clientes | ALL | NULL | NULL | NULL | NULL | 30432 | Using where |
| 2 | DEPENDENT SUBQUERY | bea_agenda | range | bea_clientes_id,bea_agente_id | bea_agente_id | 5 | NULL | 2352 | Using index condition; Using where |
Obviously hard to test without the data but something like below.
Subqueries are just not good in mysql (though its my prefered engine).
I could also recommend indexing the relevant columns which will improve performance for both queries.
For clarity can I also advise expanding queries.
select t1.id,t1.nome from (
(select group_concat(distinct(bea_clientes_id)) as list from bea_agenda where bea_clientes_id>0 and bea_agente_id in (300006,300007,300008,300009,300010,300011,300012,300013,300014,300018,300019,300020,300021,300022)
) as t1
join
(select id, nome from bea_clientes) as t2
on t1.list=t2.id
)

Optimizing MySQL select distinct order by limit safely

I have a problematic query that I know how to write faster, but technically the SQL is invalid and it has no guarantee of working correctly into the future.
The original, slow query looks like this:
SELECT sql_no_cache DISTINCT r.field_1 value
FROM table_middle m
JOIN table_right r on r.id = m.id
WHERE ((r.field_1) IS NOT NULL)
AND (m.kind IN ('partial'))
ORDER BY r.field_1
LIMIT 26
This takes about 37 seconds. Explain output:
+----+-------------+-------+--------+-----------------------+---------------+---------+---------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | rows | Extra |
+----+-------------+-------+--------+-----------------------+---------------+---------+---------+-----------------------------------------------------------+
| 1 | SIMPLE | r | range | PRIMARY,index_field_1 | index_field_1 | 9 | 1544595 | Using where; Using index; Using temporary; Using filesort |
| 1 | SIMPLE | m | eq_ref | PRIMARY,index_kind | PRIMARY | 4 | 1 | Using where; Distinct |
+----+-------------+-------+--------+-----------------------+---------------+---------+---------+-----------------------------------------------------------+
The faster version looks like this; the order by clause is pushed down into a subquery, which is joined on and is in turn limited with distinct:
SELECT sql_no_cache DISTINCT value
FROM (
SELECT r.field_1 value
FROM table_middle m
JOIN table_right r ON r.id = m.id
WHERE ((r.field_1) IS NOT NULL)
AND (m.kind IN ('partial'))
ORDER BY r.field_1
) t
LIMIT 26
This takes about 2.7 seconds. Explain output:
+----+-------------+------------+--------+-----------------------+------------+---------+---------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | rows | Extra |
+----+-------------+------------+--------+-----------------------+------------+---------+---------+-----------------------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | 1346348 | Using temporary |
| 2 | DERIVED | m | ref | PRIMARY,index_kind | index_kind | 99 | 1539558 | Using where; Using index; Using temporary; Using filesort |
| 2 | DERIVED | r | eq_ref | PRIMARY,index_field_1 | PRIMARY | 4 | 1 | Using where |
+----+-------------+------------+--------+-----------------------+------------+---------+---------+-----------------------------------------------------------+
There are three million rows in table_right and table_middle, and all mentioned columns are individually indexed. The query should be read as having an arbitrary where clause - it's dynamically generated. The query can't be rewritten in any way that prevents the where clause being easily replaced, and similarly the indexes can't be changed - MySQL doesn't support enough indexes for the number of potential filter field combinations.
Has anyone seen this problem before - specifically, select / distinct / order by / limit performing very poorly - and is there another way to write the same query with good performance that doesn't rely on unspecified implementation behaviour?
(AFAIK MariaDB, for example, ignores order by in a subquery because it should not logically affect the set-theoretic semantics of the query.)
For the more incredulous
Here's how you can create a version of database for yourself! This should output a SQL script you can run with mysql command-line client:
#!/usr/bin/env ruby
puts "create database testy;"
puts "use testy;"
puts "create table table_right(id int(11) not null primary key, field_0 int(11), field_1 int(11), field_2 int(11));"
puts "create table table_middle(id int(11) not null primary key, field_0 int(11), field_1 int(11), field_2 int(11));"
puts "begin;"
3_000_000.times do |x|
puts "insert into table_middle values (#{x},#{x*10},#{x*100},#{x*1000});"
puts "insert into table_right values (#{x},#{x*10},#{x*100},#{x*1000});"
end
puts "commit;"
Indexes aren't important for reproducing the effect. The script above is untested; it's an approximation of a pry session I had when reproducing the problem manually.
Replace the m.kind in ('partial') with m.field_1 > 0 or something similar that's trivially true. Observe the large difference in performance between the two different techniques, and how the sorting semantics are preserved (tested using MySQL 5.5). The unreliability of the semantics are, of course, precisely the reason I'm asking the question.
Please provide SHOW CREATE TABLE. In the absence of that, I will guess that these are missing and may be useful:
m: (kind, id)
r: (field_1, id)
You can turn off MariaDB's ignoring of the subquery's ORDER BY.

MySQL query randomly hangs on 'sending data' status?

I am using a MySQL 5.5.25 server, and InnoDB for my databases.
Quite often the CPU of the server is working at 100% due to a mysqld process for roughly a minute. Using SHOW PROCESSLIST:
Command | Time | State | Info
Query | 100 | Sending data | SELECT a.prefix, a...
Query | 107 | Sending data | SELECT a.prefix, a...
Query | 50 | Sending data | SELECT a.prefix, a...
The problematic query is:
SELECT a.prefix, a.test_id, a.method_id, b.test_id
FROM a
LEFT JOIN b ON b.test_id = a.test_id
AND user_id = ?
AND year = ?
All these columns are INDEXED, thus this ain't the problem. Also, when I run the query in phpMyAdmin (with a sufficient LIMIT), it takes 0.05 seconds to complete. Also, it is quite frustrating that I find it to be impossible to reproduce this problem myself, even when executing this query twice simultaneously and spamming it only gives me spikes to 40% CPU.
Prefixing the query with EXPLAIN results in:
Id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | a | ALL | NULL | NULL | NULL | NULL | 1169 |
1 | SIMPLE | b | ref | user_id,year,test_id | test_id | 4 | db.a.test_id | 58 |
Unless I cannot see what is right in front of me, I am looking for ways to discover how to debug this kind of problem. I do already log all queries with their time to execute, and their values etc. But since I cannot reproduce it, I am stuck. What should I do to figure out what this problem is about?
Wow, thanks to Masood Alam's comment I found and solved the problem.
Using SHOW ENGINE INNODB STATUS turns out to be very helpful in debugging.
It turns out that with some values for user_id the database handles things differently, hence its unpredictable behaviour and not being able to reproduce the problem. Through the status command I was able to copy and paste the exact query that was running at the moment.
This are the results of EXPLAIN of that particular value for user_id:
Id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | a | ALL | NULL | NULL | NULL | NULL | 841 |
1 | SIMPLE | b | index_merge | user_id,year,test_id | user_id,year | 4,4 | NULL | 13 | Using intersect(user_id,year); Using where
A follow-up question would be now be how this behaviour can be explained. However, to fix the problem for me was just to change the query to:
SELECT a.prefix, a.test_id, a.method_id, b.test_id
FROM a
LEFT JOIN b ON b.test_id = a.test_id
WHERE
b.id IS NULL
OR user_id = ?
AND year = ?
Of which EXPLAIN results in:
Id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | a | ALL | NULL | NULL | NULL | NULL | 671 |
1 | SIMPLE | b | ref | test_id | test_id | 4 | db.a.test_id | 49 | Using where
Now I know, that InnoDB can have different approaches for executing queries given different input values. Why? I still don't know.
SENDING DATA is an indication for a long-running query no matter which engine you use and as such is very bad.
In your case no index is used on table a in both cases and you are lucky that this table only has a couple hundred records. No index or ALL (full table scan) is something to be avoided completely. Remember, things are different when you have thousands or even millions of records in your table.
Create an index modeled after the query in case. Use EXPLAIN again to see if the engine uses this index. If not, you may use a hint like USE INDEX(myindex) or FORCE INDEX (myindex). The result should be stunning.
If not, your index might not be as good as you think. Reiterate the whole process.
Remember, though, if you change the query you work with this index might no longer be appropriate, so again you have to reiterate.

mysql slow complex query with order by

The below query even without the order by is very slow and I can't figure out why. I'm guessing it's the where date_affidavit_file but how can I make it fast with that order byas well? Perhaps a sublect on the job_id's that match the where and then pass that into the rest of the code but I still need to order by server the servername like this. Any suggestions?
explain select sql_no_cache court_county, job.id as jid, job_status,
DATE_FORMAT(job.datetime_served, '%m/%d/%Y') as dserved ,
CONCAT(server.namefirst, ' ', server.namelast) as servername, client_name,
DATE_FORMAT(job.datetime_received, '%m/%d/%Y') as dtrec ,
DATE_FORMAT(job.datetime_give2server, '%m/%d/%Y') as dtg2s,
DATE_FORMAT(date_kase_filed, '%m/%d/%Y') as dkf,
DATE_FORMAT(job.date_sent_to_court, '%m/%d/%Y') as dtstc ,
TO_DAYS(datetime_served )-TO_DAYS(date_kase_filed) as totaldays from job
left join kase on kase.id=job.kase_id
left join server on job.server_id=server.id
left join client on kase.client_id=client.id
left join LUcourt on LUcourt.id=kase.court_id
where date_affidavit_filed is not null and date_affidavit_filed !='' order by servername;
+----+-------------+---------+--------+----------------------+---------+---------+-----------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+--------+----------------------+---------+---------+-----------------------+--------+----------------------------------------------+
| 1 | SIMPLE | job | ALL | date_affidavit_filed | NULL | NULL | NULL | 365212 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | kase | eq_ref | PRIMARY | PRIMARY | 4 | pserve.job.kase_id | 1 | |
| 1 | SIMPLE | server | eq_ref | PRIMARY | PRIMARY | 4 | pserve.job.server_id | 1 | |
| 1 | SIMPLE | client | eq_ref | PRIMARY | PRIMARY | 4 | pserve.kase.client_id | 1 | |
| 1 | SIMPLE | LUcourt | eq_ref | PRIMARY | PRIMARY | 4 | pserve.kase.court_id | 1 | |
+----+-------------+---------+--------+----------------------+---------+---------+-----------------------+--------+----------------------------------------------+
Check that you have indexes on the following columns. job.kase_id or job.server_id
Also you are ordering by a calculated field which is not optimal. Perhaps order by a field with index.
If you need to preserve that exact sort, you might want to add a field in the DB for that value. And populate it with appropriate values or set up a trigger on the DB to populate it for you automatically.
This can speed up the order by:
CREATE INDEX namefull ON server (namefirst,namelast);
if you do ORDER BY (server.namefirst, server.namelast) instead of ORDER BY servername, which should produce the same output.
You can also create indexes on each table on any field you are left joining, that can improve the performance of your query too.
When you write,
where date_affidavit_filed is not null and date_affidavit_filed !=''
you practically are selecting most of the rows. Or at least so many that it is not worthwhile to run through the indexing. The query planner sees that there is an index involving date_affidavit_filed, but decides not to use it and go with the WHERE clause, which only involves date_affidavit_filed; so we know it's not a key issue, it must be a cardinality issue.
| 1 | SIMPLE | job | ALL | date_affidavit_filed | NULL | NULL | NULL | 365212 | Using where; Using temporary; Using filesort |
You can try optimizing this by creating an index on
date_affidavit_filed, kase_id, server_id
in that order. How many rows are returned by the query?
You are selecting everything that isn't empty really.
That really means everything.
I don't know how many rows of data you have have but it's a lot to go through.
Try narrowing your query to a date range or specific client.
If you really need everything, don't output it one row after a time, but build up a big string in the software you use to output with all formatting, and then when you're finished looping through the results and you have constructed the data you wish to output you can output them in one big go.
You could also use paging.
Just add limit 0,30 on page 1, limit 30,30 on page two, etc.. and let the end user walk through the pages.