I have 2 database tables Companies (uses InnoDB engine) and Company_financial_figures (uses MyISAM engine). Table Companies has about 300 000 records, Company_financial_figures has 600 000 recodrs. Also Table Company_financial_figures has flag that is used in table LEFT JOIN.
Query idea is to select all actual balances for companies (there is a situation when tere is no balance data for that company, but anyway it must be selected, so I have to use LEFT JOIN). It seems to me, that it must select about 300k records from Company_financial_figures table, but not to make Full table scan, like 600k records. And the performance for this query is very slow.
Query is something like this:
SELECT DISTINCT comp.id, comp.name, comp.surname, cff.balance FROM companies comp LEFT JOIN company_financial_figures cff ON (cff.company_id = comp.id AND cff.actual = 1)
+----+-------------+-------------------+-------+---------------+----------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------+-------+---------------+----------+---------+------+--------+-------------+
| 1 | SIMPLE | companies | index | NULL | comp_i_i | 2 | NULL | 346908 | Using index |
| 1 | SIMPLE | company_finan.. | ALL | NULL | NULL | NULL | NULL | 610364 | |
+----+-------------+-------------------+-------+---------------+----------+---------+------+--------+-------------+
I have index on company_id column, but it doesn't help.
Any suggestions?
Why do you need the DISTINCT keyword? DISTINCT always slows down your querying because every record has to be considered explicitly for DISTINCT comparison. I don't know how MySQL handles this in detail, but in Oracle, you can get horrible execution plans if one of your DISTINCT fields is nullable.
But you shouldn't need it, your records are probably distinct anyway, because you select comp.id (I'm guessing?) Wouldn't it be a lot faster when you remove DISTINCT ?
Related
I am using test DB from https://github.com/datacharmer/test_db. It has a moderate size of 160 Mb. To run queries I use MySQL Workbench.
Following code runs in 0.015s
SELECT *
FROM employees INNER JOIN salaries ON employees.emp_no = salaries.emp_no
The similar code with GROUP BY added runs for 15.0s
SELECT AVG(salary), gender
FROM employees INNER JOIN salaries ON employees.emp_no = salaries.emp_no
GROUP BY gender
I checked the execution plan for both queries and found that in both cases query cost is similar and is about 600k. I should add that the employee table has 300K rows and the salary table is about 3M rows.
Can anyone suggest a reason why the difference in the execution time is soo huge? I need this explanation to understand the way SQL works better.
Problem solution: As I found due to comments and answers the problem was related to me not noticing that in the case of the first query my IDE was limiting result to 1000 rows. That's how I got 0.015s. In reality, it takes 10.0s to make a join in my case. If the index for gender is created (indices for employees.emp_no and salaries.emp_no already exist in this DB) it takes 10.0s to make join and group by. Without index for gender second query takes 18.0s.
The EXPLAIN for the first query shows that it does a table-scan (type=ALL) of 300K rows from employees, and for each one, does a partial primary key (type=ref) lookup to 1 row (estimated) in salaries.
mysql> explain SELECT * FROM employees
INNER JOIN salaries ON employees.emp_no = salaries.emp_no;
+----+-------------+-----------+------+---------------+---------+---------+----------------------------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------+---------+---------+----------------------------+--------+-------+
| 1 | SIMPLE | employees | ALL | PRIMARY | NULL | NULL | NULL | 299113 | NULL |
| 1 | SIMPLE | salaries | ref | PRIMARY | PRIMARY | 4 | employees.employees.emp_no | 1 | NULL |
+----+-------------+-----------+------+---------------+---------+---------+----------------------------+--------+-------+
The EXPLAIN for the second query (actually a sensible query to compute AVG() as you mentioned in your comment) shows something additional:
mysql> EXPLAIN SELECT employees.gender, AVG(salary) FROM employees
INNER JOIN salaries ON employees.emp_no = salaries.emp_no
GROUP BY employees.gender;
+----+-------------+-----------+------+---------------+---------+---------+----------------------------+--------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------+---------+---------+----------------------------+--------+---------------------------------+
| 1 | SIMPLE | employees | ALL | PRIMARY | NULL | NULL | NULL | 299113 | Using temporary; Using filesort |
| 1 | SIMPLE | salaries | ref | PRIMARY | PRIMARY | 4 | employees.employees.emp_no | 1 | NULL |
+----+-------------+-----------+------+---------------+---------+---------+----------------------------+--------+---------------------------------+
See the Using temporary; Using filesort in the Extra field? That means that the query has to build a temp table to accumulate the AVG() results per group. It has to use a temp table because MySQL can't know that it will scan all the rows for each gender together, so it must assume it will need to maintain running totals independently as it scans rows. It doesn't seem like that would be a big problem to track two (in this case) gender totals, but suppose it were postal code or something like that?
Creating a temp table is a pretty expensive operation. It means writing data, not only reading it as the first query does.
If we could make an index that orders by gender, then MySQL's optimizer would know it can scan all those rows with the same gender together. So it can calculate the running total of one gender at a time, then once it's done scanning one gender, calculate the AVG(salary) and then be guaranteed no further rows for that gender will be scanned. Therefore it can skip building a temp table.
This index helps:
mysql> alter table employees add index (gender, emp_no);
Now the EXPLAIN of the same query shows that it will do an index scan (type=index) which visits the same number of entries, but it'll scan in a more helpful order for the calculation of the aggregate AVG().
Same query, but no Using temporary note:
mysql> EXPLAIN SELECT employees.gender, AVG(salary) FROM employees
INNER JOIN salaries ON employees.emp_no = salaries.emp_no
GROUP BY employees.gender;
+----+-------------+-----------+-------+----------------+---------+---------+----------------------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+----------------+---------+---------+----------------------------+--------+-------------+
| 1 | SIMPLE | employees | index | PRIMARY,gender | gender | 5 | NULL | 299113 | Using index |
| 1 | SIMPLE | salaries | ref | PRIMARY | PRIMARY | 4 | employees.employees.emp_no | 1 | NULL |
+----+-------------+-----------+-------+----------------+---------+---------+----------------------------+--------+-------------+
And executing this query is a lot faster:
+--------+-------------+
| gender | AVG(salary) |
+--------+-------------+
| M | 63838.1769 |
| F | 63769.6032 |
+--------+-------------+
2 rows in set (1.06 sec)
The addition of the GROUP BY clause could easily explain the big performance drop that you are seeing.
From the documentation :
The most general way to satisfy a GROUP BY clause is to scan the whole table and create a new temporary table where all rows from each group are consecutive, and then use this temporary table to discover groups and apply aggregate functions (if any).
The additional cost incurred by the grouping process can be very expensive. Also, grouping happens even if no aggregate function is used.
If you don’t need an aggregate function, don’t group. If you do, ensure that you have a single index that references all grouped columns, as suggested by the documentation :
In some cases, MySQL is able to do much better than that and avoid creation of temporary tables by using index access.
PS : please note that « SELECT * ... GROUP BY »-like statements are not supported since MySQL 5.7.5 (unless you turn off option ONLY_FULL_GROUP_BY)
There is another reason as well as what GMB points out. Basically, you are probably looking at the timing of the first query until it returns the first row. I doubt it is returning all the rows in 0.015 seconds.
The second query with the GROUP BY needs to process all the data to derive the results.
If you added an ORDER BY (which requires processing all the data) to the first query, then you would see a similar performance drop.
explain select id, nome from bea_clientes where id in (
select group_concat(distinct(bea_clientes_id)) as list
from bea_agenda
where bea_clientes_id>0
and bea_agente_id in(300006,300007,300008,300009,300010,300011,300012,300013,300014,300018,300019,300020,300021,300022)
)
When I try to do the above (without the explain), MySQL simply goes busy, using DEPENDENT SUBQUERY, which makes this slow as hell. The thing is why the optimizer calculates the subquery for each ids in client. I even put the IN argument in a group_concat believing that would be the same to put that result as a plain "string" to avoid scanning.
I thought this wouldn't be a problem for MySQL server which is 5.5+?
Testing in MariaDb also does the same.
Is this a known bug? I know I can rewrite this as a join, but still this is terrible.
Generated by: phpMyAdmin 4.4.14 / MySQL 5.6.26
Comando SQL: explain select id, nome from bea_clientes where id in ( select group_concat(distinct(bea_clientes_id)) as list from bea_agenda where bea_clientes_id>0 and bea_agente_id in(300006,300007,300008,300009,300010,300011,300012,300013,300014,300018,300019,300020,300021,300022) );
Lines: 2
Current selection does not contain a unique column. Grid edit, checkbox, Edit, Copy and Delete features are not available.
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|----|--------------------|--------------|-------|-------------------------------|---------------|---------|------|-------|------------------------------------|
| 1 | PRIMARY | bea_clientes | ALL | NULL | NULL | NULL | NULL | 30432 | Using where |
| 2 | DEPENDENT SUBQUERY | bea_agenda | range | bea_clientes_id,bea_agente_id | bea_agente_id | 5 | NULL | 2352 | Using index condition; Using where |
Obviously hard to test without the data but something like below.
Subqueries are just not good in mysql (though its my prefered engine).
I could also recommend indexing the relevant columns which will improve performance for both queries.
For clarity can I also advise expanding queries.
select t1.id,t1.nome from (
(select group_concat(distinct(bea_clientes_id)) as list from bea_agenda where bea_clientes_id>0 and bea_agente_id in (300006,300007,300008,300009,300010,300011,300012,300013,300014,300018,300019,300020,300021,300022)
) as t1
join
(select id, nome from bea_clientes) as t2
on t1.list=t2.id
)
I'm using mysql and a simple query like this seems to hang forever with cpu 100%.
select login_name, server, recharge
from role_info
where login_name in (select login_name
from role_info
group by login_name
having count(login_name) > 1)
the table role_info is small, with only 33535 rows.
below is the output of explain:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | PRIMARY | role_info | ALL | NULL | NULL | NULL | NULL | 33535 | Using where |
| 2 | DEPENDENT SUBQUERY | role_info | index | NULL | a | 302 | NULL | 1 | Using index |
show processlist reports the query keeps Sending data.
Query | 1135 | Sending data | select login_name, server, recharge from role_info where login_name in (select login_name from ...
I can switch to join instead of "in" and it works fine. but still curious why this simple query should act so abnormally.
I wonder how many people stumble over this bug again and again. It's long lasting MySQL bug#32665, where MySQL 6 is the target version. It means that MySQL will turn you uncorrelated subquery into correlated (executed per row of outer resultset). Just in case, in this blog post are some ideas how to workaround the limitation when you can't rewrite the query into using JOINs.
In some versions of MySQL, in with a subquery is quite inefficient. The subquery gets executed for each row being processed. In other words, the results are not cached. I think this is fixed in the most recent versions.
You know how to fix it, by using a join. In general, I gravitate to joins or exists in MySQL instead of using in with a subquery.
Here is the SQL query in question:
select * from company1
left join company2 on company2.model
LIKE CONCAT(company1.model,'%')
where company1.manufacturer = company2.manufacturer
company1 contains 2000 rows while company2 contains 9000 rows.
The query takes around 25 seconds to complete.
I have company1.model and company2.model indexed.
Any idea how I can speed this up? Thanks!
+----+-------------+-----------+------+---------------+------+---------+------+------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------+------+---------+------+------+--------------------------------+
| 1 | SIMPLE | company1 | ALL | NULL | NULL | NULL | NULL | 2853 | |
| 1 | SIMPLE | company2 | ALL | NULL | NULL | NULL | NULL | 8986 | Using where; Using join buffer |
+----+-------------+-------+---+------+---------------+------+---------+------+------+--------------------------------+
This query is not conceptually identical to yours, but maybe you want something like this? I am quite sure it will give you the same result as yours:
select
*
from
company1 inner join company2
on company1.manufacturer = company2.manufacturer
where
company2.model LIKE CONCAT(company1.model,'%')
EDIT: i also removed your left join and put an inner join. If the join doesn't succeed, company2.model is always null and NULL LIKE 'Something%' can never be true.
One way to speed this up is to remove the LIKE CONCAT() from the join condition.
MySQL is not able to use an index for substring based searches like that, so your query results in a full table scan.
Your EXPLAIN shows that you have no indexes that can be used.
Appropriate indexes on both tables would help. Either a single index on (manufacturer) or a composite (manufacturer, model):
ALTER TABLE company1
ADD INDEX manufacturer_model_IDX --- this is just a name (of your choice)
(manufacturer, model) ; --- for the index
ALTER TABLE company2
ADD INDEX manufacturer_model_IDX
(manufacturer, model) ;
I have four tables that I am trying to join and output the result to a new table. My code looks like this:
create table tbl
select a.dte, a.permno, (ret - rf) f0_xs_ret, (xs_ret - (betav*xs_mkt)) f0_resid, mkt_cap last_year_mkt_cap, betav beta_value
from a inner join b using (dte)
inner join c on (year(a.dte) = c.yr and a.permno = c.permno)
inner join d on (a.permno = d.permno and year(a.dte)-1 = year(d.dte));
All of the tables have multiple indices and for table a, (dte, permno) identify a unique record, for table b, dte id's a unique record, for table c, (yr, permno) id a unique record and for table d, (dte, permno) id a unique record. the explain from the select part of the query is:
+----+-------------+-------+--------+-------------------+---------+---------+---------- ------------------------+--------+-------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------+---------+---------+---------- ------------------------+--------+-------------------+
| 1 | SIMPLE | d | ALL | idx1 | NULL | NULL | NULL | 264129 | |
| 1 | SIMPLE | c | ref | idx2 | idx2 | 4 | achernya.d.permno | 16 | |
| 1 | SIMPLE | b | ALL | PRIMARY,idx2 | NULL | NULL | NULL | 12336 | Using join buffer |
| 1 | SIMPLE | a | eq_ref | PRIMARY,idx1,idx2 | PRIMARY | 7 | achernya.b.dte,achernya.d.permno | 1 | Using where |
+----+-------------+-------+--------+-------------------+---------+---------+----------------------------------+--------+-------------------+
Why does mysql have to read so many rows to process this thing? and if i am reading this correctly, it has to read (264129*16*12336) rows which should take a good month.
Could someone please explain what's going on here?
MySQL has to read the rows because you're using functions as your join conditions. An index on dte will not help resolve YEAR(dte) in a query. If you want to make this fast, then put the year in its own column to use in joins and move the index to that column, even if that means some denormalization.
As for the other columns in your index that you don't apply functions to, they may not be used if the index won't provide much benefit, or they aren't the leftmost column in the index and you don't use the leftmost prefix of that index in your join condition.
Sometimes MySQL does not use an index, even if one is available. One circumstance under which this occurs is when the optimizer estimates that using the index would require MySQL to access a very large percentage of the rows in the table. (In this case, a table scan is likely to be much faster because it requires fewer seeks.)
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html