Addition of GROUP BY to simple query makes it 1000 slower

Addition of GROUP BY to simple query makes it 1000 slower - mysql

I am using test DB from https://github.com/datacharmer/test_db. It has a moderate size of 160 Mb. To run queries I use MySQL Workbench.
Following code runs in 0.015s
SELECT *
FROM employees INNER JOIN salaries ON employees.emp_no = salaries.emp_no
The similar code with GROUP BY added runs for 15.0s
SELECT AVG(salary), gender
FROM employees INNER JOIN salaries ON employees.emp_no = salaries.emp_no
GROUP BY gender
I checked the execution plan for both queries and found that in both cases query cost is similar and is about 600k. I should add that the employee table has 300K rows and the salary table is about 3M rows.
Can anyone suggest a reason why the difference in the execution time is soo huge? I need this explanation to understand the way SQL works better.
Problem solution: As I found due to comments and answers the problem was related to me not noticing that in the case of the first query my IDE was limiting result to 1000 rows. That's how I got 0.015s. In reality, it takes 10.0s to make a join in my case. If the index for gender is created (indices for employees.emp_no and salaries.emp_no already exist in this DB) it takes 10.0s to make join and group by. Without index for gender second query takes 18.0s.

The EXPLAIN for the first query shows that it does a table-scan (type=ALL) of 300K rows from employees, and for each one, does a partial primary key (type=ref) lookup to 1 row (estimated) in salaries.
mysql> explain SELECT * FROM employees
INNER JOIN salaries ON employees.emp_no = salaries.emp_no;
+----+-------------+-----------+------+---------------+---------+---------+----------------------------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------+---------+---------+----------------------------+--------+-------+
| 1 | SIMPLE | employees | ALL | PRIMARY | NULL | NULL | NULL | 299113 | NULL |
| 1 | SIMPLE | salaries | ref | PRIMARY | PRIMARY | 4 | employees.employees.emp_no | 1 | NULL |
+----+-------------+-----------+------+---------------+---------+---------+----------------------------+--------+-------+
The EXPLAIN for the second query (actually a sensible query to compute AVG() as you mentioned in your comment) shows something additional:
mysql> EXPLAIN SELECT employees.gender, AVG(salary) FROM employees
INNER JOIN salaries ON employees.emp_no = salaries.emp_no
GROUP BY employees.gender;
+----+-------------+-----------+------+---------------+---------+---------+----------------------------+--------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------+---------+---------+----------------------------+--------+---------------------------------+
| 1 | SIMPLE | employees | ALL | PRIMARY | NULL | NULL | NULL | 299113 | Using temporary; Using filesort |
| 1 | SIMPLE | salaries | ref | PRIMARY | PRIMARY | 4 | employees.employees.emp_no | 1 | NULL |
+----+-------------+-----------+------+---------------+---------+---------+----------------------------+--------+---------------------------------+
See the Using temporary; Using filesort in the Extra field? That means that the query has to build a temp table to accumulate the AVG() results per group. It has to use a temp table because MySQL can't know that it will scan all the rows for each gender together, so it must assume it will need to maintain running totals independently as it scans rows. It doesn't seem like that would be a big problem to track two (in this case) gender totals, but suppose it were postal code or something like that?
Creating a temp table is a pretty expensive operation. It means writing data, not only reading it as the first query does.
If we could make an index that orders by gender, then MySQL's optimizer would know it can scan all those rows with the same gender together. So it can calculate the running total of one gender at a time, then once it's done scanning one gender, calculate the AVG(salary) and then be guaranteed no further rows for that gender will be scanned. Therefore it can skip building a temp table.
This index helps:
mysql> alter table employees add index (gender, emp_no);
Now the EXPLAIN of the same query shows that it will do an index scan (type=index) which visits the same number of entries, but it'll scan in a more helpful order for the calculation of the aggregate AVG().
Same query, but no Using temporary note:
mysql> EXPLAIN SELECT employees.gender, AVG(salary) FROM employees
INNER JOIN salaries ON employees.emp_no = salaries.emp_no
GROUP BY employees.gender;
+----+-------------+-----------+-------+----------------+---------+---------+----------------------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+----------------+---------+---------+----------------------------+--------+-------------+
| 1 | SIMPLE | employees | index | PRIMARY,gender | gender | 5 | NULL | 299113 | Using index |
| 1 | SIMPLE | salaries | ref | PRIMARY | PRIMARY | 4 | employees.employees.emp_no | 1 | NULL |
+----+-------------+-----------+-------+----------------+---------+---------+----------------------------+--------+-------------+
And executing this query is a lot faster:
+--------+-------------+
| gender | AVG(salary) |
+--------+-------------+
| M | 63838.1769 |
| F | 63769.6032 |
+--------+-------------+
2 rows in set (1.06 sec)

The addition of the GROUP BY clause could easily explain the big performance drop that you are seeing.
From the documentation :
The most general way to satisfy a GROUP BY clause is to scan the whole table and create a new temporary table where all rows from each group are consecutive, and then use this temporary table to discover groups and apply aggregate functions (if any).
The additional cost incurred by the grouping process can be very expensive. Also, grouping happens even if no aggregate function is used.
If you don’t need an aggregate function, don’t group. If you do, ensure that you have a single index that references all grouped columns, as suggested by the documentation :
In some cases, MySQL is able to do much better than that and avoid creation of temporary tables by using index access.
PS : please note that « SELECT * ... GROUP BY »-like statements are not supported since MySQL 5.7.5 (unless you turn off option ONLY_FULL_GROUP_BY)

There is another reason as well as what GMB points out. Basically, you are probably looking at the timing of the first query until it returns the first row. I doubt it is returning all the rows in 0.015 seconds.
The second query with the GROUP BY needs to process all the data to derive the results.
If you added an ORDER BY (which requires processing all the data) to the first query, then you would see a similar performance drop.

Related

Self join on a huge table with conditions is taking a lot of time , optimize query

I have a master table which has details.
I wanted to find all the combinations for a product in that session with every other product in that particular sessions for all sessions.
create table combinations as
select
a.main_id,
a.sub_id as sub_id_x,
b.sub_id as sub_id_y,
count(*) as count1,
a.dates as rundate
from
master_table a
left join
master_table b
on a.session_id = b.session_id
and a.visit_number = b.visit_number
and a.main_id = b.main_id
and a.sub_id != b.sub_id
where
a.sub_id is not null
and b.sub_id is not null
group by
a.main_id,
a.sub_id,
b.sub_id,
rundate;
I did a explain on query
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
| 1 | SIMPLE | a | NULL | ALL | NULL | NULL | NULL | NULL | 298148 | 90.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | b | NULL | ALL | NULL | NULL | NULL | NULL | 298148 | 0.08 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
The main issue is, my master table consists of 80 million rows. This query is taking more than 24 hours to execute.
All the columns are indexed and I am doing a self join.
Would creating a like table first 'master_table_2' and then doing a join would make my query faster?
Is there any way to optimize the query time?

As your table consists of a lot of rows, the join query will take a lot of time if it is not optimized properly and the WHERE clause is not used properly. But an optimized query could save your time and effort. The following link has a good explanation about the optimization of the join queries and its facts -
Optimization of Join Queries
#Marcus Adams has already provided a similar answer here
Another option is you can select individually and process in the code end for the optimization. But it is only applicable in some specific conditions only. You will have to try to compare both processes (join query and code end execution) and check the performance. I have got better performance once using this method.
Suppose a join query is like as the following -
SELECT A.a1, B.b1, A.a2
FROM A
INNER JOIN B
ON A.a3=B.b3
WHERE B.b3=C;
What I am trying to say is query individually from A and B satisfying the necessary conditions and then try to get your desired result from the code end.
N.B. : It is an unorthodox way and it could not be taken as granted to be applicable in all criteria.
Hope it helps.

Why is this MySQL query poor performance (DEPENDENT_SUBQUERY)

explain select id, nome from bea_clientes where id in (
select group_concat(distinct(bea_clientes_id)) as list
from bea_agenda
where bea_clientes_id>0
and bea_agente_id in(300006,300007,300008,300009,300010,300011,300012,300013,300014,300018,300019,300020,300021,300022)
)
When I try to do the above (without the explain), MySQL simply goes busy, using DEPENDENT SUBQUERY, which makes this slow as hell. The thing is why the optimizer calculates the subquery for each ids in client. I even put the IN argument in a group_concat believing that would be the same to put that result as a plain "string" to avoid scanning.
I thought this wouldn't be a problem for MySQL server which is 5.5+?
Testing in MariaDb also does the same.
Is this a known bug? I know I can rewrite this as a join, but still this is terrible.
Generated by: phpMyAdmin 4.4.14 / MySQL 5.6.26
Comando SQL: explain select id, nome from bea_clientes where id in ( select group_concat(distinct(bea_clientes_id)) as list from bea_agenda where bea_clientes_id>0 and bea_agente_id in(300006,300007,300008,300009,300010,300011,300012,300013,300014,300018,300019,300020,300021,300022) );
Lines: 2
Current selection does not contain a unique column. Grid edit, checkbox, Edit, Copy and Delete features are not available.
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|----|--------------------|--------------|-------|-------------------------------|---------------|---------|------|-------|------------------------------------|
| 1 | PRIMARY | bea_clientes | ALL | NULL | NULL | NULL | NULL | 30432 | Using where |
| 2 | DEPENDENT SUBQUERY | bea_agenda | range | bea_clientes_id,bea_agente_id | bea_agente_id | 5 | NULL | 2352 | Using index condition; Using where |

Obviously hard to test without the data but something like below.
Subqueries are just not good in mysql (though its my prefered engine).
I could also recommend indexing the relevant columns which will improve performance for both queries.
For clarity can I also advise expanding queries.
select t1.id,t1.nome from (
(select group_concat(distinct(bea_clientes_id)) as list from bea_agenda where bea_clientes_id>0 and bea_agente_id in (300006,300007,300008,300009,300010,300011,300012,300013,300014,300018,300019,300020,300021,300022)
) as t1
join
(select id, nome from bea_clientes) as t2
on t1.list=t2.id
)

Optimize mysql inner join query to use indexes

I have a quite simple Mysql query that outputs products from category ordered by price descending.
SELECT
p.id_product, p.price
FROM product p
INNER JOIN product_category pc
ON (pc.id_product = p.id_product AND pc.id_category=1234)
GROUP BY pc.id_product
ORDER BY p.price DESC
Since I have a lot of products in "product" table and even more product-category relations in "product_category" table this query lasts forever.
I have following indexes / primary keys defined:
table "product" - primary key (id_product)
table "product_category" - primary key (id_product, id_category), index(id_product), index(id_category)
But when I explain this query I get:
+----+-------------+-------+--------+--------------------+------------+---------+------------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------+------------+---------+------------------------+-------+----------------------------------------------+
| 1 | SIMPLE | pc | index | PRIMARY,id_product | id_product | 4 | NULL | 73779 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | p | eq_ref | PRIMARY | PRIMARY | 4 | mydb.pc.id_product | 1 | |
+----+-------------+-------+--------+--------------------+------------+---------+------------------------+-------+----------------------------------------------+
so... Using temporary; Using filesort - I think that this is the problem why everything is running so slow.
Since this query is executed a lot of times from a closed-source software I can't change it. But I want to optimize table / indexes so this query will run faster. Any help appreciated.

You have a constant condition on id_category, and a JOIN condition in id_product.
I believe that if you create an index on (id_category, id_product), in this order, MySQL will be able to find relevant index entries for category 1234, and use them to find relevant product_ids to fetch from both tables.
Unfortunately I can't test this at the moment - I may try later.
If you can give it a try you will find out very quickly...

MySQL InnoDB indexes slowing down sorts

I am using MySQL 5.6 on FreeBSD and have just recently switched from using MyISAM tables to InnoDB to gain advances of foreign key constraints and transactions.
After the switch, I discovered that a query on a table with 100,000 rows that was previously taking .003 seconds, was now taking 3.6 seconds. The query looked like this:
SELECT *
-> FROM USERS u
-> JOIN MIGHT_FLOCK mf ON (u.USER_ID = mf.USER_ID)
-> WHERE u.STATUS = 'ACTIVE' AND u.ACCESS_ID >= 8 ORDER BY mf.STREAK DESC LIMIT 0,100
I noticed that if I removed the ORDER BY clause, the execution time dropped back down to .003 seconds, so the problem is obviously in the sorting.
I then discovered that if I added back the ORDER BY but removed indexes on the columns referred to in the query (STATUS and ACCESS_ID), the query execution time would take the normal .003 seconds.
Then I discovered that if I added back the indexes on the STATUS and ACCESS_ID columns, but used IGNORE INDEX (STATUS,ACCESS_ID), the query would still execute in the normal .003 seconds.
Is there something about InnoDB and sorting results when referencing an indexed column in a WHERE clause that I don't understand?
Or am I doing something wrong?
EXPLAIN for the slow query returns the following results:
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+-------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+-------+---------------------------------------------------------------------+
| 1 | SIMPLE | u | ref | PRIMARY,STATUS,ACCESS_ID | STATUS | 2 | const | 53902 | Using index condition; Using where; Using temporary; Using filesort |
| 1 | SIMPLE | mf | eq_ref | PRIMARY | PRIMARY | 4 | PRO_MIGHT.u.USER_ID | 1 | NULL |
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+-------+---------------------------------------------------------------------+
EXPLAIN for the fast query returns the following results:
+----+-------------+-------+--------+---------------+---------+---------+----------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+----------------------+------+-------------+
| 1 | SIMPLE | mf | index | PRIMARY | STREAK | 2 | NULL | 100 | NULL |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | PRO_MIGHT.mf.USER_ID | 1 | Using where |
+----+-------------+-------+--------+---------------+---------+---------+----------------------+------+-------------+
Any help would be greatly appreciated.

In the slow case, MySQL is making an assumption that the index on STATUS will greatly limit the number of users it has to sort through. MySQL is wrong. Presumably most of your users are ACTIVE. MySQL is picking up 50k user rows, checking their ACCESS_ID, joining to MIGHT_FLOCK, sorting the results and taking the first 100 (out of 50k).
In the fast case, you have told MySQL it can't use either index on USERS. MySQL is using its next-best index, it is taking the first 100 rows from MIGHT_FLOCK using the STREAK index (which is already sorted), then joining to USERS and picking up the user rows, then checking that your users are ACTIVE and have an ACCESS_ID at or above 8. This is much faster because only 100 rows are read from disk (x2 for the two tables).
I would recommend:
drop the index on STATUS unless you frequently need to retrieve INACTIVE users (not ACTIVE users). This index is not helping you.
Read this question to understand why your sorts are so slow. You can probably tune InnoDB for better sort performance to prevent these kind of problems.
If you have very few users with ACCESS_ID at or above 8 you should see a dramatic improvement already. If not you might have to use STRAIGHT_JOIN in your select clause.
Example below:
SELECT *
FROM MIGHT_FLOCK mf
STRAIGHT_JOIN USERS u ON (u.USER_ID = mf.USER_ID)
WHERE u.STATUS = 'ACTIVE' AND u.ACCESS_ID >= 8 ORDER BY mf.STREAK DESC LIMIT 0,100
STRAIGHT_JOIN forces MySQL to access the MIGHT_FLOCK table before the USERS table based on the order in which you specify those two tables in the query.
To answer the question "Why did the behaviour change" you should start by understanding the statistics that MySQL keeps on each index: http://dev.mysql.com/doc/refman/5.6/en/myisam-index-statistics.html. If statistics are not up to date or if InnoDB is not providing sufficient information to MySQL, the query optimiser can (and does) make stupid decisions about how to join tables.

Fullscan on other table LEFT JOIN

I have 2 database tables Companies (uses InnoDB engine) and Company_financial_figures (uses MyISAM engine). Table Companies has about 300 000 records, Company_financial_figures has 600 000 recodrs. Also Table Company_financial_figures has flag that is used in table LEFT JOIN.
Query idea is to select all actual balances for companies (there is a situation when tere is no balance data for that company, but anyway it must be selected, so I have to use LEFT JOIN). It seems to me, that it must select about 300k records from Company_financial_figures table, but not to make Full table scan, like 600k records. And the performance for this query is very slow.
Query is something like this:
SELECT DISTINCT comp.id, comp.name, comp.surname, cff.balance FROM companies comp LEFT JOIN company_financial_figures cff ON (cff.company_id = comp.id AND cff.actual = 1)
+----+-------------+-------------------+-------+---------------+----------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------+-------+---------------+----------+---------+------+--------+-------------+
| 1 | SIMPLE | companies | index | NULL | comp_i_i | 2 | NULL | 346908 | Using index |
| 1 | SIMPLE | company_finan.. | ALL | NULL | NULL | NULL | NULL | 610364 | |
+----+-------------+-------------------+-------+---------------+----------+---------+------+--------+-------------+
I have index on company_id column, but it doesn't help.
Any suggestions?

Why do you need the DISTINCT keyword? DISTINCT always slows down your querying because every record has to be considered explicitly for DISTINCT comparison. I don't know how MySQL handles this in detail, but in Oracle, you can get horrible execution plans if one of your DISTINCT fields is nullable.
But you shouldn't need it, your records are probably distinct anyway, because you select comp.id (I'm guessing?) Wouldn't it be a lot faster when you remove DISTINCT ?

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008