MySql JOIN performance go down with group by - mysql

I have these tables:
table "f" (26000 record)
+------------------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+------------------+------+-----+---------+-------+
| idFascicolo | int(11) | NO | PRI | | |
| oggetto | varchar | NO |index| | |
+------------------+------------------+------+-----+---------+-------+
table "r" (22000 record)
+------------------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+------------------+------+-----+---------+-------+
| idRichiedente | int(11) | NO | PRI | | |
| name | varchar | NO |index| | |
+------------------+------------------+------+-----+---------+-------+
table "fr" (32000 record)
+------------------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+------------------+------+-----+---------+-------+
| id | int(11) | NO | PRI | | |
| idFascicolo | int(11) | NO |index| | FK |
| idRichiedente | int(11) | NO |index| | FK |
+------------------+------------------+------+-----+---------+-------+
this is my select:
SELECT
f.idFascicolo,
f.oggetto,
r.richiedente
FROM fr
JOIN f ON (f.idFascicolo=fr.idFascicolo)
JOIN r ON (r.idRichiedente=fr.idRichiedente)
WHERE r.name LIKE '%string%'
in the result, I would like to see only 1 row per f.idFascicolo (I should have "Rossi Mario" and "Rossi Marco" for the same f.idFascicolo) , the my new select is:
SELECT
f.idFascicolo,
f.oggetto,
r.richiedente
FROM fr
JOIN f ON (f.idFascicolo=fr.idFascicolo)
JOIN r ON (r.idRichiedente=fr.idRichiedente)
WHERE r.name LIKE '%string%'
GROUP BY f.idFascicolo
here, the performance read from PhpMyAdmin:
0.0057 seconds: .. WHERE r.name LIKE '%string%'
0.0527 seconds: .. WHERE r.name LIKE '%string%' GROUP BY f.idFascicolo
0.0036 seconds: .. WHERE r.name LIKE 'string%' GROUP BY f.idFascicolo
I don't understand if the problem of the slow query is GROUP BY or LIKE '%string%'(i need '%string%' .. I can't find an equivalent solution with fulltext index and MATCH .. AGAINST)
This is the explain:
+------+-------------+-------+------+-------------------------+---------------+---------+----------------------+-----------+---------------------------------------------+
| id | select type | table | type | possible keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+-------------------------+---------------+---------+----------------------+-----------+---------------------------------------------+
| 1 | simple | r | ALL | PRIMARY | NULL | NULL | NULL | 20925 |Using where; Using temporary; Using filesort |
+------+-------------+-------+------+-------------------------+---------------+---------+----------------------+-----------+---------------------------------------------+
| 1 | simple | fr | ref |idFascicolo,idRichiedente| idRichiedente | 4 | db.r.idRichiedente | 1 | |
+------+-------------+-------+------+-------------------------+---------------+---------+----------------------+-----------+---------------------------------------------+
| 1 | simple | f |eq_ref|PRIMARY | PRIMARY | 4 | db.fr.idFascicolo | 1 | |
+------+-------------+-------+------+-------------------------+---------------+---------+----------------------+-----------+---------------------------------------------+

You have two potential performance issues. First is the GROUP BY. This requires sorting the data, so it has to read all the data and do a lot of work.
The second is the LIKE. There is a fundamental difference between:
WHERE r.name LIKE '%string%'
and
WHERE r.name LIKE 'string%'
The second can use an index on r(name), because the like pattern does not start with a pattern.
I am not sure what your actual question is. I don't recommend doing using GROUP BY the way you are using it -- because you have unaggregated columns in the SELECT.

Related

Slow query and use of indexes in MySQL

I have the following query:
SELECT final_query.chr
, final_query.start
, final_query.end
, co.chr
, co.start
, co.end
, final_query.count
FROM (SELECT ed.chr
, ed.start
, ed.end
, case when e.bin1=ed.bin then e.bin2 else e.bin1 end AS target
, count
FROM (SELECT * FROM coordinates
WHERE chr="chr1" AND (start between 3960000 AND 4000000 OR end between 3960000 AND 4000000)
) ed
JOIN counts e ON (e.bin1 = ed.bin OR e.bin2=ed.bin)
SORT BY count LIMIT 1,20)
AS final_query
JOIN coordinates co ON final_query.target=co.bin;
and the output of EXPLAINED is:
+------+-------------+-------------+--------+---------------+---------+---------+-------+----------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------------+--------+---------------+---------+---------+-------+----------+------------------------------------+
| 1 | SIMPLE | e | ALL | bin1,bin2 | NULL | NULL | NULL | 30763816 | Using filesort |
| 1 | SIMPLE | coordinates | ref | PRIMARY,chr | chr | 22 | const | 4929 | Using index condition; Using where |
| 1 | SIMPLE | co | eq_ref | PRIMARY | PRIMARY | 22 | func | 1 | Using where |
+------+-------------+-------------+--------+---------------+---------+---------+-------+----------+------------------------------------+
What I am doing is to perform the following query of table coordinates, which has field chr indexed. So, in the subquery shown below, I filter those rows that match my conditions.
... (SELECT * FROM coordinates
WHERE chr="chr1" AND (start between 3960000 AND 4000000 OR end between 3960000 AND 4000000)
) ...
This table outputs field bin, also indexed. This field bin links with bin1 and bin2 both from table counts and indexed as well. So, here, what I want is to get all those rows in table counts having coordinates.bin in fields bin1 and bin2. Why in this step no index is used?
Besides of it, I would like to add an ORDER BY in my query, just before the LIMIT statement. But it slows too much my query. I don't know why, because it have to sort a maximum of 4000 rows...
How can I optimize my query?
My tables, from the DESCRIBE statement:
Table counts
+-------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| bin1 | varchar(20) | NO | MUL | NULL | |
| bin2 | varchar(20) | NO | MUL | NULL | |
| count | float(6,2) | NO | | NULL | |
+-------+-------------+------+-----+---------+----------------+
Table coordinates
+-------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| bin | varchar(20) | NO | PRI | NULL | |
| chr | varchar(20) | NO | MUL | NULL | |
| start | int(11) | NO | | NULL | |
| end | int(11) | NO | | NULL | |
+-------+-------------+------+-----+---------+-------+

How can I optimize this mysql query to find maximum simultaneous calls?

I'm trying to calculate maximum simultaneous calls. My query, which I believe to be accurate, takes way too long given ~250,000 rows. The cdrs table looks like this:
+---------------+-----------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+-----------------------+------+-----+---------+----------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| CallType | varchar(32) | NO | | NULL | |
| StartTime | datetime | NO | MUL | NULL | |
| StopTime | datetime | NO | | NULL | |
| CallDuration | float(10,5) | NO | | NULL | |
| BillDuration | mediumint(8) unsigned | NO | | NULL | |
| CallMinimum | tinyint(3) unsigned | NO | | NULL | |
| CallIncrement | tinyint(3) unsigned | NO | | NULL | |
| BasePrice | float(12,9) | NO | | NULL | |
| CallPrice | float(12,9) | NO | | NULL | |
| TransactionId | varchar(20) | NO | | NULL | |
| CustomerIP | varchar(15) | NO | | NULL | |
| ANI | varchar(20) | NO | | NULL | |
| ANIState | varchar(10) | NO | | NULL | |
| DNIS | varchar(20) | NO | | NULL | |
| LRN | varchar(20) | NO | | NULL | |
| DNISState | varchar(10) | NO | | NULL | |
| DNISLATA | varchar(10) | NO | | NULL | |
| DNISOCN | varchar(10) | NO | | NULL | |
| OrigTier | varchar(10) | NO | | NULL | |
| TermRateDeck | varchar(20) | NO | | NULL | |
+---------------+-----------------------+------+-----+---------+----------------+
I have the following indexes:
+-------+------------+-----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------+------------+-----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| cdrs | 0 | PRIMARY | 1 | id | A | 269622 | NULL | NULL | | BTREE | | |
| cdrs | 1 | id | 1 | id | A | 269622 | NULL | NULL | | BTREE | | |
| cdrs | 1 | call_time_index | 1 | StartTime | A | 269622 | NULL | NULL | | BTREE | | |
| cdrs | 1 | call_time_index | 2 | StopTime | A | 269622 | NULL | NULL | | BTREE | | |
+-------+------------+-----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
The query I am running is this:
SELECT MAX(cnt) AS max_channels FROM
(SELECT cl1.StartTime, COUNT(*) AS cnt
FROM cdrs cl1
INNER JOIN cdrs cl2
ON cl1.StartTime
BETWEEN cl2.StartTime AND cl2.StopTime
GROUP BY cl1.id)
AS counts;
It seems like I might have to chunk this data for each day and store the results in a separate table like simultaneous_calls.
I'm sure you want to know not only the maximum simultaneous calls, but when that happened.
I would create a table containing the timestamp of every individual minute
CREATE TABLE times (ts DATETIME UNSIGNED AUTO_INCREMENT PRIMARY KEY);
INSERT INTO times (ts) VALUES ('2014-05-14 00:00:00');
. . . until 1440 rows, one for each minute . . .
Then join that to the calls.
SELECT ts, COUNT(*) AS count FROM times
JOIN cdrs ON times.ts BETWEEN cdrs.starttime AND cdrs.stoptime
GROUP BY ts ORDER BY count DESC LIMIT 1;
Here's the result in my test (MySQL 5.6.17 on a Linux VM running on a Macbook Pro):
+---------------------+----------+
| ts | count(*) |
+---------------------+----------+
| 2014-05-14 10:59:00 | 1001 |
+---------------------+----------+
1 row in set (1 min 3.90 sec)
This achieves several goals:
Reduces the number of rows examined by two orders of magnitude.
Reduces the execution time from 3 hours+ to about 1 minute.
Also returns the actual timestamp when the highest count was found.
Here's the EXPLAIN for my query:
explain select ts, count(*) from times join cdrs on times.ts between cdrs.starttime and cdrs.stoptime group by ts order by count(*) desc limit 1;
+----+-------------+-------+-------+---------------+---------+---------+------+--------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+--------+------------------------------------------------+
| 1 | SIMPLE | times | index | PRIMARY | PRIMARY | 5 | NULL | 1440 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | cdrs | ALL | starttime | NULL | NULL | NULL | 260727 | Range checked for each record (index map: 0x4) |
+----+-------------+-------+-------+---------------+---------+---------+------+--------+------------------------------------------------+
Notice the figures in the rows column, and compare to the EXPLAIN of your original query. You can estimate the total number of rows examined by multiplying these together (but that gets more complicated if your query is anything other than SIMPLE).
The inline view isn't strictly necessary. (You're right about a lot of time to run the EXPLAIN on the query with the inline view, the EXPLAIN will materialize the inline view (i.e. run the inline view query and populate the derived table), and then give an EXPLAIN on the outer query.
Note that this query will return an equivalent result:
SELECT COUNT(*) AS max_channels
FROM cdrs cl1
JOIN cdrs cl2
ON cl1.StartTime BETWEEN cl2.StartTime AND cl2.StopTime
GROUP BY cl1.id
ORDER BY max_channels DESC
LIMIT 1
Though it still has to do all the work, and probably doesn't perform any better; the EXPLAIN should run a lot faster. (We expect to see "Using temporary; Using filesort" in the Extra column.)
The number of rows in the resultset is going to be the number of rows in the table (~250,000 rows), and those are going to need to be sorted, so that's going to be some time there. The bigger issue (my gut is telling me) is that join operation.
I'm wondering if the EXPLAIN (or performance) would be any different if you swapped the cl1 and cl2 in the predicate, i.e.
ON cl2.StartTime BETWEEN cl1.StartTime AND cl1.StopTime
I'm thinking that, just because I'd be tempted to try a correlated subquery. That's ~250,000 executions, and that's not likely going to be any faster...
SELECT ( SELECT COUNT(*)
FROM cdrs cl2
WHERE cl2.StartTime BETWEEN cl1.StartTime AND cl1.StopTime
) AS max_channels
, cl1.StartTime
FROM cdrs cl1
ORDER BY max_channels DESC
LIMIT 11
You could run an EXPLAIN on that, we're still going to see a "Using temporary; Using filesort", and it will also show the "dependent subquery"...
Obviously, adding a predicate on the cl1 table to cut down the number of rows to be returned (for example, checking only the past 15 days); that should speed things up, but it doesn't get you the answer you want.
WHERE cl1.StartTime > NOW() - INTERVAL 15 DAY
(None of my musings here are sure-fire answers to your question, or solutions to the performance issue; they're just musings.)

search query very slow how to optimize?

Ive got a search query that uses some joins to search in different correlating tables. But recently I've added about 3000 contacts to the contactpersonen table. And it got really slow.
Tables are these:
debiteuren : 1445 entries
contactpersonen: 3711 entries
debiteuren_toegang: 3008 entries
SELECT
contactpersonen.id,
contactpersonen.voornaam,
contactpersonen.achternaam,
debiteuren.bedrijfsnaam,
debiteuren.id as debid
FROM
debiteuren
LEFT JOIN
contactpersonen ON contactpersonen.bedrijf = debiteuren.id
LEFT JOIN
debiteuren_toegang ON debiteuren_toegang.bedrijf = debiteuren.id
WHERE
(contactpersonen.voornaam LIKE '%henk%'
OR contactpersonen.achternaam LIKE '%henk%'
OR debiteuren.id LIKE '%henk%'
OR debiteuren.bedrijfsnaam LIKE '%henk%'
OR contactpersonen.id LIKE '%henk%')
AND debiteuren_toegang.website = 'web1'
LIMIT 10
When I remove the part that searches trough contactpersonen.voornaam LIKE '%henk%' OR contactpersonen.achternaam LIKE '%henk%' The query is really fast again.
Ive added an index in phpmyadmin on voornaam and achternaam, but that didnt help anything.
Any ideas on how to make this quicker? I don't think this is a lot of rows right? Queries last for even 5 seconds at times.
Thanks!
FULL QUERY EXPLAIN:
+----+-------------+--------------------+--------+---------------+---------+---------+---------------------------------------+------+-------------+--+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | |
+----+-------------+--------------------+--------+---------------+---------+---------+---------------------------------------+------+-------------+--+
| 1 | SIMPLE | debiteuren_toegang | ALL | NULL | NULL | NULL | NULL | 3008 | Using where | |
| 1 | SIMPLE | debiteuren | eq_ref | PRIMARY | PRIMARY | 4 | deb12311_1.debiteuren_toegang.bedrijf | 1 | | |
| 1 | SIMPLE | contactpersonen | ALL | NULL | NULL | NULL | NULL | 4169 | Using where | |
+----+-------------+--------------------+--------+---------------+---------+---------+---------------------------------------+------+-------------+--+
PARTIAL QUERY EXPLAIN:
+----+-------------+--------------------+--------+---------------+---------+---------+---------------------------------------+------+-------------+--+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | |
+----+-------------+--------------------+--------+---------------+---------+---------+---------------------------------------+------+-------------+--+
| 1 | SIMPLE | debiteuren_toegang | ALL | NULL | NULL | NULL | NULL | 3008 | Using where | |
| 1 | SIMPLE | debiteuren | eq_ref | PRIMARY | PRIMARY | 4 | deb12311_1.debiteuren_toegang.bedrijf | 1 | Using where | |
| 1 | SIMPLE | contactpersonen | ALL | NULL | NULL | NULL | NULL | 4098 | | |
+----+-------------+--------------------+--------+---------------+---------+---------+---------------------------------------+------+-------------+--+
Try this,
` SELECT
contactpersonen.id,
contactpersonen.voornaam,
contactpersonen.achternaam,
debiteuren.bedrijfsnaam,
debiteuren.id as debid
FROM
debiteuren
LEFT JOIN
contactpersonen ON contactpersonen.bedrijf = debiteuren.id
LEFT JOIN
debiteuren_toegang ON debiteuren_toegang.bedrijf = debiteuren.id
WHERE
debiteuren_toegang.website = 'web1'
AND
instr ( concat(contactpersonen.voornaam, contactpersonen.achternaam, debiteuren.id, debiteuren.bedrijfsnaam,contactpersonen.id) , 'henk'
)>0
LIMIT 10`

NATURAL JOIN vs WHERE IN Clauses

Recently, I dealt with retrieving a large amount of data which consists of thousands of records from a MySQL database. Since it was my first time to handle such large data set, I didn't think about the efficiency of the SQL statement. And the problem comes.
Here are the tables of the database
(It is just a simple database model of a curriculum system):
course:
+-----------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------------------+------+-----+---------+----------------+
| course_id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| name | varchar(20) | NO | | NULL | |
| lecturer | varchar(20) | NO | | NULL | |
| credit | float | NO | | NULL | |
| week_from | tinyint(3) unsigned | NO | | NULL | |
| week_to | tinyint(3) unsigned | NO | | NULL | |
+-----------+---------------------+------+-----+---------+----------------+
select:
+-----------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+------------------+------+-----+---------+----------------+
| select_id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| card_no | int(10) unsigned | NO | | NULL | |
| course_id | int(10) unsigned | NO | | NULL | |
| term | varchar(7) | NO | | NULL | |
+-----------+------------------+------+-----+---------+----------------+
When I want to retrieve all the courses that a student has selected (with his card number),
the SQL statement is
SELECT course_id, name, lecturer, credit, week_from, week_to
FROM `course` WHERE course_id IN (
SELECT course_id FROM `select` WHERE card_no=<student's card number>
);
But, it was extremely slow and it didn't return anything for a long time.
So I changed WHERE IN clauses into NATURAL JOIN. Here is the SQL,
SELECT course_id, name, lecturer, credit, week_from, week_to
FROM `select` NATURAL JOIN `course`
WHERE card_no=<student's card number>;
It returns immediately and works fine!
So my question is:
What's the difference between NATURAL JOIN and WHERE IN Clauses?
What makes them perform differently?
(Is that maybe because I doesn't set up any INDEX?)
When shall we use NATURAL JOIN or WHERE IN?
Theoretically the two queries are equivalent. I think it's just poor implementation of the MySQL query optimizer that causes JOIN to be more efficient than WHERE IN. So I always use JOIN.
Have you looked at the output of EXPLAIN for the two queries? Here's what I got for a WHERE IN:
+----+--------------------+-------------------+----------------+-------------------+---------+---------+------------+---------+--------------------------+
| 1 | PRIMARY | t_users | ALL | NULL | NULL | NULL | NULL | 2458304 | Using where |
| 2 | DEPENDENT SUBQUERY | t_user_attributes | index_subquery | PRIMARY,attribute | PRIMARY | 13 | func,const | 7 | Using index; Using where |
+----+--------------------+-------------------+----------------+-------------------+---------+---------+------------+---------+--------------------------+
It's apparently performing the subquery, then going through every row in the main table testing whether it's in -- it doesn't use the index. For the JOIN I get:
+----+-------------+-------------------+--------+---------------------+-----------+---------+---------------------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------+--------+---------------------+-----------+---------+---------------------------------------+------+-------------+
| 1 | SIMPLE | t_user_attributes | ref | PRIMARY,attribute | attribute | 1 | const | 15 | Using where |
| 1 | SIMPLE | t_users | eq_ref | username,username_2 | username | 12 | bbodb_test.t_user_attributes.username | 1 | |
+----+-------------+-------------------+--------+---------------------+-----------+---------+---------------------------------------+------+-------------+
Now it uses the index.
Try this:
SELECT course_id, name, lecturer, credit, week_from, week_to
FROM `course` c
WHERE c.course_id IN (
SELECT s.course_id
FROM `select` s
WHERE card_no=<student's card number>
AND c.course_id = s.course_id
);
Notice the addition of the AND clause in the sub-query. This is called a co-related sub-query because it relates the two course_ids, just as the NATURAL JOIN does.
I think Barmar's index explanation is on the mark.

MySQL simple join using temporary table

I've got (what I thought) was a very simple query on a MySQL database, but using explain shows that the query is using a temporary table. I've tried changing the order of the select and the join but to no avail. I've reduced the tables to their simplest (to see if it was an issue with the complexity of my tables, but I still have the problem). I've tried it with two basic tables, one with a "name" field, and the other with a foreign key reference back to that table:
a:
+-------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| name | varchar(128) | NO | MUL | NULL | |
+-------+--------------+------+-----+---------+----------------+
b:
+-------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| a_id | int(11) | NO | MUL | NULL | |
+-------+---------+------+-----+---------+----------------+
And this is my query:
SELECT a.id, a.name FROM a JOIN b ON a.id = b.a_id ORDER BY a.name;
Which I thought was very simple... just a list of all records in a that have records in b, ordered by name. Alas explain says this:
+----+-------------+-------+--------+---------------+---------+---------+--------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+--------+------+----------------------------------------------+
| 1 | SIMPLE | b | index | a_id | a_id | 4 | NULL | 2 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | a | eq_ref | PRIMARY | PRIMARY | 4 | b.a_id | 1 | |
+----+-------------+-------+--------+---------------+---------+---------+--------+------+----------------------------------------------+
It looks like it should be using the key on the b table but for some reason it isn't. I get the feeling I'm missing something basic (or my RDBMS knowledge needs brushing up somewhat). Anyone any ideas why it's using a temporary table for such a simple query?