optimization of mysql query - mysql

I have table with two columns (varchar from, varchar to). This table represents connections betwen nodes (from node, to node). I want to get all nodes connected from or to node that I specify and nodes connected from or to those nodes. Currently I use query below that gives me proper results but I'm searching for neater solution.
//currently used query specified node "node1"
SELECT tonode as node
FROM conn
WHERE
fromnode
IN
(SELECT tonode as node FROM conn WHERE fromnode="node1"
UNION
SELECT fromnode as node FROM conn WHERE tonode="node1")
UNION
SELECT fromnode as node
FROM conn
WHERE
tonode
IN
(SELECT tonode as node FROM conn WHERE fromnode="node1"
UNION
SELECT fromnode as node FROM conn WHERE tonode="node1")
//create table for conn table
CREATE TABLE `conn` (
`fromnode` varchar(70) NOT NULL,
`tonode` varchar(70) NOT NULL,
PRIMARY KEY (`fromnode`,`tonode`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC;
INSERT INTO `conn` (`fromnode`,`tonode`) VALUES
('node1','node2'),
('node1','node3'),
('node3','node2'),
('node4','node1'),
('node4','node2'),
('node4','node5'),
('node5','node6'),
('node4','node3');

My optimized version:
SET #origin = "node1";
SELECT DISTINCT
IF(c1.fromnode = #origin,
IF(c1.tonode = c2.tonode,
IF(c2.fromnode = #origin, c2.tonode, c2.fromnode),
IF(c2.tonode = #origin, c2.fromnode, c2.tonode)
),
IF(c1.fromnode = c2.tonode,
IF(c2.fromnode = #origin, c2.tonode, c2.fromnode),
IF(c2.tonode = #origin, c2.fromnode, c2.tonode)
)
) AS node
FROM conn AS c1
LEFT JOIN conn AS c2 ON (c1.fromnode = c2.fromnode OR c1.tonode = c2.fromnode OR c1.fromnode = c2.tonode OR c1.tonode = c2.tonode)
WHERE c1.fromnode = #origin OR c1.tonode = #origin;
the DESCRIBE output of your old query:
+----+--------------------+------------+--------+---------------+---------+---------+------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+--------+---------------+---------+---------+------------+------+--------------------------+
| 1 | PRIMARY | conn | index | NULL | PRIMARY | 424 | NULL | 8 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | conn | eq_ref | PRIMARY | PRIMARY | 424 | const,func | 1 | Using where; Using index |
| 3 | DEPENDENT UNION | conn | eq_ref | PRIMARY | PRIMARY | 424 | func,const | 1 | Using where; Using index |
| NULL | UNION RESULT | <union2,3> | ALL | NULL | NULL | NULL | NULL | NULL | |
| 4 | UNION | conn | index | NULL | PRIMARY | 424 | NULL | 8 | Using where; Using index |
| 5 | DEPENDENT SUBQUERY | conn | eq_ref | PRIMARY | PRIMARY | 424 | const,func | 1 | Using where; Using index |
| 6 | DEPENDENT UNION | conn | eq_ref | PRIMARY | PRIMARY | 424 | func,const | 1 | Using where; Using index |
| NULL | UNION RESULT | <union5,6> | ALL | NULL | NULL | NULL | NULL | NULL | |
| NULL | UNION RESULT | <union1,4> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------------+------------+--------+---------------+---------+---------+------------+------+--------------------------+
the DESCRIBE output of my query:
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------------------------------------+
| 1 | SIMPLE | c1 | index | PRIMARY | PRIMARY | 424 | NULL | 8 | Using where; Using index; Using temporary |
| 1 | SIMPLE | c2 | index | PRIMARY | PRIMARY | 424 | NULL | 8 | Using index |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------------------------------------+

if I understood you correctly (about going only 2 levels deep), you can do something like this:
SELECT level,fromnode , tonode
FROM conn1
WHERE level < 3
CONNECT BY PRIOR tonode = fromnode
START WITH fromnode like '%';

With these "from" and "to" relationships being bidirectional (you are needing to traverse both directions), there's just no easy statement to do that in MySQL. To get all of the node values in a single result set returned in a single column, the closest I can come to avoiding a UNION operation is:
SELECT CASE
WHEN t.i = 1 THEN t.dnode
WHEN t.i = 2 AND t.dnode = c.fromnode THEN c.tonode
WHEN t.i = 2 AND t.dnode = c.tonode THEN c.fromnode
ELSE NULL
END AS node
FROM ( SELECT d.i
, m.root
, CASE WHEN m.root = n.fromnode THEN n.tonode ELSE n.fromnode END AS dnode
FROM (SELECT 'node1' AS root) m
CROSS
JOIN (SELECT 1 AS i UNION ALL SELECT 2) d
LEFT
JOIN conn n ON m.root IN (n.fromnode,n.tonode)
) t
LEFT
JOIN conn c
ON t.i = 2 AND t.dnode IN (c.fromnode,c.tonode)
GROUP BY node
ORDER BY node
I don't know if I'm even going to be able to unpack that, but I'll try.
To avoid having to specify the root node 'node1' multiple times, I use a subquery to return it.
(SELECT 'node1' AS root) m
Because we are going "two levels deep", I need two sets of nodes, so I create a Cartesian product to double the number of rows I've got, and I'm going to label them 1 for the first level, and 2 for the second level.
CROSS
JOIN (SELECT 1 AS i UNION ALL SELECT 2) d
With that, I'm now ready to join to the conn table, and I want any rows that have either a fromnode or tonode value that matches the root node.
LEFT
JOIN conn n ON m.root IN (n.fromnode,n.tonode)
With that resultset, I want to "flip" the fromnode and tonode on some of those rows so that we basically always have the "root" node on one side. I do this with a CASE expression that tests which side matches the root:
CASE WHEN m.root = n.fromnode THEN n.tonode ELSE n.fromnode END AS dnode
So now I wrap that resultset as an inline view aliased t. That subquery can be run separately, to see that we're returning what we expect:
SELECT d.i
, m.root
, CASE WHEN m.root = n.fromnode THEN n.tonode ELSE n.fromnode END AS dnode
FROM (SELECT 'node1' AS root) m
CROSS
JOIN (SELECT 1 AS i UNION ALL SELECT 2) d
LEFT
JOIN conn n ON m.root IN (n.fromnode,n.tonode)
We do need to return that "level" value (d.i we generated earlier, we need it on the next step, when we join to the conn table again, to traverse the next level, and I only need to join those rows where we are going to look at the second level.
LEFT
JOIN conn c
ON t.i = 2 AND t.dnode IN (c.fromnode,c.tonode)
And again, I don't care which side node is on, at this point, I just need to do the match.
At this point, you could run the whole query, and pull t.*, c.* to see what we've got, but I'm going to skip that part and go right to the "magic".
At this point, we can use a CASE expression to "pick out" the node value we want from that mess.
If the level value (unfortunately labeled i) is a 1, then we're looking at the first level, so we just need to get the "nodeX" value that was on the "other" side of the "root". That's available from the t source as expression aliased as dnode.
Otherwise, we're going to look at the rows for the "second" level, i = 2. (In this particular case, the test on i = 2 could be omitted, but it's included hear for completeness, and just in case we are going to extend this approach to get three (gasp!) or more (oh my!) levels.
Here, we need to know which side (from or to) matched the first level, and we just pull the other side. If t.dnode matches on side, we pull the other side.
Finally, we use a GROUP BY to collapse the duplicates.
Since we don't care what level these were from, we omit returning t.i, which would give us the level.
SUMMARY
I don't think this is any more straightforward than your query. And I have no idea how performance would compare. But it's nice to have other statements to compare performance against.

Related

Which PDO SQL Query is faster in the long run and heavy data?

From a Table has over a million record, When i pull the data from it,
I want to check if the requested data exists or not, So which path is more efficient and faster then the other?
$Query = '
SELECT n.id
FROM names n
INNER JOIN ages a ON n.id = a.aid
INNER JOIN regions r ON n.id = r.rid
WHERE id = :id
';
$stmt->prepare($Query);
$stmt->execute(['id' => $id]);
if ($stmt->rowCount() == 1) {
$row = $stmt->fetch();
......................
} else {
exit();
}
or
$EXISTS = 'SELECT EXISTS (
SELECT n.fname, n.lname, a.age, r.region
FROM names n
INNER JOIN ages a ON n.id = a.aid
INNER JOIN regions r ON n.id = r.rid
WHERE id = :id
LIMIT 1
)
';
$stmt->prepare($EXISTS);
$stmt->execute(['id' => $id]);
if ($stmt->fetchColumn() == 1) {
$stmt->prepare($Query);
$stmt->execute(['id' => $id]);
$row = $stmt->fetch();
......................
} else {
exit();
}
keeping in mind that id is PRIMARY (INT) and aid, rid are INDEXED (INT)
The two methods you show are almost certainly equivalent, with almost no measurable difference in performance.
SELECT n.id
FROM names n
INNER JOIN ages a ON n.id = a.aid
INNER JOIN regions r ON n.id = r.rid
WHERE id = :id
I assume names.id is the primary key of that table. A primary key lookup is very fast.
Then it will do a secondary key lookup to the other two tables, and it will be an index-only access because there's no reference to other columns of those tables.
You should learn how to use EXPLAIN to analyze MySQL's optimization plan. This is a skill you should practice any time you want to improve the performance of an SQL query.
See https://dev.mysql.com/doc/refman/5.7/en/using-explain.html
mysql> explain SELECT n.id
-> FROM names n
-> INNER JOIN ages a ON n.id = a.aid
-> INNER JOIN regions r ON n.id = r.rid
-> WHERE id = 1;
+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+
| 1 | SIMPLE | n | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | Using index |
| 1 | SIMPLE | a | NULL | ref | aid | aid | 5 | const | 1 | 100.00 | Using index |
| 1 | SIMPLE | r | NULL | ref | rid | rid | 5 | const | 1 | 100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+
We see that each table access is using an index (I'm assuming indexes though you did not provide your SHOW CREATE TABLE in your question).
Compare to the second solution with SELECT EXISTS(...)
mysql> explain SELECT EXISTS (
-> SELECT n.id
-> FROM names n
-> INNER JOIN ages a ON n.id = a.aid
-> INNER JOIN regions r ON n.id = r.rid
-> WHERE id = 1
-> LIMIT 1);
+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+----------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+----------------+
| 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
| 2 | SUBQUERY | n | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | Using index |
| 2 | SUBQUERY | a | NULL | ref | aid | aid | 5 | const | 1 | 100.00 | Using index |
| 2 | SUBQUERY | r | NULL | ref | rid | rid | 5 | const | 1 | 100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+----------------+
The subquery looks identical to the first query optimization plan; it still uses indexes in the same way. But it's relegated to a subquery. Probably not a big difference, but it's one more thing.
The only advantage is that the SELECT EXISTS... query is guaranteed to return just one row with a true/false value. The first query might return a result set with zero, one, or many rows, depending how many matched the JOINs in the query. The difference is not a performance difference (unless it returns so many rows that it takes time to transfer the result set to the client, or uses a lot of memory to hold the result set in the client), but just a matter of convenience for the way you code it.
Don't normalize ages; it is just a waste of space and time. age (assuming it is 'years') can fit in a 1-byte TINYINT UNSIGNED (range: 0..255) and avoid the JOIN lookup. Note that aid seems to be a 4-byte INT, which can hold billions of different values -- do you have billions of different ages?
Perhaps changing regions is worth it also.
In the first query, the two JOINs do nothing but verify that there are rows in age and regions. That is probably a waste.
EXISTS stops when one row is found. So LIMIT 1 is very unnecessary.

MySQL Entity Framework Wraps query into sub-select for Order By

We support both MSSQL and MySQL for Entityframework 6 in an MVC 5 Application. Now, the problem I am having is when using the MySQL connectors and LINQ, queries which have an INNER JOIN and an ORDER BY will cause the query to be brought into a sub-select and the ORDER BY is applied on the outside. This causes a substantial performance impact. This does not happen when using the MSSQL connector. Here is an example:
SELECT
`Project3`.*
FROM
(SELECT
`Extent1`.*,
`Extent2`.`Name_First`
FROM
`ResultRecord` AS `Extent1`
LEFT OUTER JOIN `ResultInputEntity` AS `Extent2` ON `Extent1`.`Id` = `Extent2`.`Id`
WHERE
`Extent1`.`DateCreated` <= '4/4/2016 6:29:59 PM'
AND `Extent1`.`DateCreated` >= '12/31/2015 6:30:00 PM'
AND 0000 = `Extent1`.`CustomerId`
AND (`Extent1`.`InUseById` IS NULL OR 0000 = `Extent1`.`InUseById` OR `Extent1`.`LockExpiration` < '4/4/2016 6:29:59 PM')
AND `Extent1`.`DivisionId` IN (0000)
AND `Extent1`.`IsDeleted` != 1
AND EXISTS( SELECT
1 AS `C1`
FROM
`ResultInputEntityIdentification` AS `Extent3`
WHERE
`Extent1`.`Id` = `Extent3`.`InputEntity_Id`
AND 0 = `Extent3`.`Type`
AND '0000' = `Extent3`.`Number`
AND NOT (`Extent3`.`Number` IS NULL)
OR LENGTH(`Extent3`.`Number`) = 0)
AND EXISTS( SELECT
1 AS `C1`
FROM
`ResultRecordAssignment` AS `Extent4`
WHERE
1 = `Extent4`.`AssignmentType`
AND `Extent4`.`AssignmentId` = 0000
OR 2 = `Extent4`.`AssignmentType`
AND `Extent4`.`AssignmentId` = 0000
AND `Extent4`.`ResultRecordId` = `Extent1`.`Id`)) AS `Project3`
ORDER BY `Project3`.`DateCreated` ASC , `Project3`.`Name_First` ASC , `Project3`.`Id` ASC
LIMIT 0 , 25
This query simply times out when being ran against against a few million rows. This is the explain for the above query:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra |
| 1 | PRIMARY | Extent1 | ref | IX_ResultRecord_CustomerId,IX_ResultRecord_DateCreated,IX_ResultRecord_IsDeleted,IX_ResultRecord_InUseById,IX_ResultRecord_LockExpiration,IX_ResultRecord_DivisionId | IX_ResultRecord_CustomerId | 4 | const | 1 | Using where; Using temporary; Using filesort |
| 1 | PRIMARY | Extent2 | ref | PRIMARY | PRIMARY | 8 | Extent1.Id | 1 | |
| 4 | DEPENDENT SUBQUERY | Extent4 | ref | IX_RA_AT,IX_RA_A_ID,IX_RA_RR_ID | IX_RA_A_ID | 5 | const | 1 | Using where |
| 3 | DEPENDENT SUBQUERY | Extent3 | ALL | IX_InputEntity_Id,IX_InputEntityIdentification_Type,IX_InputEntityIdentification_Number | | | | 14341877 | Using where
Now, as it would get generated in MSSQL, or we simply get rid of the sub select to ORDER BY, the improvement is dramatic!
SELECT
`Extent1`.*,
`Extent2`.`Name_First`
FROM
`ResultRecord` AS `Extent1`
LEFT OUTER JOIN `ResultInputEntity` AS `Extent2` ON `Extent1`.`Id` = `Extent2`.`Id`
WHERE
`Extent1`.`DateCreated` <= '4/4/2016 6:29:59 PM'
AND `Extent1`.`DateCreated` >= '12/31/2015 6:30:00 PM'
AND 0000 = `Extent1`.`CustomerId`
AND (`Extent1`.`InUseById` IS NULL
OR 0000 = `Extent1`.`InUseById`
OR `Extent1`.`LockExpiration` < '4/4/2016 6:29:59 PM')
AND `Extent1`.`DivisionId` IN (0000)
AND `Extent1`.`IsDeleted` != 1
AND EXISTS( SELECT
1 AS `C1`
FROM
`ResultInputEntityIdentification` AS `Extent3`
WHERE
`Extent1`.`Id` = `Extent3`.`InputEntity_Id`
AND 9 = `Extent3`.`Type`
AND '0000' = `Extent3`.`Number`
AND NOT (`Extent3`.`Number` IS NULL)
OR LENGTH(`Extent3`.`Number`) = 0)
AND EXISTS( SELECT
1 AS `C1`
FROM
`ResultRecordAssignment` AS `Extent4`
WHERE
1 = `Extent4`.`AssignmentType`
AND `Extent4`.`AssignmentId` = 0000
OR 2 = `Extent4`.`AssignmentType`
AND `Extent4`.`AssignmentId` = 0000
AND `Extent4`.`ResultRecordId` = `Extent1`.`Id`)
ORDER BY `Extent1`.`DateCreated` ASC , `Extent2`.`Name_First` ASC , `Extent1`.`Id` ASC
LIMIT 0 , 25
This query now runs in 0.10 seconds! And the explain plan is now this:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | extra |
| 1 | PRIMARY | <subquery2> | ALL | distinct_key | | | | 1 | Using temporary; Using filesort |
| 1 | PRIMARY | Extent1 | ref | PRIMARY,IX_ResultRecord_CustomerId,IX_ResultRecord_DateCreated,IX_ResultRecord_IsDeleted,IX_ResultRecord_InUseById,IX_ResultRecord_LockExpiration,IX_ResultRecord_DivisionId | PRIMARY | 8 | Extent3.InputEntity_Id | 1 | Using where |
| 1 | PRIMARY | Extent4 | ref | IX_RA_AT,IX_RA_A_ID,IX_RA_RR_ID | IX_RA_RR_ID | 8 | Extent3.InputEntity_Id | 1 | Using where; Start temporary; End temporary |
| 1 | PRIMARY | Extent2 | ref | PRIMARY | PRIMARY | 8 | Extent3.InputEntity_Id | 1 | |
| 2 | MATERIALIZED | Extent3 | ref | IX_InputEntity_Id,IX_InputEntityIdentification_Type,IX_InputEntityIdentification_Number | IX_InputEntityIdentification_Type | 4 | const | 1 | Using where |
Now, I have had this issue many times across the system, and it is clear that it is an issue with the MySQL EF 6 Connector deciding to always wrap queries in a sub-select to apply the ORDER BY, but only when there is a join in the query. This is causing major performance issues. Some answers I have seen suggest modifying the connector source code, but that can be tedious, has anyone had this same issue, know a work around, modified the connector already or have any other suggestions besides simply moving to SQL Server and leaving MySQL behind, as that is not an option.
Did you have a look to SQL Server generated SQL? Is it different or only performances are different?
Because [usually] is not the provider that decide the structure of the query (i.e. order a subquery). The provider just translate the structure of the query to the syntax of the DBMS. So, In your case the problem could be the DBMS optimizer.
In issues similar to your I used a different approach based on mapping a query to entities i.e. using ObjectContext.ExecuteStoreQuery.
It turns out that in order to work around this with the MySQL Driver, your entire lambda must be written in one go. Meaning in ONE Where(..) Predicate. This way the driver knows that it is all one result set. Now, if you build an initial IQueryable, and then keep appending Where clauses to it which access child tables, it will believe that there are multiple result sets and therefore wrap your entire query into a sub-select in order to sort and limit it.

Updating millions of records on inner joined subquery - optimization techniques

I'm looking for some advice on how I might better optimize this query.
For each _piece_detail record that:
Contains at least one matching _scan record on (zip, zip_4,
zip_delivery_point, serial_number)
Belongs to a company from mailing_groups (through a chain of relationships)
Has either:
first_scan_date_time that is greater than the MIN(scan_date_time) of the related _scan records
latest_scan_date_time that is less than the MAX(scan_date_time) of
the related _scan records
I will need to:
Set _piece_detail.first_scan_date_time to MIN(_scan.scan_date_time)
Set _piece_detail.latest_scan_date_time to MAX(_scan.scan_date_time)
Since I'm dealing with millions upon millions of records, I am trying to reduce the number of records that I actually have to search through. Here are some facts about the data:
The _piece_details table is partitioned by job_id, so it seems to
make the most sense to run through these checks in the order of
_piece_detail.job_id, _piece_detail.piece_id.
The scan records table contains over 100,000,000 records right now and is partitioned by (zip, zip_4, zip_delivery_point,
serial_number, scan_date_time), which is the same key that is used
to match a _scan with a _piece_detail (aside from scan_date_time).
Only about 40% of the _piece_detail records belong to a mailing_group, but we don't know which ones these are until we run
through the full relationship of joins.
Only about 30% of the _scan records belong to a _piece_detail with a mailing_group.
There are typically between 0 and 4 _scan records per _piece_detail.
Now, I am having a hell of a time finding a way to execute this in a decent way. I had originally started with something like this:
UPDATE _piece_detail
INNER JOIN (
SELECT _piece_detail.job_id, _piece_detail.piece_id, MIN(_scan.scan_date_time) as first_scan_date_time, MAX(_scan.scan_date_time) as latest_scan_date_time
FROM _piece_detail
INNER JOIN _container_quantity
ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id
AND _piece_detail.job_id = _container_quantity.job_id
INNER JOIN _container_summary
ON _container_quantity.container_id = _container_summary.container_id
AND _container_summary.job_id = _container_quantity.job_id
INNER JOIN _mail_piece_unit
ON _container_quantity.mpu_id = _mail_piece_unit.mpu_id
AND _container_quantity.job_id = _mail_piece_unit.job_id
INNER JOIN _header
ON _header.job_id = _piece_detail.job_id
INNER JOIN mailing_groups
ON _mail_piece_unit.mpu_company = mailing_groups.mpu_company
INNER JOIN _scan
ON _scan.zip = _piece_detail.zip
AND _scan.zip_4 = _piece_detail.zip_4
AND _scan.zip_delivery_point = _piece_detail.zip_delivery_point
AND _scan.serial_number = _piece_detail.serial_number
GROUP BY _piece_detail.job_id, _piece_detail.piece_id, _scan.zip, _scan.zip_4, _scan.zip_delivery_point, _scan.serial_number
) as t1 ON _piece_detail.job_id = t1.job_id AND _piece_detail.piece_id = t1.piece_id
SET _piece_detail.first_scan_date_time = t1.first_scan_date_time, _piece_detail.latest_scan_date_time = t1.latest_scan_date_time
WHERE _piece_detail.first_scan_date_time < t1.first_scan_date_time
OR _piece_detail.latest_scan_date_time > t1.latest_scan_date_time;
I thought that this may have been trying to load too much into memory at once and might not be using the indexes properly.
Then I thought that I might be able to avoid doing that huge joined subquery and add two leftjoin subqueries to get the min/max like so:
UPDATE _piece_detail
INNER JOIN _container_quantity
ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id
AND _piece_detail.job_id = _container_quantity.job_id
INNER JOIN _container_summary
ON _container_quantity.container_id = _container_summary.container_id
AND _container_summary.job_id = _container_quantity.job_id
INNER JOIN _mail_piece_unit
ON _container_quantity.mpu_id = _mail_piece_unit.mpu_id
AND _container_quantity.job_id = _mail_piece_unit.job_id
INNER JOIN _header
ON _header.job_id = _piece_detail.job_id
INNER JOIN mailing_groups
ON _mail_piece_unit.mpu_company = mailing_groups.mpu_company
LEFT JOIN _scan fs ON (fs.zip, fs.zip_4, fs.zip_delivery_point, fs.serial_number) = (
SELECT zip, zip_4, zip_delivery_point, serial_number
FROM _scan
WHERE zip = _piece_detail.zip
AND zip_4 = _piece_detail.zip_4
AND zip_delivery_point = _piece_detail.zip_delivery_point
AND serial_number = _piece_detail.serial_number
ORDER BY scan_date_time ASC
LIMIT 1
)
LEFT JOIN _scan ls ON (ls.zip, ls.zip_4, ls.zip_delivery_point, ls.serial_number) = (
SELECT zip, zip_4, zip_delivery_point, serial_number
FROM _scan
WHERE zip = _piece_detail.zip
AND zip_4 = _piece_detail.zip_4
AND zip_delivery_point = _piece_detail.zip_delivery_point
AND serial_number = _piece_detail.serial_number
ORDER BY scan_date_time DESC
LIMIT 1
)
SET _piece_detail.first_scan_date_time = fs.scan_date_time, _piece_detail.latest_scan_date_time = ls.scan_date_time
WHERE _piece_detail.first_scan_date_time < fs.scan_date_time
OR _piece_detail.latest_scan_date_time > ls.scan_date_time
These are the explains when I convert them to SELECT statements:
+----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 844161 | NULL |
| 1 | PRIMARY | _piece_detail | eq_ref | PRIMARY,first_scan_date_time,latest_scan_date_time | PRIMARY | 18 | t1.job_id,t1.piece_id | 1 | Using where |
| 2 | DERIVED | _header | index | PRIMARY | date_prepared | 3 | NULL | 87 | Using index; Using temporary; Using filesort |
| 2 | DERIVED | _piece_detail | ref | PRIMARY,cqt_database_id,zip | PRIMARY | 10 | odms._header.job_id | 9703 | NULL |
| 2 | DERIVED | _container_quantity | eq_ref | unique,mpu_id,job_id,job_id_container_quantity | unique | 14 | odms._header.job_id,odms._piece_detail.cqt_database_id | 1 | NULL |
| 2 | DERIVED | _mail_piece_unit | eq_ref | PRIMARY,company,job_id_mail_piece_unit | PRIMARY | 14 | odms._container_quantity.mpu_id,odms._header.job_id | 1 | Using where |
| 2 | DERIVED | mailing_groups | eq_ref | PRIMARY | PRIMARY | 27 | odms._mail_piece_unit.mpu_company | 1 | Using index |
| 2 | DERIVED | _container_summary | eq_ref | unique,container_id,job_id_container_summary | unique | 14 | odms._header.job_id,odms._container_quantity.container_id | 1 | Using index |
| 2 | DERIVED | _scan | ref | PRIMARY | PRIMARY | 28 | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number | 1 | Using index |
+----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+
+----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+
| 1 | PRIMARY | _header | index | PRIMARY | date_prepared | 3 | NULL | 87 | Using index |
| 1 | PRIMARY | _piece_detail | ref | PRIMARY,cqt_database_id,first_scan_date_time,latest_scan_date_time | PRIMARY | 10 | odms._header.job_id | 9703 | NULL |
| 1 | PRIMARY | _container_quantity | eq_ref | unique,mpu_id,job_id,job_id_container_quantity | unique | 14 | odms._header.job_id,odms._piece_detail.cqt_database_id | 1 | NULL |
| 1 | PRIMARY | _mail_piece_unit | eq_ref | PRIMARY,company,job_id_mail_piece_unit | PRIMARY | 14 | odms._container_quantity.mpu_id,odms._header.job_id | 1 | Using where |
| 1 | PRIMARY | mailing_groups | eq_ref | PRIMARY | PRIMARY | 27 | odms._mail_piece_unit.mpu_company | 1 | Using index |
| 1 | PRIMARY | _container_summary | eq_ref | unique,container_id,job_id_container_summary | unique | 14 | odms._header.job_id,odms._container_quantity.container_id | 1 | Using index |
| 1 | PRIMARY | fs | index | NULL | updated | 1 | NULL | 102462928 | Using where; Using index; Using join buffer (Block Nested Loop) |
| 1 | PRIMARY | ls | index | NULL | updated | 1 | NULL | 102462928 | Using where; Using index; Using join buffer (Block Nested Loop) |
| 3 | DEPENDENT SUBQUERY | _scan | ref | PRIMARY | PRIMARY | 28 | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number | 1 | Using where; Using index; Using filesort |
| 2 | DEPENDENT SUBQUERY | _scan | ref | PRIMARY | PRIMARY | 28 | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number | 1 | Using where; Using index; Using filesort |
+----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+
Now, looking at the explains generated by each, I really can't tell which is giving me the best bang for my buck. The first one shows fewer total rows when multiplying the rows column, but the second appears to execute a bit quicker.
Is there anything that I could do to achieve the same results while increasing performance through modifying the query structure?
Disable update of index while doing the bulk updates
ALTER TABLE _piece_detail DISABLE KEYS;
UPDATE ....;
ALTER TABLE _piece_detail ENABLE KEYS;
Refer to the mysql docs : http://dev.mysql.com/doc/refman/5.0/en/alter-table.html
EDIT:
After looking at the mysql docs I pointed to, I see the docs specify this for MyISAM table, and is nit clear for other table types. Further solutions here : How to disable index in innodb
There is something I was taught and I strictly follow till today - Create as many temporary table you want while avoiding the usage of derived tables. Especially it in case of UPDATE/DELETE/INSERTs as
you cant predict the index on derived tables
The derived tables might not be held in memory if the resultset is big
The table(MyIsam)/rows(Innodb) may be locked for longer time as each time the derived query is running. I prefer a temp table which has primary key join with parent table.
And most importantly it makes you code look neat and readable.
My approach will be
CREATE table temp xxx(...)
INSERT INTO xxx select q from y inner join z....;
UPDATE _piece_detail INNER JOIN xxx on (...) SET ...;
Always reduce you downtime!!
Why aren't you using sub-queries for each join? Including the inner joins?
INNER JOIN (SELECT field1, field2, field 3 from _container_quantity order by 1,2,3)
ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id
AND _piece_detail.job_id = _container_quantity.job_id
INNER JOIN (SELECT field1, field2, field3 from _container_summary order by 1,2,3)
ON _container_quantity.container_id = _container_summary.container_id
AND _container_summary.job_id = _container_quantity.job_id
You're definitely pulling a lot into memory by not limiting your selects on those inner joins. By using the order by 1,2,3 at the end of each sub-query you create an index on each sub-query. Your only index is on headers and you aren't joining on _headers....
A couple suggestions to optimize this query. Either create the indexes you need on each table, or use the Sub-query join clauses to create manually the indexes you need on the fly.
Also remember that when you do a left join on a "temporary" table full of aggregates you are just asking for performance trouble.
Contains at least one matching _scan record on (zip, zip_4,
zip_delivery_point, serial_number)
Umm...this is your first point in what you want to do, but none of these fields are indexed?
From your explain results it seems that the subquery is going through all the rows twice then, how about you keep the MIN/MAX from the first one and use just one left join instead of two?

How to check another row if value exists?

This is driving me nuts. I have dumped the imdb db using imdbpy. I'm trying to find US movies that have the actor data available by the first letter of the movie.
Below is an example of a query that fetches the movies without acto data. This runs pretty quick:
SELECT DISTINCT title.id,title.title,title.production_year
FROM title
INNER JOIN movie_info ON
(movie_info.movie_id = title.id
AND
movie_info.info_type_id = 8
AND
movie_info.info = 'USA')
WHERE title LIKE 'a%'
AND title.kind_id = 1
LIMIT 75
The cast data is stored in a separate table called cast_info and contains about 22 million records. The nr_order column contains the order of credits for actors in a movie. For example, Tom Hank would be 1 in Forrest Gump. There are typically dozens of rows for each movie_id.
So to check to see if the actor data is available, there should be at least one row that isn't null for that particular movie_id. If all the values in nr_order for a movie_id are null, it does NOT contain the data I need.
To attempt to grab this information is used the query below:
SELECT DISTINCT title.id,title.title,title.production_year
FROM title
INNER JOIN movie_info ON
(movie_info.movie_id = title.id
AND
movie_info.info_type_id = 8
AND
movie_info.info = 'USA')
INNER JOIN cast_info ON
(cast_info.movie_id = title.id
AND
cast_info.nr_order = 1)
WHERE title LIKE 'a%'
AND title.kind_id = 1
LIMIT 75
For some reason the query becomes very slow. It takes .3-.7 for the first query and about 6-10 seconds for the second. I added an index on cast_info.nr_order but it didn't help.
The EXPLAIN output:
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
| 1 | SIMPLE | title | range | PRIMARY,title_idx_title,fk_kind_type_id_4 | title_idx_title | 257 | NULL | 132801| Using where; Using temporary|
| 1 | SIMPLE | movie_info| ref | ovie_info_idx_mid,info_type_id movie_info_idx_mid| movie_info_idx_mid| 4 | imdb.title.id| 4 | Using where; Distinct |
| 1 | SIMPLE | table1 | ref | cast_info_idx_mid,nr_order | cast_info_idx_mid | 4 | imdb.title.id| 12 | Using where; Distinct |
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
Any ideas would be very helpful!
EDIT: EXPLAIN from 1st query
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
| 1 | SIMPLE | title | range | PRIMARY,title_idx_title,fk_kind_type_id_4 | title_idx_title | 257 | NULL | 132801| Using where; Using temporary|
| 1 | SIMPLE | movie_info| ref | ovie_info_idx_mid,info_type_id movie_info_idx_mid| movie_info_idx_mid| 4 | imdb.title.id| 4 | Using where; Distinct |
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
Since you're only concerned with whether there is or is not cast information available, you could try using EXISTS instead:
SELECT DISTINCT title.id,title.title,title.production_year
FROM title
INNER JOIN movie_info ON
(movie_info.movie_id = title.id
AND
movie_info.info_type_id = 8
AND
movie_info.info = 'USA')
WHERE title LIKE 'a%'
AND title.kind_id = 1
AND EXISTS(SELECT 1 FROM cast_info WHERE cast_info.movie_id = title.id AND cast_info.nr_order IS NOT NULL)
LIMIT 75
I'm not sure exactly the explanation for your behavior, but the DISTINCT could be doing something funny with lots of rows on the join - or at least lots of rows on the joined product - (note the Distinct being applied to the cast_info table in the explain).

Three Queries Faster than One -- What's Wrong with my Joins?

I've got a JPA ManyToMany relationship set up, which gives me three important tables: my Ticket table, my Join table, and my Inventory table. They're InnoDB tables on MySQL 5.1. The relevant bits are:
Ticket:
+--------+----------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+----------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| Status | longtext | YES | | NULL | |
+--------+----------+------+-----+---------+----------------+
JoinTable:
+-------------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------+------+-----+---------+-------+
| InventoryID | int(11) | NO | PRI | NULL | | Foreign Key - Inventory
| TicketID | int(11) | NO | PRI | NULL | | Foreign Key - Ticket
+-------------+---------+------+-----+---------+-------+
Inventory:
+--------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| TStampString | varchar(32) | NO | MUL | NULL | |
+--------------+--------------+------+-----+---------+----------------+
TStampStrings are of the form "yyyy.mm.dd HH:MM:SS Z" (for example, '2010.03.19 22:27:57 GMT'). Right now all of the Tickets created directly correspond to some specific hour TStampString, so that SELECT COUNT(*) FROM Ticket; is the same as SELECT COUNT(DISTINCT(SUBSTRING(TStampString, 1, 13))) FROM Inventory;
What I'd like to do is regroup certain Tickets based on the minute granularity of a TStampString: (SUBSTRING(TStampString, 1, 16)). So I'm profiling and testing the SELECT of an INSERT INTO ... SELECT statement:
EXPLAIN SELECT SUBSTRING(i.TStampString, 1, 16) FROM Ticket t JOIN JoinTable j
ON t.ID = j.TicketID JOIN Inventory i ON j.InventoryID = i.ID WHERE t.Status
= 'Regroup' GROUP BY SUBSTRING(i.TStampString, 1, 16);
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
|id| type |tbl| type | psbl_keys | key | len | ref | rows | Extra |
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
|1 | SMPL | t | ALL | PRI | NULL| NULL| NULL | 35569 | where |
| | | | | | | | | | +temporary|
| | | | | | | | | | +filesort |
|1 | SMPL | j | ref | PRI,FK1,FK2 | FK2 | 4 | t.ID | 378 | index |
|1 | SMPL | i | eq_ref | PRI | PRI | 4 | j.Invent | 1 | |
| | | | | | | | oryID | | |
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
What this implies to me is that for each row in Ticket, MySQL first does the joins then later decides that the row is invalid due to the WHERE clause. Certainly the runtime is abominable (I gave up after 30 minutes). Note that it goes no faster with t.Status = 'Regroup' moved to the first JOIN clause and no WHERE clause.
But what's interesting is that if I run this query manually in three steps, doing what I thought the optimizer would do, each step returns almost immediately:
--Step 1: Select relevant Tickets (results dumped to file)
SELECT ID FROM Ticket WHERE Status = 'Regroup';
--Step 2: Get relevant Inventory entries
SELECT InventoryID FROM JoinTable WHERE TicketID IN (step 1s file);
--Step 3: Select what I wanted all along
SELECT SUBSTRING(TStampString, 1, 16) FROM Inventory WHERE ID IN (step 2s file)
GROUP BY SUBSTRING(TStampString, 1, 16);
On my particular tables, the first query gives 154 results, the second creates 206,598 lines, and the third query returns 9198 rows. All of them combined take ~2 minutes to run, with the last query having the only significant runtime.
Dumping the intermediate results to a file is cumbersome, and more importantly I'd like to know how to write my original query such that it runs reasonably. So how do I structure this three-table-join such that it runs as fast as I know is possible?
UPDATE: I've added a prefix index on Status(16), which changes my EXPLAIN profile rows to 153, 378, and 1 respectively (since the first row has a key to use). The JOIN version of my query now takes ~6 minutes, which is tolerable but still considerably slower than the manual version. I'd still like to know why the join is performing sorely suboptimally, but it may be that one can't create independent subqueries in buggy MySQL 5.1. If enough time passes I'll accept Add Index as the solution to my problem, although it's not exactly the answer to my question.
In the end I did end up manually recreating every step of the join on disk. Tens of thousands of files each with a thousand queries was still significantly faster than anything I could get my version of MySQL to do. But since that process would be horribly specific and unhelpful for the layman, I'm accepting ypercube's answer of Add (Partial) Indexes.
What you can do to speed up the query:
Add an index on Status. Even if you don't change the type to VARCHAR, you can still add a partial index:
ALTER TABLE Ticket
ADD INDEX status_idx
Status(16) ;
I assume that the Primary key of the Join table is (InventoryID, TicketID). You can add another index on (TicketID, InventoryID) as well. This may not benefit this particular query but it will be helpful in other queries you'll have.
The answer on why this happens is that the optimizer does not always choose the best plan. You can try this variation of your query and see how the EXPLAIN plan differs and if there is any efficiency gain:
SELECT SUBSTRING(i.TStampString, 1, 16)
FROM
( SELECT (DISTINCT) j.InventoryID
FROM Ticket t
JOIN JoinTable j
ON t.ID = j.TicketID
WHERE t.Status = 'Regroup'
) AS tmp
JOIN Inventory i
ON tmp.InventoryID = i.ID
GROUP BY SUBSTRING(i.TStampString, 1, 16) ;
try giving the first substring-clause an alias and using it in the group-by.
SELECT SUBSTRING(i.TStampString, 1, 16) as blaa FROM Ticket t JOIN JoinTable j
ON t.ID = j.TicketID JOIN Inventory i ON j.InventoryID = i.ID WHERE t.Status
= 'Regroup' GROUP BY blaa;
also avoid the join altogether since you dont need it..
SELECT distinct(SUBSTRING(i.TStampString, 1,16)) from inventory i where i.ID in
( select id from JoinTable j where j.TicketID in
(select id from Ticket t where t.Status = 'Regroup'));
would that work?
btw. you do have an index on the Status field ?