Optimize MySQL Sub-Query - mysql

Is there a way this query can be optimized? It looks redundant:
SELECT
SUM((SELECT
IFNULL(SUM(trx.totalAmount), 0)
FROM trx
WHERE
FIND_IN_SET (trx.clientOrderId, "B6A8DB9568,6E7705B487,59C4D4234D,1D9CD4EF96,4C373E8CDE,E818BEE48F,6610555669,ECF388E288,32FD93075C,B03417425B,18FD77061A,1C39E4BD04,C92B970E55,0920F06DFA,EEFB4AAADA,FC2D9FF9AD") > 0
AND trx.txnType IN ('REFUND', 'VOID')
)) as refunds,
SUM((SELECT
IFNULL(SUM(trx.totalAmount), 0)
FROM trx
WHERE
FIND_IN_SET (trx.clientOrderId, "B6A8DB9568,6E7705B487,59C4D4234D,1D9CD4EF96,4C373E8CDE,E818BEE48F,6610555669,ECF388E288,32FD93075C,B03417425B,18FD77061A,1C39E4BD04,C92B970E55,0920F06DFA,EEFB4AAADA,FC2D9FF9AD") > 0
AND trx.txnType = 'SALE'
AND trx.billingCycleNumber != 1
)) AS lifetimeRevenue
Pleas note that this is just a part of the query and there are like 10 more of those on the original query so really needs to know if it can be optimized.
Thank guys.

The problem with using subqueries like that is that each subquery has to scan the full table. Also using FIND_IN_SET() the way you are using it forces a full table-scan even if you have indexes. So you are doing 12 full table-scans.
Here's a solution that does not use subqueries at all. It scans the table for the matching clientOrderId values once, to get a superset of all the rows that match any of the txTypes you need.
Then each sum of the totalAmount is conditional, if the txnType is one of certain types, otherwise use zero for each row's totalAmount, and zero contributes nothing to the sum, so it's as if you had skipped the rows with non-matching txnType.
SELECT
SUM(IF(trx.txnType IN ('REFUND', 'VOID'), trx.totalAmount, 0)) AS refunds,
SUM(IF(trx.txnType = 'SALE' AND trx.billingCycleNumber != 1, trx.totalAmount, 0)) AS lifetimeRevenue
FROM trx
WHERE trx.clientOrderId IN (
'B6A8DB9568', '6E7705B487', '59C4D4234D', '1D9CD4EF96',
'4C373E8CDE', 'E818BEE48F', '6610555669', 'ECF388E288',
'32FD93075C', 'B03417425B', '18FD77061A', '1C39E4BD04',
'C92B970E55', '0920F06DFA', 'EEFB4AAADA', 'FC2D9FF9AD')
AND trx.txnType IN ('REFUND', 'VOID', 'SALE');
You should have an index on (clientOrderId) for this query. Since you have two IN() predicates, the WHERE clause will only use the index for the first column in the index anyway.
Don't use a FIND_IN_SET() expression, because it won't use an index for the WHERE clause.
You said there are 10 more terms in the query. So I anticipate that there are some different types of expressions in those terms. I'm not going to answer any "but what if the next terms look like something different...". I have shown you the method of unraveling the subquery into one single-pass query. Applying it to other terms in your query is up to you.
Here's a demo I tested:
create table trx (
clientOrderId char(10),
txnType enum('REFUND','VOID','SALE'),
totalAmount numeric(9,2),
billingCycleNumber int default 0,
key (clientOrderId)
);
+---------------+---------+-------------+--------------------+
| clientOrderId | txnType | totalAmount | billingCycleNumber |
+---------------+---------+-------------+--------------------+
| B6A8DB9568 | REFUND | 42.00 | 0 |
| 59C4D4234D | SALE | 84.00 | 0 |
+---------------+---------+-------------+--------------------+
Here's the EXPLAIN for your query:
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+----------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+----------------+
| 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
| 3 | SUBQUERY | trx | NULL | ALL | NULL | NULL | NULL | NULL | 2 | 50.00 | Using where |
| 2 | SUBQUERY | trx | NULL | ALL | NULL | NULL | NULL | NULL | 2 | 50.00 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+----------------+
Notice one subquery for each term, and each one does "type=All" as its table access.
Here's the EXPLAIN for my query:
+----+-------------+-------+------------+-------+---------------+---------------+---------+------+------+----------+------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+---------------+---------+------+------+----------+------------------------------------+
| 1 | SIMPLE | trx | NULL | range | clientOrderId | clientOrderId | 11 | NULL | 16 | 50.00 | Using index condition; Using where |
+----+-------------+-------+------------+-------+---------------+---------------+---------+------+------+----------+------------------------------------+
One simple table access, using an index.
The result from both your query and my query given the example data I tried:
+---------+-----------------+
| refunds | lifetimeRevenue |
+---------+-----------------+
| 42.00 | 84.00 |
+---------+-----------------+

Related

About MySQL's Leftmost Prefix Matching Optimization

I now have a table like this:
> DESC userInfo;
+--------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+---------------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| name | char(32) | NO | MUL | NULL | |
| age | tinyint(3) unsigned | NO | | NULL | |
| gender | tinyint(1) | NO | | 1 | |
+--------+---------------------+------+-----+---------+----------------+
I made (name, age) a joint unique index:
> SHOW INDEX FROM userInfo;
+----------+------------+--------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+--------------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+--------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+--------------------+
| userInfo | 0 | PRIMARY | 1 | id | A | 0 | NULL | NULL | | BTREE | | |
| userInfo | 0 | joint_unique_index | 1 | name | A | 0 | NULL | NULL | | BTREE | | 联合唯一索引 |
| userInfo | 0 | joint_unique_index | 2 | age | A | 0 | NULL | NULL | | BTREE | | 联合唯一索引 |
+----------+------------+--------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+--------------------+
3 rows in set (0.00 sec)
Now, when I use the following query statement, its type is All:
> DESC SELECT * FROM userInfo WHERE age = 18;
+----+-------------+----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | userInfo | NULL | ALL | NULL | NULL | NULL | NULL | 1 | 100.00 | Using where |
+----+-------------+----------+------------+------+---------------+------+---------+------+------+----------+-------------+
I can understand this behavior, because according to the leftmost prefix matching feature, age will not be used as an index column when querying.
But when I use the following statement to query, its type is Index:
> DESC SELECT name, age FROM userInfo WHERE age = 18;
+----+-------------+----------+------------+-------+---------------+--------------------+---------+------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+-------+---------------+--------------------+---------+------+------+----------+--------------------------+
| 1 | SIMPLE | userInfo | NULL | index | NULL | joint_unique_index | 132 | NULL | 1 | 100.00 | Using where; Using index |
+----+-------------+----------+------------+-------+---------------+--------------------+---------+------+------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)
I can't understand how this result is produced. According to Example 1, the age as the query condition does not satisfy the leftmost prefix matching feature, but from the results, its type is actually Index! Is this an optimization in MySQL?
When I try to make sure I use indexed columns as query conditions, their type is always ref, as shown below:
> DESC SELECT * FROM userInfo WHERE name = "Jack";
+----+-------------+----------+------------+------+--------------------+--------------------+---------+-------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+--------------------+--------------------+---------+-------+------+----------+-------+
| 1 | SIMPLE | userInfo | NULL | ref | joint_unique_index | joint_unique_index | 128 | const | 1 | 100.00 | NULL |
+----+-------------+----------+------------+------+--------------------+--------------------+---------+-------+------+----------+-------+
1 row in set, 1 warning (0.00 sec)
> DESC SELECT name, age FROM userInfo WHERE name = "Jack";
+----+-------------+----------+------------+------+--------------------+--------------------+---------+-------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+--------------------+--------------------+---------+-------+------+----------+-------------+
| 1 | SIMPLE | userInfo | NULL | ref | joint_unique_index | joint_unique_index | 128 | const | 1 | 100.00 | Using index |
+----+-------------+----------+------------+------+--------------------+--------------------+---------+-------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
Please tell me why when I use age as a query, the first result is ALL, but the second result is INDEX. Is this the result of MySQL optimization?
In other words, when SELECT * is used, index column queries are not applied, but when SELECT joint_col1, joint_col2 FROM joint_col2 are used, index column queries (because type is INDEX) are used. Why does this difference occur?
Simplifying a bit, an index (name, age) is basically the same as if you had another table (name, age, id) with a copy of those values. The primary key is (for InnoDB) included for technical reasons - MySQL uses it to find the full row in the original table.
So you can basically think of it as if you have 2 tables: (id, name, age, gender) and (name, age, id), both with the same amount of rows. And both have the ability to jump to/skip specific rows if you provide the leftmost columns.
If you do
SELECT * FROM userInfo WHERE age = 18;
MySQL has to read, as you expected, every row of the table, as there is no way to find rows with age = 18 faster - just as you concluded, there is no index with age as the leftmost column.
If you do
SELECT name, age FROM userInfo WHERE age = 18;
the situation doesn't change a lot: MySQL will also have to read every row, and still cannot use the index on (name, age) to limit the number of rows it has to read.
But MySQL can use a trick: since you only need the columns name and age, it can read all rows from the index-"table" and still have all information it needs, as the index is a covering index (it covers all required columns).
Why would MySQL do that? Because it has to read less absolute data than reading the complete table: the index stores the information you want in less bytes (as it doesn't include gender). Reading less data to get all the information you need is better/faster than reading more data to get the same information. So MySQL will do just that.
But to emphasize it: your query still has to read all rows, it is still basically a full table scan ("ALL") - just on a "table" (the index) with less columns, to save some bytes. While you won't notice a difference with one tinyint column, if your table has a lot of or large columns, it's actually a relevant speedup.
The "leftmost" rule applies to the WHERE clause versus the INDEX.
INDEX(name, age) is useful for WHERE name = '...' or WHERE name = '...' AND ((anything else)) because name is leftmost in the index.
What you have is WHERE age = ... ((and nothing else)), so you need INDEX(age) (or INDEX(age, ...)).
In particular, SELECT name, age FROM userInfo WHERE age = 18;:
INDEX(age) -- good
INDEX(age, name) -- better because it is "covering".
The order of columns in the WHERE does not matter; the order in the INDEX does matter.

NOT IN subquery versus ON != Operation

I have two tables called ny_clean (3454602 entries) and pickup_0_ids_temp_table (2739268 entries) who have both an id CHAR(11) column which is a primary key and has a BTREE index on top of it ( MySQL 5.7) .
The "id" column in pickup_0_ids_temp_table is a subset of ny_clean and I want to get a result which is ny_clean without the id values from pickup_0_ids_temp_table.
Option 1:
EXPLAIN
SELECT *
FROM pickup_0_ids_temp_table as t
JOIN ny_clean as n
ON n.id != t.id;
+----+-------------+----------+------------+-------+---------------+-------------------+---------+------+---------+----------+-----------------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+-------+---------------+-------------------+---------+------+---------+----------+-----------------------------------------------------------------+
| 1 | SIMPLE | t | NULL | index | NULL | PRIMARY | 11 | NULL | 2734512 | 100.00 | Using index |
| 1 | SIMPLE | ny_clean | NULL | index | NULL | btree_pk_ny_clean | 11 | NULL | 3445904 | 90.00 | Using where; Using index; Using join buffer (Block Nested Loop) |
+----+-------------+----------+------------+-------+---------------+-------------------+---------+------+---------+----------+-----------------------------------------------------------------+
Option 2:
EXPLAIN
SELECT *
FROM ny_clean as n
WHERE n.id NOT IN (
SELECT id
FROM pickup_0_ids_temp_table);
+----+--------------------+-------------------------+------------+-----------------+------------------------+---------+---------+------+---------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-------------------------+------------+-----------------+------------------------+---------+---------+------+---------+----------+-------------+
| 1 | PRIMARY | n | NULL | ALL | NULL | NULL | NULL | NULL | 3445904 | 100.00 | Using where |
| 2 | DEPENDENT SUBQUERY | pickup_0_ids_temp_table | NULL | unique_subquery | PRIMARY,btree_pickup_0 | PRIMARY | 11 | func | 1 | 100.00 | Using index |
+----+--------------------+-------------------------+------------+-----------------+------------------------+---------+---------+------+---------+----------+-------------+
I then use one of the options inside this larger query
EXPLAIN
INSERT INTO y
SELECT id, pickup_longitude, pickup_latitude
FROM x
JOIN
(OPTION 1 OR 2) as z
ON z.id = x.id;
When I used Option 1 inside the larger query it ran for two days and it was not finished. Option 2 on the other hand did the job in less than 30minutes
My Question: Why is that?
Following the MySQL documentation (https://dev.mysql.com/doc/refman/5.7/en/subquery-materialization.html) I would suspect that it is due to materialization of the subquery but how would I check this ?
And am I interpreting the EXPLAIN Output wrong? Because judging from it I would expect Option 1 to be faster since it uses an index on both tables
Or does it have to do ith the larger query?
Thanks in advance
Your option 1 doesn't do what you think will do.
If you have two tables
n.id t.id
1 1
2 2
3 3
ON n.id != t.id;
You get:
1,2
1,3
2,1
2,3
3,1
3,2
That is almost a cartesian product. So 3.4 mill x 2.7 mill ~ 9.18 mill rows
Then you try to do a JOIN and because that materialzed table doesnt have index will take very long time.

Strange behavior or RAND in MariaDB: Single RAND delivers more than 1 result

When running the following query:
SELECT productid
FROM product
WHERE productid=ROUND(RAND()*(SELECT MAX(productid) FROM product));
The result should be 0 or 1 results (0 due to data gaps, 1 if a record is found), however it results in multiple results a good number of times (very easy to reproduce, 90% of queries have more than 1 result).
Sample output:
+-----------+
| productid |
+-----------+
| 11701 |
| 20602 |
| 22029 |
| 24994 |
+-----------+
(Number of records in DB is about 30k).
Running a single SELECT RAND() always results in a single result.
Explain:
explain SELECT productid FROM product WHERE productid=ROUND(RAND()*(SELECT MAX(productid) FROM product));
+----+-------------+---------+------------+-------+---------------+--------------+---------+------+-------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------------+-------+---------------+--------------+---------+------+-------+----------+------------------------------+
| 1 | PRIMARY | product | NULL | index | NULL | idx_prod_url | 2003 | NULL | 31197 | 10.00 | Using where; Using index |
| 2 | SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+---------+------------+-------+---------------+--------------+---------+------+-------+----------+------------------------------+
Who can explain this behavior?
Follow up:
Following Martin's remark a rewrite of the query in:
SELECT productid FROM product
WHERE productid=(SELECT ROUND(RAND()*(SELECT MAX(productid) FROM product)));
Explain:
explain SELECT productid FROM product WHERE productid=(SELECT ROUND(RAND()*(SELECT MAX(productid) FROM product)));
+----+----------------------+---------+------------+-------+---------------+--------------+---------+------+-------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+----------------------+---------+------------+-------+---------------+--------------+---------+------+-------+----------+------------------------------+
| 1 | PRIMARY | product | NULL | index | NULL | idx_prod_url | 2003 | NULL | 31197 | 100.00 | Using where; Using index |
| 2 | UNCACHEABLE SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
| 3 | SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
+----+----------------------+---------+------------+-------+---------------+--------------+---------+------+-------+----------+------------------------------+
However despite the changed plan, the behavior stays the same.
Follow up 2:
Using an INNER JOIN, the behavior disappears:
SELECT a.productid FROM product a
INNER JOIN (SELECT ROUND(RAND()*(SELECT MAX(productid))) as productid
FROM product) b ON a.productid=b.productid;
Explain:
explain SELECT a.productid FROM product a INNER JOIN (SELECT ROUND(RAND()*(SELECT MAX(productid))) as productid FROM product) b ON a.productid=b.productid;
+----+--------------------+------------+------------+--------+---------------+--------------+---------+-------+-------+----------+----------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+------------+------------+--------+---------------+--------------+---------+-------+-------+----------+----------------+
| 1 | PRIMARY | <derived2> | NULL | system | NULL | NULL | NULL | NULL | 1 | 100.00 | NULL |
| 1 | PRIMARY | a | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | Using index |
| 2 | DERIVED | product | NULL | index | NULL | idx_prod_url | 2003 | NULL | 31197 | 100.00 | Using index |
| 3 | DEPENDENT SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
+----+--------------------+------------+------------+--------+---------------+--------------+---------+-------+-------+----------+----------------+
Try this:
SELECT productid
FROM product
ORDER BY rand() LIMIT 1;
See MySQL Select Random Records for other random select options.
Explanation:
When the query in the original post executes, Maria says, "I need to look at data from the table, product, so let me start with the first row." Then, "Let me check the WHERE clause to see if I should use this row." At that point, Maria calculates a random number and multiplies it by the largest productid, rounding it to an integer. She checks to see if that integer equals the productid from the first row. If it does not, then she forgets about that row. If it does, then she selects the productid from that row and holds onto it. Either way, she then moves on to the second row. She calculates a (new) random number, multiplies it by the largest productid, and checks whether it equals the productid from the second row. If it does not, she moves on. If it does, she adds that productid to her list of selected values. She does the same thing on the third row, etc. The final output is then a list of zero to MAX(productid) productid values, each one independently selected with a probability of 1/MAX(productid).
When the query from Follow up 2 executes, the left side of the join is just a table. For the right side of the join, Maria says, "I need to look at data from the table, product. Since there is no WHERE clause, I will look at the whole table. Oh my, I have an aggregate function, MAX(productid), in the SELECT clause. That gives me a single value from the whole table." She calculates a single random number and multiplies it by that value, creating a table, b, consisting of a single column, productid, with one row. That table is then inner joined with the table, product, to find every row with a matching productid, which would be exactly one row, and the productid from that row is selected.
Note that if all you are looking for is the productid and you do not need any other data from the selected row, then the right side of the join is all you should need.
SELECT ROUND(RAND()*(SELECT MAX(productid))) as productid from product;
Note also that ROUND is probably not the right function for what you are trying to do. This might be what you really want:
SELECT FLOOR(1+RAND()*(SELECT MAX(productid))) as productid from product;

MySQL query optimisation help

hoping you can help me on the right track to start optimising my queries. I've never thought too much about optimisation before, but I have a few queries similar to the one below and want to start concentrating on improving their efficiency. An example of a query which I badly need to optimise is as follows:
SELECT COUNT(*) AS `records_found`
FROM (`records_owners` AS `ro`, `records` AS `r`)
WHERE r.reg_no = ro.contact_no
AND `contacted_email` <> "0000-00-00"
AND `contacted_post` <> "0000-00-00"
AND `ro`.`import_date` BETWEEN "2010-01-01" AND "2010-07-11" AND `r`.`pa_date_of_birth` > "2010-01-01" AND EXISTS ( SELECT `number` FROM `roles` WHERE `roles`.`number` = r.`reg_no` )
Running EXPLAIN on the above produces the following:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+--------+---------------+---------+---------+---------------------------------------+-------+-------------+
| 1 | PRIMARY | r | ALL | NULL | NULL | NULL | NULL | 21533 | Using where |
| 1 | PRIMARY | ro | eq_ref | PRIMARY | PRIMARY | 4 | r.reg_no | 1 | Using where |
| 2 | DEPENDENT SUBQUERY | roles | ALL | NULL | NULL | NULL | NULL | 189 | Using where |
As you can see, you have a dependent subquery, which is one of the worst thing performance-wise in MySQL. See here for tips:
http://dev.mysql.com/doc/refman/5.0/en/select-optimization.html
http://dev.mysql.com/doc/refman/5.0/en/in-subquery-optimization.html

What is the significance of the order of statements in mysql explain output?

This is mysql explain plan for one of the query I am looking into.
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------+
| 1 | SIMPLE | table2 | index | NULL | PRIMARY | 4 | NULL | 6 | |
| 1 | SIMPLE | table3 | ALL | NULL | NULL | NULL | NULL | 23 | |
| 1 | SIMPLE | table1 | ALL | NULL | NULL | NULL | NULL | 8 | |
| 1 | SIMPLE | table5 | index | NULL | PRIMARY | 4 | NULL | 1 | |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------+
4 rows in set (0 sec)
What is the significance of the order of statements in this output ?
Does it mean that table5 is read before all others ?
The tables are listed in the output in the order that MySQL would read them while processing the query. You can read more about the Explain plan output here.
Additionally, the output tells me:
The optimizer saw the query as having four (4) SELECT statements within it. Being a "simple" select type, those queries are not using UNION or subqueries.
Two of those statements could use indexes (based on the type column), which were primary keys (based on the key column). The other two could not use any indexes.