Mysql improve query speed involving multiple tables - mysql

I have the following query
SELECT a.id, b.id from table1 AS a, table2 AS b
WHERE a.table2_id IS NULL
AND a.plane = SUBSTRING(b.imb, 1, 20)
AND (a.stat LIKE "f%" OR a.stat LIKE "F%")
Here is the output of EXPLAIN
+----+-------------+-------+------+-------------------------------------------------------------------------------------------+------------------------------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-------------------------------------------------------------------------------------------+------------------------------+---------+------+----------+-------------+
| 1 | SIMPLE | b | ALL | NULL | NULL | NULL | NULL | 28578039 | |
| 1 | SIMPLE | a | ref | index_on_plane,index_on_table2_id_id,mysql_confirmstat_on_stat | index_on_plane | 258 | func| 2 | Using where |
+----+-------------+-------+------+-------------------------------------------------------------------------------------------+------------------------------+---------+------+----------+-------------+
The query takes around 80 minutes to execute.
The indexes on table1 are as follows
+--------------+------------+--------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+--------------+------------+--------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| table1 | 0 | PRIMARY | 1 | id | A | 50319117 | NULL | NULL | | BTREE | | |
| table1 | 1 | index_on_post | 1 | post | A | 7188445 | NULL | NULL | YES | BTREE | | |
| table1 | 1 | index_on_plane | 1 | plane | A | 25159558 | NULL | NULL | YES | BTREE | | |
| table1 | 1 | index_on_table2_id | 1 | table2_id | A | 25159558 | NULL | NULL | YES | BTREE | | |
| table1 | 1 | index_on_stat | 1 | stat | A | 187 | NULL | NULL | YES | BTREE | | |
+--------------+------------+--------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
and table2 indexes are.
+-------+------------+---------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------+------------+---------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| table2 | 0 | PRIMARY | 1 | id | A | 28578039 | NULL | NULL | | BTREE | | |
| table2 | 1 | index_on_post | 1 | post | A | 28578039 | NULL | NULL | YES | BTREE | | |
| table2 | 1 | index_on_ver | 1 | ver | A | 1371 | NULL | NULL | YES | BTREE | | |
| table2 | 1 | index_on_imb | 1 | imb | A | 28578039 | NULL | NULL | YES | BTREE | | |
+-------+------------+---------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
How can the execution time of this query be improved?
Here is the updated explain
EXPLAIN SELECT STRAIGHT_JOIN a.id, b.id from table1 AS a JOIN b AS b
ON a.plane=substring(b.imb,1,20)
WHERE a.table2_id IS NULL
AND (a.stat LIKE "f%" OR a.stat LIKE "F%");
+----+-------------+-------+------+-------------------------------------------------------------------------------------------+-------------------------------+---------+-------+----------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-------------------------------------------------------------------------------------------+-------------------------------+---------+-------+----------+--------------------------------+
| 1 | SIMPLE | a | ref | index_on_plane,index_on_table2_id,index_on_stat | index_on_table2_id | 5 | const | 500543 | Using where |
| 1 | SIMPLE | b | ALL | NULL | NULL | NULL | NULL | 28578039 | Using where; Using join buffer |
+----+-------------+-------+------+-------------------------------------------------------------------------------------------+-------------------------------+---------+-------+----------+--------------------------------+
SQL fiddle link http://www.sqlfiddle.com/#!2/362a6/4

Your schema dooms your query to slowness, in at least three ways. You are going to need to modify your schema to get anything like decent performance out of this. I can see three ways to fix your schema.
First way (probably very easy to fix):
a.stat LIKE "f%" OR a.stat LIKE "F%"
This OR operation likely doubles the runtime of your query. But if you set the collation of your stat column to something case-insensitive you can change this to
a.stat LIKE "f%"
You already have an index on this column.
Second way (maybe not so hard to fix). This clause definitively defeats the use of an index; they're useless when NULL values are involved.
WHERE a.table2_id IS NULL
Can you change the definition of table2_id to NOT NULL and provide a default value (perhaps zero) to indicate missing data? If so, you'll be in good shape because you'll be able to change this to a search predicate that uses an index.
WHERE a.table2_id = 0
Third way (probably hard). The presence of the function in this clause defeats the use of an index in joining.
WHERE ... a.plane = SUBSTRING(b.imb, 1, 20)
You need to make an extra column (yeah, yeah, in Oracle it could be a function index, but who has that kind of money?) called b.plane or something with that substring stored in it.
If you do all this stuff and refactor your query just a bit, here's what it will look like:
SELECT a.id AS aid,
b.id AS bid
FROM table1 AS a
JOIN table2 AS b ON a.plane = b.plane /* the new column */
WHERE a.stat LIKE 'f%'
AND a.table2_id = 0
Finally, you can probably tweak this performance up a bit by creating the following compound indexes as covering indexes for the query. Look up covering indexes if you're not sure what that means.
table1 (table2_id, stat, plane, id)
table2 (plane, id) /* plane is your new column */
There's a tradeoff in covering indexes: they slow down insertion and update operations, but speed up queries. Only you have enough information to make that tradeoff wisely.

Column on which join operation is getting perform must be indexed and MySQL optimiser should use it for better performance. It will minimise the number of rows examined (join size)
Try this one
SELECT STRAIGHT_JOIN a.id, b.id from table1 AS a JOIN table2 AS b ON a.plane=substring(b.imb,1,20)
WHERE a.table2__id IS NULL and (a.stat LIKE "f%" OR a.stat LIKE "F%")
Check the execution plan first. If it is even not using the index_on_imb index, create one composite index combining table2.imb and table2.id in which table2.imb would be top in order.

An derived table may improve performance in this case depends on this indexes index_on_table2_id,index_on_stat..
SELECT a.id, b.id from table1 AS a, table2 AS b
WHERE a.table2_id IS NULL
AND a.plane = SUBSTRING(b.imb, 1, 20)
AND (a.stat LIKE "f%" OR a.stat LIKE "F%")
May be rewritten to..
The derived table will force MySQL into checking 500543 rows like the last explain said
SELECT a.id, b.id
FROM (SELECT plane FROM table1 WHERE (a.table2_id IS NULL) AND (a.stat LIKE "f%" OR a.stat LIKE "F%")) a
INNER JOIN table2 b
ON a.plane = SUBSTRING(b.imb, 1, 20)

Aside from my comment about ID colmns, it appears you are trying to back-fill a join on the "plane" instead of by the ID columns. If I am correct, you want all records from table2 where there is no match in table1
select
a.id,
b.id
from
table2 b
left join table1 a
on b.id = a.table2_id
AND substr( b.imb, 1, 20 ) = a.plane
AND ( a.stat LIKE "f%"
OR a.stat LIKE "F%")
where
a.table2_id is null
Also, to help the index join, I would have covering indexes so the engine does not have to go back to the raw data to get qualifying records.
table1 -- index ( plane, stat, table2_id, id )
table2 -- index ( imb, id )
But again, please clarify basis of table join do or do not have it based on a Key... Per the sample columns of table1 having a column table2_id, I am GUESSING this relates to table2.id.
The purpose of doing a left-join basically says... For each record in the left-side table (in my example table2), join to the right-side table (table1) on whatever criteria/conditions -- now using the KEY ID column as primary basis, then the plane and status setting.
So, even though I'm doing a join between the two tables on the table2_id, if it DOES find a match, it will be excluded... Only when it does NOT find a match will it be included.
Finally, since you are hiding the true basis of the tables, you are leaving it to guessing work of those helping. Even if it was "personal" type of data, you are not showing any data, just how do I get it. Having a better mental image of what you are looking to get is better than bogus table/column names with limited context.

Related

Find the number total rows analyzed in a MySQL JOIN query

I have a MySQL query which has a JOIN of 12 tables. When I explain the query, It showing 394699 rows for one table and 185368 rows for another table. All other tables has 1-3 rows. The total result which I am getting from the query id 472 rows only. But for that, it is taking more than 1 minute.
Is there any way to check how many rows has been analyzed to produce such a result? So that, I can find which is the table costs the higher time.
I am giving the query structure below. As the table structure is too high, I am not able to provide it here.
SELECT h.nid,h.attached_nid,h.created, s.field_species_value as species, g.field_gender_value as gender, u.field_unique_id_value as unqid, n.title, dob.field_adult_healthy_weight_value as birth_date, dcolor.field_dog_primary_color_value as dogcolor, ccolor.field_primary_color_value as catcolor, sdcolor.field_dog_secondary_color_value as sdogcolor, sccolor.field_secondary_color_value as scatcolor, dpattern.field_dog_pattern_value as dogpattern, cpattern.field_cat_pattern_value as catpattern
FROM table1 h
JOIN table2 n ON n.nid = h.nid
JOIN table3 s ON n.nid = s.entity_id
JOIN table4 u ON n.nid = u.entity_id
LEFT JOIN table5 g ON n.nid = g.entity_id
LEFT JOIN table6 dob ON n.nid = dob.entity_id
LEFT JOIN table7 AS dcolor ON n.nid = dcolor.entity_id
LEFT JOIN table8 AS ccolor ON n.nid = ccolor.entity_id
LEFT JOIN table9 AS sdcolor ON n.nid = sdcolor.entity_id
LEFT JOIN table10 AS sccolor ON n.nid = sccolor.entity_id
LEFT JOIN table11 AS dpattern ON n.nid = dpattern.entity_id
LEFT JOIN table12 AS cpattern ON n.nid = cpattern.entity_id
WHERE h.title = '4208'
AND ((h.created BETWEEN 1483257600 AND 1485935999))
AND h.uid!=1
AND h.uid IN(
SELECT etid
FROM `table`
WHERE gid=464
AND entity_type='user')
AND h.attached_nid>0
ORDER BY CAST(h.created as UNSIGNED) DESC;
Below is the EXPLAIN result which I get
+------+--------------+---------------+--------+----------------------+---------------------+---------+----------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------+---------------+--------+----------------------+---------------------+---------+----------------------+--------+----------------------------------------------+
| 1 | PRIMARY | s | index | entity_id | field_species_value | 772 | NULL | 394699 | Using index; Using temporary; Using filesort |
| 1 | PRIMARY | u | ref | entity_id | entity_id | 4 | pantheon.s.entity_id | 1 | |
| 1 | PRIMARY | n | eq_ref | PRIMARY | PRIMARY | 4 | pantheon.s.entity_id | 1 | |
| 1 | PRIMARY | g | ref | entity_id | entity_id | 4 | pantheon.s.entity_id | 1 | |
| 1 | PRIMARY | dob | ref | entity_id | entity_id | 4 | pantheon.s.entity_id | 1 | |
| 1 | PRIMARY | dcolor | ref | entity_id | entity_id | 4 | pantheon.s.entity_id | 1 | |
| 1 | PRIMARY | ccolor | ref | entity_id | entity_id | 4 | pantheon.s.entity_id | 1 | |
| 1 | PRIMARY | sdcolor | ref | entity_id | entity_id | 4 | pantheon.s.entity_id | 1 | |
| 1 | PRIMARY | sccolor | ref | entity_id | entity_id | 4 | pantheon.s.entity_id | 1 | |
| 1 | PRIMARY | dpattern | ref | entity_id | entity_id | 4 | pantheon.s.entity_id | 1 | |
| 1 | PRIMARY | cpattern | ref | entity_id | entity_id | 4 | pantheon.s.entity_id | 1 | |
| 1 | PRIMARY | h | ref | attached_nid,nid,uid | nid | 5 | pantheon.s.entity_id | 3 | Using index condition; Using where |
| 1 | PRIMARY | <subquery2> | eq_ref | distinct_key | distinct_key | 4 | func | 1 | Using where |
| 2 | MATERIALIZED | og_membership | ref | entity,gid | gid | 4 | const | 185368 | Using where |
+------+--------------+---------------+--------+----------------------+---------------------+---------+----------------------+--------+----------------------------------------------+
You can find the ROWS_EXAMINED by using the Performance Schema.
Here is a link to the performance schema quick start guide.
https://dev.mysql.com/doc/refman/5.5/en/performance-schema-quick-start.html
This is the query I run in PHP applications, to find out what queries I need to optimize. You should be able to adapt it pretty easily.
The query finds the stats on the query that was run before this one. So in my apps, I run query after every query I run, store the results, then at the end of the PHP script I output the stats for every query I ran during the request.
SELECT `EVENT_ID`, TRUNCATE(`TIMER_WAIT`/1000000000000,6) as Duration,
`SQL_TEXT`, `DIGEST_TEXT`, `NO_INDEX_USED`, `NO_GOOD_INDEX_USED`, `ROWS_AFFECTED`, `ROWS_SENT`, `ROWS_EXAMINED`
FROM `performance_schema`.`events_statements_history`
WHERE
`CURRENT_SCHEMA` = '{$database}' AND `EVENT_NAME` LIKE 'statement/sql/%'
AND `THREAD_ID` = (SELECT `THREAD_ID` FROM `performance_schema`.`threads` WHERE `performance_schema`.`threads`.`PROCESSLIST_ID` = CONNECTION_ID())
ORDER BY `EVENT_ID` DESC LIMIT 1;
To decrease the number of rows accessed from og_membership, try adding an index containing the gid, entity_type, and etid fields. Including gid and entity_type should make the lookup more performant and including etid will make the index a covering index.
After adding the index, run EXPLAIN again to look at the results. Based on the new explain plan, either keep the index, remove the index, and/or add an additional index. Keep doing this until you get results you are satisfied with.
For sure, you will want to try and eliminate any mentions of Using temporary or Using filesort. Using temporary implies a temporary table is being used to make this query probably for the sheer size of your intermittent. Using filesort implies ordering isn't being satisfied with an index and is being done by examining the matching rows.
An detail explanation about EXPLAIN can be found at https://dev.mysql.com/doc/refman/5.7/en/explain-output.html.
Key-Value (EAV) schema sucks.
Indexes:
table1: INDEX(title, created)
table1: INDEX(uid, title, created)
table: INDEX(gid, entity_type, etid)
table* -- Is `entity_id` already an index? Can it be the PRIMARY KEY?
Does nid need to be NULL instead of NOT NULL?
If those don't do enough, try:
And turn the IN ( SELECT ... ) into a JOIN ( SELECT ... ) USING(hid)
If you still need help, please provide SHOW CREATE TABLE and EXPLAIN SELECT ...

Index absent from `possible_keys`… but only on some environments

I have a software stack which I run myself, but also deploy to customer premises.
There is a particular query which runs very well in my environment, but runs terribly in the customer's environment.
I have confirmed using EXPLAIN that my environment's query planner sees that there is a great index available (and uses it). Whereas the same query in the customer's environment does not offer that index under possible_keys.
Here's the full query, somewhat anonymized:
SELECT t0.*,
t1.*,
t2.*,
t3.value
FROM table0 t0
LEFT OUTER JOIN table1 t1
ON t0.id = t1.table0_id
LEFT OUTER JOIN table2 t2
ON t1.id = t2.table1_id
AND t2.deleted = 0
LEFT OUTER JOIN table3 t3
ON t0.id = t3.table0_id
AND t3.type = 'whatever'
WHERE t0.business IN ('workcorp')
AND '2016-11-01 00:00:00' <= t0.date
AND t0.date < '2016-12-01 00:00:00'
ORDER BY t0.date DESC
The stage where our environments differ is on JOINing to table3. So theoretically you can ignore a large amount of the query and think of it like this:
SELECT t0.*
t3.value
FROM table0 t0
LEFT OUTER JOIN table3 t3
ON t0.id = t3.table0_id
AND t3.type = 'whatever'
Both of our environments' query plans agree on how to JOIN to t1 and to t2. But they differ in their plan for how to JOIN to t3.
My environment correctly identifies two possible indexes for JOINing to t3, and correctly identifies that table0_id is the best choice for this query:
+----+-------------+-------+------+--------------------------+-----------+---------+------+-------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------+--------------------------+-----------+---------+------+-------+----------+-------------+
| 1 | SIMPLE | t3 | ref | table0_id,type_and_value | table0_id | 108 | func | 2 | 100.00 | Using where |
+----+-------------+-------+------+--------------------------+-----------+---------+------+-------+----------+-------------+
The customer's environment does not consider the index table0_id to be an option, and falls back to type_and_value (which is a really bad choice):
+----+-------------+-------+------+----------------+----------------+---------+-------+----------------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------+----------------+----------------+---------+-------+----------------+----------+-------------+
| 1 | SIMPLE | t3 | ref | type_and_value | type_and_value | 257 | const | (far too many) | 100.00 | Using where |
+----+-------------+-------+------+----------------+----------------+---------+-------+----------------+----------+-------------+
What happens if we FORCE INDEX?
EXPLAIN EXTENDED SELECT t0.*,
t1.*,
t2.*,
t3.value
FROM table0 t0
LEFT OUTER JOIN table1 t1
ON t0.id = t1.table0_id
LEFT OUTER JOIN table2 t2
ON t1.id = t2.table1_id
AND t2.deleted = 0
LEFT OUTER JOIN table3 t3 FORCE INDEX (table0_id)
ON t0.id = t3.table0_id
AND t3.type = 'whatever'
WHERE t0.business IN ('workcorp')
AND '2016-11-01 00:00:00' <= t0.date
AND t0.date < '2016-12-01 00:00:00'
ORDER BY t0.date DESC
On my environment, I got:
+----+-------------+-------+------+---------------+-----------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+-----------+---------+------+------+-------------+
| 1 | SIMPLE | t3 | ref | table0_id | table0_id | 108 | func | 2 | Using where |
+----+-------------+-------+------+---------------+-----------+---------+------+------+-------------+
Compared to my original query plan (which proposed two possible_keys): this narrowed down the choice to just one.
But the customer got a different result:
+----+-------------+-------+------+---------------+------+---------+-------+---------+----------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------+---------------+------+---------+-------+---------+----------+----------------------------------------------------+
| 1 | SIMPLE | t3 | ALL | NULL | NULL | NULL | NULL | (loads) | 100.00 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+-------+------+---------------+------+---------+-------+---------+----------+----------------------------------------------------+
Adding the FORCE INDEX narrows down the customer's possible_keys from one bad choice, to no choices.
So why is it that the customer's environment does not have the same indexes available in possible_keys? Naturally I was given to suspect "maybe they don't have that index". So we did a SHOW INDEXES FROM table3. Here's my environment (for comparison):
+--------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+--------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| table3 | 0 | PRIMARY | 1 | id | A | 16696 | NULL | NULL | | BTREE | | |
| table3 | 1 | table0_id | 1 | table0_id | A | 16696 | NULL | NULL | | BTREE | | |
| table3 | 1 | type_and_value | 1 | type | A | 14 | NULL | NULL | | BTREE | | |
| table3 | 1 | type_and_value | 2 | value | A | 8348 | NULL | NULL | | BTREE | | |
+--------+------------+-----------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
Their environment had the same index, table0_id available:
+--------+------------+-----------------+--------------+-----------------+-----------+-------------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+--------+------------+-----------------+--------------+-----------------+-----------+-------------------+----------+--------+------+------------+---------+---------------+
| table3 | 1 | table0_id | 1 | table0_id | A | (same as PRIMARY) | NULL | NULL | | BTREE | | |
+--------+------------+-----------------+--------------+-----------------+-----------+-------------------+----------+--------+------+------------+---------+---------------+
I was also careful to ask "is this a slave? is the master the same?": they assured me that all instances had this index, as required.
So I thought "maybe the index is broken in some way?" And proposed that they do the very simplest query relying upon that index:
EXPLAIN EXTENDED SELECT *
FROM table3
WHERE table0_id = 'whatever'
In this case: their environment behaved the same as mine (and correctly), proposing the use of the index table0_id:
+----+-------------+--------+------+---------------+-----------+---------+-------+------+----------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------+---------------+-----------+---------+-------+------+----------+-----------------------+
| 1 | SIMPLE | table3 | ref | table0_id | table0_id | 108 | const | 1 | 100.00 | Using index condition |
+----+-------------+--------+------+---------------+-----------+---------+-------+------+----------+-----------------------+
So they definitely have that index. And their query planner can recognise that it is eligible for use (for some queries at least).
What's going on here? Why is table0_id ineligible for use on certain queries, but only on the customer's environment? Could it be that the index is broken in some way? Or that the query planner is making a mistake?
Are there any other tests I can do to figure out why it's not using the index for this query?
Turns out it was the charsets (and/or the collations).
I used this query to reveal the charset in each environment:
SELECT table_name, column_name, character_set_name FROM information_schema.`COLUMNS`
WHERE table_schema = "my_cool_database"
AND table_name IN ("t0", "t3")
ORDER BY 1 DESC, 2
And for bonus points I checked the character collation in each environment:
SHOW FULL COLUMNS FROM t0;
SHOW FULL COLUMNS FROM t3;
In my environment: all columns in both tables had utf8 charset and utf8_unicode_ci collation.
In the customer's environment: t0 matched my environment exactly, yet t3 was a unique snowflake; its columns had latin1 charset and latin1_swedish_ci collation.
So, what we were seeing is that the index that existed on t3.table0_id (a latin1 column) could not be used to JOIN to a utf8 table. Hence the index worked fine for:
SELECT *
FROM table3
WHERE table0_id = 'whatever'
Yet the index could not be used for:
SELECT
t0.id,
t3.value
FROM t0
LEFT OUTER JOIN t3
Similar symptoms are described on the Percona blog, John Watson's blog and Baron Schwartz's blog.

Why does the output of EXPLAIN change after each SHOW index?

I was trying to improve performance on some queries through indexes using EXPLAIN and I noticed each time I used SHOW index FROM TableB; the output of the rows colums in the EXPLAIN of a query changed
Ex:
mysql> EXPLAIN Select A.id
From TableA A
Inner join TableB B
On A.address = B.address And A.code = B.code
Group by A.id
Having count(distinct B.id) = 1;
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------+-------+----------------------------------------------+
| 1 | SIMPLE | B | index | test_index | PRIMARY | 518 | NULL | 10561 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | A | eq_ref | PRIMARY | PRIMARY | 514 | db.B.address,db.B.code | 1 | |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------+-------+----------------------------------------------+
2 rows in set (0.00 sec)
mysql> show index from TableB;
+-----------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-----------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| TableB | 0 | PRIMARY | 1 | id | A | 7 | NULL | NULL | | BTREE | |
| TableB | 0 | PRIMARY | 2 | address | A | 21 | NULL | NULL | | BTREE | |
| TableB | 0 | PRIMARY | 3 | code | A | 10402 | NULL | NULL | | BTREE | |
| TableB | 1 | test_index | 1 | address | A | 1 | NULL | NULL | | BTREE | |
| TableB | 1 | test_index | 2 | code | A | 10402 | NULL | NULL | | BTREE | |
| TableB | 1 | test_index | 3 | id | A | 10402 | NULL | NULL | | BTREE | |
+-----------+------------+--------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
6 rows in set (0.03 sec)
and...
mysql> EXPLAIN Select A.id
From TableA A
Inner join TableB B
On A.address = B.address And A.code = B.code Group by A.id
Having count(distinct B.id) = 1;
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------+-------+----------------------------------------------+
| 1 | SIMPLE | B | index | test_index | PRIMARY | 518 | NULL | 9800 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | A | eq_ref | PRIMARY | PRIMARY | 514 | db.B.address,db.B.code | 1 | |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------+-------+----------------------------------------------+
2 rows in set (0.00 sec)
Why does this happen?
The rows column should be taken as a rough estimate only. It's not a precise number.
It's based on statistical estimates of how many rows will be examined during a query. The actual number of rows cannot be known until you actually execute the query.
The statistics are based on samples read from the table periodically. These samples are re-read occasionally, for example after you run ANALYZE TABLE or certain INFORMATION_SCHEMA queries, or certain SHOW statements.
I don't find 20% variation in statistics to be a big deal. In many situations, think of the graph being like an upturned parabola, and you need to know which side of the minimum point you are on. In complex queries, where the Optimizer is likely to goof, it need a lot more than simple stats, such as Histograms of MariaDB 10.0 / 10.1. (I don't have enough experience with such to say whether that makes much headway.)
Your particular query is probably going to be performed in only one way, regardless of the statistics. An example of a complicated query would be a JOIN with WHERE clauses filtering each table. The optimizer has to decide which table to start with. Another case is a single table with a WHERE and ORDER BY and they cannot both be handled by a single index -- should it use an index to filter, but then have to sort? or should it use an index for ORDER BY, but then have to filter on the fly?

Mysql intersection query performance

I am quite new to mysql. I have 2 identical mysql tables which have 50K rows (70 columns) each. Those tables are updated everyday by a datafeed. I need to execute some nested queries like intersections / substractions etc.
One of the queries I try to use is as below.
But it doesn't work properly. Either it takes 5 min. to 10 min. (through terminal) or it does not respond back.
SELECT *
FROM table1
WHERE table1.sku IN (SELECT t1.sku
FROM ((SELECT DISTINCT sku
FROM table2)
UNION ALL
(SELECT DISTINCT sku
FROM table1)) AS t1
GROUP BY sku
HAVING Count(*) >= 2)
How can I make it work faster/properly? How should I configure the tables/columns (index, primary key etc.) Or do I need to make any tuning on the mysql server?
I tried several things. I created indexes on the 'sku' which are varchar(75)
columns. My database server runs on a 1 CoreProcessor (Digital Ocean) server
with 512MB Memory.
--- query with 'EXPLAIN'
+----+--------------------+-----------------------+-------+---------------+---------+---------+------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-----------------------+-------+---------------+---------+---------+------+-------+---------------------------------+
| 1 | PRIMARY | table1 | ALL | NULL | NULL | NULL | NULL | 30260 | Using where |
| 2 | DEPENDENT SUBQUERY | <derived3> | ALL | NULL | NULL | NULL | NULL | 65677 | Using temporary; Using filesort |
| 3 | DERIVED | table2 | range | NULL | sku_idx | 227 | NULL | 31016 | Using index for group-by |
| 4 | UNION | table1 | range | NULL | sku | 227 | NULL | 30261 | Using index for group-by |
| NULL | UNION RESULT | <union3,4> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------------+-----------------------+-------+---------------+---------+---------+------+-------+---------------------------------+
If I understand this particular query correctly, you are trying to display all the records from table1 which have a corresponding sku in table2.
That can be achieved by a much simpler query:
SELECT *
FROM table1
WHERE table1.sku IN (SELECT DISTINCT table2.sku FROM table2 )
GROUP BY table1.sku
Or, with joins:
SELECT table1.*
FROM table1
INNER JOIN table2 ON table1.sku = table2.sku
GROUP BY table1.sku
This should work in an instant if you have indexes on table1.sku and table2.sku

MySQL Speeding up left outer join / check for null queries

The object of my query is to get all rows from table a where gender = f and username does not exist in table b where campid = xxxx. Here is the query I am using with success:
SELECT `id`
FROM pool
LEFT JOIN sent
ON pool.username = sent.username
AND sent.campid = 'YA1LGfh9'
WHERE sent.username IS NULL
AND pool.gender = 'f'
The problem is that the query takes over 9 minutes to complete, the pool table contains over 10 million rows and the sent table is eventually going to grow even larger than that. I have created indexes for many of the columns including username and gender. However, MySQL refuses to use any of my indexes for this query. I even tried using FORCE INDEX. Here are my indexes from pool and the output of EXPLAIN for my query:
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| pool | 0 | PRIMARY | 1 | id | A | 9326880 | NULL | NULL | | BTREE | |
| pool | 1 | username | 1 | username | A | 9326880 | NULL | NULL | | BTREE | |
| pool | 1 | source | 1 | source | A | 6 | NULL | NULL | | BTREE | |
| pool | 1 | gender | 1 | gender | A | 9 | NULL | NULL | | BTREE | |
| pool | 1 | location | 1 | location | A | 59030 | NULL | NULL | | BTREE | |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
6 rows in set (0.00 sec)
mysql> explain SELECT `id` FROM pool FORCE INDEX (username) LEFT JOIN sent ON pool.username = sent.username AND sent.campid = 'YA1LGfh9' WHERE sent.username IS NULL AND pool.gender = 'f';
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------------------+
| 1 | SIMPLE | pool | ALL | NULL | NULL | NULL | NULL | 9326881 | Using where |
| 1 | SIMPLE | sent | ALL | NULL | NULL | NULL | NULL | 351 | Using where; Not exists |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------------------+
2 rows in set (0.00 sec)
also, here are my indexes for the sent table:
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| sent | 0 | PRIMARY | 1 | primary_key | A | 351 | NULL | NULL | | BTREE | |
| sent | 1 | username | 1 | username | A | 351 | NULL | NULL | | BTREE | |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
2 rows in set (0.00 sec)
You can see that no indexes are not being used and so my query takes extremely too long. If anyone has a solution that involves reworking the query, please show me an example of how to do it using my data structure so that I won't have any confusion of how to implement and test. Thank you.
First, your original query was correct in your placement of everything... including the camp. By using a LEFT JOIN from Pool to Sent, and then pulling a required equality such as "CAMP" into the WHERE clause as previously suggested is ultimately converting that into an INNER JOIN, thus requiring entry on both sides. Leave it as you had it.
You already have an index on user name on the sent table, but I would do the following.
build an index on the "sent" table on (CampID, UserName) as a composite (ie: multiple key) index. This way the left join will be optimized for BOTH entries.
On your "pool" table, try a composite index on 3 fields of (gender, username, id ).
By doing this, you can take advantage of NOT having to go through all the actual pages of data that encompass your 10+ million records. Since the index HAS the columns for compare, it doesn't have to find the actual record and look at the columns, it can use those of the index directly.
Also, for grins, I added keyword "STRAIGHT_JOIN" which tells MySQL to query exactly as I show and don't try to think for me. MANY times, I've found this to significantly improve query performance... On very few have I been given feedback that it has NOT helped.
SELECT STRAIGHT_JOIN
p.id
FROM
pool p
LEFT JOIN sent s
ON s.campid = 'YA1LGfh9'
AND p.username = s.username
WHERE
p.gender = 'f'
AND s.username IS NULL
All that said, you are still going to be returning how many records out of the 10+ million... if the pool has 10+ million, and the single camp only has 5,000. You will still be returning almost the entire set.