Efficient implementation of MINUS in MySQL - mysql

I have two tables both having a column called key and I like to something like SELECT key FROM table1 MINUS SELECT key FROM table2 in MySQL (5.6.19). (table1 contains about 1.5 million rows, table2 about 100,000.) The things I tried are
SELECT key FROM table1 WHERE key NOT IN (SELECT key FROM table2);
SELECT a.key FROM table1 a LEFT JOIN table2 b USING (key) WHERE b.key IS NULL;
SELECT a.key FROM table1 a LEFT JOIN table2 b ON a.key=b.key WHERE b.key IS NULL;
But both are unbelievable inefficient! (After waiting for a result for the first query I stopped it and started all queries in parallel during the night. The first one took insane 12 hours 51 min, the second and third one 7 hours 32 min and 7 hours 53 min, respectively).
How can this be done efficiently? Is it only a problem of MySQL or of all SQL implementations in general? (In case of importance: key is of type char(32), table1 contains many more columns (also a lot of strings), table2 apart from key only some integer columns). Thank you very much for any hint in advance!
#Steve Rukuts: The EXPLAIN SELECT a.key FROM table1 a LEFT JOIN table2 b ON a.key=b.key WHERE b.key IS NULL; gives:
+----+-------------+-------+-------+---------------+------+---------+------+---------+-----------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+------+---------+------+---------+-----------------------------------------------------------------+
| 1 | SIMPLE | a | index | NULL | Key | 97 | NULL | 1372811 | Using index |
| 1 | SIMPLE | b | index | NULL | Key | 33 | NULL | 101580 | Using where; Using index; Using join buffer (Block Nested Loop) |
+----+-------------+-------+-------+---------------+------+---------+------+---------+-----------------------------------------------------------------+

Related

A join here seems unnecessary

TableA
------
id
Name
other_fields
TableB
------
A_id (foreign key to TableA.id)
other_fields
Select entries from TableB which reference entries in TableA with some specific property (e.g. Name = "Alice")
This can be easily done with a join:
SELECT TableB.*
FROM TableA INNER JOIN TableB on TableA.id = TableB.A_id
WHERE TableA.Name = "Alice"
Being used to procedural programming, the join seems overkill and unnecessary as we don't actually need any information from TableA other than the id of Alice.
So -- assuming that Alice is unique -- is there a way to do this (pseudocode):
variable alice_id = get id of Alice from TableA
SELECT *
FROM TableB
WHERE A_id = alice_id
If yes, should it be used in favor of the classical JOIN method? Is it faster? (in principle, of course)
You're asking if you can do this:
SELECT * FROM TableB WHERE A_id = (SELECT id FROM TableA WHERE Name = 'Alice');
It's a perfectly legitimate query, but MySQL will perform much better doing the join because the subquery is treated as a second separate query. Using the MySQL EXPLAIN command (just put it in front of your SELECT query) will show the indexes, temporary tables, and other resources that are used for a query. It should give you an idea when one query is faster or more efficient than another.
For your workload and indexes, you should try both queries' execution plan and runtime. In either case you would benefit from having an index on name.
I believe that both queries are going to end up with similar plans. Let's check that out.
Create the tables
create table tablea (id int primary key, nm as varchar(50));
create index idx_tablea_nm on tablea(nm);
create table tableb(a_id int, anotherfield varchar(100),
key idx_tableb_id(a_id),
constraint fk_tableb_tablea_id foreign key (a_id) references tablea (id));
Let's do an EXPLAIN on the first one:
explain select tableb.* from tablea inner join tableb on tablea.id = tableb.a_id where tablea.nm = 'Alice';
+----+-------------+--------+------+-----------------------+---------------+---------+-------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+-----------------------+---------------+---------+-------------------+------+--------------------------+
| 1 | SIMPLE | tablea | ref | PRIMARY,idx_tablea_nm | idx_tablea_nm | 53 | const | 1 | Using where; Using index |
| 1 | SIMPLE | tableb | ref | idx_tableb_id | idx_tableb_id | 5 | tablea.id | 1 | Using where |
+----+-------------+--------+------+-----------------------+---------------+---------+-------------------+------+--------------------------+
Let's do EXPLAIN on the second one:
explain select * from tableb where a_id = (select id from tablea where nm = 'Alice');
+----+-------------+--------+------+---------------+---------------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+---------------+---------+-------+------+--------------------------+
| 1 | PRIMARY | tableb | ref | idx_tableb_id | idx_tableb_id | 5 | const | 1 | Using where |
| 2 | SUBQUERY | tablea | ref | idx_tablea_nm | idx_tablea_nm | 53 | | 1 | Using where; Using index |
+----+-------------+--------+------+---------------+---------------+---------+-------+------+--------------------------+
I don't have much data in those tables and with little data you will notice identical performance. As the workload changes, the execution play may change.

what is the fastest way to join several tables matching specific column values in MySQL

I have 3 tables that look like this:
CREATE TABLE big_table_1 (
id INT(11),
col1 TINYINT(1),
col2 TINYINT(1),
col3 TINYINT(1),
PRIMARY KEY (`id`)
)
And so on for big_table_2 and big_table_3. The col1, col2, col3 values are either 0, 1 or null.
I'm looking for id's whose col1 value equals 1 in each table. I join them as follows, using the simplest method I can think of:
SELECT t1.id
FROM big_table_1 AS t1
INNER JOIN big_table_2 AS t2 ON t2.id = t1.id
INNER JOIN big_table_3 AS t3 ON t3.id = t1.id
WHERE t1.col1 = 1
AND t2.col1 = 1
AND t3.col1 = 1;
With 10 million rows per table, the query takes about 40 seconds to execute on my machine:
407231 rows in set (37.19 sec)
Explain results:
+----+-------------+-------+--------+---------------+---------+---------+--------------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+--------------+----------+-------------+
| 1 | SIMPLE | t3 | ALL | PRIMARY | NULL | NULL | NULL | 10999387 | Using where |
| 1 | SIMPLE | t1 | eq_ref | PRIMARY | PRIMARY | 4 | testDB.t3.id | 1 | Using where |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY | PRIMARY | 4 | testDB.t3.id | 1 | Using where |
+----+-------------+-------+--------+---------------+---------+---------+--------------+----------+-------------+
If I declare index on col1, the result is slightly slower:
407231 rows in set (40.84 sec)
I have also tried the following query:
SELECT t1.id
FROM (SELECT distinct ta1.id FROM big_table_1 ta1 WHERE ta1.col1=1) as t1
WHERE EXISTS (SELECT ta2.id FROM big_table_2 ta2 WHERE ta2.col1=1 AND ta2.id = t1.id)
AND EXISTS (SELECT ta3.id FROM big_table_3 ta3 WHERE ta3.col1=1 AND ta3.id = t1.id);
But it's slower:
407231 rows in set (44.01 sec) [with index on col1]
407231 rows in set (1 min 36.52 sec) [without index on col1]
Is the aforementioned simple method basically the fastest way to do this in MySQL? Would it be necessary to shard the table onto multiple servers in order to get the result faster?
Addendum: EXPLAIN results for Andrew's code as requested (I trimmed the tables down to 1 million rows only, and the index is on id and col1):
+----+-------------+-------------+-------+---------------+---------+---------+------+---------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+---------------+---------+---------+------+---------+--------------------------------+
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 332814 | |
| 1 | PRIMARY | <derived4> | ALL | NULL | NULL | NULL | NULL | 333237 | Using where; Using join buffer |
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 333505 | Using where; Using join buffer |
| 4 | DERIVED | big_table_3 | index | NULL | PRIMARY | 5 | NULL | 1000932 | Using where; Using index |
| 3 | DERIVED | big_table_2 | index | NULL | PRIMARY | 5 | NULL | 1000507 | Using where; Using index |
| 2 | DERIVED | big_table_1 | index | NULL | PRIMARY | 5 | NULL | 1000932 | Using where; Using index |
+----+-------------+-------------+-------+---------------+---------+---------+------+---------+--------------------------------+
INNER JOIN (same as JOIN) lets the optimizer pick whether to use the table to its left or the table to its right. The simplified SELECT you presented could start with any of the three tables.
The optimizer likes to start with the table with the WHERE clause. Your simplified example implies that each table is equally good IF there is an INDEX starting with col1. (See retraction below.)
The second and subsequent tables need a different rule for indexing. In your simplified example, col1 is used for filtering and id is used for JOINing. INDEX(col1, id) and INDEX(id, col1) work equally well for getting to the second table.
I keep saying "your simplified example" because as soon as you change anything, most of the advice in these answers is up for grabs.
(The retraction) When you have a column with "low cardinality" such as your col%, with only 0,1,NULL possibilities, INDEX(col1) is essentially useless since it it faster to blindly scan the table rather than use the index.
On the other hand, INDEX(col1, ...) may be useful, as mentioned for the second table.
However neither is useful for the first table. If you have such an INDEX, it will be ignored.
Then comes "covering". Again, your example is unrealistically simplistic because there are essentially no fields touched other than id and col1. A "covering" index includes all the fields of a table that are touched in the query. A covering index is virtually always smaller than the data, so it takes less effort to run through a covering index, hence faster.
(Retract the retraction) INDEX(col1, id), in that order is a useful covering index for the first table.
Imagine how my discussion had gone if you had not mentioned that col1 had only 3 values. Quite different.
And we have not gotten to ORDER BY, IN(...), BETWEEN...AND..., engine differences, tricks with the PRIMARY KEY, LEFT JOIN, etc.
More insight into building indexes from Selects.
ANALYZE TABLE should not be necessary.
For kicks try it with a covered index (a composite of id,col1)
So 1 index make it primary composite. No other indexes.
Then run analyze table xxx (3 times total, once per table)
Then fire it off hoping the mysql cbo isnt to dense to figure it out.
Second idea is to see results without a where clause. Convert it all inside of join on clause
Have you tried this:
SELECT t1.id
FROM
(SELECT id from big_table_1 where col1 = 1) AS t1
INNER JOIN (SELECT id from big_table_2 where col1 = 1) AS t2 ON t2.id = t1.id
INNER JOIN (SELECT id from big_table_3 where col1 = 1) AS t3 ON t3.id = t1.id

MySQL performance with GROUP BY and JOIN

After spending a lot of time with variants to this question I'm wondering if someone can help me optimize this query or indexes.
I have three temp tables ref1, ref2, ref3 all defined as below, with ref1 and ref2 each having about 6000 rows and ref3 only 3 rows:
CREATE TEMPORARY TABLE ref1 (
id INT NOT NULL AUTO_INCREMENT,
val INT,
PRIMARY KEY (id)
)
ENGINE = MEMORY;
The slow query is against a table like so, with about 1M rows:
CREATE TABLE t1 (
d DATETIME NOT NULL,
id1 INT NOT NULL,
id2 INT NOT NULL,
id3 INT NOT NULL,
x INT NULL,
PRIMARY KEY (id1, d, id2, id3)
)
ENGINE = INNODB;
The query in question:
SELECT id1, SUM(x)
FROM t1
INNER JOIN ref1 ON ref1.id = t1.id1
INNER JOIN ref2 ON ref2.id = t1.id2
INNER JOIN ref3 ON ref3.id = t1.id3
WHERE d BETWEEN '2011-03-01' AND '2011-04-01'
GROUP BY id1;
The temp tables are used to filter the result set down to just the items a user is looking for.
EXPLAIN
+----+-------------+-------+--------+---------------+---------+---------+------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+------------------+------+---------------------------------+
| 1 | SIMPLE | ref1 | ALL | PRIMARY | NULL | NULL | NULL | 6000 | Using temporary; Using filesort |
| 1 | SIMPLE | t1 | ref | PRIMARY | PRIMARY | 4 | med31new.ref1.id | 38 | Using where |
| 1 | SIMPLE | ref3 | ALL | PRIMARY | NULL | NULL | NULL | 3 | Using where; Using join buffer |
| 1 | SIMPLE | ref2 | eq_ref | PRIMARY | PRIMARY | 4 | med31new.t1.id2 | 1 | |
+----+-------------+-------+--------+---------------+---------+---------+------------------+------+---------------------------------+
(on a different system with ~5M rows EXPLAIN show t1 first in the list, with "Using where; Using index; Using temporary; Using filesort")
Is there something obvious I'm missing that would prevent the temporary table from being used?
First filesort does not mean a file is writtent on disk to perform the sort, it's the name of the quicksort algorithm in mySQL, check what-does-using-filesort-mean-in-mysql.
So the problematic keyword in your explain is Using temporary, not Using filesort. For that you can play with tmp_table_size & max_heap_table_size(put the same values on both) to allow more in-memory work and avoid temporary table creation, check this link on the subject with remarks about documentation mistakes.
Then you could try different index policy, and see the results, but do not try to avoid filesort.
Last thing, not related, you make a SUM(x) but x can takes NULL values, SUM(COALESCE(x) , 0) is maybe better if you do not want any NULL value on the Group to make your sum being NULL.
Add an index on JUST the DATE. Since that is the criteria of the first table, and the others are just joins, it will be optimized against the DATE first... the joins are secondary.
Isn't this:
SELECT id1, SUM(x)
FROM t1
INNER JOIN ref1 ON ref1.id = t1.id1
INNER JOIN ref2 ON ref2.id = t1.id2
INNER JOIN ref3 ON ref3.id = t1.id3
WHERE d BETWEEN '2011-03-01' AND '2011-04-01'
GROUP BY id1;
exactly equivalent to:
select id1, SUM(x)
FROM t1
WHERE d BETWEEN '2011-03-01' AND '2011-04-01'
group by id1;
What are the extra tables being used for? I think the temp table mentioned in another answer is referring to MySQL creating a temp table during query execution. If you're hoping to create a sub-query (or table) that will minimize number of operations required in a join, that might speed up the query, but I don't see joined data being selected.

mysql query with OR optimization

Can the following query be optimized? What indexes can be created?
SELECT column_a
FROM Table_b
JOIN Table_a
WHERE Table_B.ID_b = Table_A.ID_a
OR Table_B.ID_b = Table_A.ID_b;
Your query should actually be:
SELECT column_a
FROM Table_b
JOIN Table_a ON Table_B.ID_b IN (Table_A.ID_a, Table_A.ID_b)
If you don't provide ON criteria with the JOIN, MySQL accepts this as being a CROSS JOIN -- the result is a cartesian product (that's bad, unless that's really what you want). If I knew which table that column_a came from, I might suggest a different approach to the query...
Index the following:
Table_B.ID_b
Table_A.ID_a
Table_A.ID_b
The two columns in TABLE_A could be a covering index, rather than separate ones.
If the ID_x fields are keys (primary or unique), this should already be pretty good. (I.e., if they're not, you should make sure that all fields affected by the WHERE part are indexed.)
Consider posting an EXPLAIN of the query:
EXPLAIN SELECT column_a FROM Table_b JOIN Table_a
WHERE Table_B.ID_b = Table_A.ID_a OR Table_B.ID_b = Table_A.ID_b;
From comments:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | Table_b | index | INDEX_ON_ID_b | INDEX_ON_ID_b | 3 | NULL | 1507 | Using index; Using temporary |
| 1 | SIMPLE | Table_a | ALL |ID_a,ID_b,ID_a_column_a, ID_b_column_a_index | NULL | NULL | NULL | 29252089 | Range checked for each record (index map: 0x306) |

MySQL: Inner join vs Where [duplicate]

This question already has answers here:
Explicit vs implicit SQL joins
(12 answers)
Closed 9 years ago.
Is there a difference in performance (in mysql) between
Select * from Table1 T1
Inner Join Table2 T2 On T1.ID = T2.ID
And
Select * from Table1 T1, Table2 T2
Where T1.ID = T2.ID
?
As pulled from the accepted answer in question 44917:
Performance wise, they are exactly the
same (at least in SQL Server) but be
aware that they are deprecating the
implicit outer join syntax.
In MySql the results are the same.
I would personally stick with joining tables explicitly... that is the "socialy acceptable" way of doing it.
They are the same. This can be seen by running the EXPLAIN command:
mysql> explain Select * from Table1 T1
-> Inner Join Table2 T2 On T1.ID = T2.ID;
+----+-------------+-------+-------+---------------+---------+---------+------+------+---------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+---------------------------------------------+
| 1 | SIMPLE | T1 | index | PRIMARY | PRIMARY | 4 | NULL | 4 | Using index |
| 1 | SIMPLE | T2 | index | PRIMARY | PRIMARY | 4 | NULL | 4 | Using where; Using index; Using join buffer |
+----+-------------+-------+-------+---------------+---------+---------+------+------+---------------------------------------------+
2 rows in set (0.00 sec)
mysql> explain Select * from Table1 T1, Table2 T2
-> Where T1.ID = T2.ID;
+----+-------------+-------+-------+---------------+---------+---------+------+------+---------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+---------------------------------------------+
| 1 | SIMPLE | T1 | index | PRIMARY | PRIMARY | 4 | NULL | 4 | Using index |
| 1 | SIMPLE | T2 | index | PRIMARY | PRIMARY | 4 | NULL | 4 | Using where; Using index; Using join buffer |
+----+-------------+-------+-------+---------------+---------+---------+------+------+---------------------------------------------+
2 rows in set (0.00 sec)
Well one late answer from me, As I am analyzing performance of a older application which uses comma based join instead of INNER JOIN clause.
So here are two tables which have a join (both have records more than 1 lac). When executing query which has a comma based join, it takes a lot longer than the INNER JOIN case.
When I analyzed the explain statement, I found that the query having comma join was using the join buffer. However the query having INNER JOIN clause had 'using Where'.
Also these queries are significantly different, as shown in rows column in explain query.
These are my queries and their respective explain results.
explain select count(*) FROM mbst a , his_moneypv2 b
WHERE b.yymm IN ('200802','200811','201001','201002','201003')
AND a.tel3 != ''
AND a.mb_no = b.mb_no
AND b.true_grade_class IN (3,6)
OR b.grade_class IN (4,7);
+----+-------------+-------+-------------+----------------------------------------------------------------+--------------------------------------+---------+------+--------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------------+----------------------------------------------------------------+--------------------------------------+---------+------+--------+---------------------------------------------------------------------+
| 1 | SIMPLE | b | index_merge | PRIMARY,mb_no,yymm,yymm_2,idx_true_grade_class,idx_grade_class | idx_true_grade_class,idx_grade_class | 5,9 | NULL | 16924 | Using sort_union(idx_true_grade_class,idx_grade_class); Using where |
| 1 | SIMPLE | a | ALL | PRIMARY | NULL | NULL | NULL | 134472 | Using where; Using join buffer |
+----+-------------+-------+-------------+----------------------------------------------------------------+--------------------------------------+---------+------+--------+---------------------------------------------------------------------+
v/s
explain select count(*) FROM mbst a inner join his_moneypv2 b
on a.mb_no = b.mb_no
WHERE b.yymm IN ('200802','200811','201001','201002','201003')
AND a.tel3 != ''
AND b.true_grade_class IN (3,6)
OR b.grade_class IN (4,7);
+----+-------------+-------+-------------+----------------------------------------------------------------+--------------------------------------+---------+--------------------+-------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------------+----------------------------------------------------------------+--------------------------------------+---------+--------------------+-------+---------------------------------------------------------------------+
| 1 | SIMPLE | b | index_merge | PRIMARY,mb_no,yymm,yymm_2,idx_true_grade_class,idx_grade_class | idx_true_grade_class,idx_grade_class | 5,9 | NULL | 16924 | Using sort_union(idx_true_grade_class,idx_grade_class); Using where |
| 1 | SIMPLE | a | eq_ref | PRIMARY | PRIMARY | 62 | shaklee_my.b.mb_no | 1 | Using where |
+----+-------------+-------+-------------+----------------------------------------------------------------+--------------------------------------+---------+--------------------+------
Actually they are virtually the same, The JOIN / ON is newer ANSI syntac, the WHERE is older ANSI syntax. Both are recognized by query engines
The comma in a FROM clause is a CROSS JOIN. We can imagine that SQL server has a select query execution procedure which somehow should look like that:
1. iterate through every table
2. find rows that meet join predicate and put it into result table.
3. from the result table, get only those rows that meets the where condition.
If it really looks like that, then using a CROSS JOIN on a table that has a few thousands rows could allocate a lot of memory, when every row is combined with each other before the where condition is examined. Your SQL server could be quite busy then.
I would think so because the first example explicitly tells mysql which columns to join and how to join them where the second one mysql has to try and figure out where you want to join.
the second query is just another notation for an inner join, so if there is a difference in porformance it's only because one query can be parsed faster than the other one - and that difference, if it exists, will be so tiny that you won't notice it.
for more information you could try to take a look at this question (and use the search on SO next time before asking a question that already is answered)
The first query is easier to understand for MySQL so it is likely that the execution plan will be better and that the query will run faster.
The second query without the where clause, is a cross join. If MySQL is able to understand the where clause good enough, it will do its best to avoid cross joining all the rows, but nothing guarantee that.
In a case as simple as yours, the performance will be strictly identical.
Performance wise, the first query will always be better or identical to the second one. And from my point of view it is also a lot easier to understand when rereading.