MySQL select distinct with inner join and where. What indexes to create?

MySQL select distinct with inner join and where. What indexes to create? - mysql

I have the following query
SELECT DISTINCT table_a.id FROM table_a
INNER JOIN table_b ON table_a.id = table_b.profile_id
WHERE table_b.role_id IN (1,2,3,4,5,6)
I have an index on table_b:
CREATE INDEX test_index ON table_b (role_id, profile_id)
But EXPLAIN gives me 'use temporary' for table_b
Using where; Using index; Using temporary
What indexes should I create to overcome this?
Update:
Explain output
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | table_b | range | table_b_e1c74d82,test_index | test_index | 5 | NULL | 12860 | Using where; Using index; Using temporary |
| 1 | SIMPLE | table_a | eq_ref | PRIMARY | PRIMARY | 4 | table_b.profile_id | 1 | Using index |
The table_b_e1c74d82 index is a (profile_id, production_id, role_id) index.

I think you need following indexes for this:
role_id in table_b
profile_id in table_b
id in table_a
MySQL will run its optimiser on the query and will the pick the best one for the job.
Using temporary just means that to resolve the query, MySQL needs to create a temporary table to hold the result. This typically happens if the query contains JOIN, GROUP BY and ORDER BY clauses that list columns differently.

Related

A join here seems unnecessary

TableA
------
id
Name
other_fields
TableB
------
A_id (foreign key to TableA.id)
other_fields
Select entries from TableB which reference entries in TableA with some specific property (e.g. Name = "Alice")
This can be easily done with a join:
SELECT TableB.*
FROM TableA INNER JOIN TableB on TableA.id = TableB.A_id
WHERE TableA.Name = "Alice"
Being used to procedural programming, the join seems overkill and unnecessary as we don't actually need any information from TableA other than the id of Alice.
So -- assuming that Alice is unique -- is there a way to do this (pseudocode):
variable alice_id = get id of Alice from TableA
SELECT *
FROM TableB
WHERE A_id = alice_id
If yes, should it be used in favor of the classical JOIN method? Is it faster? (in principle, of course)

You're asking if you can do this:
SELECT * FROM TableB WHERE A_id = (SELECT id FROM TableA WHERE Name = 'Alice');
It's a perfectly legitimate query, but MySQL will perform much better doing the join because the subquery is treated as a second separate query. Using the MySQL EXPLAIN command (just put it in front of your SELECT query) will show the indexes, temporary tables, and other resources that are used for a query. It should give you an idea when one query is faster or more efficient than another.

For your workload and indexes, you should try both queries' execution plan and runtime. In either case you would benefit from having an index on name.
I believe that both queries are going to end up with similar plans. Let's check that out.
Create the tables
create table tablea (id int primary key, nm as varchar(50));
create index idx_tablea_nm on tablea(nm);
create table tableb(a_id int, anotherfield varchar(100),
key idx_tableb_id(a_id),
constraint fk_tableb_tablea_id foreign key (a_id) references tablea (id));
Let's do an EXPLAIN on the first one:
explain select tableb.* from tablea inner join tableb on tablea.id = tableb.a_id where tablea.nm = 'Alice';
+----+-------------+--------+------+-----------------------+---------------+---------+-------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+-----------------------+---------------+---------+-------------------+------+--------------------------+
| 1 | SIMPLE | tablea | ref | PRIMARY,idx_tablea_nm | idx_tablea_nm | 53 | const | 1 | Using where; Using index |
| 1 | SIMPLE | tableb | ref | idx_tableb_id | idx_tableb_id | 5 | tablea.id | 1 | Using where |
+----+-------------+--------+------+-----------------------+---------------+---------+-------------------+------+--------------------------+
Let's do EXPLAIN on the second one:
explain select * from tableb where a_id = (select id from tablea where nm = 'Alice');
+----+-------------+--------+------+---------------+---------------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+---------------+---------+-------+------+--------------------------+
| 1 | PRIMARY | tableb | ref | idx_tableb_id | idx_tableb_id | 5 | const | 1 | Using where |
| 2 | SUBQUERY | tablea | ref | idx_tablea_nm | idx_tablea_nm | 53 | | 1 | Using where; Using index |
+----+-------------+--------+------+---------------+---------------+---------+-------+------+--------------------------+
I don't have much data in those tables and with little data you will notice identical performance. As the workload changes, the execution play may change.

what is the fastest way to join several tables matching specific column values in MySQL

I have 3 tables that look like this:
CREATE TABLE big_table_1 (
id INT(11),
col1 TINYINT(1),
col2 TINYINT(1),
col3 TINYINT(1),
PRIMARY KEY (`id`)
)
And so on for big_table_2 and big_table_3. The col1, col2, col3 values are either 0, 1 or null.
I'm looking for id's whose col1 value equals 1 in each table. I join them as follows, using the simplest method I can think of:
SELECT t1.id
FROM big_table_1 AS t1
INNER JOIN big_table_2 AS t2 ON t2.id = t1.id
INNER JOIN big_table_3 AS t3 ON t3.id = t1.id
WHERE t1.col1 = 1
AND t2.col1 = 1
AND t3.col1 = 1;
With 10 million rows per table, the query takes about 40 seconds to execute on my machine:
407231 rows in set (37.19 sec)
Explain results:
+----+-------------+-------+--------+---------------+---------+---------+--------------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+--------------+----------+-------------+
| 1 | SIMPLE | t3 | ALL | PRIMARY | NULL | NULL | NULL | 10999387 | Using where |
| 1 | SIMPLE | t1 | eq_ref | PRIMARY | PRIMARY | 4 | testDB.t3.id | 1 | Using where |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY | PRIMARY | 4 | testDB.t3.id | 1 | Using where |
+----+-------------+-------+--------+---------------+---------+---------+--------------+----------+-------------+
If I declare index on col1, the result is slightly slower:
407231 rows in set (40.84 sec)
I have also tried the following query:
SELECT t1.id
FROM (SELECT distinct ta1.id FROM big_table_1 ta1 WHERE ta1.col1=1) as t1
WHERE EXISTS (SELECT ta2.id FROM big_table_2 ta2 WHERE ta2.col1=1 AND ta2.id = t1.id)
AND EXISTS (SELECT ta3.id FROM big_table_3 ta3 WHERE ta3.col1=1 AND ta3.id = t1.id);
But it's slower:
407231 rows in set (44.01 sec) [with index on col1]
407231 rows in set (1 min 36.52 sec) [without index on col1]
Is the aforementioned simple method basically the fastest way to do this in MySQL? Would it be necessary to shard the table onto multiple servers in order to get the result faster?
Addendum: EXPLAIN results for Andrew's code as requested (I trimmed the tables down to 1 million rows only, and the index is on id and col1):
+----+-------------+-------------+-------+---------------+---------+---------+------+---------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+---------------+---------+---------+------+---------+--------------------------------+
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 332814 | |
| 1 | PRIMARY | <derived4> | ALL | NULL | NULL | NULL | NULL | 333237 | Using where; Using join buffer |
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 333505 | Using where; Using join buffer |
| 4 | DERIVED | big_table_3 | index | NULL | PRIMARY | 5 | NULL | 1000932 | Using where; Using index |
| 3 | DERIVED | big_table_2 | index | NULL | PRIMARY | 5 | NULL | 1000507 | Using where; Using index |
| 2 | DERIVED | big_table_1 | index | NULL | PRIMARY | 5 | NULL | 1000932 | Using where; Using index |
+----+-------------+-------------+-------+---------------+---------+---------+------+---------+--------------------------------+

INNER JOIN (same as JOIN) lets the optimizer pick whether to use the table to its left or the table to its right. The simplified SELECT you presented could start with any of the three tables.
The optimizer likes to start with the table with the WHERE clause. Your simplified example implies that each table is equally good IF there is an INDEX starting with col1. (See retraction below.)
The second and subsequent tables need a different rule for indexing. In your simplified example, col1 is used for filtering and id is used for JOINing. INDEX(col1, id) and INDEX(id, col1) work equally well for getting to the second table.
I keep saying "your simplified example" because as soon as you change anything, most of the advice in these answers is up for grabs.
(The retraction) When you have a column with "low cardinality" such as your col%, with only 0,1,NULL possibilities, INDEX(col1) is essentially useless since it it faster to blindly scan the table rather than use the index.
On the other hand, INDEX(col1, ...) may be useful, as mentioned for the second table.
However neither is useful for the first table. If you have such an INDEX, it will be ignored.
Then comes "covering". Again, your example is unrealistically simplistic because there are essentially no fields touched other than id and col1. A "covering" index includes all the fields of a table that are touched in the query. A covering index is virtually always smaller than the data, so it takes less effort to run through a covering index, hence faster.
(Retract the retraction) INDEX(col1, id), in that order is a useful covering index for the first table.
Imagine how my discussion had gone if you had not mentioned that col1 had only 3 values. Quite different.
And we have not gotten to ORDER BY, IN(...), BETWEEN...AND..., engine differences, tricks with the PRIMARY KEY, LEFT JOIN, etc.
More insight into building indexes from Selects.
ANALYZE TABLE should not be necessary.

For kicks try it with a covered index (a composite of id,col1)
So 1 index make it primary composite. No other indexes.
Then run analyze table xxx (3 times total, once per table)
Then fire it off hoping the mysql cbo isnt to dense to figure it out.
Second idea is to see results without a where clause. Convert it all inside of join on clause

Have you tried this:
SELECT t1.id
FROM
(SELECT id from big_table_1 where col1 = 1) AS t1
INNER JOIN (SELECT id from big_table_2 where col1 = 1) AS t2 ON t2.id = t1.id
INNER JOIN (SELECT id from big_table_3 where col1 = 1) AS t3 ON t3.id = t1.id

mysql using temporary table with subqueries, but not group by and order by

I have the following mysql query which is taking about 3 minutes to run. It does have 2 sub queries, but the tables have very few rows. When doing an explain, it looks like the "using temporary" might be the culprit. Apparently, it looks like the database is creating a temporary table for all three queries as noted in the "using temporary" designation below.
What confused me is that the MySQL documentation says, that using temporary is generally caused by group by and order by, neither of which I'm using. Do the subqueries cause an implicit group by or order by? Are the sub-queries causing a temporary table to be necessary regardless of group by or order by? Any recommendations of how to restructure this query so MySQL can handle it more efficiently? Any other tuning ideas in the MySQL settings?
mysql> explain
SELECT DISTINCT COMPANY_ID, COMPANY_NAME
FROM COMPANY
WHERE ID IN (SELECT DISTINCT ID FROM CAMPAIGN WHERE CAMPAIGN_ID IN (SELECT
DISTINCT CAMPAIGN_ID FROM AD
WHERE ID=10 AND (AD_STATUS='R' OR AD_STATUS='T'))
AND (STATUS_CODE='L' OR STATUS_CODE='A' OR STATUS_CODE='C'));
+----+--------------------+----------+------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+----------+------+---------------+------+---------+------+------+------------------------------+
| 1 | PRIMARY | COMPANY | ALL | NULL | NULL | NULL | NULL | 1207 | Using where; Using temporary |
| 2 | DEPENDENT SUBQUERY | CAMPAIGN | ALL | NULL | NULL | NULL | NULL | 880 | Using where; Using temporary |
| 3 | DEPENDENT SUBQUERY | AD | ALL | NULL | NULL | NULL | NULL | 264 | Using where; Using temporary |
+----+--------------------+----------+------+---------------+------+---------+------+------+------------------------------+
thanks!
Phil

I don't know the structure of your schema, but I would try the following:
CREATE INDEX i_company_id ON company(id); -- should it be a Primary Key?..
CREATE INDEX i_campaign_id ON campaign(id); -- same, PK here?
CREATE INDEX i_ad_id ON ad(id); -- the same question applies
ANALYZE TABLE company, campaign, ad;
And your query can be simplified like this:
SELECT DISTINCT c.company_id, c.company_name
FROM company c
JOIN campaign cg ON c.id = cg.id
JOIN ad ON cg.campaign_id = ad.campaign_id
WHERE ad.id = 10
AND ad.ad_status IN ('R', 'T')
AND ad.status_code IN ('L', 'A', 'C');
DISTINCT clauses in the subqueries are slowing down things significantly for you, the final one is sufficient.

mysql query with OR optimization

Can the following query be optimized? What indexes can be created?
SELECT column_a
FROM Table_b
JOIN Table_a
WHERE Table_B.ID_b = Table_A.ID_a
OR Table_B.ID_b = Table_A.ID_b;

Your query should actually be:
SELECT column_a
FROM Table_b
JOIN Table_a ON Table_B.ID_b IN (Table_A.ID_a, Table_A.ID_b)
If you don't provide ON criteria with the JOIN, MySQL accepts this as being a CROSS JOIN -- the result is a cartesian product (that's bad, unless that's really what you want). If I knew which table that column_a came from, I might suggest a different approach to the query...
Index the following:
Table_B.ID_b
Table_A.ID_a
Table_A.ID_b
The two columns in TABLE_A could be a covering index, rather than separate ones.

If the ID_x fields are keys (primary or unique), this should already be pretty good. (I.e., if they're not, you should make sure that all fields affected by the WHERE part are indexed.)
Consider posting an EXPLAIN of the query:
EXPLAIN SELECT column_a FROM Table_b JOIN Table_a
WHERE Table_B.ID_b = Table_A.ID_a OR Table_B.ID_b = Table_A.ID_b;
From comments:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | Table_b | index | INDEX_ON_ID_b | INDEX_ON_ID_b | 3 | NULL | 1507 | Using index; Using temporary |
| 1 | SIMPLE | Table_a | ALL |ID_a,ID_b,ID_a_column_a, ID_b_column_a_index | NULL | NULL | NULL | 29252089 | Range checked for each record (index map: 0x306) |

MySQL: Inner join vs Where [duplicate]

This question already has answers here:
Explicit vs implicit SQL joins
(12 answers)
Closed 9 years ago.
Is there a difference in performance (in mysql) between
Select * from Table1 T1
Inner Join Table2 T2 On T1.ID = T2.ID
And
Select * from Table1 T1, Table2 T2
Where T1.ID = T2.ID
?

As pulled from the accepted answer in question 44917:
Performance wise, they are exactly the
same (at least in SQL Server) but be
aware that they are deprecating the
implicit outer join syntax.
In MySql the results are the same.
I would personally stick with joining tables explicitly... that is the "socialy acceptable" way of doing it.

They are the same. This can be seen by running the EXPLAIN command:
mysql> explain Select * from Table1 T1
-> Inner Join Table2 T2 On T1.ID = T2.ID;
+----+-------------+-------+-------+---------------+---------+---------+------+------+---------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+---------------------------------------------+
| 1 | SIMPLE | T1 | index | PRIMARY | PRIMARY | 4 | NULL | 4 | Using index |
| 1 | SIMPLE | T2 | index | PRIMARY | PRIMARY | 4 | NULL | 4 | Using where; Using index; Using join buffer |
+----+-------------+-------+-------+---------------+---------+---------+------+------+---------------------------------------------+
2 rows in set (0.00 sec)
mysql> explain Select * from Table1 T1, Table2 T2
-> Where T1.ID = T2.ID;
+----+-------------+-------+-------+---------------+---------+---------+------+------+---------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+---------------------------------------------+
| 1 | SIMPLE | T1 | index | PRIMARY | PRIMARY | 4 | NULL | 4 | Using index |
| 1 | SIMPLE | T2 | index | PRIMARY | PRIMARY | 4 | NULL | 4 | Using where; Using index; Using join buffer |
+----+-------------+-------+-------+---------------+---------+---------+------+------+---------------------------------------------+
2 rows in set (0.00 sec)

Well one late answer from me, As I am analyzing performance of a older application which uses comma based join instead of INNER JOIN clause.
So here are two tables which have a join (both have records more than 1 lac). When executing query which has a comma based join, it takes a lot longer than the INNER JOIN case.
When I analyzed the explain statement, I found that the query having comma join was using the join buffer. However the query having INNER JOIN clause had 'using Where'.
Also these queries are significantly different, as shown in rows column in explain query.
These are my queries and their respective explain results.
explain select count(*) FROM mbst a , his_moneypv2 b
WHERE b.yymm IN ('200802','200811','201001','201002','201003')
AND a.tel3 != ''
AND a.mb_no = b.mb_no
AND b.true_grade_class IN (3,6)
OR b.grade_class IN (4,7);
+----+-------------+-------+-------------+----------------------------------------------------------------+--------------------------------------+---------+------+--------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------------+----------------------------------------------------------------+--------------------------------------+---------+------+--------+---------------------------------------------------------------------+
| 1 | SIMPLE | b | index_merge | PRIMARY,mb_no,yymm,yymm_2,idx_true_grade_class,idx_grade_class | idx_true_grade_class,idx_grade_class | 5,9 | NULL | 16924 | Using sort_union(idx_true_grade_class,idx_grade_class); Using where |
| 1 | SIMPLE | a | ALL | PRIMARY | NULL | NULL | NULL | 134472 | Using where; Using join buffer |
+----+-------------+-------+-------------+----------------------------------------------------------------+--------------------------------------+---------+------+--------+---------------------------------------------------------------------+
v/s
explain select count(*) FROM mbst a inner join his_moneypv2 b
on a.mb_no = b.mb_no
WHERE b.yymm IN ('200802','200811','201001','201002','201003')
AND a.tel3 != ''
AND b.true_grade_class IN (3,6)
OR b.grade_class IN (4,7);
+----+-------------+-------+-------------+----------------------------------------------------------------+--------------------------------------+---------+--------------------+-------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------------+----------------------------------------------------------------+--------------------------------------+---------+--------------------+-------+---------------------------------------------------------------------+
| 1 | SIMPLE | b | index_merge | PRIMARY,mb_no,yymm,yymm_2,idx_true_grade_class,idx_grade_class | idx_true_grade_class,idx_grade_class | 5,9 | NULL | 16924 | Using sort_union(idx_true_grade_class,idx_grade_class); Using where |
| 1 | SIMPLE | a | eq_ref | PRIMARY | PRIMARY | 62 | shaklee_my.b.mb_no | 1 | Using where |
+----+-------------+-------+-------------+----------------------------------------------------------------+--------------------------------------+---------+--------------------+------

Actually they are virtually the same, The JOIN / ON is newer ANSI syntac, the WHERE is older ANSI syntax. Both are recognized by query engines

The comma in a FROM clause is a CROSS JOIN. We can imagine that SQL server has a select query execution procedure which somehow should look like that:
1. iterate through every table
2. find rows that meet join predicate and put it into result table.
3. from the result table, get only those rows that meets the where condition.
If it really looks like that, then using a CROSS JOIN on a table that has a few thousands rows could allocate a lot of memory, when every row is combined with each other before the where condition is examined. Your SQL server could be quite busy then.

I would think so because the first example explicitly tells mysql which columns to join and how to join them where the second one mysql has to try and figure out where you want to join.

the second query is just another notation for an inner join, so if there is a difference in porformance it's only because one query can be parsed faster than the other one - and that difference, if it exists, will be so tiny that you won't notice it.
for more information you could try to take a look at this question (and use the search on SO next time before asking a question that already is answered)

The first query is easier to understand for MySQL so it is likely that the execution plan will be better and that the query will run faster.
The second query without the where clause, is a cross join. If MySQL is able to understand the where clause good enough, it will do its best to avoid cross joining all the rows, but nothing guarantee that.
In a case as simple as yours, the performance will be strictly identical.
Performance wise, the first query will always be better or identical to the second one. And from my point of view it is also a lot easier to understand when rereading.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008