Why is LEFT JOIN slower than INNER JOIN?

Why is LEFT JOIN slower than INNER JOIN? - mysql

I have two queries, the first one (inner join) is super fast, and the second one (left join) is super slow. How do I make the second query fast?
EXPLAIN SELECT saved.email FROM saved INNER JOIN finished ON finished.email = saved.email;
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE finished index NULL email 258 NULL 32168 Using index
1 SIMPLE saved ref email email 383 func 1 Using where; Using index
EXPLAIN SELECT saved.email FROM saved LEFT JOIN finished ON finished.email = saved.email;
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE saved index NULL email 383 NULL 40971 Using index
1 SIMPLE finishedindex NULL email 258 NULL 32168 Using index
Edit: I have added table info for both tables down below.
CREATE TABLE `saved` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`slug` varchar(255) DEFAULT NULL,
`email` varchar(127) NOT NULL,
[omitted fields include varchar, text, longtext, int],
PRIMARY KEY (`id`),
KEY `slug` (`slug`),
KEY `email` (`email`)
) ENGINE=MyISAM AUTO_INCREMENT=56329 DEFAULT CHARSET=utf8;
CREATE TABLE `finished` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`slug` varchar(255) DEFAULT NULL,
`submitted` int(11) DEFAULT NULL,
`status` int(1) DEFAULT '0',
`name` varchar(255) DEFAULT NULL,
`email` varchar(255) DEFAULT NULL,
[omitted fields include varchar, text, longtext, int],
PRIMARY KEY (`id`),
KEY `assigned_user_id` (`assigned_user_id`),
KEY `event_id` (`event_id`),
KEY `slug` (`slug`),
KEY `email` (`email`),
KEY `city_id` (`city_id`),
KEY `status` (`status`),
KEY `recommend` (`recommend`),
KEY `pending_user_id` (`pending_user_id`),
KEY `submitted` (`submitted`)
) ENGINE=MyISAM AUTO_INCREMENT=33063 DEFAULT CHARSET=latin1;

With INNER JOIN, MySQL generally will start with the table with the smallest number of rows. In this case, it starts with table finished and does a look up for the corresponding record in saved using the index on saved.email.
For a LEFT JOIN, (excluding some optimizations) MySQL generally joins the records in order (starting with the left most table). In this case, MySQL starts with the table saved, then attempts to find each corresponding record in finished. Since there is no usable index on finished.email, it must do a full scan for each look up.
Edit
Now that you posted your schema, I can see that MySQL is ignoring the index (finished.email) when going from utf8 to latin1 character set. You've not posted the character sets and collations for each column, so I'm going by the default character set for the table. The collations must be compatible in order for MySQL to use the index.
MySQL can coerce (upgrade) a latin1 collation, which is very limited, up to a utf8 collation such as unicode_ci (so the first query can use the index on saved.email by upgrading latin1 collation to utf8), but the opposite is not true (the second query can't use the index on finished.email since it can't downgrade a utf8 collation down to latin1).
The solution is to change both email columns to a compatible collation, perhaps most easily by making them identical character sets and collations.

The LEFT JOIN query is slower than the INNER JOIN query because it's doing more work.
From the EXPLAIN output, it looks like MySQL is doing nested loop join. (There's nothing wrong with nested loops; I think that's the only join operation that MySQL uses in version 5.5 and earlier.)
For the INNER JOIN query, MySQL is using an efficient "ref" (index lookup) operation to locate the matching rows.
But for the LEFT JOIN query, it looks like MySQL is doing a full scan of the index to find the matching rows. So, with the nested loops join operation, MySQL is doing a full index scan scan for each row from the other table. So, that's on the order of tens of thousands of scans, and each of those scans is inspecting tens of thousands of rows.
Using the estimated row counts from the EXPLAIN output, that's going to require (40971*32168=) 1,317,955,128 string comparisons.
The INNER JOIN query avoids a lot of that work, so it's a lot faster. (It's avoiding all those string comparisons by using an index operation.
-- LEFT JOIN
id select table type key key_len ref rows Extra
-- ------ -------- ----- ----- ------- ---- ----- ------------------------
1 SIMPLE saved index email 383 NULL 40971 Using index
1 SIMPLE finished index email 258 NULL 32168 Using index
-- INNER JOIN
id select table type key key_len ref rows Extra
-- ------ -------- ----- ----- ------- ---- ----- ------------------------
1 SIMPLE finished index email 258 NULL 32168 Using index
1 SIMPLE saved ref email 383 func 1 Using where; Using index
^^^^^ ^^^^ ^^^^^ ^^^^^^^^^^^^
NOTE: Markus Adams spied the difference in characterset in the email columns CREATE TABLE statements that were added to your question.
I believe that it's the difference in the characterset that's preventing MySQL from using an index for your query.
Q2: How do I make the LEFT JOIN query faster?
A: I don't believe it's going to be possible to get that specific query to run faster, without a schema change, such as changing the characterset of the two email columns to match.
The only affect that the "outer join" to the finished table looks like it is to produce "duplicate" rows whenever more than one matching row is found. I'm not understanding why the outer join is needed. Why not just get rid of it altogether, and just do:
SELECT saved.email FROM saved

I'm afraid more info will probably be needed.
However, inner joins eliminate any item that has a null foreign key (no match, if you will). This means that there are less rows to scan to associate.
For a left join however, any non-match needs to be given a blank row, so all of the rows are scanned regardless -- nothing can be eliminated.
This makes the data set larger and requires more resources to process. Also, when you write your select, don't do select * -- instead, explicitly state which columns you want.

The data types of saved.email and finished.email differ in two respects. First, they have different lengths. Second, finished.email can be NULL. So, your LEFT JOIN operation can't exploit the index on finished.email.
Can you change the definition of finished.email to this, so it matches the field you're joining it with?
`email` varchar(127) NOT NULL
If you do you'll probably get a speedup.

Related

Why does MySQL not use index if varchar size is too high?

I am joining with a table and noticed that if the field I join on has a varchar size that's too high then MySQL doesn't use the index for that field in the join, thus resulting in a significantly longer query time. I've put explains and table definition below. It is version MySQL 5.7. Any ideas why this is happening?
Table definition:
CREATE TABLE `LotRecordsRaw` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`lotNumber` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`scrapingJobId` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `lotNumber_UNIQUE` (`lotNumber`),
KEY `idx_Lot_lotNumber` (`lotNumber`)
) ENGINE=InnoDB AUTO_INCREMENT=14551 DEFAULT CHARSET=latin1;
Explains:
explain
(
select lotRecord.*
from LotRecordsRaw lotRecord
left join (
select lotNumber, max(scrapingJobId) as id
from LotRecordsRaw
group by lotNumber
) latestJob on latestJob.lotNumber = lotRecord.lotNumber
)
produces:
The screenshot above shows that the derived table is not using the index on "lotNumber". In that example, the "lotNumber" field was a varchar(255). If I change it to be a smaller size, e.g. varchar(45), then the explain query produces this:
The query then runs orders of magnitude faster (2 seconds instead of 100 sec). What's going on here?

Hooray! You found an optimization reason for not blindly using 255 in VARCHAR.
Please try 191 and 192 -- I want to know if that is the cutoff.
Meanwhile, I have some other comments:
A UNIQUE is a KEY. That is, idx_Lot_lotNumber is redundant and may as well be removed.
The Optimizer can (and probably would) use INDEX(lotNumber, scrapingJobId) as a much faster way to find those MAXes.
Unfortunately, there is no way to specify "make a unique index on lotNumber, but also have that other column in the index.
Wait! With lotNumber being unique, there is only one row per lotNumber. That means MAX and GROUP BY are totally unnecessary!
It seems like lotNumber could be promoted to PRIMARY KEY (and completely get rid of id).

How to avoid full table scan when MySQL optimizer opts out of using a key on a small joined table

I'm running into a strange issue and I'm not sure how to google the scenario properly. I have 2 tables supplied below:
CREATE TABLE revenue_centers (
revc_id int(11) NOT NULL AUTO_INCREMENT,
revc_name varchar(50) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (revc_id)
) ENGINE=InnoDB AUTO_INCREMENT=51 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE cpt (
cpt_id int(11) NOT NULL AUTO_INCREMENT,
cpt_revc_id int(11) NOT NULL,
PRIMARY KEY (cpt_id),
KEY cpt_revc_id (cpt_revc_id)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
There are more columns but the ones above are all that should be needed. These tables are duplicated across separate databases for several customers. Below is one of the queries I have that is causing me the issue:
select * from cpt left join revenue_centers on cpt_revc_id=revc_id order by cpt_id;
With EXPLAIN, on tables that have at least 4 rows in revenue_centers, we get the following:
1 SIMPLE cpt index PRIMARY 4 89664
1 SIMPLE revenue_centers eq_ref PRIMARY PRIMARY 4 _8.cpt.cpt_revc_id 1
The query is quick and there is no lag. For customers that don't use revenue_centers, we get the following:
1 SIMPLE cpt ALL 22725 Using temporary; Using filesort
1 SIMPLE revenue_centers ALL PRIMARY 1 Using where; Using join buffer (Block Nested Loop)
I understand that because the full scan of revenue_centers is likely cheaper than going through an index of 1, mysql chooses to go with the join buffer. I do not understand why it chooses to do the full scan on the cpt table.
If I use the FORCE INDEX hint on the join, it turns into an eq_ref type join and the cpt query no longer uses a temporary table or the filesort. If possible, I would like to avoid adding this into our code. I've attempted to set max_seeks_for_key as low as 1 to see if I could force the optimizer to use the primary key without the index hint but it refuses to change.
Aside from adding the FORCE INDEX to our codebase, is there any way I can improve this query or at least prevent the use of a temporary table?

Optimize Indexes for Particular Query in mySQL

I have a fairly simple query that is taking about 14 seconds to complete and I would like to speed it up. I think I have the correct indexes in place, but I'm not sure...
Here is the query
SELECT *
FROM opportunities
WHERE cid = 7785
AND STATUS != 4
AND otype != 200
AND links > 0
AND ontopic != 'F'
ORDER BY links DESC
LIMIT 0, 100;
Here is the table schema
CREATE TABLE `opportunities` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`cid` int(11) NOT NULL,
`url` varchar(900) CHARACTER SET utf8 NOT NULL,
`status` tinyint(4) NOT NULL,
`links` int(11) NOT NULL,
`otype` int(11) NOT NULL,
`reserved` tinyint(4) NOT NULL,
`ontopic` varchar(3) CHARACTER SET utf8 NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `cid` (`cid`,`url`),
KEY `cid1` (`cid`),
KEY `url` (`url`),
KEY `otype` (`otype`),
KEY `reserved` (`reserved`),
KEY `ontopic` (`ontopic`),
KEY `status` (`status`),
KEY `links` (`links`),
KEY `ontopic_links` (`ontopic`,`links`),
KEY `cid_status_otype_links_ontopic` (`cid`,`status`,`otype`,`links`,`ontopic`)
) ENGINE=InnoDB AUTO_INCREMENT=13022832 DEFAULT CHARSET=latin1
Here is the result of the EXPLAIN command
id: 1
select_type: Simple
table: opportunities
partitions: null
type: range
possible_keys: cid,cid1,otype,ontopic,status,links,ontopic_links,cid_status_otype_links_ontopic
key: links
keylen: 4
ref: null
rows: 1531552
filtered: 0.33
Extra: Using index condition; Using where
Thoughts / Questions
Am I reading it correctly that it is using the "links" key to do the query? Why wouldn't it use a more complete index, like the cid_status_otype_links_ontopic which covers all the conditions of my query?
Thanks in advance!
As requested
There are 30,961 results that match the query when you remove the LIMIT 0,100. Interestingly, the "count()" command returns almost instantaneously.

It's a funny thing about using inequality comparisons, that they count as range conditions.
That is, equality matches one value, but anything other than equality (!=, >, <, IN, BETWEEN).
By matching multiple values, it means that only the first column in an index used in a range condition is going to be optimized. You'd think that your index cid_status_otype_links_ontopic has all the columns mentioned in conditions of your query, but only the first two will be used. The first because you have an equality comparison for cid. The second because the next column is used in an inequality comparison, and then that's where it stops using columns from the index.*
Evidence: if you can force that index to be used, you should see the keylen field of the EXPLAIN result show only 5, which is the size of cid (4 bytes) + status (1 byte).
The MySQL optimizer apparently has predicted that it would be more beneficial to use your links index, because that allows it to access the rows in index order, which is the same as the sort order you requested with your ORDER BY.
Evidence: you don't see "Using filesort" in your EXPLAIN notes.
Is that really better than using one of the other indexes? Maybe, maybe not. The optimizer's predictions aren't always perfect.
You can use an index hint to override the optimizer's choice:
SELECT * FROM opportunities USE INDEX (cid_status_otype_links_ontopic) WHERE ...
Try that out, do the EXPLAIN of that query and compare it to your other EXPLAIN. Then execute both queries and see which is reliably faster.
(* Actually, I have to add a footnote about the index column usage. MySQL 5.6 and later can do a little bit better than just the two columns, when you see the note "Using Index Condition" in the EXPLAIN. But it's not quite the same. You can read more about that here: https://dev.mysql.com/doc/refman/5.6/en/index-condition-pushdown-optimization.html)

What you have must plow through all of the rows, using your 5-column index, then sort the results and deliver 100 rows.
The only index likely to be useful is INDEX(cid, links). This is because cid is the only column being tested with =, then having links might be useful for the ORDER BY and LIMIT. There is still the risk that the != tests will require filtering a lot of rows.
Are status and otype multi-valued? If either has only 2 values, then turning the != into = and adding it to the index would be beneficial.
Do you really need all the columns (SELECT *)? If not, and if you don't need any big columns (url), then you could go with a 'covering' index.
More on writing indexes .

Count the number of rows between unix time stamps for each ID

I'm trying to populate some data for a table. The query is being run on a table that contains ~50 million records. The query I'm currently using is below. It counts the number of rows that match the template id and are BETWEEN two unix timestamps:
SELECT COUNT(*) as count FROM `s_log`
WHERE `time_sent` BETWEEN '1346904000' AND '1346993271'
AND `template` = '1'
While the query above does work, performance is rather slow while looping through each template which at times can be in the hundreds. The time stamps are stored as int and are properly indexed. Just to test thing out, I tried running the query below, omitting the time_sent restriction:
SELECT COUNT(*) as count FROM `s_log`
AND `template` = '1'
As expected, it runs very fast, but is obviously not restricting count results inside the correct time frame. How can I obtain a count for a specific template AND restrict that count BETWEEN two unix timestamps?
EXPLAIN:
1 | SIMPLE | s_log | ref | time_sent,template | template | 4 | const | 71925 | Using where
SHOW CREATE TABLE s_log:
CREATE TABLE `s_log` (
`id` int(255) NOT NULL AUTO_INCREMENT,
`email` varchar(255) NOT NULL,
`time_sent` int(25) NOT NULL,
`template` int(55) NOT NULL,
`key` varchar(255) NOT NULL,
`node_id` int(55) NOT NULL,
`status` varchar(55) NOT NULL,
PRIMARY KEY (`id`),
KEY `email` (`email`),
KEY `time_sent` (`time_sent`),
KEY `template` (`template`),
KEY `node_id` (`node_id`),
KEY `key` (`key`),
KEY `status` (`status`),
KEY `timestamp` (`timestamp`)
) ENGINE=MyISAM AUTO_INCREMENT=2078966 DEFAULT CHARSET=latin1

The best index you may have in this case is composite one template + time_sent
CREATE INDEX template_time_sent ON s_log (template, time_sent)
PS: Also as long as all your columns in the query are integer DON'T enclose their values in quotes (in some cases it could lead to issues, at least with older mysql versions)

First, you have to create an index that has both of your columns together (not seperately). Also check your table type, i think it would work great if your table is innoDB.
And lastly, use your WHERE clause in this fashion:
`WHEREtemplate= '1' ANDtime_sent` BETWEEN '1346904000' AND '1346993271'
What this does is first check if template is 1, if it is then it would check for the second condition else skip. This will definitely give you performance-edge

If you have to call the query for each template maybe it would be faster to get all the information with one query call by using GROUP BY:
SELECT template, COUNT(*) as count FROM `s_log`
WHERE `time_sent` BETWEEN 1346904000 AND 1346993271;
GROUP BY template
It's just a guess that this would be faster and you also would have to redesign your code a bit.
You could also try to use InnoDB instead of MyISAM. InnoDB uses a clustered index which maybe performs better on large tables. From the MySQL site:
Accessing a row through the clustered index is fast because the row data is on the same page where the index search leads. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record. (For example, MyISAM uses one file for data rows and another for index records.)
There are some questions on Stackoverflow which discuss the performance between InnoDB and MyISAM:
Should I use MyISAM or InnoDB Tables for my MySQL Database?
Migrating from MyISAM to InnoDB
MyISAM versus InnoDB

MySQL partial indexes on varchar fields and group by optimization

I am having some issues with a group query with MySQL.
Question
Is there a reason why a query won't use a 10 character partial index on a varchar(255) field to optimize a group by?
Details
My setup:
CREATE TABLE `sessions` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) DEFAULT NULL,
`ref_source` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`guid` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`initial_path` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`referrer_host` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`campaign` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_sessions_on_user_id` (`user_id`),
KEY `index_sessions_on_referrer_host` (`referrer_host`(10)),
KEY `index_sessions_on_initial_path` (`initial_path`(10)),
KEY `index_sessions_on_campaign` (`campaign`(10))
) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
A number of columns and indexes are not shown here since they don't really impact the issue.
What I want to do is run a query to see all of the referring hosts and the number of session coming from each. I don't have a huge table, but it is big enough where I full table scans aren't fun. The query I want to run is:
SELECT COUNT(*) AS count_all, referrer_host AS referrer_host FROM `sessions` GROUP BY referrer_host;
The explain gives:
+----+-------------+----------+------+---------------+------+---------+------+--------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+---------------+------+---------+------+--------+---------------------------------+
| 1 | SIMPLE | sessions | ALL | NULL | NULL | NULL | NULL | 303049 | Using temporary; Using filesort |
+----+-------------+----------+------+---------------+------+---------+------+--------+---------------------------------+
I have a partial index on referrer_host, but it isn't using it. Even if I try to USE INDEX or FORCE INDEX it doesn't help. The explain is the same, as is the performance.
If I add a full index on referrer_host, instead of a 10 character partial index, everything is works better, if not instantly. (350ms vs. 10 seconds)
I have tested partial indexes that are bigger than the longest entry in the field to no avail as well. The full index is the only thing that seems to work.

with the full index, the query will find scan the entire index and return the number of records pointed to for each unique key. the table isn't touched.
with the partial index, the engine doesn't know the value of the referrer_host until it looks at the record. It has to scan the whole table!
if most of the values for referrer_host are less than 10 chars then in theory, the optimiser could use the index and then only check rows that have more than 10 chars. But, because this is not a clustered index it would have to make many non-sequential disk reads to find these records. It could end up being even slower, because a table scan will at least be a sequential read. Instead of making assumptions, the optimiser just does a scan.

Try this query:
EXPLAIN SELECT COUNT(referrer_host) AS count_all, referrer_host FROM `sessions` GROUP BY referrer_host;
Now the count will fail for the group by on referrer_host = null, but I'm uncertain if there's another way around this.

You're grouping on referrer_host for all the rows in the table. As your index doesn't include referrer_host (it contains the first 10 chars!), it's going to scan the whole table.
I'll bet that this is faster, though less detailed:
SELECT COUNT(*) AS count_all, substring(referrer_host,1,10) AS referrer_host FROM `sessions` GROUP BY referrer_host;
If you need the full referrer, index it.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008