Indexing table columns

Indexing table columns - mysql

I am developing an application and I create query strings from the program and pass it to a stored procedure which includes four Prepared statements. After passing the variables, the statements would be as follows,
DROP TABLE IF EXISTS tbl_correlatedData;
CREATE TABLE tbl_correlatedData
SELECT t0.*,t1.counttimeStamplocalIp,t2.countlocalPort,t3.countlocalGeo,t4.countisp,t5.countforeignip,t6.countforeignPort,t7.countforeignGeo,t8.countinfection
FROM tbl_union_threats t0
LEFT JOIN tbl_tsli t1
USING (timeStamp,localIp) LEFT JOIN tbl_tslilp t2 USING (timeStamp,localIp,localPort)
LEFT JOIN tbl_tslilplg t3
USING (timeStamp,localIp,localPort,localGeo)
LEFT JOIN tbl_tslilplgisp t4
USING (timeStamp,localIp,localPort,localGeo,isp)
LEFT JOIN tbl_tslilplgispfi t5
USING (timeStamp,localIp,localPort,localGeo,isp,foreignip)
LEFT JOIN tbl_tslilplgispfifp t6
USING (timeStamp,localIp,localPort,localGeo,isp,foreignip,foreignPort)
LEFT JOIN tbl_tslilplgispfifpfg t7
USING (timeStamp,localIp,localPort,localGeo,isp,foreignip,foreignPort,foreignGeo)
LEFT JOIN tbl_tslilplgispfifpfginf t8 USING (timeStamp,localIp,localPort,localGeo,isp,foreignip,foreignPort,foreignGeo,infection)
GROUP BY timeStamp,localIp;
ALTER TABLE tbl_correlatedData
MODIFY timeStamp VARCHAR(200) NOT NULL,
MODIFY localIp VARCHAR(200) NOT NULL,
MODIFY localPort VARCHAR(200) NOT NULL,
MODIFY localGeo VARCHAR(200) NOT NULL,
MODIFY isp VARCHAR(200) NOT NULL,
MODIFY foreignip VARCHAR(200) NOT NULL,
MODIFY foreignPort VARCHAR(200) NOT NULL,
MODIFY foreignGeo VARCHAR(200) NOT NULL,
MODIFY infection VARCHAR(200) NOT NULL;
CREATE INDEX id_index ON tbl_correlatedData (timeStamp,localIp,localPort,localGeo,isp,foreignIp,foreignPort,foreignGeo,infection);
BUT when the process gets to the indexing query, it gives out an error saying,
Incorrect key file for table 'tbl_correlateddata'; try to repair it
FYI :
I am trying this out on Windows Vista 32 bit with a free space of 19 GB on the drive with the xampp server and the table getting created shows its size as 25Mb on phpMyadmin.
EDIT:
when i try to repair it using REPAIR TABLE tbl_correlateddata gives out the following,
Table | Op | Msg_type | Msg_text
-----------------------------------------------------------------------------------------------------------------
db_threatanalysis.tbl_correlateddata | repair| Error | Table 'db_threatanalysis.tbl_correlateddata' doesn...
db_threatanalysis.tbl_correlateddata | repair| status | Operation failed
Thank you very much for the help..in advance :)

You are trying to create compound (multi-column) index, which is just too long for Innodb.
You have 9 columns, each VARCHAR(200), so total index width is 1800 chars. According to MySQL documentation, Innodb key is limited at 3072 chars. So, you should be ok, but, there is no guarantee that ALTER TABLE ... MODIFY ... was able to reduce all column widths to 200 or less, so even if one remained at something like 4000 chars, you will get this error.
Solution:
Reduce number of fields in your compound index.
Analyze queries which are going to query this generated table, and only create indexes that are really necessary. I would imagine most of them will be one-column indexes.
Also, it is rather strange why do you need VARCHAR(200) to store something as simple as timestamp, ip, port, etc. You can probably easily squeeze it to 10 bytes or less and call it a day.

Your key size might be too long. I tried something similiar on my local MySQL install. Since I dont have your tables I could not run the CREATE TABLE statement. As my database is setup for UNICODE my keys size was over 4000 bytes. MySQL InnoDB can only create indexes with a key size of 3072 bytes.
My code looked like follows:
CREATE TABLE tbl_correlatedData
(
`timeStamp` VARCHAR(200) NOT NULL,
localIp VARCHAR(200) NOT NULL,
localPort VARCHAR(200) NOT NULL,
localGeo VARCHAR(200) NOT NULL,
isp VARCHAR(200) NOT NULL,
foreignip VARCHAR(200) NOT NULL,
foreignPort VARCHAR(200) NOT NULL,
foreignGeo VARCHAR(200) NOT NULL,
infection VARCHAR(200) NOT NULL
);
CREATE INDEX id_index ON tbl_correlatedData
(timeStamp,
localIp,
localPort,
localGeo,
isp,
foreignIp,
foreignPort,
foreignGeo,
infection
);
This resulted in the error:
Error Code: 1071. Specified key was too long; max key length is 3072 bytes
Please read about size limitations here: http://dev.mysql.com/doc/refman/5.5/en/innodb-restrictions.html. I suspect you have this problem.

index key prefixes can be up to 767 bytes for innodb table where it will be approximately 1000 bytes for myisam table
total index length of mysql innodb is 3072
here first you just check the length of the index and if possible reduce the column size varchar(100) for all
if possible create separate indexes ( if it suits your requirement )
see the link
http://dev.mysql.com/doc/refman/5.0/en//create-index.html
http://bugs.mysql.com/bug.php?id=6604

Related

Why does MySQL not use index if varchar size is too high?

I am joining with a table and noticed that if the field I join on has a varchar size that's too high then MySQL doesn't use the index for that field in the join, thus resulting in a significantly longer query time. I've put explains and table definition below. It is version MySQL 5.7. Any ideas why this is happening?
Table definition:
CREATE TABLE `LotRecordsRaw` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`lotNumber` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`scrapingJobId` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `lotNumber_UNIQUE` (`lotNumber`),
KEY `idx_Lot_lotNumber` (`lotNumber`)
) ENGINE=InnoDB AUTO_INCREMENT=14551 DEFAULT CHARSET=latin1;
Explains:
explain
(
select lotRecord.*
from LotRecordsRaw lotRecord
left join (
select lotNumber, max(scrapingJobId) as id
from LotRecordsRaw
group by lotNumber
) latestJob on latestJob.lotNumber = lotRecord.lotNumber
)
produces:
The screenshot above shows that the derived table is not using the index on "lotNumber". In that example, the "lotNumber" field was a varchar(255). If I change it to be a smaller size, e.g. varchar(45), then the explain query produces this:
The query then runs orders of magnitude faster (2 seconds instead of 100 sec). What's going on here?

Hooray! You found an optimization reason for not blindly using 255 in VARCHAR.
Please try 191 and 192 -- I want to know if that is the cutoff.
Meanwhile, I have some other comments:
A UNIQUE is a KEY. That is, idx_Lot_lotNumber is redundant and may as well be removed.
The Optimizer can (and probably would) use INDEX(lotNumber, scrapingJobId) as a much faster way to find those MAXes.
Unfortunately, there is no way to specify "make a unique index on lotNumber, but also have that other column in the index.
Wait! With lotNumber being unique, there is only one row per lotNumber. That means MAX and GROUP BY are totally unnecessary!
It seems like lotNumber could be promoted to PRIMARY KEY (and completely get rid of id).

Update a table with longblobs is too slow

I have a table filled with data (about 20,000 records). I am trying to update it by the data from another table, but I have a timeout (30 seconds).
At first I tried a naive solution:
UPDATE TableWhithBlobs a
JOIN AnotherTable b on a.AnotherTableId = b.Id
SET a.SomeText= b.Description;
This script is working much longer then 30 seconds, so I tried to reduce join:
UPDATE TableWhithBlobs a
SET a.SomeText = (select b.Description from AnotherTable b where a.AnotherTableId = b.Id);
but this one is still very slow. Is there any cases how it could be fast?
Edit:
A bit explanation about what I'm doing. Previously, I had two tables, which in my script are called TableWhithBlobs and AnotherTable. In table TableWhithBlobs, a link to table AnotherTable was stored, but this link was not a real foreign key, it was just a guid from table AnotherTable. And there is a Unique key constraint for this reference in TableWhithBlobs for this guid. I decided to fix this, remove the old field from table TableWhithBlobs and add a normal foreign key to it (using the primary ID from AnotherTable). The script from the question just adds the correct data to this new field. After that, I delete old guid reference and add a new foreign key constraint. Everything works fine in the small amount of data in TableWhithBlobs, but on QA database with 20000 rows its extremely slow.
Update
SHOW CREATE TABLE TableWhithBlobs;
CREATE TABLE `TableWhithBlobs` (
`Id` bigint(20) NOT NULL AUTO_INCREMENT,
`AnotherTableId` char(36) CHARACTER SET ascii NOT NULL,
`ChunkNumber` bigint(20) NOT NULL,
`Content` longblob NOT NULL,
`SomeText` bigint(20) NOT NULL,
PRIMARY KEY (`Id`),
UNIQUE KEY `AnotherTableId` (`AnotherTableId`,`ChunkNumber`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1
SHOW CREATE TABLE AnotherTable ;
CREATE TABLE `AnotherTable` (
`Description` bigint(20) NOT NULL AUTO_INCREMENT,
`Id` char(36) CHARACTER SET ascii NOT NULL,
`Length` bigint(20) NOT NULL,
`ContentDigest` char(68) CHARACTER SET ascii NOT NULL,
`ContentAndMetadataDigest` char(68) CHARACTER SET ascii NOT NULL,
`Status` smallint(6) NOT NULL,
`ChunkStartNumber` bigint(20) NOT NULL DEFAULT '0',
`IsTestData` bit(1) NOT NULL DEFAULT b'0',
PRIMARY KEY (`Description`),
UNIQUE KEY `Id` (`Id`),
UNIQUE KEY `ContentAndMetadataDigest` (`ContentAndMetadataDigest`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1
PS. Column names may look weird because i want to hide the actual production scheme names.
innodb_buffer_pool_size is 134217728, RAM is 4Gb
Result of
explain UPDATE TableWhithBlobs a JOIN AnotherTable b on a.AnotherTableId =
b.Id SET a.SomeText= b.Description;
Version: mysql Ver 14.14 Distrib 5.7.21-20, for debian-linux-gnu (x86_64) using 6.3

Some thoughts, none of which jump out as "the answer":
Increase innodb_buffer_pool_size to 1500M, assuming this does not lead to swapping.
Step back and look at "why" the BIGINT needs to be copied over so often. And whether "all" rows need updating.
Put the LONGBLOB into another table in parallel with the current one. That will add a JOIN for the cases when you need to fetch the blob, but may keep it out of the way for the current query. (I would not expect the blob to be "in the way", but apparently it is.)
What is in the blob? In some situations, it is better to have the blob in a file. A prime example is an image for a web site -- it could be accessed via http's <img...>.
Increase the timeout -- but this just "sweeps the problem under the rug" and probably leads to 30+ second delays in other things that are waiting for it. I don't recognize 30 seconds as a timeout amount. Look through SHOW VARIABLES LIKE '%out'; Try increasing any that are 30.
Do the update piecemeal -- but would this have other implications? (Anyway, Luuk should carry this option forward.)

What about doing smaller updates?
UPDATE TableWhithBlobs a
JOIN AnotherTable b on a.AnotherTableId = b.Id
SET a.SomeText= b.Description
WHERE a.SomeText <> b.Description;
or even:
UPDATE TableWhithBlobs a
JOIN AnotherTable b on a.AnotherTableId = b.Id
SET a.SomeText= b.Description
WHERE a.SomeText <> b.Description
LIMIT 100;
Your timeout problem should be solved 😉, but i do not know how many times you have to run this to finally get the 0 rows affected...

Duplicating records in a mysql table intermittently causes statements to hang and not return

We are trying to duplicate existing records in a table: make 10 records out of one. The original table contains 75.000 records, and once the statements are done will contain about 750.000 (10 times as many). The statements sometimes finish after 10 minutes, but many times they never return. Hours later we will receive a timeout. This happens about 1 out of 3 times. We are using a test database where nobody is working on, so there is no concurrent access to the table. I don't see any way to optimise the SQL since to me the EXPLAIN PLAN looks fine.
The database is mysql 5.5 hosted on AWS RDS db.m3.x-large. The CPU load goes up to 50% during the statements.
Question: What could cause this intermittent behaviour? How do I resolve it?
This is the SQL to create a temporary table, make roughly 9 new records per existing record in ct_revenue_detail in the temporary table, and then copy the data from the temporary table to ct_revenue_detail
---------------------------------------------------------------------------------------------------------
-- CREATE TEMPORARY TABLE AND COPY ROLL-UP RECORDS INTO TABLE
---------------------------------------------------------------------------------------------------------
CREATE TEMPORARY TABLE ct_revenue_detail_tmp
SELECT r.month,
r.period,
a.participant_eid,
r.employee_name,
r.employee_cc,
r.assignments_cc,
r.lob_name,
r.amount,
r.gp_run_rate,
r.unique_id,
r.product_code,
r.smart_product_name,
r.product_name,
r.assignment_type,
r.commission_pcent,
r.registered_name,
r.segment,
'Y' as account_allocation,
r.role_code,
r.product_eligibility,
r.revenue_core,
r.revenue_ict,
r.primary_account_manager_id,
r.primary_account_manager_name
FROM ct_revenue_detail r
JOIN ct_account_allocation_revenue a
ON a.period = r.period AND a.unique_id = r.unique_id
WHERE a.period = 3 AND lower(a.rollup_revenue) = 'y';
This is the second query. It copies the records from the temporary table back to the ct_revenue_detail TABLE
INSERT INTO ct_revenue_detail(month,
period,
participant_eid,
employee_name,
employee_cc,
assignments_cc,
lob_name,
amount,
gp_run_rate,
unique_id,
product_code,
smart_product_name,
product_name,
assignment_type,
commission_pcent,
registered_name,
segment,
account_allocation,
role_code,
product_eligibility,
revenue_core,
revenue_ict,
primary_account_manager_id,
primary_account_manager_name)
SELECT month,
period,
participant_eid,
employee_name,
employee_cc,
assignments_cc,
lob_name,
amount,
gp_run_rate,
unique_id,
product_code,
smart_product_name,
product_name,
assignment_type,
commission_pcent,
registered_name,
segment,
account_allocation,
role_code,
product_eligibility,
revenue_core,
revenue_ict,
primary_account_manager_id,
primary_account_manager_name
FROM ct_revenue_detail_tmp;
This is the EXPLAIN PLAN of the SELECT:
+----+-------------+-------+------+------------------------+--------------+---------+------------------------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+------------------------+--------------+---------+------------------------------------+-------+-------------+
| 1 | SIMPLE | a | ref | ct_period,ct_unique_id | ct_period | 4 | const | 38828 | Using where |
| 1 | SIMPLE | r | ref | ct_period,ct_unique_id | ct_unique_id | 5 | optusbusiness_20160802.a.unique_id | 133 | Using where |
+----+-------------+-------+------+------------------------+--------------+---------+------------------------------------+-------+-------------+
This is the definition of ct_revenue_detail:
ct_revenue_detail | CREATE TABLE `ct_revenue_detail` (
`participant_eid` varchar(255) DEFAULT NULL,
`lob_name` varchar(255) DEFAULT NULL,
`amount` decimal(32,16) DEFAULT NULL,
`employee_name` varchar(255) DEFAULT NULL,
`period` int(11) NOT NULL DEFAULT '0',
`pk_id` int(11) NOT NULL AUTO_INCREMENT,
`gp_run_rate` decimal(32,16) DEFAULT NULL,
`month` int(11) DEFAULT NULL,
`assignments_cc` int(11) DEFAULT NULL,
`employee_cc` int(11) DEFAULT NULL,
`unique_id` int(11) DEFAULT NULL,
`product_code` varchar(50) DEFAULT NULL,
`smart_product_name` varchar(255) DEFAULT NULL,
`product_name` varchar(255) DEFAULT NULL,
`assignment_type` varchar(100) DEFAULT NULL,
`commission_pcent` decimal(32,16) DEFAULT NULL,
`registered_name` varchar(255) DEFAULT NULL,
`segment` varchar(100) DEFAULT NULL,
`account_allocation` varchar(25) DEFAULT NULL,
`role_code` varchar(25) DEFAULT NULL,
`product_eligibility` varchar(25) DEFAULT NULL,
`rollup` varchar(10) DEFAULT NULL,
`revised_amount` decimal(32,16) DEFAULT NULL,
`original_amount` decimal(32,16) DEFAULT NULL,
`comment` varchar(255) DEFAULT NULL,
`amount_revised_flag` varchar(255) DEFAULT NULL,
`exclude_segment` varchar(10) DEFAULT NULL,
`revenue_type` varchar(50) DEFAULT NULL,
`revenue_core` decimal(32,16) DEFAULT NULL,
`revenue_ict` decimal(32,16) DEFAULT NULL,
`primary_account_manager_id` varchar(100) DEFAULT NULL,
`primary_account_manager_name` varchar(100) DEFAULT NULL,
PRIMARY KEY (`pk_id`,`period`),
KEY `ct_participant_eid` (`participant_eid`),
KEY `ct_period` (`period`),
KEY `ct_employee_name` (`employee_name`),
KEY `ct_month` (`month`),
KEY `ct_segment` (`segment`),
KEY `ct_unique_id` (`unique_id`)
) ENGINE=InnoDB AUTO_INCREMENT=15338782 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY HASH (period)
PARTITIONS 120 */ |
Edit 29.9: The intermittent behaviour was caused by the omission of a delete SQL statement. If the original table was not deleted before automatically duplicating records. The first time all is fine: we started with 75,000 records and ended up with 750,000 records.
Because the delete statement was missed the next time we already had 750,000 records, and the script would make 7.5M records out of it. That would still work, but the subsequent run trying to make 7.5M into 75M records would fail. 1 in 3 failures.
We would then try all the scripts manually, and of course then we would delete the table properly, and all would go well. The reason why we didn't see that beforehand was that our application does not output anything when running the SQL.

The real delay would be with your second query inserting from the temporary table back into the original tables. There are several issues here.
Sheer amount of data
Looking at your table, there are several columns of varchar(255) a conservative estimate would put an average length of your rows at 2kb. That's roughly about 1.5GB that's being copied from one table to another and being moved to different partitions! Partitioning makes reads more efficient but for inserting the engine has to figure out which partition the data should be moved to so it's actually writing to lots of different files instead of sequentially to one file. For spinning disks, this is slow.
Rebuilding the indexes
One of the biggest costs of inserts is rebuilding the indexes. In your case you have many of them.
KEY `ct_participant_eid` (`participant_eid`),
KEY `ct_period` (`period`),
KEY `ct_employee_name` (`employee_name`),
KEY `ct_month` (`month`),
KEY `ct_segment` (`segment`),
KEY `ct_unique_id` (`unique_id`)
And some of this indexes like employee_name are on varchar(255) columns. That means pretty hefty indexes.
Solution part 1 - Normalize
Your database isn't normalized. Here is a classic example:
primary_account_manager_id varchar(100) DEFAULT NULL,
primary_account_manager_name varchar(100) DEFAULT NULL,
You should really be having a table called account_manager and these two fields should be in that. primary_account_manager_id probably should be an integer field. It is only the id that should be in your ct_revenue_detail table.
Similarly you really shouldn't have employee_name, registered_name etc in this table. They should be in separate tables and they should be linked to ct_revenue_detail by foreign keys.
Solution part 2 - Rethink indexes.
Do you need so many? Mysql only uses one index per table per where clause anyway so some of these indexes are probably never used. Is this one really needed:
KEY `ct_unique_id` (`unique_id`)
You already have primary key why do you even need another unique column?

Indexes for the SELECT: For
SELECT ...
FROM ct_revenue_detail r
JOIN ct_account_allocation_revenue a
ON a.period = r.period AND a.unique_id = r.unique_id
WHERE a.period = 3 AND lower(a.rollup_revenue) = 'y';
a needs INDEX(period, rollup_revenue) in either order. However, you also need to declare rollup_revenue to have a ..._ci collation and avoiding the column in a function. That is change lower(a.rollup_revenue) = 'y' to a.rollup_revenue = 'y'.
r needs INDEX(period, unique_id) in either order. But, as e4c5 mentioned, if unique_id is really "unique" in this table, then take advantage of such.
Bulkiness is a problem when shoveling data around.
decimal(32,16) takes 16 bytes and gives you precision and range that are probably unnecessary. Consider FLOAT (4 bytes, ~7 significant digits, adequate range) or DOUBLE (8 bytes, ~16 significant digits, adequate range).
month int(11) takes 4 bytes. If that is just a value 1..12, then use TINYINT UNSIGNED (1 byte).
DEFAULT NULL -- I suspect most columns will never be NULL; if so, say NOT NULL for them.
amount_revised_flag varchar(255) -- if that is really a "flag", such as "yes"/"no", then use an ENUM and save lots of space.
It is uncommon to have both an id and a name in the same table (see primary_account_manager*); that is usually relegated to a "normalization table".
"Normalize" (already mentioned by #e4c5).
HASH partitioning
Hash partitioning is virtually useless. Unless you can justify it (preferably with a benchmark), I recommend removing partitioning. More discussion.
Adding or removing partitioning usually involves changing the indexes. Please show us the main queries so we can help you build suitable indexes (especially composite indexes) for the queries.

Why is LEFT JOIN slower than INNER JOIN?

I have two queries, the first one (inner join) is super fast, and the second one (left join) is super slow. How do I make the second query fast?
EXPLAIN SELECT saved.email FROM saved INNER JOIN finished ON finished.email = saved.email;
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE finished index NULL email 258 NULL 32168 Using index
1 SIMPLE saved ref email email 383 func 1 Using where; Using index
EXPLAIN SELECT saved.email FROM saved LEFT JOIN finished ON finished.email = saved.email;
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE saved index NULL email 383 NULL 40971 Using index
1 SIMPLE finishedindex NULL email 258 NULL 32168 Using index
Edit: I have added table info for both tables down below.
CREATE TABLE `saved` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`slug` varchar(255) DEFAULT NULL,
`email` varchar(127) NOT NULL,
[omitted fields include varchar, text, longtext, int],
PRIMARY KEY (`id`),
KEY `slug` (`slug`),
KEY `email` (`email`)
) ENGINE=MyISAM AUTO_INCREMENT=56329 DEFAULT CHARSET=utf8;
CREATE TABLE `finished` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`slug` varchar(255) DEFAULT NULL,
`submitted` int(11) DEFAULT NULL,
`status` int(1) DEFAULT '0',
`name` varchar(255) DEFAULT NULL,
`email` varchar(255) DEFAULT NULL,
[omitted fields include varchar, text, longtext, int],
PRIMARY KEY (`id`),
KEY `assigned_user_id` (`assigned_user_id`),
KEY `event_id` (`event_id`),
KEY `slug` (`slug`),
KEY `email` (`email`),
KEY `city_id` (`city_id`),
KEY `status` (`status`),
KEY `recommend` (`recommend`),
KEY `pending_user_id` (`pending_user_id`),
KEY `submitted` (`submitted`)
) ENGINE=MyISAM AUTO_INCREMENT=33063 DEFAULT CHARSET=latin1;

With INNER JOIN, MySQL generally will start with the table with the smallest number of rows. In this case, it starts with table finished and does a look up for the corresponding record in saved using the index on saved.email.
For a LEFT JOIN, (excluding some optimizations) MySQL generally joins the records in order (starting with the left most table). In this case, MySQL starts with the table saved, then attempts to find each corresponding record in finished. Since there is no usable index on finished.email, it must do a full scan for each look up.
Edit
Now that you posted your schema, I can see that MySQL is ignoring the index (finished.email) when going from utf8 to latin1 character set. You've not posted the character sets and collations for each column, so I'm going by the default character set for the table. The collations must be compatible in order for MySQL to use the index.
MySQL can coerce (upgrade) a latin1 collation, which is very limited, up to a utf8 collation such as unicode_ci (so the first query can use the index on saved.email by upgrading latin1 collation to utf8), but the opposite is not true (the second query can't use the index on finished.email since it can't downgrade a utf8 collation down to latin1).
The solution is to change both email columns to a compatible collation, perhaps most easily by making them identical character sets and collations.

The LEFT JOIN query is slower than the INNER JOIN query because it's doing more work.
From the EXPLAIN output, it looks like MySQL is doing nested loop join. (There's nothing wrong with nested loops; I think that's the only join operation that MySQL uses in version 5.5 and earlier.)
For the INNER JOIN query, MySQL is using an efficient "ref" (index lookup) operation to locate the matching rows.
But for the LEFT JOIN query, it looks like MySQL is doing a full scan of the index to find the matching rows. So, with the nested loops join operation, MySQL is doing a full index scan scan for each row from the other table. So, that's on the order of tens of thousands of scans, and each of those scans is inspecting tens of thousands of rows.
Using the estimated row counts from the EXPLAIN output, that's going to require (40971*32168=) 1,317,955,128 string comparisons.
The INNER JOIN query avoids a lot of that work, so it's a lot faster. (It's avoiding all those string comparisons by using an index operation.
-- LEFT JOIN
id select table type key key_len ref rows Extra
-- ------ -------- ----- ----- ------- ---- ----- ------------------------
1 SIMPLE saved index email 383 NULL 40971 Using index
1 SIMPLE finished index email 258 NULL 32168 Using index
-- INNER JOIN
id select table type key key_len ref rows Extra
-- ------ -------- ----- ----- ------- ---- ----- ------------------------
1 SIMPLE finished index email 258 NULL 32168 Using index
1 SIMPLE saved ref email 383 func 1 Using where; Using index
^^^^^ ^^^^ ^^^^^ ^^^^^^^^^^^^
NOTE: Markus Adams spied the difference in characterset in the email columns CREATE TABLE statements that were added to your question.
I believe that it's the difference in the characterset that's preventing MySQL from using an index for your query.
Q2: How do I make the LEFT JOIN query faster?
A: I don't believe it's going to be possible to get that specific query to run faster, without a schema change, such as changing the characterset of the two email columns to match.
The only affect that the "outer join" to the finished table looks like it is to produce "duplicate" rows whenever more than one matching row is found. I'm not understanding why the outer join is needed. Why not just get rid of it altogether, and just do:
SELECT saved.email FROM saved

I'm afraid more info will probably be needed.
However, inner joins eliminate any item that has a null foreign key (no match, if you will). This means that there are less rows to scan to associate.
For a left join however, any non-match needs to be given a blank row, so all of the rows are scanned regardless -- nothing can be eliminated.
This makes the data set larger and requires more resources to process. Also, when you write your select, don't do select * -- instead, explicitly state which columns you want.

The data types of saved.email and finished.email differ in two respects. First, they have different lengths. Second, finished.email can be NULL. So, your LEFT JOIN operation can't exploit the index on finished.email.
Can you change the definition of finished.email to this, so it matches the field you're joining it with?
`email` varchar(127) NOT NULL
If you do you'll probably get a speedup.

Running SQL queries with JOINs on large datasets

Im new to using MySQL.
Im trying to run an inner join query, between a database of 80,000 (this is table B) records against a 40GB data set with approx 600million records (this is table A)
Is Mysql suitable for running this sort of query?
Whay sort of time should I expect it to take?
This is the code I ied is below. However it failed as my dbs connection failed at 60000 secs.
set net_read_timeout = 36000;
INSERT
INTO C
SELECT A.id, A.link_id, link_ref, network,
date_1, time_per,
veh_cls, data_source, N, av_jt
from A
inner join B
on A.link_id = B.link_id;
Im starting to look into ways to cutting down the 40GB table size to a temp table, to try and make the query more manageabe. But I keep getting
Error Code: 1206. The total number of locks exceeds the lock table size 646.953 sec
Am I on the right track?
cheers!
my code for splitting the database is:
LOCK TABLES TFM_830_car WRITE, tfm READ;
INSERT
INTO D
SELECT A.id, A.link_id, A.time_per, A.av_jt
from A
where A.time_per = 34 and A.veh_cls = 1;
UNLOCK TABLES;
Perhaps my table indices are in correct all I have is a simple primary key
CREATE Table A
(
id int unsigned Not Null auto_increment,
link_id varchar(255) not Null,
link_ref int not Null,
network int not Null,
date_1 varchar(255) not Null,
#date_2 time default Null,
time_per int not null,
veh_cls int not null,
data_source int not null,
N int not null,
av_jt int not null,
sum_squ_jt int not null,
Primary Key (id)
);
Drop table if exists B;
CREATE Table B
(
id int unsigned Not Null auto_increment,
TOID varchar(255) not Null,
link_id varchar(255) not Null,
ABnode varchar(255) not Null,
#date_2 time not Null,
Primary Key (id)
);
In terms of the schema, it is just these two two tables (A and B) loaded underneath a database

I believe that answer has already been given in this post: The total number of locks exceeds the lock table size
ie. use a table lock to avoid InnoDB default row by row lock mode

thanks foryour help.
Indexing seems to have solved the problem. I managed to reduce the query time from 700secs to aprox 0.2secs per record by indexing on:
A.link_id
i.e. from
from A
inner join B
on A.link_id = B.link_id;
found this really usefull post. v helpfull for a newbe like myself
http://hackmysql.com/case4
code used to index was:
CREATE INDEX linkid_index ON A(link_id);

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008