Optimize speed of Mysql JOIN query - mysql

I have 2 tables called T1 made of 1.6mln of rows and T2 made of 4.6mln of rows with with one-to-many relationship.
The CREATE STMT of T1 is:
CREATE TABLE `T1` (
`field_1` text,
`field_2` text,
`field_3` decimal(10,6) DEFAULT NULL,
`field_4` decimal(10,6) DEFAULT NULL,
`field_4` decimal(10,6) DEFAULT NULL,
`field_5` text,
`field_6` text,
`field_7` text,
`field_8` double DEFAULT NULL,
`field_9` text,
`field_10` text,
`field_11` int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The CREATE STMT of T2 is:
CREATE TABLE `T2` (
`field_1` int(11) DEFAULT NULL,
`field_2` text,
`field_3` text,
`field_4` text,
`field_5` text,
`field_6` text,
`field_7` text,
`field_8` text,
`field_9` text,
`field_10` text,
`field_11` text,
`field_12` text,
`field_13` text
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I don't have set any kind of indexes or any particular constraints for now, but the T1.field_1 should be my ideal key and can be joined with T2.field_2 field.
If I decide to make a JOIN like:
SELECT * FROM T1
JOIN T2
ON T1.field_1=T2.field_2
WHERE T1.=2130100;
The benchmark is really high.
This is the EXPLAIN:
So I'm just trying to understand what could be some possibile improvements:
Add some index
Change the type of the input fields?
Maybe add a primary key?

In you where condition you missed the column name i assume the columns is named your_col
Starting form mysql 5.0.3 varchar can be up 65,535 so you could try using varchar instead of text when possibile
for indexing there are limitation on the size of the index max key length is 767 byte ( assuming 3 bytes for each utf8 character. so about 250 utf8 char )
the column candidate for indexing must respected these limit
if this is possible then you could
add index on
table t2 colums fiedl_2
and on
table t1 a composite index on column (Your_col, field_1)
these are the columns involved in where and ON clause
SELECT * FROM T1
JOIN T2
ON T1.field_1=T2.field_2
WHERE T1.Your_col=2130100;

Since you are using latin1, switch t1.field_1 and t2.field_2 to VARCHAR of no more than 767. Use the shortest value that is not likely to be exceeded. Do likewise for all the other TEXT columns. (If you need >767, stick with TEXT.)
Then add two indexes:
T1: INDEX(??) -- whatever column you are using in the `WHERE`
T2: INDEX(field_2)
If the column in T1 is an INT, then 2130100 is OK. But if it is TEXT (or soon to be VARCHAR(..), then quote it: "2130100". The should prevent a surprising and unnecessary table scan of T1.

Related

How to optimize an UPDATE and JOIN query on practically identical tables?

I am trying to update one table based on another in the most efficient way.
Here is the table DDL of what I am trying to update
Table1
CREATE TABLE `customersPrimary` (
`id` int NOT NULL AUTO_INCREMENT,
`groupID` int NOT NULL,
`IDInGroup` int NOT NULL,
`name` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`address` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `groupID-IDInGroup` (`groupID`,`IDInGroup`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Table2
CREATE TABLE `customersSecondary` (
`groupID` int NOT NULL,
`IDInGroup` int NOT NULL,
`name` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`address` varchar(200) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
PRIMARY KEY (`groupID`,`IDInGroup`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Both the tables are practically identical but customersSecondary table is a staging table for the other by design. The big difference is primary keys. Table 1 has an auto incrementing primary key, table 2 has a composite primary key.
In both tables the combination of groupID and IDInGroup are unique.
Here is the query I want to optimize
UPDATE customersPrimary
INNER JOIN customersSecondary ON
(customersPrimary.groupID = customersSecondary.groupID
AND customersPrimary.IDInGroup = customersSecondary.IDInGroup)
SET
customersPrimary.name = customersSecondary.name,
customersPrimary.address = customersSecondary.address
This query works but scans EVERY row in customersSecondary.
Adding
WHERE customersPrimary.groupID = (groupID)
Cuts it down significantly to the number of rows with the GroupID in customersSecondary. But this is still often far larger than the number of rows being updated since the groupID can be large. I think the WHERE needs improvement.
I can control table structure and add indexes. I will have to keep both tables.
Any suggestions would be helpful.
Your existing query requires a full table scan because you are saying update everything on the left based on the value on the right. Presumably the optimiser is choosing customersSecondary because it has fewer rows, or at least it thinks it has.
Is the full table scan causing you problems? Locking? Too slow? How long does it take? How frequently are the tables synced? How many records are there in each table? What is the rate of change in each of the tables?
You could add separate indices on name and address but that will take a good chunk of space. The better option is going to be to add an indexed updatedAt column and use that to track which records have been changed.
ALTER TABLE `customersPrimary`
ADD COLUMN `updatedAt` DATETIME NOT NULL DEFAULT '2000-01-01 00:00:00',
ADD INDEX `idx_customer_primary_updated` (`updatedAt`);
ALTER TABLE `customersSecondary`
ADD COLUMN `updatedAt` DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
ADD INDEX `idx_customer_secondary_updated` (`updatedAt`);
And then you can add updatedAt to your join criteria and the WHERE clause -
UPDATE customersPrimary cp
INNER JOIN customersSecondary cs
ON cp.groupID = cs.groupID
AND cp.IDInGroup = cs.IDInGroup
AND cp.updatedAt < cs.updatedAt
SET
cp.name = cs.name,
cp.address = cs.address,
cp.updatedAt = cs.updatedAt
WHERE cs.updatedAt > :last_query_run_time;
For :last_query_run_time you could use the last run time if you are storing it. Otherwise, if you know you are running the query every hour you could use NOW() - INTERVAL 65 MINUTE. Notice I have used more than one hour to make sure records aren't missed if there is a slight delay for some reason. Another option would be to use SELECT MAX(updatedAt) FROM customersPrimary -
UPDATE customersPrimary cp
INNER JOIN (SELECT MAX(updatedAt) maxUpdatedAt FROM customersPrimary) t
INNER JOIN customersSecondary cs
ON cp.groupID = cs.groupID
AND cp.IDInGroup = cs.IDInGroup
AND cp.updatedAt < cs.updatedAt
SET
cp.name = cs.name,
cp.address = cs.address,
cp.updatedAt = cs.updatedAt
WHERE cs.updatedAt > t.maxUpdatedAt;
Plan A:
Something like this would first find the "new" rows, then add only those:
UPDATE primary
SET ...
JOIN ( SELECT ...
FROM secondary
LEFT JOIN primary
WHERE primary... IS NULL )
ON ...
Might secondary have changes? If so, a variant of that would work.
Plan B:
Better yet is to TRUNCATE TABLE secondary after it is folded into primary.

Why does MySQL not use index if varchar size is too high?

I am joining with a table and noticed that if the field I join on has a varchar size that's too high then MySQL doesn't use the index for that field in the join, thus resulting in a significantly longer query time. I've put explains and table definition below. It is version MySQL 5.7. Any ideas why this is happening?
Table definition:
CREATE TABLE `LotRecordsRaw` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`lotNumber` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`scrapingJobId` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `lotNumber_UNIQUE` (`lotNumber`),
KEY `idx_Lot_lotNumber` (`lotNumber`)
) ENGINE=InnoDB AUTO_INCREMENT=14551 DEFAULT CHARSET=latin1;
Explains:
explain
(
select lotRecord.*
from LotRecordsRaw lotRecord
left join (
select lotNumber, max(scrapingJobId) as id
from LotRecordsRaw
group by lotNumber
) latestJob on latestJob.lotNumber = lotRecord.lotNumber
)
produces:
The screenshot above shows that the derived table is not using the index on "lotNumber". In that example, the "lotNumber" field was a varchar(255). If I change it to be a smaller size, e.g. varchar(45), then the explain query produces this:
The query then runs orders of magnitude faster (2 seconds instead of 100 sec). What's going on here?
Hooray! You found an optimization reason for not blindly using 255 in VARCHAR.
Please try 191 and 192 -- I want to know if that is the cutoff.
Meanwhile, I have some other comments:
A UNIQUE is a KEY. That is, idx_Lot_lotNumber is redundant and may as well be removed.
The Optimizer can (and probably would) use INDEX(lotNumber, scrapingJobId) as a much faster way to find those MAXes.
Unfortunately, there is no way to specify "make a unique index on lotNumber, but also have that other column in the index.
Wait! With lotNumber being unique, there is only one row per lotNumber. That means MAX and GROUP BY are totally unnecessary!
It seems like lotNumber could be promoted to PRIMARY KEY (and completely get rid of id).

Optimize MYSQL Select query in large table

Given the table:
CREATE TABLE `sample` (
`id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`vendorid` VARCHAR(45) NOT NULL,
`year` INT(10) NOT NULL,
`title` TEXT NOT NULL,
`description` TEXT NOT NULL
PRIMARY KEY (`id`) USING BTREE
)
Table size: over 7 million. All fields are not unique, except id.
Simple query:
SELECT * FROM sample WHERE title='milk'
Takes over 45s-60s to complete.
Tried to put unique index on title and description but got 1170 error.
How could I optimize it? Would be very grateful for suggestions.
TEXT columns need prefix indexes -- it's not possible to index their entire contents; they can be too large. And, if the column values aren't unique, don't use UNIQUE indexes; they won't work.
Try this:
ALTER TABLE simple ADD INDEX title_prefix (title(64));
Pro tip For columns you need to use in WHERE statements, do your best to use VARCHAR(n) where n is less than 768. Avoid TEXT and other blob types unless you absolutely need them; they can make for inefficient operation of your database server.

Why primary key has no good effect on select?

It's my table t1; It has one million rows.
CREATE TABLE `t1` (
`a` varchar(10) NOT NULL,
`b` varchar(10) DEFAULT NULL,
`c` varchar(10) DEFAULT NULL,
`d` varchar(10) DEFAULT NULL,
`e` varchar(10) DEFAULT NULL,
`f` varchar(10) DEFAULT NULL,
`g` varchar(10) DEFAULT NULL,
`h` varchar(10) DEFAULT NULL,
PRIMARY KEY (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
Result:
mysql> select * from t1 where a=10000000;
Empty set (1.42 sec)
mysql> select * from t1 where b=10000000;
Empty set (1.41 sec)
Why select primary key is as fast as a normal field?
Try select * from t1 where a='10000000';.
You're probably forcing MySQL to convert all of those strings to integers - because integers have a higher type precedence than varchar - in which case an index on the strings is useless
Actually, apparently, I was slightly wrong. By my reading of the conversions documentation, I believe that in MySQL we end up forcing both sides of the comparison to be converted to float, since I can't see any bullet point above:
In all other cases, the arguments are compared as floating-point (real) numbers.
that would match a string on one side and an integer on the other.
Data is stored in blocks in almost all databases. Reading a block is an elementary Unit of IO.
Indexes helps the system in zeroing in on the datablock which holds the data that we are trying to read and would avoid reading all the datablocks. In a very small table which has single or very few data blocks the usage of index could actually be a overhead and might be skipped altogether. Even if used, the indexes would rarely provide any performance benefit. Try the same experiment on a rather large table.
PS: Indexes and Key (Primary Keys) are not interchangeable concepts. The Former is Physical and the latter is logical.

MySQL doesn't match varchar keys ending with a number

I've defined a table like
CREATE TABLE `mytable` (
`identifier` varchar(45) NOT NULL,
`f1` char(1) NOT NULL,
KEY `identifier` (`identifier`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
and then added primary key and index as
ALTER TABLE `mytable` ADD PRIMARY KEY ( `identifier` )
ALTER TABLE `mytable` ADD INDEX ( `identifier` )
In the identifier field my table is populated with values like (about 800,000 records)
USER01-TESTXXY-CAD-10172
USER01-TESTXXY-CAD-1020
USER01-TESTXXY-CAD-10245
USER02-TEST-003-SUBA
USER02-TEST-002-SUBB
I've discovered that queries where the identifier ends with a number aren't matched:
SELECT *
FROM identifier
WHERE identifier = 'USER01-TESTXXY-CAD-10245';
but queries matching an identifier which ends with letters are matched successfully
SELECT *
FROM identifier
WHERE identifier = 'USER02-TEST-003-SUBA';
My queries are exact, I don't need to compare with LIKE because my users provide me exact strings. Besides varchar(45) is more than enough space for my identifiers.
What I did wrong? What could be the reason or solution?
I think you have made a typo in both the queries. Your table name should be mytable instead of identifier.
SELECT *
FROM mytable
WHERE identifier = 'USER01-TESTXXY-CAD-10245';
Is it a typo ?
Is there another table in your database that is called as identifier?