Optimize mysql subquery containing self-join to reduce CPU Usage - mysql

I understand that mysql query involving self-join table might lead to slow query and/or CPU spike, but have been struggling to come up with ways to improve it.
CREATE TABLE `tool` (
`tool_id` char(32) NOT NULL,
`provider` varchar(36) NOT NULL,
PRIMARY KEY (`tool_id`),
)
CREATE TABLE `edata` (
`e_data_id` char(32) NOT NULL,
`tool_id` char(32) DEFAULT NULL,
`ref_e_data_id` char(32) DEFAULT NULL,
PRIMARY KEY (`e_data_id`),
KEY `e_ref_e_data__06a0c1a7_fk` (`ref_e_data_id`),
KEY `edata_tool_id_61d6bb9b` (`tool_id`),
CONSTRAINT `e_tool_id_61d6bb9b` FOREIGN KEY (`tool_id`) REFERENCES `tool` (`tool_id`),
)
here is the query in question
mutdata
LEFT JOIN (SELECT e1.edata_id as m_id, a1.provider as m_cp from edata e1 INNER JOIN tool a1 on e1.tool_id=a1.tool_id WHERE a1.deleted=0) as mapping
on mutdata.ref_e_data_id=mapping.m_id or mutdata.e_data_id=map.m_id
in short, first the subquery is constructed as a lookup table like a dictionary or map, then mutdata tries to use the lookup table to determine the corresponding provider (this query is part of even larger query). Is there a way to optimize this part?

These indexes may help:
mutdata: INDEX(ref_e_data_id, e_data_id)
map: INDEX(m_id)
e1: INDEX(tool_id, edata_id)
a1: INDEX(deleted, tool_id, provider)
Try not to use the construct JOIN ( SELECT ... ); instead try to bump that up a level.
Do you really need LEFT in either place?
OR is terrible for performance. Sometimes it is practical to use two SELECT connected by UNION DISTINCT as a workaround. That way, each SELECT may be able to take advantage of a different index.
Where is map in the query?

Related

Simple SQL query lasts forever

I am using mysql-workbench and mysql server in ubunt 18 machine with 16 GB RAM.
I have a schema named ips, and two tables, say: table1 and table2.
In table1 and table2 there are two fields: ip and description, bit are of type string. I have a lot of record. table1 has 779938 records and table2 has 136657 records.
I need to make a joint query to find the number of ips in table2 that has a description starts with str1% and does not contains str2 and does not contains str3. In the same time, those ips has a description in table1 that does not start with str1%, and contains either str2 or str3.
This is my query:
SELECT COUNT(`table2`.`ip`)
FROM `ips`.`table2`, `ips`.`table1`
WHERE `table2`.`ip` = `table1`.`ip`
AND (LOWER(`table1`.`description`) NOT LIKE 'str1%'
AND (LOWER(`tabl1`.`description`) LIKE '%-str2-%'
OR LOWER(`table1`.`description`) LIKE '%-str3-%'
)
)
AND (LOWER(`table2`.`description`) LIKE 'str1%'
AND LOWER(`table2`.`description`) NOT LIKE '%-str2-%'
AND LOWER(`table2`.`description`) NOT LIKE '%-str3-%'
);
However, the query never ends. The duration has ? and I never get result. Can you please help?
EDIT:
Here are the SHOW CREATE TABLE and
1) SHOW CREATE TABLEips.table2;
CREATE TABLE `table2` (
`ip` varchar(500) DEFAULT NULL,
`description` varchar(500) DEFAULT NULL,
`type` varchar(500) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
2) SHOW CREATE TABLEips.table1;
CREATE TABLE `table1` (
`ip` varchar(500) DEFAULT NULL,
`description` varchar(500) DEFAULT NULL,
`type` varchar(500) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
3) EXPLAIN <query>
# id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
1, SIMPLE, table2, , ALL, , , , , 136109, 100.00, Using where
1, SIMPLE, table1, , ALL, , , , , 786072, 10.00, Using where; Using join buffer (Block Nested Loop)
EDIT 2:
The data for ip field are string in this format: str.str.str.str
The decription field is in this format: str1-str2-str3-str4
The previous answer regarding Indexing might optimise the query. It might be correct. But I am sorry that I have to check the answer I used to solve the problem. Thanks to #Raymond Nijland for being first to point the indexing issue which reminded me of the primary keys.
The source of the problem is that both tables in the query did not have primary key. The primary key must be for a key that is unique and not null. In my case I already have the ip field ready to server as the primary key. Since I use mysql- workbench I right click the tables, click Alter Table then check the primary key for the approperiate field as follows:
That solved my problem.
You are getting the ALL operator in the execution plan because the SQL planner is not using any index. It's performing a Full Table Scan on both tables.
A Full Table Scan can be optimal when you are selecting more than 5% of the rows. In your case this could be good if your string prefix "str1" had a single letter. If it has more than one character, then the usage on an index could greatly improve the performance.
Now, the comparisong you are performing is not a simple one. You are not comparing the value of a column, but the result of an expression: LOWER(table1.description). Therefore you need to create virtual columns and index them if you want this query to be fast. This is available on MySQL 5.7 and newer:
alter table table1 add lower_desc varchar(50)
generated always as (LOWER(description)) virtual;
create index ix1 on table1 (lower_desc);
alter table table2 add lower_desc varchar(50)
generated always as (LOWER(description)) virtual;
create index ix2 on table2 (lower_desc);
These indexes will make your queries faster when the prefix has two or more characters. Get the execution plan again. Now, the operators ALL should not be there anymore (INDEX operators should show up in their place now).
Incidentally, I think your missed a join in the query. I think it should look like (I added the third line):
SELECT COUNT(`table2`.`ip`)
FROM `ips`.`table2`
JOIN `ips`.`table1` on `ips`.`table1`.ip = `ips`.`table2`.ip
WHERE `table2`.`ip` = `table1`.`ip`
AND (LOWER(`table1`.`description`) NOT LIKE 'str1%'
AND (LOWER(`tabl1`.`description`) LIKE '%-str2-%'
OR LOWER(`table1`.`description`) LIKE '%-str3-%'
)
)
AND (LOWER(`table2`.`description`) LIKE 'str1%'
AND LOWER(`table2`.`description`) NOT LIKE '%-str2-%'
AND LOWER(`table2`.`description`) NOT LIKE '%-str3-%'
);
Also, to optimize the join performance you'll need one (or both) of the indexes shown below:
create index ix3 on table1 (ip);
create index ix4 on table2 (ip);

Mysql index is not being taken while field is mentioned in join on clause

explain select * from users u join wallet w on w.userId=u.uuid where w.userId='8319611142598331610'; //Index is taken
explain select * from users u join wallet w on w.userId=u.uuid where w.currencyId=8; //index is not taken
As can be seen above, the index userIdIdx is used in the latter case, but not in the former.
Following are the schema of the two tables -
CREATE TABLE `users` (
`uuid` varchar(600) DEFAULT NULL,
KEY `uuidIdx` (`uuid`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `wallet` (
`Id` int(11) NOT NULL AUTO_INCREMENT,
`userId` varchar(200) NOT NULL DEFAULT '',
`currencyId` int(11) NOT NULL,
PRIMARY KEY (`Id`),
KEY `userIdIdx` (`userId`),
KEY `currencyIdIdx` (`currencyId`)
) ENGINE=InnoDB AUTO_INCREMENT=279668 DEFAULT CHARSET=latin1;
How do I force MySql to consider the userIdIdx or uuidIdx index?
There are two methodes improving this.
Method 1:
Adding a multiple column index wallet(userId, currencyId) looks to be better for both queries.
see demo https://www.db-fiddle.com/f/aesNYevEzwopmXrnQJRPoS/0
Method 2
Rewrite the query.
This works with the current table structure.
Query
SELECT
*
FROM (
SELECT
wallet.userId
FROM
wallet
WHERE
wallet.currencyId = 8
) AS wallet
INNER JOIN
users
ON
wallet.userId = users.uuid
see demo https://www.db-fiddle.com/f/aesNYevEzwopmXrnQJRPoS/3
p.s i also advice you to also add Id int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY to the users table when you use InnoDB as table engine.
This post off mine explains why https://dba.stackexchange.com/a/48184/27070
Both queries are doing the best they can with what you gave them.
select *
from users u
join wallet w ON w.userId=u.uuid
where w.userId='8319611142598331610';
select *
from users u
join wallet w ON w.userId=u.uuid
where w.currencyId=8;
If there is only one row in a table (such as users), the Optimizer takes a different path. That seems to be what happened with the first query.
Otherwise, both queries would start with wallet since there is filtering going on. Each of the secondary keys in wallet is handy for one of the queries. Even better would be
INDEX(userId, currencyId, id) -- for first query
INDEX(currencyId, userId, id) -- for second query
The first column is used in the WHERE; the other two columns make the index "covering" so that it does not need to bounce between the index and the data.
(Geez, those tables have awfully few columns.)
After filtering in w, it moves on to u and uses INDEX(uuid). Since that is the only column in the table, (no name??), it can be "Using index", that is "covering".
And the only reason for reaching into u is to verify that there exists a user with the value matching w.userId. Since you probably always have that, why JOIN to users at all in the query??

Why this MySQL IN takes so much longer than WHERE OR?

I have two tables, identities and events.
identities has only two columns, identity1 and identity2 and both have a HASH INDEX.
events has ~50 columns and the column _p has a HASH INDEX.
CREATE TABLE `identities` (
`identity1` varchar(255) NOT NULL DEFAULT '',
`identity2` varchar(255) DEFAULT NULL,
UNIQUE KEY `uniques` (`identity1`,`identity2`),
KEY `index2` (`identity2`) USING HASH,
KEY `index1` (`identity1`) USING HASH
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
-
CREATE TABLE `events` (
`rowid` int(11) NOT NULL AUTO_INCREMENT,
`_p` varchar(255) NOT NULL,
`_t` int(10) NOT NULL,
`_n` varchar(255) DEFAULT '',
`returning` varchar(255) DEFAULT NULL,
`referrer` varchar(255) DEFAULT NULL,
`url` varchar(255) DEFAULT NULL,
[...]
`fcc_already_sells_online` varchar(255) DEFAULT NULL,
UNIQUE KEY `_p` (`_p`,`_t`,`_n`),
KEY `rowid` (`rowid`),
KEY `IDX_P` (`_p`) USING HASH
) ENGINE=InnoDB AUTO_INCREMENT=5231165 DEFAULT CHARSET=utf8;
So, why does this query:
SELECT SQL_NO_CACHE * FROM events WHERE _p IN (SELECT identity2 FROM identities WHERE identity1 = 'user#example.com') ORDER BY _t
takes ~40 seconds, while this one:
SELECT SQL_NO_CACHE * FROM events WHERE _p = 'user#example.com' OR _p = 'user2#example.com' OR _p = 'user3#example.com' OR _p = 'user4#example.com' ORDER BY _t
takes only 20ms when they are basically the same?
edit:
This inner query takes 3,3ms:
SELECT SQL_NO_CACHE identity2 FROM identities WHERE identity1 = 'user#example.com'
The cause:
MySQL treats conditions IN <static values list> and IN <sub-query> as different things. It is well-stated in documentation that the second one is equal to = ANY() query which can not use index even if that index exists. MySQL is just not ingenious enough to do it. On the opposite, first one is treated as a simple range scan when the index is there meaning that MySQL can easily use the index.
Possible ways to resolve:
As I see it, there are workarounds and you've already even mentioned one of them. So it may be:
Using JOIN. If there is a field to join by, this is most likely the best way to solve a problem. Actually, since version 5.6 MySQL already tries to enforce this optimization if it's possible, but that does not work in complex cases or in case where there is no dependent sub-query (so basically if MySQL can not "track" that reference). Looking to your case, this isn't an option and this is actually what is not happening for your sub-query.
Querying the sub-resource in the application and forming the static list. Yes, despite the common practice is to avoid multiple queries due to connection/network/query planning overhead, this is the case where actually it can work. In your case, even if you have something like 200ms overhead on all the recounted stuff before, it still worth to query sub-resource independently and substitute static list to next query in the application afterwards.
this is already asked
it's easier to to manage the IN operator because is only a construct that defines the OR operator on multiple conditions with = operator on the same value. If you use the OR operator the optimizer may not consider that you're always using the = operator on the same value.
Because your query is calling this inner query for each row in events table.
In second case indentity table is not used.
You should use joining instead.

MySQL - multiple column index

I'm learning MySQL index and found that index should be applied to any column named in the WHERE clause of a SELECT query.
Then I found Multiple Column Index vs Multiple Indexes.
First Q, I was wondering what is multiple column index. I found code bellow from Joomla, is this Multiple Column Index?
CREATE TABLE `extensions` (
`extension_id` INT(11) NOT NULL AUTO_INCREMENT,
`name` VARCHAR(100) NOT NULL,
`type` VARCHAR(20) NOT NULL,
`element` VARCHAR(100) NOT NULL,
`folder` VARCHAR(100) NOT NULL,
`client_id` TINYINT(3) NOT NULL,
... ...
PRIMARY KEY (`extension_id`),
// does code below is multiple column index?
INDEX `element_clientid` (`element`, `client_id`),
INDEX `element_folder_clientid` (`element`, `folder`, `client_id`),
INDEX `extension` (`type`, `element`, `folder`, `client_id`)
)
Second Q, am I correct if thinking that one Multiple Column Index is used on one SELECT ?
SELECT column_x WHERE element=y AND clinet_id=y; // index: element_clientid
SELECT ex.col_a, tb.col_b
FROM extensions ex
LEFT JOIN table2 tb
ON (ex.ext_id = tb.ext_id)
WHERE ex.element=x AND ex.folder=y AND ex.client_id=z; // index: element_folder_clientid
General rule of thumb for indexes is to slap one onto any field used in a WHERE or JOIN clause.
That being said, there are some optimizations you can do. If you KNOW that a certain combination of fields are the only one that will ever be used in WHERE on a particular table, then you can create a single multi-field key on just those fields, e.g.
INDEX (field1, field2, field5)
v.s.
INDEX (field1),
INDEX (field2),
INDEX (field5)
A multi-field index can be more efficient in many cases, v.s having to scan multiple indexes. The downside is that the multi-field index is only usable if the fields in question are actually used in a WHERE clause.
With your sample queries, since element and field_id are in all three indexes, you might be better off splitting them off into their own dedicated index. If these are changeable fields, then it's better to keep it their own dedicated index. e.g. if you ever have to change field_id in bulk, the DB has to update 3 different indexes, v.s. updating just one dedicated one.
But it all comes down to benchmarking - test your particular setup with various index setups and see which performs best. Rules of thumbs are handy, but don't work 100% of the time.

Optimizing MySQL Query, takes almost 20 seconds!

I'm running the following query on a Macbook Pro 2.53ghz with 4GB of Ram:
SELECT
c.id AS id,
c.name AS name,
c.parent_id AS parent_id,
s.domain AS domain_name,
s.domain_id AS domain_id,
NULL AS stats
FROM
stats s
LEFT JOIN stats_id_category sic ON s.id = sic.stats_id
LEFT JOIN categories c ON c.id = sic.category_id
GROUP BY
c.name
It takes about 17 seconds to complete.
EXPLAIN:
alt text http://img7.imageshack.us/img7/1364/picture1va.png
The tables:
Information:
Number of rows: 147397
Data size: 20.3MB
Index size: 1.4MB
Table:
CREATE TABLE `stats` (
`id` int(11) unsigned NOT NULL auto_increment,
`time` int(11) NOT NULL,
`domain` varchar(40) NOT NULL,
`ip` varchar(20) NOT NULL,
`user_agent` varchar(255) NOT NULL,
`domain_id` int(11) NOT NULL,
`date` timestamp NOT NULL default CURRENT_TIMESTAMP,
`referrer` varchar(400) default NULL,
KEY `id` (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=147398 DEFAULT CHARSET=utf8
Information second table:
Number of rows: 1285093
Data size: 11MB
Index size: 17.5MB
Second table:
CREATE TABLE `stats_id_category` (
`stats_id` int(11) NOT NULL,
`category_id` int(11) NOT NULL,
KEY `stats_id` (`stats_id`,`category_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
Information third table:
Number of rows: 161
Data size: 3.9KB
Index size: 8KB
Third table:
CREATE TABLE `categories` (
`id` int(11) NOT NULL auto_increment,
`parent_id` int(11) default NULL,
`name` varchar(40) NOT NULL,
`questions_category_id` int(11) NOT NULL default '0',
`rank` int(2) NOT NULL default '0',
PRIMARY KEY (`id`),
KEY `id` (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=205 DEFAULT CHARSET=latin1
Hopefully someone can help me speed this up.
I see several WTF's in your query:
You use two LEFT OUTER JOINs but then you group by the c.name column which might have no matches. So perhaps you don't really need an outer join? If that's the case, you should use an inner join, because outer joins are often slower.
You are grouping by c.name but this gives ambiguous results for every other column in your select-list. I.e. there might be multiple values in these columns in each grouping by c.name. You're lucky you're using MySQL, because this query would simply give an error in any other RDBMS.
This is a performance issue because the GROUP BY is likely causing the "using temporary; using filesort" you see in the EXPLAIN. This is a notorious performance-killer, and it's probably the single biggest reason this query is taking 17 seconds. Since it's not clear why you're using GROUP BY at all (using no aggregate functions, and violating the Single-Value Rule), it seems like you need to rethink this.
You are grouping by c.name which doesn't have a UNIQUE constraint on it. You could in theory have multiple categories with the same name, and these would be lumped together in a group. I wonder why you don't group by c.id if you want one group per category.
SELECT NULL AS stats: I don't understand why you need this. It's kind of like creating a variable that you never use. It shouldn't harm performance, but it's just another WTF that makes me think you haven't thought this query through very well.
You say in a comment you're looking for number of visitors per category. But your query doesn't have any aggregate functions like SUM() or COUNT(). And your select-list includes s.domain and s.domain_id which would be different for every visitor, right? So what value do you expect to be in the result set if you only have one row per category? This isn't really a performance issue either, it just means your query results don't tell you anything useful.
Your stats_id_category table has an index over its two columns, but no primary key. So you can easily get duplicate rows, and this means your count of visitors may be inaccurate. You need to drop that redundant index and use a primary key instead. I'd order category_id first in that primary key, so the join can take advantage of the index.
ALTER TABLE stats_id_category DROP KEY stats_id,
ADD PRIMARY KEY (category_id, stats_id);
Now you can eliminate one of your joins, if all you need to count is the number of visitors:
SELECT c.id, c.name, c.parent_id, COUNT(*) AS num_visitors
FROM categories c
INNER JOIN stats_id_category sic ON (sic.category_id = c.id)
GROUP BY c.id;
Now the query doesn't need to read the stats table at all, or even the stats_id_category table. It can get its count simply by reading the index of the stats_id_category table, which should eliminate a lot of work.
You are missing the third table in the information provided (categories).
Also, it seems odd that you are doing a LEFT JOIN and then using the right table (which might be all NULLS) in the GROUP BY. You will end up grouping all of the non-matching rows together as a result, is that what you intended?
Finally, can you provide an EXPLAIN for the SELECT?
Harrison is right; we need the other table. I would start by adding an index on category_id to stats_id_category, though.
I agree with Bill. Point 2 is very important. The query doesn't even make logical sense. Also, with the simple fact that there is no where statement means that you have to pull back every row in the stats table, which seems to be around 140000. It then has to sort all that data, so that it can perform the GROUP BY. This is because sort [ O(n log n)] and then find duplicates [ O(n) ] is much faster than just finding duplicates without sorting the data set [ O(n^2)?? ].