How can one break up an UPDATE into subqueries? - mysql

There exists a table:
CREATE TABLE person
(
id INT(10) PRIMARY KEY AUTO_INCREMENT,
nameFirst VARCHAR(255) DEFAULT '?',
nameSecond VARCHAR(255) DEFAULT '',
fatherNameFirst VARCHAR(255) DEFAULT NULL
);
Note: There are actually other columns there, 18 in total, but they are not being used here.
The goal is to set up father's first name from using second name of the child. It can be predicted from the second name (patronymic) in russian language, but not always correctly. So i plan to do what can be done automatically and some will do by hand later.
So the UPDATE done as follows:
UPDATE person AS child
LEFT JOIN
(SELECT DISTINCT nameFirst FROM person) AS parent
ON CONCAT(parent.nameFirst,'овна')=child.nameSecond OR CONCAT(parent.nameFirst,'ович')=child.nameSecond
SET child.fatherNameFirst=parent.nameFirst;
Eventually it will need to run on a table that has >2m entries, for now i have tried with the sample data of 400k. The problem is that after about an hour of my computer using one if its cores at 100% the query has not yet finished.
So i was thinking if i can break it up into subqueries, so these can be set to run one after another, but they should each take 5-10 minutes. This way if i need to do something, i can terminate currently running one and not lose a day of CPU time.
I have attempted to add: WHERE child.id<1000 but either it was still way too long or had little impact (perhaps i misunderstand how MariaDB opens up this update).
In case sample data will actually help somebody understand it better:
select id, nameFirst, nameSecond from person limit 10;
+----+--------------------+----------------------------+
| id | nameFirst | nameSecond |
+----+--------------------+----------------------------+
| 1 | Туликович | |
| 2 | Август | Михайлович |
| 3 | Август | Христианович |
| 4 | Александр | Александрович |
| 5 | Александр | Христьянович |
| 6 | Альберт | Викторович |
| 7 | Альбрехт | Александрович |
| 8 | Амалия | Андреевна |
| 9 | Амалия | Ивановна |
| 10 | Ангелина | Андреевна |
+----+--------------------+----------------------------+
fatherNameFirst is empty at this time.

You could break this down alphabetically by adding where nameFirst like 'A%' to the update query - and then run the query multiple times.

Given this sample data:
CREATE TABLE person
(
id INT(10) PRIMARY KEY AUTO_INCREMENT,
nameFirst VARCHAR(255) DEFAULT '?',
nameSecond VARCHAR(255) DEFAULT '',
fatherNameFirst VARCHAR(255) DEFAULT NULL
) DEFAULT CHARSET=utf8;
INSERT INTO person
(`id`, `nameFirst`, `nameSecond`)
VALUES
(1, 'Туликович', NULL),
(2, 'Август', 'Михайлович'),
(3, 'Август', 'Христианович'),
(4, 'Александр', 'Александрович'),
(5, 'Александр', 'Христьянович'),
(6, 'Альберт', 'Викторович'),
(7, 'Альбрехт', 'Александрович'),
(8, 'Амалия', 'Андреевна'),
(9, 'Амалия', 'Ивановна'),
(10, 'Ангелина', 'Андреевна')
;
with your query you get this EXPLAIN output:
+----+-------------+------------+------+---------------+------+---------+------+------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+---------------+------+---------+------+------+----------------------------------------------------+
| 1 | PRIMARY | child | ALL | NULL | NULL | NULL | NULL | 10 | NULL |
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 10 | Using where; Using join buffer (Block Nested Loop) |
| 2 | DERIVED | person | ALL | NULL | NULL | NULL | NULL | 10 | Using temporary |
+----+-------------+------------+------+---------------+------+---------+------+------+----------------------------------------------------+
That's probably the worst you can get.
Let's see if we can rewrite this. First, there's absolutely no need for this subquery and the DISTINCT.
mysql > explain UPDATE person AS child
-> LEFT JOIN person parent
-> ON CONCAT(parent.nameFirst,'овна')=child.nameSecond OR CONCAT(parent.nameFirst,'ович')=child.nameSecond
-> SET child.fatherNameFirst=parent.nameFirst;
+----+-------------+--------+------+---------------+------+---------+------+------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------+---------+------+------+----------------------------------------------------+
| 1 | SIMPLE | child | ALL | NULL | NULL | NULL | NULL | 10 | NULL |
| 1 | SIMPLE | parent | ALL | NULL | NULL | NULL | NULL | 10 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+--------+------+---------------+------+---------+------+------+----------------------------------------------------+
This eliminates the Using temporary. That's good.
With an index on nameFirst, we can speed this up further.
CREATE INDEX idx_person_nameFirst ON person(nameFirst);
Then explaining again:
+----+-------------+--------+-------+---------------+----------------------+---------+------+------+-----------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+----------------------+---------+------+------+-----------------------------------------------------------------+
| 1 | SIMPLE | child | ALL | NULL | NULL | NULL | NULL | 10 | NULL |
| 1 | SIMPLE | parent | index | NULL | idx_person_nameFirst | 768 | NULL | 10 | Using where; Using index; Using join buffer (Block Nested Loop) |
+----+-------------+--------+-------+---------------+----------------------+---------+------+------+-----------------------------------------------------------------+
Not yet perfect, but it's using the index. This should speed things up a lot.
From here on, it gets hard to optimize further. You can experiment a bit by adjusting the join buffer size, but I recommend you do this in a session only.
SET SESSION join_buffer_size = <whatever value>;
Every thread connecting to your server uses its own join buffer. That's why you should only test it in a session. When you have very much connections on your server, memory consumption could get out of hand.

Related

MySQL LEFT JOIN or WHERE IN SUBQUERY

I need a piece of advice, building an app now and I need to run some queries on rather large tables, possibly at a very frequent rate, so I'm trying to get the best approach performance wise.
I have the following 2 tables:
Albums:
+---------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| eventid | int(11) | NO | MUL | NULL | |
| album | varchar(200) | NO | | NULL | |
| filename | varchar(200) | NO | | NULL | |
| obstacle_time | time | NO | | NULL | |
+---------------+--------------+------+-----+---------+----------------+
and keywords:
+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| eventid | int(11) | NO | MUL | NULL | |
| filename | varchar(200) | NO | | NULL | |
| bibnumbers | varchar(200) | NO | | NULL | |
| gender | varchar(20) | YES | | NULL | |
| top_style | varchar(20) | YES | | NULL | |
| pants_style | varchar(20) | YES | | NULL | |
| other | varchar(20) | YES | | NULL | |
| cap | varchar(200) | NO | | NULL | |
| tshirt | varchar(200) | NO | | NULL | |
| pants | varchar(200) | NO | | NULL | |
+-------------+--------------+------+-----+---------+----------------+
Both table have a unique_index declared which is a constraint of the eventid+filename column.
Both table contains information about some images, but the albums table is available instantly (as soon as I have the images), while the keywords table usually becomes available several days later after a manual tagging of the images is completed
Now I will have people searching for all kind of things once the tagging is enabled, but since the results can be HUGE (up to 10.000 or more) I'm only showing them in small chunks so the browser doesn't get killed with trying to load a huge amount of images, because of this my server will be hit with loads of query requests (every time the visitor scrolls to the bottom of the page, an ajax query will return the next chunk of images).
Now my question is, which of the following queries is better performance wise:
SELECT `albums`.`filename`,`basket`.`id`,`albums`.`id`,`obstacle_time`
FROM `albums`
LEFT JOIN `basket`
ON `basket`.`eventid` = `albums`.`eventid`
AND `basket`.`fileid` = `albums`.`id`
AND `basket`.`visitor_id` = 1
LEFT JOIN `keywords`
ON `keywords`.`eventid` = `albums`.`eventid`
AND `albums`.`filename` = `keywords`.`filename`
WHERE
`albums_2015`.`eventid` = 1
AND `album` LIKE '%string%'
AND `obstacle_time` >= '08:00:00'
AND `obstacle_time` <= '14:11:10'
AND `gender` = 1
AND `top_style` REGEXP '[[:<:]]0[[:>:]]|[[:<:]]1[[:>:]]'
AND `cap` = '2'
AND `tshirt` = '1'
AND `pants` = '3'
ORDER BY `obstacle_time`
LIMIT X, 10
OR using an IN CLAUSE inside WHERE like:
SELECT `albums`.`filename`,`basket`.`id`,`albums`.`id`,`obstacle_time`
FROM `albums`
LEFT JOIN `basket`
ON `basket`.`eventid` = `albums`.`eventid`
AND `basket`.`fileid` = `albums`.`id`
AND `basket`.`visitor_id` = 1
WHERE
`albums_2015`.`eventid` = 1
AND `album` LIKE '%string%'
AND `obstacle_time` >= '08:00:00'
AND `obstacle_time` <= '14:11:10'
AND `filename` IN (
SELECT `filename`
FROM `keywrods`
WHERE
`eventid` = 1
AND `gender` = 1
AND `top_style` REGEXP '[[:<:]]0[[:>:]]|[[:<:]]1[[:>:]]'
AND `cap` = '2'
AND `tshirt` = '1'
AND `pants` = '3'
)
ORDER BY `obstacle_time`
LIMIT X, 10
I had looked to similar questions but wasn't able to figure it out which is the best course of action.
My understanding so far is that:
Using LEFT JOINtakes advantages of INDEXING, BUT!!! if I use it I will get a full join of the tables even when I only need a significantly smaller result set, so it's almost a wast to join thousands of rows just to then filter out most of them.
Using IN and subquery isn't indexed??? I'm not 100% sure about this, I'm using MySQL 5.6 and to the best of my understanding since 5.6 even subqueries get automatically indexed my MySQL. I think this method has benefits when there result is significantly filtered, not sure if there will be any benefit if the subquery will return all the possible filenames.
As footnote questions:
Should I consider returning the whole result to the client on the first query and use client side (HTML) techniques to load the images gradually rather than re-querying the server each time?
Should I consider merging the 2 tables into 1, how much of a performance impact will that have? (can be tricky due to various reasons, which have no place in the question)
Thanks.
EDIT 1
Explain for JOIN query:
+----+-------------+---------------+--------+---------------+--------------+---------+----------------------------------------+------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+--------+---------------+--------------+---------+----------------------------------------+------+----------------------------------------------------+
| 1 | SIMPLE | albums_2015 | ref | unique_index | unique_index | 4 | const | 6475 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | basket | ALL | NULL | NULL | NULL | NULL | 2 | Using where; Using join buffer (Block Nested Loop) |
| 1 | SIMPLE | keywords_2015 | eq_ref | unique_index | unique_index | 206 | const,mybibnumber.albums_2015.filename | 1 | Using index |
+----+-------------+---------------+--------+---------------+--------------+---------+----------------------------------------+------+----------------------------------------------------+
Using WHERE IN:
+----+-------------+---------------+--------+---------------+--------------+---------+----------------------------------------+------+----------------------------------------------------+--+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | |
+----+-------------+---------------+--------+---------------+--------------+---------+----------------------------------------+------+----------------------------------------------------+--+
| 1 | SIMPLE | albums_2015 | ref | unique_index | unique_index | 4 | const | 6475 | Using where; Using temporary; Using filesort | |
| 1 | SIMPLE | keywords_2015 | eq_ref | unique_index | unique_index | 206 | const,mybibnumber.albums_2015.filename | 1 | Using where | |
| 1 | SIMPLE | basket | ALL | NULL | NULL | NULL | NULL | 2 | Using where; Using join buffer (Block Nested Loop) | |
+----+-------------+---------------+--------+---------------+--------------+---------+----------------------------------------+------+----------------------------------------------------+--+
EDIT 2
I wasn't able to set up a SQL Fiddler (keep getting error of something went wrong), so I have created a test database on one of my servers.
Address: http://188.165.217.185/phpmyadmin/, user: temp_test, pass: test_temp
I'm still building the whole thing and I don't have all the values filled in yet, like top_style, pants_style, etc, so a more appropriate query for the test scenario will be:
WHERE IN:
SELECT `albums_2015`.`filename`,
`albums_2015`.`id`,
`obstacle_time`
FROM `albums_2015`
WHERE `albums_2015`.`eventid` = 1
AND `album` LIKE '%'
AND `obstacle_time` >= '08:00:00'
AND `obstacle_time` <= '14:11:10'
AND `filename` IN (SELECT `filename`
FROM `keywords_2015`
WHERE eventid = 1
AND
`bibnumbers` REGEXP '[[:<:]]113[[:>:]]|[[:<:]]106[[:>:]]')
ORDER BY `obstacle_time`
LIMIT 0, 10
LEFT JOIN
SELECT `albums_2015`.`filename`,`albums_2015`.`id`,`obstacle_time`
FROM `albums_2015`
LEFT JOIN `keywords_2015`
ON `keywords_2015`.`eventid` = `albums_2015`.`eventid`
AND `albums_2015`.`filename` = `keywords_2015`.`filename`
WHERE
`albums_2015`.`eventid` = 1
AND `album` LIKE '%'
AND `obstacle_time` >= '08:00:00'
AND `obstacle_time` <= '14:11:10'
AND `bibnumbers` REGEXP '[[:<:]]113[[:>:]]|[[:<:]]106[[:>:]]'
ORDER BY `obstacle_time`
LIMIT 0, 10
More a bunch of tips :
Join using index are the best if you have to deal with multi table query,
Don't mind adding some index to speed up your query (index take space, but on INT field it's nothing and you gain way more than you lose).
In case of big table, caching the data in the distant table is usually a good idea.
An insert Trigger on TAG_table that cache the displayed part in the distant table (like the tag name for the overview of albums) can help you keeping your join query at a descent frequency.
Be careful with REGEX, it's something that hurt badly the perf. Adding a new table to split data is a better idea (and use indexing which is native optimisation)
For every field in a WHERE clause of a big and frequent query you should have an index on it. If you can't put one, then your DB model is f**cked-up and need to be changed.

How to optimize mysql query with COUNT(*) and GROUP BY

I have the original table of 4 columns, described as follows:
+----------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+-------------+------+-----+---------+-------+
| FieldID | varchar(10) | NO | MUL | NULL | |
| PaperID | varchar(10) | NO | | NULL | |
| RefID | varchar(10) | NO | | NULL | |
| FieldID2 | varchar(10) | NO | MUL | NULL | |
+----------+-------------+------+-----+---------+-------+
I want to run a query with COUNT(*) and GROUP BY :
select FieldID, FieldID2, count(*) from nFPRF75_1 GROUP BY FieldID, FieldID2
I've created indexes on both column FieldID and column FieldID2, however, they seem to be ineffective. I have also tried OPTIMIZE table_name and created redundant indexes on these two columns (as is indicated by other optimization questions), unfortunately it didn't work out either.
Here is what I get from EXPLAIN:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------+------+---------+------+----------+---------------------------------+
| 1 | SIMPLE | nFPRF75_1 | ALL | NULL | NULL | NULL | NULL | 90412507 | Using temporary; Using filesort |
I wonder if there's anyway that I can use indexes in this query, or any other way to optimize it. Now it's of very low efficiency since there's lots of lines.
Thanks a lot for the help!
You should create a multi-column index of (FieldID, FieldID2).
Create an index of FieldID, FieldID2 if you are grouping by them. That must improve the speed.
Also, I recommend you change count(*) to count('myIntColumn') which improve the speed too.

Comparing performance of two queries in MySQL

I am trying to optimize a query on a mysql table I've created. I expect that there will be many many rows in the table. Looking at this question the accepted answer and the top voted answer suggests two different approaches.
I wrote these two queries and want to know which one is more performant.
SELECT uv.*
FROM UserVisit uv INNER JOIN
(SELECT ID,MAX(visitDate) visitDate
FROM UserVisit GROUP BY ID) last
ON (uv.ID = last.ID AND uv.visitDate = last.visitDate);
Running this with EXPLAIN yields:
+----+-------------+------------+--------+---------------+---------+---------+--------------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+---------+---------+--------------------------------+------+-------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 2 | |
| 1 | PRIMARY | uv | eq_ref | PRIMARY | PRIMARY | 11 | last.playscanID,last.visitDate | 1 | |
| 2 | DERIVED | UserVisit | index | NULL | PRIMARY | 11 | NULL | 4 | Using index |
+----+-------------+------------+--------+---------------+---------+---------+--------------------------------+------+-------------+
3 rows in set (0.01 sec)
And the other query:
SELECT lastVisits.*
FROM ( SELECT * FROM UserVisit ORDER BY visitDate DESC ) lastVisits
GROUP BY lastVisits.ID
Running that with EXPLAIN yields:
+----+-------------+------------+------+---------------+------+---------+------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+---------------+------+---------+------+------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 4 | Using temporary; Using filesort |
| 2 | DERIVED | UserVisit | ALL | NULL | NULL | NULL | NULL | 4 | Using filesort |
+----+-------------+------------+------+---------------+------+---------+------+------+---------------------------------+
2 rows in set (0.00 sec)
I am uncertain how to interpret the result of the two EXPLAINs.
Which of these queries can I expect to be faster and why?
EDIT:
This is the way UserVisit table looks:
+----------------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+---------------------+------+-----+---------+-------+
| ID | bigint(20) unsigned | NO | PRI | NULL | |
| visitDate | date | NO | PRI | NULL | |
| visitTime | time | NO | | NULL | |
| analysisResult | decimal(3,2) | NO | | NULL | |
+----------------+---------------------+------+-----+---------+-------+
4 rows in set (0.00 sec)
Firstly, you might want to read the manual on EXPLAIN. It's a dense read, but it should provide most of the information you want.
Secondly, as Strawberry says, the second query works by accident. The behaviour may change in future versions, and your query would not return an error, just different data. That's nearly always a bad thing.
Finally, the EXPLAIN suggests that version 1 will be faster. In EXTRA, it's saying it's using an index, which is much faster than filesort. Without a schema, it's hard to be sure, but I think you will also benefit from a compound key on ID and visitdate.

MySQL simple join using temporary table

I've got (what I thought) was a very simple query on a MySQL database, but using explain shows that the query is using a temporary table. I've tried changing the order of the select and the join but to no avail. I've reduced the tables to their simplest (to see if it was an issue with the complexity of my tables, but I still have the problem). I've tried it with two basic tables, one with a "name" field, and the other with a foreign key reference back to that table:
a:
+-------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| name | varchar(128) | NO | MUL | NULL | |
+-------+--------------+------+-----+---------+----------------+
b:
+-------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| a_id | int(11) | NO | MUL | NULL | |
+-------+---------+------+-----+---------+----------------+
And this is my query:
SELECT a.id, a.name FROM a JOIN b ON a.id = b.a_id ORDER BY a.name;
Which I thought was very simple... just a list of all records in a that have records in b, ordered by name. Alas explain says this:
+----+-------------+-------+--------+---------------+---------+---------+--------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+--------+------+----------------------------------------------+
| 1 | SIMPLE | b | index | a_id | a_id | 4 | NULL | 2 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | a | eq_ref | PRIMARY | PRIMARY | 4 | b.a_id | 1 | |
+----+-------------+-------+--------+---------------+---------+---------+--------+------+----------------------------------------------+
It looks like it should be using the key on the b table but for some reason it isn't. I get the feeling I'm missing something basic (or my RDBMS knowledge needs brushing up somewhat). Anyone any ideas why it's using a temporary table for such a simple query?

MySQL query optimisation help

hoping you can help me on the right track to start optimising my queries. I've never thought too much about optimisation before, but I have a few queries similar to the one below and want to start concentrating on improving their efficiency. An example of a query which I badly need to optimise is as follows:
SELECT COUNT(*) AS `records_found`
FROM (`records_owners` AS `ro`, `records` AS `r`)
WHERE r.reg_no = ro.contact_no
AND `contacted_email` <> "0000-00-00"
AND `contacted_post` <> "0000-00-00"
AND `ro`.`import_date` BETWEEN "2010-01-01" AND "2010-07-11" AND `r`.`pa_date_of_birth` > "2010-01-01" AND EXISTS ( SELECT `number` FROM `roles` WHERE `roles`.`number` = r.`reg_no` )
Running EXPLAIN on the above produces the following:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+--------+---------------+---------+---------+---------------------------------------+-------+-------------+
| 1 | PRIMARY | r | ALL | NULL | NULL | NULL | NULL | 21533 | Using where |
| 1 | PRIMARY | ro | eq_ref | PRIMARY | PRIMARY | 4 | r.reg_no | 1 | Using where |
| 2 | DEPENDENT SUBQUERY | roles | ALL | NULL | NULL | NULL | NULL | 189 | Using where |
As you can see, you have a dependent subquery, which is one of the worst thing performance-wise in MySQL. See here for tips:
http://dev.mysql.com/doc/refman/5.0/en/select-optimization.html
http://dev.mysql.com/doc/refman/5.0/en/in-subquery-optimization.html