I am quite new to mysql. I have 2 identical mysql tables which have 50K rows (70 columns) each. Those tables are updated everyday by a datafeed. I need to execute some nested queries like intersections / substractions etc.
One of the queries I try to use is as below.
But it doesn't work properly. Either it takes 5 min. to 10 min. (through terminal) or it does not respond back.
SELECT *
FROM table1
WHERE table1.sku IN (SELECT t1.sku
FROM ((SELECT DISTINCT sku
FROM table2)
UNION ALL
(SELECT DISTINCT sku
FROM table1)) AS t1
GROUP BY sku
HAVING Count(*) >= 2)
How can I make it work faster/properly? How should I configure the tables/columns (index, primary key etc.) Or do I need to make any tuning on the mysql server?
I tried several things. I created indexes on the 'sku' which are varchar(75)
columns. My database server runs on a 1 CoreProcessor (Digital Ocean) server
with 512MB Memory.
--- query with 'EXPLAIN'
+----+--------------------+-----------------------+-------+---------------+---------+---------+------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-----------------------+-------+---------------+---------+---------+------+-------+---------------------------------+
| 1 | PRIMARY | table1 | ALL | NULL | NULL | NULL | NULL | 30260 | Using where |
| 2 | DEPENDENT SUBQUERY | <derived3> | ALL | NULL | NULL | NULL | NULL | 65677 | Using temporary; Using filesort |
| 3 | DERIVED | table2 | range | NULL | sku_idx | 227 | NULL | 31016 | Using index for group-by |
| 4 | UNION | table1 | range | NULL | sku | 227 | NULL | 30261 | Using index for group-by |
| NULL | UNION RESULT | <union3,4> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------------+-----------------------+-------+---------------+---------+---------+------+-------+---------------------------------+
If I understand this particular query correctly, you are trying to display all the records from table1 which have a corresponding sku in table2.
That can be achieved by a much simpler query:
SELECT *
FROM table1
WHERE table1.sku IN (SELECT DISTINCT table2.sku FROM table2 )
GROUP BY table1.sku
Or, with joins:
SELECT table1.*
FROM table1
INNER JOIN table2 ON table1.sku = table2.sku
GROUP BY table1.sku
This should work in an instant if you have indexes on table1.sku and table2.sku
Related
mysql version is 5.5.40-0+wheezy1-log
I have this query:
SELECT cycle_id, sum(fst_field) + sum(snd_field) AS tot_sum
FROM mytable WHERE parent_id IN (
SELECT id FROM mytable WHERE cycle_id = 2662
)
I have these indexes:
parent_id
parent_id, cycle_id, fst_field, snd_field
If I execute the command
EXPLAIN EXTENDED SELECT cycle_id, sum(fst_field) + sum(snd_field) AS tot_sum
FROM mytable WHERE parent_id IN (
SELECT id FROM mytable WHERE cycle_id = 2662
)
This is the result:
+----+--------------------+-----------+-----------------+----------------------+---------+---------+------+--------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-----------+-----------------+----------------------+---------+---------+------+--------+----------+-------------+
| 1 | PRIMARY | mytable | ALL | NULL | NULL | NULL | NULL | 185971 | 100.00 | Using where |
| 2 | DEPENDENT SUBQUERY | mytable | unique_subquery | PRIMARY,cycle_id_idx | PRIMARY | 4 | func | 1 | 100.00 | Using where |
+----+--------------------+-----------+-----------------+----------------------+---------+---------+------+--------+----------+-------------+
It does not use any index. I tried to add other composed indexes (i tried several), without success.
I don't remember if 5.5 still had a very crude handling of IN ( SELECT ... ). If so, that would probably explain the problem
Consider upgrading to 5.6 or 5.7 or 8.0.
Convert the query to use a JOIN.
INDEX(cycle_id) is needed.
I have a MySQL query which has a JOIN of 12 tables. When I explain the query, It showing 394699 rows for one table and 185368 rows for another table. All other tables has 1-3 rows. The total result which I am getting from the query id 472 rows only. But for that, it is taking more than 1 minute.
Is there any way to check how many rows has been analyzed to produce such a result? So that, I can find which is the table costs the higher time.
I am giving the query structure below. As the table structure is too high, I am not able to provide it here.
SELECT h.nid,h.attached_nid,h.created, s.field_species_value as species, g.field_gender_value as gender, u.field_unique_id_value as unqid, n.title, dob.field_adult_healthy_weight_value as birth_date, dcolor.field_dog_primary_color_value as dogcolor, ccolor.field_primary_color_value as catcolor, sdcolor.field_dog_secondary_color_value as sdogcolor, sccolor.field_secondary_color_value as scatcolor, dpattern.field_dog_pattern_value as dogpattern, cpattern.field_cat_pattern_value as catpattern
FROM table1 h
JOIN table2 n ON n.nid = h.nid
JOIN table3 s ON n.nid = s.entity_id
JOIN table4 u ON n.nid = u.entity_id
LEFT JOIN table5 g ON n.nid = g.entity_id
LEFT JOIN table6 dob ON n.nid = dob.entity_id
LEFT JOIN table7 AS dcolor ON n.nid = dcolor.entity_id
LEFT JOIN table8 AS ccolor ON n.nid = ccolor.entity_id
LEFT JOIN table9 AS sdcolor ON n.nid = sdcolor.entity_id
LEFT JOIN table10 AS sccolor ON n.nid = sccolor.entity_id
LEFT JOIN table11 AS dpattern ON n.nid = dpattern.entity_id
LEFT JOIN table12 AS cpattern ON n.nid = cpattern.entity_id
WHERE h.title = '4208'
AND ((h.created BETWEEN 1483257600 AND 1485935999))
AND h.uid!=1
AND h.uid IN(
SELECT etid
FROM `table`
WHERE gid=464
AND entity_type='user')
AND h.attached_nid>0
ORDER BY CAST(h.created as UNSIGNED) DESC;
Below is the EXPLAIN result which I get
+------+--------------+---------------+--------+----------------------+---------------------+---------+----------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------+---------------+--------+----------------------+---------------------+---------+----------------------+--------+----------------------------------------------+
| 1 | PRIMARY | s | index | entity_id | field_species_value | 772 | NULL | 394699 | Using index; Using temporary; Using filesort |
| 1 | PRIMARY | u | ref | entity_id | entity_id | 4 | pantheon.s.entity_id | 1 | |
| 1 | PRIMARY | n | eq_ref | PRIMARY | PRIMARY | 4 | pantheon.s.entity_id | 1 | |
| 1 | PRIMARY | g | ref | entity_id | entity_id | 4 | pantheon.s.entity_id | 1 | |
| 1 | PRIMARY | dob | ref | entity_id | entity_id | 4 | pantheon.s.entity_id | 1 | |
| 1 | PRIMARY | dcolor | ref | entity_id | entity_id | 4 | pantheon.s.entity_id | 1 | |
| 1 | PRIMARY | ccolor | ref | entity_id | entity_id | 4 | pantheon.s.entity_id | 1 | |
| 1 | PRIMARY | sdcolor | ref | entity_id | entity_id | 4 | pantheon.s.entity_id | 1 | |
| 1 | PRIMARY | sccolor | ref | entity_id | entity_id | 4 | pantheon.s.entity_id | 1 | |
| 1 | PRIMARY | dpattern | ref | entity_id | entity_id | 4 | pantheon.s.entity_id | 1 | |
| 1 | PRIMARY | cpattern | ref | entity_id | entity_id | 4 | pantheon.s.entity_id | 1 | |
| 1 | PRIMARY | h | ref | attached_nid,nid,uid | nid | 5 | pantheon.s.entity_id | 3 | Using index condition; Using where |
| 1 | PRIMARY | <subquery2> | eq_ref | distinct_key | distinct_key | 4 | func | 1 | Using where |
| 2 | MATERIALIZED | og_membership | ref | entity,gid | gid | 4 | const | 185368 | Using where |
+------+--------------+---------------+--------+----------------------+---------------------+---------+----------------------+--------+----------------------------------------------+
You can find the ROWS_EXAMINED by using the Performance Schema.
Here is a link to the performance schema quick start guide.
https://dev.mysql.com/doc/refman/5.5/en/performance-schema-quick-start.html
This is the query I run in PHP applications, to find out what queries I need to optimize. You should be able to adapt it pretty easily.
The query finds the stats on the query that was run before this one. So in my apps, I run query after every query I run, store the results, then at the end of the PHP script I output the stats for every query I ran during the request.
SELECT `EVENT_ID`, TRUNCATE(`TIMER_WAIT`/1000000000000,6) as Duration,
`SQL_TEXT`, `DIGEST_TEXT`, `NO_INDEX_USED`, `NO_GOOD_INDEX_USED`, `ROWS_AFFECTED`, `ROWS_SENT`, `ROWS_EXAMINED`
FROM `performance_schema`.`events_statements_history`
WHERE
`CURRENT_SCHEMA` = '{$database}' AND `EVENT_NAME` LIKE 'statement/sql/%'
AND `THREAD_ID` = (SELECT `THREAD_ID` FROM `performance_schema`.`threads` WHERE `performance_schema`.`threads`.`PROCESSLIST_ID` = CONNECTION_ID())
ORDER BY `EVENT_ID` DESC LIMIT 1;
To decrease the number of rows accessed from og_membership, try adding an index containing the gid, entity_type, and etid fields. Including gid and entity_type should make the lookup more performant and including etid will make the index a covering index.
After adding the index, run EXPLAIN again to look at the results. Based on the new explain plan, either keep the index, remove the index, and/or add an additional index. Keep doing this until you get results you are satisfied with.
For sure, you will want to try and eliminate any mentions of Using temporary or Using filesort. Using temporary implies a temporary table is being used to make this query probably for the sheer size of your intermittent. Using filesort implies ordering isn't being satisfied with an index and is being done by examining the matching rows.
An detail explanation about EXPLAIN can be found at https://dev.mysql.com/doc/refman/5.7/en/explain-output.html.
Key-Value (EAV) schema sucks.
Indexes:
table1: INDEX(title, created)
table1: INDEX(uid, title, created)
table: INDEX(gid, entity_type, etid)
table* -- Is `entity_id` already an index? Can it be the PRIMARY KEY?
Does nid need to be NULL instead of NOT NULL?
If those don't do enough, try:
And turn the IN ( SELECT ... ) into a JOIN ( SELECT ... ) USING(hid)
If you still need help, please provide SHOW CREATE TABLE and EXPLAIN SELECT ...
This select query takes about 20 seconds to complete.
select Count(*)
from products as bad_rows
inner join (
select pid, MAX(last_updated_date) as maxdate
from products
group by pid
having count(*) > 1
) as good_rows on good_rows.pid= bad_rows.pid
and good_rows.maxdate <> bad_rows.last_updated_date
where bad_rows.available = 0
The delete on the other hand is still running after 30 minutes !
delete bad_rows
from products as bad_rows
inner join (
select pid, MAX(last_updated_date) as maxdate
from products
group by pid
having count(*) > 1
) as good_rows on good_rows.pid= bad_rows.pid
and good_rows.maxdate <> bad_rows.last_updated_date
where bad_rows.available = 0
Why ?
Table Schema is as follows:
Explain for the select is as follows:
+----+-------------+------------+------+---------------+------+---------+------+-------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+---------------+------+---------+------+-------+--------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 6253 | |
| 1 | PRIMARY | bad_rows | ALL | NULL | NULL | NULL | NULL | 34603 | Using where; Using join buffer |
| 2 | DERIVED | products | ALL | NULL | NULL | NULL | NULL | 34603 | Using temporary; Using filesort|
+----+-------------+------------+------+---------------+------+---------+------+-------+--------------------------------
ok so I just googled the results explain which hinted that my query could be slow because of not having indexes on pid. It didn't actually say that, but I just had a hunch from reading about the results of Explain.
SO I added a index on pid and voila. Delete over in 1 minute!!
I have the following query
SELECT a.id, b.id from table1 AS a, table2 AS b
WHERE a.table2_id IS NULL
AND a.plane = SUBSTRING(b.imb, 1, 20)
AND (a.stat LIKE "f%" OR a.stat LIKE "F%")
Here is the output of EXPLAIN
+----+-------------+-------+------+-------------------------------------------------------------------------------------------+------------------------------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-------------------------------------------------------------------------------------------+------------------------------+---------+------+----------+-------------+
| 1 | SIMPLE | b | ALL | NULL | NULL | NULL | NULL | 28578039 | |
| 1 | SIMPLE | a | ref | index_on_plane,index_on_table2_id_id,mysql_confirmstat_on_stat | index_on_plane | 258 | func| 2 | Using where |
+----+-------------+-------+------+-------------------------------------------------------------------------------------------+------------------------------+---------+------+----------+-------------+
The query takes around 80 minutes to execute.
The indexes on table1 are as follows
+--------------+------------+--------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+--------------+------------+--------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| table1 | 0 | PRIMARY | 1 | id | A | 50319117 | NULL | NULL | | BTREE | | |
| table1 | 1 | index_on_post | 1 | post | A | 7188445 | NULL | NULL | YES | BTREE | | |
| table1 | 1 | index_on_plane | 1 | plane | A | 25159558 | NULL | NULL | YES | BTREE | | |
| table1 | 1 | index_on_table2_id | 1 | table2_id | A | 25159558 | NULL | NULL | YES | BTREE | | |
| table1 | 1 | index_on_stat | 1 | stat | A | 187 | NULL | NULL | YES | BTREE | | |
+--------------+------------+--------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
and table2 indexes are.
+-------+------------+---------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------+------------+---------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| table2 | 0 | PRIMARY | 1 | id | A | 28578039 | NULL | NULL | | BTREE | | |
| table2 | 1 | index_on_post | 1 | post | A | 28578039 | NULL | NULL | YES | BTREE | | |
| table2 | 1 | index_on_ver | 1 | ver | A | 1371 | NULL | NULL | YES | BTREE | | |
| table2 | 1 | index_on_imb | 1 | imb | A | 28578039 | NULL | NULL | YES | BTREE | | |
+-------+------------+---------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
How can the execution time of this query be improved?
Here is the updated explain
EXPLAIN SELECT STRAIGHT_JOIN a.id, b.id from table1 AS a JOIN b AS b
ON a.plane=substring(b.imb,1,20)
WHERE a.table2_id IS NULL
AND (a.stat LIKE "f%" OR a.stat LIKE "F%");
+----+-------------+-------+------+-------------------------------------------------------------------------------------------+-------------------------------+---------+-------+----------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-------------------------------------------------------------------------------------------+-------------------------------+---------+-------+----------+--------------------------------+
| 1 | SIMPLE | a | ref | index_on_plane,index_on_table2_id,index_on_stat | index_on_table2_id | 5 | const | 500543 | Using where |
| 1 | SIMPLE | b | ALL | NULL | NULL | NULL | NULL | 28578039 | Using where; Using join buffer |
+----+-------------+-------+------+-------------------------------------------------------------------------------------------+-------------------------------+---------+-------+----------+--------------------------------+
SQL fiddle link http://www.sqlfiddle.com/#!2/362a6/4
Your schema dooms your query to slowness, in at least three ways. You are going to need to modify your schema to get anything like decent performance out of this. I can see three ways to fix your schema.
First way (probably very easy to fix):
a.stat LIKE "f%" OR a.stat LIKE "F%"
This OR operation likely doubles the runtime of your query. But if you set the collation of your stat column to something case-insensitive you can change this to
a.stat LIKE "f%"
You already have an index on this column.
Second way (maybe not so hard to fix). This clause definitively defeats the use of an index; they're useless when NULL values are involved.
WHERE a.table2_id IS NULL
Can you change the definition of table2_id to NOT NULL and provide a default value (perhaps zero) to indicate missing data? If so, you'll be in good shape because you'll be able to change this to a search predicate that uses an index.
WHERE a.table2_id = 0
Third way (probably hard). The presence of the function in this clause defeats the use of an index in joining.
WHERE ... a.plane = SUBSTRING(b.imb, 1, 20)
You need to make an extra column (yeah, yeah, in Oracle it could be a function index, but who has that kind of money?) called b.plane or something with that substring stored in it.
If you do all this stuff and refactor your query just a bit, here's what it will look like:
SELECT a.id AS aid,
b.id AS bid
FROM table1 AS a
JOIN table2 AS b ON a.plane = b.plane /* the new column */
WHERE a.stat LIKE 'f%'
AND a.table2_id = 0
Finally, you can probably tweak this performance up a bit by creating the following compound indexes as covering indexes for the query. Look up covering indexes if you're not sure what that means.
table1 (table2_id, stat, plane, id)
table2 (plane, id) /* plane is your new column */
There's a tradeoff in covering indexes: they slow down insertion and update operations, but speed up queries. Only you have enough information to make that tradeoff wisely.
Column on which join operation is getting perform must be indexed and MySQL optimiser should use it for better performance. It will minimise the number of rows examined (join size)
Try this one
SELECT STRAIGHT_JOIN a.id, b.id from table1 AS a JOIN table2 AS b ON a.plane=substring(b.imb,1,20)
WHERE a.table2__id IS NULL and (a.stat LIKE "f%" OR a.stat LIKE "F%")
Check the execution plan first. If it is even not using the index_on_imb index, create one composite index combining table2.imb and table2.id in which table2.imb would be top in order.
An derived table may improve performance in this case depends on this indexes index_on_table2_id,index_on_stat..
SELECT a.id, b.id from table1 AS a, table2 AS b
WHERE a.table2_id IS NULL
AND a.plane = SUBSTRING(b.imb, 1, 20)
AND (a.stat LIKE "f%" OR a.stat LIKE "F%")
May be rewritten to..
The derived table will force MySQL into checking 500543 rows like the last explain said
SELECT a.id, b.id
FROM (SELECT plane FROM table1 WHERE (a.table2_id IS NULL) AND (a.stat LIKE "f%" OR a.stat LIKE "F%")) a
INNER JOIN table2 b
ON a.plane = SUBSTRING(b.imb, 1, 20)
Aside from my comment about ID colmns, it appears you are trying to back-fill a join on the "plane" instead of by the ID columns. If I am correct, you want all records from table2 where there is no match in table1
select
a.id,
b.id
from
table2 b
left join table1 a
on b.id = a.table2_id
AND substr( b.imb, 1, 20 ) = a.plane
AND ( a.stat LIKE "f%"
OR a.stat LIKE "F%")
where
a.table2_id is null
Also, to help the index join, I would have covering indexes so the engine does not have to go back to the raw data to get qualifying records.
table1 -- index ( plane, stat, table2_id, id )
table2 -- index ( imb, id )
But again, please clarify basis of table join do or do not have it based on a Key... Per the sample columns of table1 having a column table2_id, I am GUESSING this relates to table2.id.
The purpose of doing a left-join basically says... For each record in the left-side table (in my example table2), join to the right-side table (table1) on whatever criteria/conditions -- now using the KEY ID column as primary basis, then the plane and status setting.
So, even though I'm doing a join between the two tables on the table2_id, if it DOES find a match, it will be excluded... Only when it does NOT find a match will it be included.
Finally, since you are hiding the true basis of the tables, you are leaving it to guessing work of those helping. Even if it was "personal" type of data, you are not showing any data, just how do I get it. Having a better mental image of what you are looking to get is better than bogus table/column names with limited context.
The following query operates on two tables: dev_Profile and dev_User.
SELECT
dev_Profile.ID AS pid,
Name AS username,
st1.online
FROM
dev_Profile
LEFT JOIN (
SELECT
dev_User.ID,
lastActivityTime /* DATETIME */
FROM
dev_User)
AS st1 ON st1.ID = dev_Profile.UserID;
There are about 11K rows in each table and this query takes close to 6 seconds to complete. I don't have a lot of experience with databases yet. I thought creating an index for dev_Profile.UserID would do the trick, since dev_Profile.ID already has an index (it's the PK) and dev_Profile.UserID didn't have an index, but this didn't help at all.
EDIT: The EXPLAIN output for this query:
+----+-------------+-------------+------+---------------+------+---------+------+-------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+---------------+------+---------+------+-------+-------+
| 1 | PRIMARY | dev_Profile | ALL | NULL | NULL | NULL | NULL | 11521 | |
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 11191 | |
| 2 | DERIVED | dev_User | ALL | NULL | NULL | NULL | NULL | 11440 | |
+----+-------------+-------------+------+---------------+------+---------+------+-------+-------+
Any suggestions?
Why the nested select? That might be confusing the optimizer. Try eliminating it:
SELECT
dev_Profile.ID AS pid,
Name AS username,
st1.online
FROM
dev_Profile
LEFT JOIN dev_User st1 ON st1.ID = dev_Profile.UserID;