so I have two tables in mysql: articles and articles_rubrics, both with ~20.000 rows
article has multiple cols, but its article_id is indexed.
articles_rubrics has only two cols: article_id and rubrics_id and both are indexed separately and on top of that there is joined index of these two.
My issue is that when I select data from these tables with join, the order is extremely important, which is an issue for me and I don't understand the reason for it:
SELECT article_id,rubric_id FROM articles
LEFT JOIN articles_rubrics USING(article_id)
WHERE rubric_id=1
ORDER BY article_id DESC
LIMIT 10;
and explain says (for articles_rubrics) this:
time: 0.312 s
key_len: 1
ref: const
rows: 7352
extra: Using where; Using temporary; Using filesort
But when I switch the order of it:
SELECT article_id,rubric_id FROM articles_rubrics
LEFT JOIN articles USING(article_id)
WHERE rubric_id=1
ORDER BY article_id DESC
LIMIT 10;
and explain says (for articles_rubrics) this:
time: 0.001 s
key_len:9
ref: NULL
rows: 28
extra: Using where; Using index
So I have two tables and this makes their querying go ~300 times slower/faster. How is that even possible?
PS: I've heavily simplified my real world problem for this example, but I stumbled upon this because my
SELECT * FROM articles [LEFT JOIN for 5 other tables]
was taking 1.5s and when I actually added other join to the mix, execution time changed to like 0.006s.
Show index:
show index from articles;
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Sub_part Packed Null Index_type Comment Index_comment
articles 0 PRIMARY 1 article_id A 20043 NULL NULL BTREE
articles 1 article_url_title 1 article_url_title A 10021 NULL NULL BTREE
articles 1 FULLTEXT 1 article_title NULL 1 NULL NULL FULLTEXT
articles 1 FULLTEXT 2 article_content NULL 1 NULL NULL FULLTEXT
show index from articles_rubrics;
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Sub_part Packed Null Index_type
articles_rubrics 0 PRIMARY 1 article_id A NULL NULL NULL BTREE
articles_rubrics 0 PRIMARY 2 rubric_id A 20814 NULL NULL BTREE
articles_rubrics 1 rubric_id 1 rubric_id A 17 NULL NULL BTREE
articles_rubrics 1 article_id 1 article_id A 20814 NULL NULL BTREE
SELECT article_id,rubric_id
FROM articles
LEFT JOIN articles_rubrics USING(article_id)
WHERE rubric_id=1 <<<<<<<<<<<<<<<<<<<<<<<<<<< problem here
ORDER BY article_id DESC
LIMIT 10;
By insisting that every row returned from this query has rubric_id=1 you have eliminated any row where there is no match between the 2 tables and therefore there is NO POINT in using a LEFT JOIN
SELECT a.article_id, ar.rubric_id
FROM articles AS a
INNER JOIN articles_rubrics AS ar ON a.article_id = ar.article_id
WHERE ar.rubric_id = 1
ORDER BY a.article_id DESC
LIMIT 10;
You need to use table or table aliases in EVERY reference.
Join operations on a database are expensive processes. It is better to use simple SELECT nesting. Make a list to store the data and then use the items in the list for the next queries.
These two queries run the same, only difference is in using article_id from articles_rubrics in both.
-- SELECT article_id,rubric_id FROM articles -- would be slow here
SELECT ar.article_id,ar.rubric_id FROM articles
JOIN articles_rubrics ar USING(article_id)
WHERE rubric_id=1
ORDER BY article_id DESC
LIMIT 10;
SELECT ar.article_id,ar.rubric_id FROM articles_rubrics ar
JOIN articles USING(article_id)
WHERE rubric_id=1
ORDER BY article_id DESC
LIMIT 10;
If I force the sql server to use articles_rubrics table in the result, he correctly decides that the articles aren't actually needed. The server, however, won't do that automatically, even though article_id is used as a key.
I still don't fully understand why it's happening (or how the optimization algorithm actually works), because in both cases, where rubric_id=1 goes into the articles_rubrics table and in both cases, the selected columns are there already (and join articles for existence is ran again, in both cases).
However, for some reason, in the first example, the server decides to load all the articles first and only then, he checks each one for the rubric_id.
Related
The following query takes mysql to execute almost 7 times longer than implementing the same using two separate queries, and avoiding OR on the WHERE statement. I prefer using a single query as I can sort and group everything.
Here is the problematic query:
EXPLAIN SELECT *
FROM `posts`
LEFT JOIN `teams_users`
ON (teams_users.team_id=posts.team_id
AND teams_users.user_id='7135')
WHERE (teams_users.status='1'
OR posts.user_id='7135');
Result:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE posts ALL user_id NULL NULL NULL 169642
1 SIMPLE teams_users eq_ref PRIMARY PRIMARY 8 posts.team_id,const 1 Using where
Now if I do the following two queries instead, the aggregate execution time, as said, is shorter by 7 times:
EXPLAIN SELECT *
FROM `posts`
LEFT JOIN `teams_users`
ON (teams_users.team_id=posts.team_id
AND teams_users.user_id='7135')
WHERE (teams_users.status='1');
Result:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE teams_users ref PRIMARY,status status 1 const 5822 Using where
1 SIMPLE posts ref team_id team_id 5 teams_users.team_id 9 Using where
and:
EXPLAIN SELECT *
FROM `posts`
LEFT JOIN `teams_users`
ON (teams_users.team_id=posts.team_id
AND teams_users.user_id='7135')
WHERE (posts.user_id='7135');
Result:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE posts ref user_id user_id 4 const 142
1 SIMPLE teams_users eq_ref PRIMARY PRIMARY 8 posts.team_id,const 1
Obviously the amount of scanned rows is much lower on the two queries.
Why is the initial query slow?
Thanks.
Yes, OR is frequently a performance-killer. A common work-around is to do UNION. For your example:
SELECT *
FROM `posts`
LEFT JOIN `teams_users`
ON (teams_users.team_id=posts.team_id
AND teams_users.user_id='7135')
WHERE (teams_users.status='1')
UNION DISTINCT
SELECT *
FROM `posts`
LEFT JOIN `teams_users`
ON (teams_users.team_id=posts.team_id
AND teams_users.user_id='7135')
WHERE (posts.user_id='7135');
If you are sure there are not dups, change to the faster UNION ALL.
If you are not fishing for missing team_users rows, use JOIN instead of LEFT JOIN.
If you need ORDER BY, add some parens:
( SELECT ... )
UNION ...
( SELECT ... )
ORDER BY ...
Otherwise, the ORDER BY would apply only to the second SELECT. (If you also need 'pagination', see my blog .)
Please note that you might also need LIMIT in certain circumstances.
The queries without the OR clause are both sargable. That is, they both can be satisfied using indexes.
The query with the OR would be sargable if the MySQL query planner contained logic to figure out it can rewrite it as the UNION ALL of two queries. By the MySQL query planner doesn't (yet) have that kind of logic.
So, it does table scans to get the result set. Those are often very slow.
Our server company advised we switch to Percona when setting up our new DB servers. So we're currently on Percona Server (GPL), Release 82.0 (version 5.6.36-82.0) , Revision 58e846a and there's one behavior I'm trying to wrap my head around that we definitely weren't experiencing before with MySql so I thought I'd reach out:
This is a query we perform fairly regularly to pull an article from our db
SELECT * FROM table_a a, table_b b
WHERE a.id = b.id AND a.status_field = 'open' AND b.filter_field = 'no_filter' AND b.view_field = 'article'
ORDER BY a.unixtimestamp DESC LIMIT 1
This used to complete very quickly but under Percona, the combination of the where conditions from table b and ordering from table a makes the whole query take ~3s. I don't fully understand this behaviour.
If I alter it to:
SELECT * FROM table_a a, table_b b
WHERE a.id = b.id AND a.status_field = 'open' AND b.filter_field = 'no_filter' AND b.view_field = 'article'
ORDER BY b.unixtimestamp DESC LIMIT 1
Then it completes very quickly (< 0.05s)
Is this sort of an expected behavior with Percona?
I just wanted to know before changing any db structure to compensate.
Edit:
For the explain, I simplified the query and it still has the same issue (id = entry_id):
Slow Query (1.5122389793):
SELECT * FROM table_a a, table_b b
WHERE a.id = b.id AND b.special_filter = 'no_filter'
ORDER BY a.id DESC LIMIT 1
Slow Query explain:
1 SIMPLE table_b ref PRIMARY,entry_id,special_filter special_filter 26 const 130733 Using where; Using temporary; Using filesort
1 SIMPLE table_a eq_ref PRIMARY PRIMARY 4 db_name.table_b.entry_id 1 Using index
Fast Query (0.0006549358):
SELECT * FROM table_a a, table_b b
WHERE a.id = b.id AND b.special_filter = 'no_filter'
ORDER BY b.id DESC LIMIT 1
Fast Query explain:
1 SIMPLE table_b ref PRIMARY,entry_id,special_filter special_filter 26 const 130733 Using where
1 SIMPLE table_a eq_ref PRIMARY PRIMARY 4 db_name.table_b.entry_id 1 Using index
I've tried to omit as much info from the table as possible for security reasons but if I'm grossly missing something I can add it back in
table_a:
Relevant Keys:
table_a 0 PRIMARY 1 entry_id A 321147 BTREE
Create table:
table_a CREATE TABLE table_a ( entry_id int(10) unsigned NOT NULL AUTO_INCREMENT) ENGINE=InnoDB AUTO_INCREMENT=356198 DEFAULT CHARSET=utf8 DELAY_KEY_WRITE=1
table_b:
Relevant Keys:
table_b 0 PRIMARY 1 entry_id A 261467 BTREE
table_b 1 entry_id 1 entry_id A 261467 BTREE
table_b 1 special_filter 1 special_filter A 14 8 BTREE
Create Table:
table_b CREATE TABLE table_b ( entry_id int(10) unsigned NOT NULL DEFAULT '0', special_filter text NOT NULL, ) ENGINE=InnoDB DEFAULT CHARSET=utf8 DELAY_KEY_WRITE=1
Both tables have ~ 350k rows
It seems to be a mysql optimizer issue where it chooses not to join on the primary key under some conditions. Closely related to or the same as this issue: https://dba.stackexchange.com/questions/53274/mysql-innodb-issue-not-using-indexes-correctly-percona-5-6
Explicitly writing the query to use a STRAIGHT_JOIN with the tables in a specific order solves the issue. Writing USE INDEX(PRIMARY) after the JOIN keyword is easier though and doesn't rely on the table order.
I've got a composite key table CUSTOMER_PRODUCT_XREF
__________________________________________________________________
|CUSTOMER_ID (PK NN VARCHAR(191)) | PRODUCT_ID(PK NN VARCHAR(191))|
-------------------------------------------------------------------
In my batch program I need to select 500 updated customers and also get the PRODUCT_ID's purchased by CUSTOMERs separated by comma and update our SOLR index. In my query I'm select 500 customers and doing a left join to CUSTOMER_PRODUCT_XREF
SELECT
customer.*, group_concat(xref.PRODUCT_ID separator ', ')
FROM
CUSTOMER customer
LEFT JOIN CUSTOMER_PRODUCT_XREF xref ON customer.CUSTOMER_ID=xref.CUSTOMER_ID
group by customer.CUSTOMER_ID
LIMIT 500;
EDIT: EXPLAIN QUERY
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE customer ALL PRIMARY NULL NULL NULL 74236 Using where; Using temporary; Using filesort
1 SIMPLE xref index NULL PRIMARY 1532 NULL 121627 Using where; Using index; Using join buffer (Block Nested Loop)
I got lost connection exception after 20 minutes running the above query.
I tried with the following (sub query) and it took 1.7 seconds to get result but still slow.
SELECT
customer.*, (SELECT group_concat(PRODUCT_ID separator ', ')
FROM CUSTOMER_PRODUCT_XREF xref
WHERE customer.CUSTOMER_ID=xref.CUSTOMER_ID
GROUP BY customer.CUSTOMER_ID)
FROM
CUSTOMER customer
LIMIT 500;
EDIT: EXPLAIN QUERY produces
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY customer ALL NULL NULL NULL NULL 74236 NULL
2 DEPENDENT SUBQUERY xref index NULL PRIMARY 1532 NULL 121627 Using where; Using index; Using temporary; Using filesort
Question
CUSTOMER_PRODUCT_XREF already has both columns set as PRIMARY_KEY and NOT_NULL but why is my query still very slow ? I thought having Primary Key on a column was enough to build an index for it. Do I need further indexing ?
DATABASE INFO:
All the ID's in my database are VARCHAR(191) because the id's can contain alphabets.
I'm using utf8mb4_unicode_ci character encoding
I'm using SET group_concat_max_len := ##max_allowed_packet to get maximum number of product_ids for each customer. Prefer using group_concat in one main query so that I don't have to do multiple separate queries to get products for each customer.
Your original version of the query is doing the join first and then sorting all the resulting data -- which is probably pretty big given how large the fields are.
You can "fix" that version by selecting 500 hundred customers first and then doing the join:
SELECT c.*, group_concat(xref.PRODUCT_ID separator ', ')
FROM (select c.*
from CUSTOMER customer c
order by c.customer_id
limit 500
) c LEFT JOIN
CUSTOMER_PRODUCT_XREF xref
ON c.CUSTOMER_ID=xref.CUSTOMER_ID
group by c.CUSTOMER_ID ;
An alternative that might or might not have a big impact would be to doing the aggregation by customer in a subquery and join that, as in:
SELECT c.*, xref.products
FROM (select c.*
from CUSTOMER customer c
order by c.customer_id
limit 500
) c LEFT JOIN
(select customer_id, group_concat(xref.PRODUCT_ID separator ', ') as products
from CUSTOMER_PRODUCT_XREF xref
) xref
ON c.CUSTOMER_ID=xref.CUSTOMER_ID;
What you have discovered is that the MySQL optimizer does not recognize this situation (where the limit has a big impact on performance). Some other database engines do a better job of optimization in this case.
Alright the speed of the queries in my question shot up when I created an index just on the CUSTOMER_ID in CUSTOMER_PRODUCT_XREF table.
So I've got two indexes now
PRIMARY_KEY_INDEX on PRODUCT_ID and CUSTOMER_ID
CUSTOMER_ID_INDEX on CUSTOMER_ID
I hold a set of nodes in one mysql table1 and a table of edges in another one (table2). Nodes come with primary keys and edges use this "foreign key"
**table1**
id label
1 node1
2 node2
3 node3
**table2**
FK_first FK_sec rel
1 3 guardian
2 1 guardian
1 3 times
I know the db-design is not perfect, but its simple...
Now i want the number of 'rel' for every node and do a query like:
SELECT
label,
COUNT( rel ) as freq
FROM
`table1`
LEFT JOIN table2 ON (id=FK_first OR id=FK_second)
GROUP BY label
ORDER BY freq DESC
I have about 1000 nodes and 2000 edges. A query with ON (id=FK_first OR id=FK_second), then the query is way faster (<1 sec). The other query needs about 6 sec which is ver slow.
I would appreciate some comments to speed this up a bit :-)
LEFT JOIN table2 ON (id=FK_first OR id=FK_second) ~6 sec
LEFT JOIN table2 ON (id=FK_first) ~0.16 sec
LEFT JOIN table2 ON (id=FK_second) ~0.16 sec
LEFT JOIN table2 ON id IN (FK_first,FK_second) ~6 sec
EXPLAIN 1:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE table1 ALL NULL NULL NULL NULL 2571 Using temporary; Using filesort
1 SIMPLE table2 ALL FK_first,FK_second,FK_first_2 NULL NULL NULL 3858
EXPLAIN 2:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE table1 index NULL PRIMARY 2 NULL 2571 Using index; Using temporary; Using filesort
1 SIMPLE table2 ref FK_first,FK_first_2 FK_first_2 4 table1.id 1
Try doing two joins and moving the "OR" into the COUNT() function:
For every row, this joins table2 once on FK1, then again on FK2 (if it is not already joined to that row via FK1. Then in the COUNT, we specify that only rows which have either join's rel column as non-null.
SELECT
label,
COUNT( table2A.rel || table2B.rel ) as freq
FROM
`table1`
LEFT JOIN
table2 as table2A
ON id=table2A.FK_first
LEFT JOIN
table2 as table2B
ON id=table2B.FK_second
AND table2A.FKFirst != table2B.FKFirst
GROUP BY label
ORDER BY freq DESC
I have a many-to-many query that i'd like to optimize,
what indexes should i create for it?
SELECT (SELECT COUNT(post_id)
FROM posts
WHERE post_status = 1) as total,
p.*,
GROUP_CONCAT(t.tag_name) tagged
FROM tags_relation tr
JOIN posts p ON p.post_id = tr.rel_post_id
JOIN tags t ON t.tag_id = tr.rel_tag_id
WHERE p.post_status=1
GROUP BY p.post_id
EXPLAIN
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY p ALL PRIMARY NULL NULL NULL 5 Using where; Using filesort
You can take a look at the query execution plan using the Explain statement. This will show you whether a full table scan is happening or if it was able to find an index to retrieve the data. From that point on you can optimize further.
Edit
Based on your query execution plan, first optimization step check your tables have the primary key defined and you can set an index on post_status and tag_name columns.