Why performance difference in MySQL for where vs join clause - mysql

There is huge execution time gap between this three query. I dont know underlying working of MySQL query engine.
Query1:
select t2.* from (
select id1, id2
from table t2
where condition
) t3, table t2
where
t2.id1 = t3.id1 AND
t2.id2 = t3.id2
takes ~180 seconds to run.
Query2:
select t2.* from (
select id1, id2
from table t2
where condition
) t3 inner join table t2
on
t2.id1 = t3.id1 AND
t2.id2 = t3.id2
takes ~180 seconds to run.
result of explain.
id|select_type|table |partitions|type|possible_keys|key |key_len|ref |rows |filtered|Extra |
--|-----------|----------|----------|----|-------------|-----------|-------|----------------------------------------------------------------------------------------|-----|--------|--------------------------------------------------|
1|PRIMARY |t2 | |ALL | | | | |95619| 19|Using where |
1|PRIMARY |t2 | |ALL | | | | |95619| 1|Using where; Using join buffer (Block Nested Loop)|
1|PRIMARY |<derived3>| |ref |<auto_key0> |<auto_key0>|216 |t2.id1,t2.id2 | 10| 100|Using index |
3|DERIVED |t3 | |ALL | | | | |95619| 100|Using temporary; Using filesort |
Query3:
select t2.* from (
select id1, id2
from table t2
where condition
) t3 left join table t2
on
t2.id1 = t3.id1 AND
t2.id2 = t3.id2
takes ~1.6 seconds to run.
result of explain.
id|select_type|table |partitions|type|possible_keys|key |key_len|ref |rows |filtered|Extra |
--|-----------|----------|----------|----|-------------|-----------|-------|----------------------------------------------------------------------------------------|-----|--------|--------------------------------------------------|
1|PRIMARY |t2 | |ALL | | | | |95619| 19|Using where |
1|PRIMARY |<derived3>| |ref |<auto_key0> |<auto_key0>|216 |t2.id1,t2.id2 | 10| 100|Using index |
1|PRIMARY |t2 | |ALL | | | | |95619| 100|Using where; Using join buffer (Block Nested Loop)|
3|DERIVED |t3 | |ALL | | | | |95619| 100|Using temporary; Using filesort |
UPDATE: I have edited the question with additional query.
UPDATE2 Added Explain result for query.
EDIT : Some stats about table.
subquery contains 83 rows. Table t2 have ~97k rows.

I think what's going on is this:
The first two queries are completely equivalent -- MySQL doesn't distinguish between INNER JOIN and cross product when the conditions relating the two tables are the same; it doesn't matter whether the conditions are in ON or WHERE. So the relevant difference is just between inner join and outer join.
When performing any kind of join, MySQL has to decide which table to treat as the primary table and which is the dependent table. It will scan the primary table and then find the matching rows in the dependent table to produce the result set.
An outer join forces a particular ordering to this; in a LEFT JOIN, the left table is always the primary, because the result has to include at least one row for each row in that table. So after generating an intermediate table for t3, it just has to scan those 83 rows and find the corresponding rows in t2 that match the joining condition. The columns being matched are presumably indexed, so this is very fast.
But with an inner join, it can go either way. The query optimizer will try to estimate which table is smaller, and use that as the primary that it scans. But when it's the result of a subquery, it doesn't know how many rows it will return. So it's apparently choosing to use t2 as the primary, rather than the intermediate t3 table. This means it's scanning 97K rows, testing each of them to find the matching rows in t3.
A smart optimizer would notice that the subquery is simply filtering the same table, so it must return fewer rows. Few people would claim that MySQL has one of the better query planners. I suspect it's choosing to use the regular table rather than the intermediate table because it can confine the scan to the index.
It's actually surprising that it only takes 180 seconds, rather than 1800 seconds (30 minutes), which is closer to the ratio between 97k and 83.
I'm not really an expert on interpreting EXPLAIN results, but I think you can see the difference on line 2; in the fast case it's table = <derived3>, rows = 10, in the slow case it's table = t2, rows = 95619.

Related

where like and order by on different tables/columns

For information, on the following examples, big_table is composed of millions of rows and small_table of hundreds.
Here is the basic query i'm trying to do:
SELECT b.id
FROM big_table b
LEFT JOIN small_table s
ON b.small_id=s.id
WHERE s.name like 'something%'
ORDER BY b.name
LIMIT 10, 10;
This is slow and I can understand why both index can't be used.
My initial idea was to split the query into parts.
This is fast:
SELECT id FROM small_table WHERE name like 'something%';
This is also fast:
SELECT id FROM big_table WHERE small_id IN (1, 2) ORDER BY name LIMIT 10, 10;
But, put together, it becomes slow:
SELECT id FROM big_table
WHERE small_id
IN (
SELECT id
FROM small_table WHERE name like 'something%'
)
ORDER BY name
LIMIT 10, 10;
Unless the subquery is re-evaluated for every row, it shouldn't be slower than executing both query separately right?
I'm looking for any help optimizing the initial query and understanding why the second one doesn't work.
EXPLAIN result for the last query :
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
| 1 | PRIMARY | small_table | range | PRIMARY, ix_small_name | ix_small_name | 768 | NULL | 1 | Using where; Using index; Using temporary; Using filesort |
| 1 | PRIMARY | big_table | ref | ix_join_foreign_key | ix_join_foreign_key | 9 | small_table.id | 11870 | |
temporary solution :
SELECT id FROM big_table ignore index(ix_join_foreign_key)
WHERE small_id
IN (
SELECT id
FROM small_table ignore index(PRIMARY)
WHERE name like 'something%'
)
ORDER BY name
LIMIT 10, 10;
(result & explain is the same with an EXISTS instead of IN)
EXPLAIN output becomes:
| 1 | PRIMARY | big_table | index | NULL | ix_big_name | 768 | NULL | 20 | |
| 1 | PRIMARY | <subquery2> | eq_ref | distinct_key | distinct_key | 8 | func | 1 | |
| 2 | MATERIALIZED | small_table | range | ix_small_name | ix_small_name | 768 | NULL | 1 | Using where; Using index |
if anyone has a better solution, I'm still interested.
The problem that you are facing is that you have conditions on the small table but are trying to avoid a sort in the large table. In MySQL, I think you need to do at least a full table scan.
One step is to write the query using exists, as others have mentioned:
SELECT b.id
FROM big_table b
WHERE EXISTS (SELECT 1
FROM small_table s
WHERE s.name LIKE 'something%' AND s.id = b.small_id
)
ORDER BY b.name;
The question is: Can you trick MySQL into doing the ORDER BY using an index? One possibility is to use the appropriate index. In this case, the appropriate index is: big_table(name, small_id, id) and small_table(id, name). The ordering of the keys in the index is important. Because the first is a covering index, MySQL might read through the index in order by name, choosing the appropriate ids.
You are looking for an EXISTS or IN query. As MySQL is known to be weak on IN I'd try EXISTS in spite of liking IN better for its simplicity.
select id
from big_table b
where exists
(
select *
from small_table s
where s.id = b.small_id
and s.name = 'something%'
)
order by name
limit 10, 10;
It would be helpful to have a good index on big_table. It should first contain the small_id to find the match, then the name for the sorting. The ID is automatically included in MySQL indexes, as far as I know (otherwise it should also be added to the index). So thus you'd have an index containing all fields needed from big_table (that is called a covering index) in the desired order, so all data can be read from the index alone and the table itself doesn't have to get accessed.
create index idx_big_quick on big_table(small_id, name);
you can try this:
SELECT b.id
FROM big_table b
JOIN small_table s
ON b.small_id = s.id
WHERE s.name like 'something%'
ORDER BY b.name;
or
SELECT b.id FROM big_table b
WHERE EXISTS(SELECT 1 FROM small_table s
WHERE s.name LIKE 'something%' AND s.id = b.small_id)
ORDER BY b.name;
NOTE: you don't seem to need LEFT JOIN. Left outer join will almost always result in full table scan of the big_table
PS make sure you have an index on big_table.small_id
Plan A
SELECT b.id
FROM big_table b
JOIN small_table s ON b.small_id=s.id
WHERE s.name like 'something%'
ORDER BY b.name
LIMIT 10, 10;
(Note removal of LEFT.)
You need
small_table: INDEX(name, id)
big_table: INDEX(small_id), or, for 'covering': INDEX(small_id, name, id)
It will use the s index to find 'something%' and walk through. But it must find all such rows, and JOIN to b to find all such rows there. Only then can it do the ORDER BY, OFFSET, and LIMIT. There will be a filesort (which may happen in RAM).
The column order in the indexes is important.
Plan B
The other suggestion may work well; it depends on various things.
SELECT b.id
FROM big_table b
WHERE EXISTS
( SELECT *
FROM small_table s
WHERE s.name LIKE 'something%'
AND s.id = b.small_id
)
ORDER BY b.name
LIMIT 10, 10;
That needs these:
big_table: INDEX(name), or for 'covering', INDEX(name, small_id, id)
small_table: INDEX(id, name), which is 'covering'
(Caveat: If you are doing something other than SELECT b.id, my comments about covering may be wrong.)
Which is faster (A or B)? Cannot predict without understanding the frequency of 'something%' and how 'many' the many-to-1 mapping is.
Settings
If these tables are InnoDB, then be sure that innodb_buffer_pool_size is set to about 70% of available RAM.
Pagination
Your use of OFFSET implies that you are 'paging' through the data? OFFSET is an inefficient way to do it. See my blog on such, but note that only Plan B will work with it.

Rows column in Query Plan confusing

I have a MySql query
SELECT TE.company_id,
SUM(TE.debit- TE.credit) As summation
FROM Transactions T JOIN Transaction_E TE2
ON (T.parent_id = TE2.transaction_id)
JOIN Transaction_E TE
ON (TE.transaction_id = T.id AND TE.company_id IS NOT NULL)
JOIN Accounts A
ON (TE2.account_id=A.id AND A.deactivated_timestamp=0)
WHERE (TE.company_id IN (1,2))
AND A.user_id=2341 GROUP BY TE.company_id;
When I explain the query, the plan for it is like (in summary):
| Select type | table | type | rows |
-------------------------------------
| SIMPLE | A | ref | 2 |
| SIMPLE | TE2 | ref | 17 |
| SIMPLE | T | ref | 1 |
| SIMPLE | TE | ref | 1 |
But if I do a count(*) on the same query (instead of SUM(..) ), then it shows that there are ~40k rows for a particular company_id. What I don't understand is why the query plan shows so few rows being scanned while there is at least 40k rows being processed. What does the rows column in the query plan represent? Does it not represent the number of rows that get processed in that table? In that case it should be at most 2*17*1*1 = 34 rows?
The query plan just shows a high level judgement on the expected number of rows required per table to meet the end result.
It is to be used as a tool for judging as to how the optimizer is 'seeing' your query, and to help it a bit, in case query performance is worse or can be improved.
There is always a possibility that the query plan is built based on an earlier snapshot of statistics, and hence should not be taken on face value, especially while dealing with cardinality.
Well, first let's get rid of the computational bug:
SELECT TE.company_id, TE.summation
FROM
( SELECT company_id,
SUM(debit - credit) As summation
FROM Transaction_E
WHERE company_id IN (1,2)
) TE
JOIN Transactions T ON TE.transaction_id = T.id
JOIN Transaction_E TE2 ON T.parent_id = TE2.transaction_id
JOIN Accounts A ON TE2.account_id = A.id
AND A.deactivated_timestamp = 0
WHERE A.user_id = 2341;
Your query is probably summing up the same company multiple times before doing the GROUP BY. My variant avoids that inflation of the aggregate.
I got rid of TE.company_id IS NOT NULL because it was redundant.
See what the EXPLAIN says about this, then let's discuss your question about EXPLAIN further.

How to resolve SQL id chain?

I have an MySQL DB with table like that:
| id | redirect |
+-------+----------+
| 1 | NULL |
| 2 | 3 |
| 3 | NULL |
| 4 | 5 |
| 5 | 6 |
| 6 | 8 |
| 7 | NULL |
| 8 | NULL |
+-------+----------+
I need to create query for recursive resolving redirects.
So I can get results:
1 1
2 3
3 3
4 8
5 8
6 8
7 7
8 8
Thanks
One approach is to get each "level" with a separate query.
To get the first level, we can test for a NULL in redirect_id column to identify a "terminating" node.
To get the second level, we can use a JOIN operation to match rows that have a redirect_id that match the id from a "terminating" row (identified previously).
The third level follows the same pattern, adding another JOIN operation to return rows that redirect to a level two row.
And so on.
For example:
SELECT t1.id AS start_id
, t1.id AS terminate_id
FROM mytable t1
WHERE t1.redirect_id IS NULL
UNION ALL
SELECT t2.id
, t1.id
FROM mytable t1
JOIN mytable t2 ON t2.redirect_id = t1.id
WHERE t1.redirect_id IS NULL
UNION ALL
SELECT t3.id
, t1.id
FROM mytable t1
JOIN mytable t2 ON t2.redirect_id = t1.id
JOIN mytable t3 ON t3.redirect_id = t2.id
WHERE t1.redirect_id IS NULL
UNION ALL
SELECT t4.id
, t1.id
FROM mytable t1
JOIN mytable t2 ON t2.redirect_id = t1.id
JOIN mytable t3 ON t3.redirect_id = t2.id
JOIN mytable t4 ON t4.redirect_id = t3.id
WHERE t1.redirect_id IS NULL
The limitation of this single-query UNION ALL approach is that it would need to be extended to a finite maximum number of levels. (This approach is not truly "recursive".)
If we needed a truly recursive approach, we could run each query separately, just adding an extra "level" for each run, following the same pattern. We'd know that we'd exhausted all possible paths when the result of a query returns no rows.
I've demonstrated the use of the UNION ALL operator to combine the results into a single set, using a single query. (Add an ORDER BY clause at the end of the statement if the order of the rows is important. It would also be easy to include a literal "level" column to the resultset, e.g. 1 AS level for the first SELECT, the 2 on the second query, etc. to identify how far a node was from the termination.
MySQL doesn't support an Oracle style CONNECT BY syntax (in Oracle, we COULD write a single query that would traverse this set and return the specified rows, an arbitrary number of levels.)
To get a truly "recursive" approach in MySQL would require multiple queries. (Note that MySQL can support "recursion" in stored procedure calls, if the server is configured to allow it.)

how to optimize this sql in mysql

i have a sql like this:
select t1.id,t1.type from collect t1
where t1.type='1' and t1.status='1'
and t1.total>(t1.totalComplate+t1.totalIng)
and id not in(
select tid from collect_log t2
where t2.tid=t1.id and t2.bid='1146')
limit 1;
is is ok, but its performance seems not very good and if i using a order command:
select t1.id,t1.type from collect t1
where t1.type='1' and t1.status='1'
and t1.total>(t1.totalComplate+t1.totalIng)
and id not in(
select tid from collect_log t2
where t2.tid=t1.id and t2.bid='1146')
order by t1.id asc
limit 1;
it become even worse.
how can i optimize this?
here is explain:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+------+---------------+-----+---------+-----------------+------+-----------------------------+
| 1 | PRIMARY | t1 | ref | i2,i1,i3 | i1 | 4 | const,const | 99 | Using where; Using filesort |
| 2 | DEPENDENT SUBQUERY | t2 | ref | i5 | i5 | 65 | img.t1.id,const | 2 | Using where; Using index
1) If it's not already done, define an index on the collect.id column :
CREATE INDEX idx_collect_id ON collect (id);
Or possibly a unique index if you can (if id is never the same for any two lines) :
CREATE UNIQUE INDEX idx_collect_id ON collect (id);
Maybe you need an index on collect_log.tid or collect_log.bid, too. Or even on both columns, like so :
CREATE INDEX idx_collect_log_tidbid ON collect (tid, bid);
Make it UNIQUE if it makes sense, that is, if no two lines have the same values for the (tid, bid) couple in the table. For instance if these queries give the same result, it might be possible :
SELECT count(*) FROM collect_log;
SELECT count(DISTINCT tid, bid) FROM collect_log;
But don't make it UNIQUE if you're unsure what it means.
2) Verify the types of the columns collect.type, collect.status and collect_log.bid. In your query, you are comparing them with strings, but maybe they are defined as INT (or SMALLINT or TINYINT...) ? In this case I advise you to drop the quotes around the numbers, because string comparisons are painfully slow compared to integer comparisons.
select t1.id,t1.type from collect t1
where t1.type=1 and t1.status=1
and t1.total>(t1.totalComplate+t1.totalIng)
and id not in(
select tid from collect_log t2
where t2.tid=t1.id and t2.bid=1146)
order by t1.id asc
limit 1;
3) If that still doesn't help, just add EXPLAIN in front of your query, and you'll get the execution plan. Paste the results here and we can help you make some sense out of it. Actually, I would advise you to do this step before creating any new index.
I'd try to get rid of the IN statement using an INNER LEFT JOIN first.
Something like this (untested):
select t1.id,t1.type
from collect t1
LEFT JOIN collect_log t2 ON t1.id = t2.tid
where t1.type='1'
and t1.status='1'
and t1.total>(t1.totalComplate+t1.totalIng)
and NOT t2.bid = '1146'
order by t1.id asc
limit 1;

Duplicates in Database, Help Edit My Query to Filter Them Out?

I have just finished my latest task of creating an RSS Feed using PHP to fetch data from a database.
I've only just noticed that a lot (if not all) of these items have duplicates and I was trying to work out how to only fetch one of each.
I had a thought that in my PHP loop I could only print out every second row to only have one of each set of duplicates but in some cases there are 3 or 4 of each article so somehow it must be achieved by the query.
Query:
SELECT *
FROM uk_newsreach_article t1
INNER JOIN uk_newsreach_article_photo t2
ON t1.id = t2.newsArticleID
INNER JOIN uk_newsreach_photo t3
ON t2.newsPhotoID = t3.id
ORDER BY t1.publishDate DESC;
Table Structures:
uk_newsreach_article
--------------------
id | headline | extract | text | publishDate | ...
uk_newsreach_article_photo
--------------------------
id | newsArticleID | newsPhotoID
uk_newsreach_photo
------------------
id | htmlAlt | URL | height | width | ...
For some reason or another there are lots of duplicates and the only thing truely unique amongst each set of data is the uk_newsreach_article_photo.id because even though uk_newsreach_article_photo.newsArticleID and uk_newsreach_article_photo.newsPhotoID are identical in a set of duplicates, all I need is one from each set, e.g.
Sample Data
id | newsArticleID | newsPhotoID
--------------------------------
2 | 800482746 | 7044521
10 | 800482746 | 7044521
19 | 800482746 | 7044521
29 | 800482746 | 7044521
39 | 800482746 | 7044521
53 | 800482746 | 7044521
67 | 800482746 | 7044521
I tried sticking a DISTINCT into the query along with specifying the actual columns I wanted but this didn't work.
As you have noticed, the DISTINCT operator will return every id. You could use a GROUP BYinstead.
You will have to make a decision about wich id you want to retain. In the example, I have used MINbut any aggregate function would do.
SQL Statement
SELECT MIN(t1.id), t2.newsArticleID, t2.newsPhotoID
FROM uk_newsreach_article t1
INNER JOIN uk_newsreach_article_photo t2
ON t1.id = t2.newsArticleID
INNER JOIN uk_newsreach_photo t3
ON t2.newsPhotoID = t3.id
GROUP BY t2.newsArticleID, t2.newsPhotoID
ORDER BY t1.publishDate DESC;
Disclaimer
Now while this would be an easy solution to your immediate problem, if you decide that duplicates should not happen, you really should consider redesigning your tables to prevent duplicates getting into your tables in the first place.
group by all your selected columns with HAVING COUNT(*) > 1 will eleminate all duplicates like this:
SELECT *
FROM uk_newsreach_article t1
INNER JOIN uk_newsreach_article_photo t2
ON t1.id = t2.newsArticleID
INNER JOIN uk_newsreach_photo t3
ON t2.newsPhotoID = t3.id
GROUP BY t1.id, t1.headline, t1.extract, t1.text, t1.publishDate,
t2.id, t2.newsArticleID, t2.newsPhotoID,
t3.id, t3.htmlAlt, t3.URL, t3.height, t3.width
HAVING COUNT(*) > 1
ORDER BY t1.publishDate DESC;