MySQL Using IN() SubQuery Creates Much Longer Execution Time

MySQL Using IN() SubQuery Creates Much Longer Execution Time - mysql

What is the difference between the following? The first query takes 0.00 to execute, the second takes 0.00 to execute, the third takes 0.71 to execute. For some reason when I put the two queries together in example 3 it takes much longer to execute. Table traffic has an index on shortcode and ip, table redirect has an index on campaign and pubid.
Is there another type of index that could speed this scenario up?
Query 1: (0.00 Execution)
SELECT * FROM traffic
WHERE shortcode IN ('CODE1')
GROUP BY ip
Query 2: (0.00 Execution)
SELECT shortcode FROM redirect WHERE campaign = '385' AND pubid = '35'
Query 3: (0.71 Execution)
SELECT * FROM traffic
WHERE shortcode IN
(SELECT shortcode FROM redirect WHERE campaign = '385' AND pubid = '35')
GROUP BY ip

In older versions of MySQL, the IN ( SELECT ... ) construct was very poorly optimized. For every row in traffic it would re-execute the subquery. What version of MySQL are you using? The simple and efficient solution was to turn it into a JOIN.
SELECT t.*
FROM traffic AS t
JOIN redirect AS r USING(shortcode)
WHERE campaign = '385'
AND pubid = '35'
GROUP BY ip
You also need INDEX(campaign, pubid, shortcode) .
There is a "bug" in the either query -- You are asking for all columns, but grouping by only ip. If the rest of the columns are not really dependent on ip.

Related

why query all colum is way slower than query only id column in mysql high limit offset situation?

1. select * from inv_inventory_change limit 1000000,10
2. select id from inv_inventory_change limit 1000000,10
The first sql' timeconsumption is about 1.6s, the second sql timeconsumption is about 0.37s;
So the 2nd sql and 1st sql timeconsumption differential is about 1.27s;
I understand msyql will use covering index when only query indexed column, that is why 'select id' is faster;
However, when i use in idlist sql below to execute, i found it only took about 0.2s which is much shorter than the differential(1.27s), which is confusing me;
select * from inv_inventory_change c where c.id in (1013712,1013713,1013714,1013715,1013716,1013717,1013718,1013719,1013720,1013721);
My key question is why the time differential is much bigger than the where id in sql;
The inv_inventory_change table has 2321211 records;
And i add 'order by id asc' on above sqls, the timeconsumption not change;
EXPLAIN

The rule is very simple; your first query can be served without reading data from the disk/memory cache.
select id from inv_inventory_change limit 1000000,10
This can be directly served from the index table (B-Tree or its variant) without reading page information and other meta information.
select * from inv_inventory_change limit 1000000,10
This query will require two steps to fetch records. First, it will perform a query on the index table, which would be quick, but next, it needs to read page information for those records that will require disk io and storing in cache, etc. Since a LIMIT is applied, it will automatically sort for you depending on the default ORDER BY setting, most likely it will sort using the id field. Since you're selecting a large number of records it will use FileSort to sort records or something similar.
select * from inv_inventory_change c where c.id in (1013712,1013713,1013714,1013715,1013716,1013717,1013718,1013719,1013720,1013721);
This query would be served using a range scan on the index table and it can find the entry corresponding to 1013712 in O(lon N) time and it should be able the serve the query quickly.
You should also look at the number of records you're reading, e.g the query having limit 1000000,10 will require many disk io due to a large number of entries whereas in the 3rd example it will read a handful number of pages.

Why executes MariaDB all subqueries in select-statement before order by keyword, even if they are not necessary?

we switched our database from mySQL8 to MariaDB10 a week ago and now we have massive performance problems. We figured out why: we are working with subqueries in select statements and ORDER BY pretty often. Here is an example:
SELECT id, (SELECT id2 FROM table2 INNER JOIN [...] WHERE column.foreignkey = table.id) queryResult
FROM table
WHERE status = 5
ORDER BY column
LIMIT 10
imagine, there are 1.000.000 entries in table which are affected if status = 5.
What happens in mySQL8: ORDER BY and LIMIT execute and after that the subquery (10 rows affected)
What happens in MariaDB10: the subquery executes (1.000.000 rows affected) and after that ORDER BY and LIMIT
Both queries are returning 10 rows but under MariaDB10 it is incredible slow because of that. Why is this happing? And is there an option in MariaDB which we should activate to avoid this? I know from mySQL8 that select subqueries will be executed when they are mentioned in ORDER BY. But if not they will be executed when the resultset is there.
Info: if we do this, everything is fine:
SELECT *, (SELECT id2 FROM table2 INNER JOIN [...] WHERE column.foreignkey = outerTable.id)
FROM (
SELECT id
FROM table
WHERE status = 5
ORDER BY column
LIMIT 10
) outerTable
Thank you so much for any help.

This is because table a by nature unsorted bunch of rows
A "table" (and subquery in the FROM clause too) is - according to the SQL standard - an unordered set of rows. Rows in a table (or in a subquery in the FROM clause) do not come in any specific order. That's why the optimizer can ignore the ORDER BY clause that you have specified. In fact, the SQL standard does not even allow the ORDER BY clause to appear in this subquery (we allow it, because ORDER BY ... LIMIT ... changes the result, the set of rows, not only their order).
mariadb manual
So the optimizer removes and ignores the ORDER BY.
You found already a method to circumvent it using LIMIT and ORDER By in the subquery

After searching and searching I finally found a solution to make the mariaDB10 database working as I knew it from mySQL8.
For those which have similar problems: set this each time you connect to the server and everything works like in mySQL8:
SET optimizer_use_condition_selectivity = 1
Long version: the problem I described at the top was suddenly solved and the subquery was executed like it was in the past under mySQL8. I did exactly nothing!
But there were soon new problems: we have a statistic page, which was incredible slow. I noticed that an index was missing and I add it. I executed the query and it was working. Without index 100.000 rows affected for finding the results, after adding 38. Well done.
Then strange things started to happen: I executed the query again and the database didn't use the index. So I executed it again and again. This was the result:
1st query execution (I did it with ANALYZE): 100.000 rows affected
2nd query execution: 38 rows affected
3rd query execution: 38 rows affected
4th query execution: 100.000 rows affected
5th query execution: 100.000 rows affected
It was complete random, even in our SaaS solution! So I startet to search how the optimizer determine an execution plan. I found this: optimizer_use_condition_selectivity
the default for mariaDB10.4 server is 4 which means, that histograms are used to calculate the result set. I saw a few videos about it and recognized that this will not work in our case (although we stuck to database normalization). Mode 1 works well:
use selectivity of index backed range conditions to calculate the cardinality of a partial join if the last joined table is accessed by full table scan or an index scan
I hope this will help some other guys which despair with this like I did.

At 5.6, MariaDB and MySQL went off in different directions for the Optimizer. MariaDB focused a lot on subqueries, though perhaps to the detriment of this particular query.
Do you have INDEX(status, column)? It would help most variants of this query.

Yes, the subquery has to be evaluated for every row before the order by. The subquery only seems to need id, so you can phrase this as:
SELECT id,
(SELECT id2 FROM table2 INNER JOIN [...] WHERE column.foreignkey = t.id) as queryResult
FROM (SELECT t.*
FROM table t
WHERE status = 5
ORDER BY column
LIMIT 10
) t
This evaluates the subquery only after the rows have been selected from the table.

MySQL(version 5.5): Why `JOIN` is faster than `IN` clause?

[Summary of the question: 2 SQL statements produce same results, but at different speeds. One statement uses JOIN, other uses IN. JOIN is faster than IN]
I tried a 2 kinds of SELECT statement on 2 tables, named booking_record and inclusions. The table inclusions has a many-to-one relation with table booking_record.
(Table definitions not included for simplicity.)
First statement: (using IN clause)
SELECT
id,
agent,
source
FROM
booking_record
WHERE
id IN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
)
Second statement: (using JOIN)
SELECT
id,
agent,
source
FROM
booking_record
JOIN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
) inclusions
ON
id = foreign_key_booking_record
with 300,000+ rows in booking_record-table and 6,100,000+ rows in inclusions-table; the 2nd statement delivered 127 rows in just 0.08 seconds, but the 1st statement took nearly 21 minutes for same records.
Why JOIN is so much faster than IN clause?

This behavior is well-documented. See here.
The short answer is that until MySQL version 5.6.6, MySQL did a poor job of optimizing these types of queries. What would happen is that the subquery would be run each time for every row in the outer query. Lots and lots of overhead, running the same query over and over. You could improve this by using good indexing and removing the distinct from the in subquery.
This is one of the reasons that I prefer exists instead of in, if you care about performance.

EXPLAIN should give you some clues (Mysql Explain Syntax
I suspect that the IN version is constructing a list which is then scanned by each item (IN is generally considered a very inefficient construct, I only use it if I have a short list of items to manually enter).
The JOIN is more likely constructing a temp table for the results, making it more like normal JOINs between tables.

You should explore this by using EXPLAIN, as said by Ollie.
But in advance, note that the second command has one more filter: id = foreign_key_booking_record.
Check if this has the same performance:
SELECT
id,
agent,
source
FROM
booking_record
WHERE
id IN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
id = foreign_key_booking_record -- new filter
AND
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
)

MySQL query caching of inner query

I have a large query with many nested SELECT statements. A simplified version might look like this:
SELECT * FROM tableA WHERE x IN(
SELECT * FROM tableB WHERE x IN(
SELECT * FROM tableC WHERE user_id = y
)
)
Crucially, the innermost statement starts off by looking at the user_id and selecting a list of id numbers to use in the rest of the query.
The problem I'm having is that even if two users have the same data in tableC, the rest of the query doesn't seem to be cached.
For example if SELECT * FROM tableC WHERE user_id = 1 returns (1,2,3,4,5)
and SELECT * FROM tableC WHERE user_id = 2 also returns (1,2,3,4,5)
If I run the full query with user_id = 1 the execution time is about 0.007 seconds. If I re-run the query, I get a reduced execution time of 0.002. If I change the user_id to 2 and run the query, the execution time goes back to 0.007 for the first time the query is run. Is it possible for mySQL to cache the result of the individual parts of a query?

It seems that you use mysql. So when you run query 'SELECT * FROM tableC WHERE user_id = 1' first time you get the result '1,2,3,4,5' and you query goes to query cache. Therefore the time execution after the second running is less than the first one. In this case your result is associated with your first query.
When you run the second query your server doesn't know anything about it. So it runs it and returns something(in your case results are identical). Next time when you run the second query you will get it from query cache and it will be significantly fast. Anyway the server will store two different records in query cache.

Counting Records COUNT or SUM or ROW?

This table contains server monitoring records. Once the server fails to ping, it inserts new records. So one server can fail multiple times. I want to get the count of records how many times SERVER 3 fails.
This is the table where failure_id is Primary Key.
failure_id server_id protocol added_date
---------- --------- -------- ---------------------
1 1 HTTP 2013-02-04 15:50:42
2 3 HTTP 2013-02-04 16:35:20
Using (*) to count the rows
SELECT
COUNT(*) AS `total`
FROM
`failures` `f`
WHERE CAST(`f`.`server_id` AS CHAR) = 3;
Using server_id to count the rows
SELECT
COUNT(`f`.`server_id`) AS `total`
FROM
`failures` `f`
WHERE CAST(`f`.`server_id` AS CHAR) = 3;
Using SUM to count the rows
SELECT
IFNULL(SUM(1), 0) AS `total`
FROM
`failures` `f`
WHERE CAST(`f`.`server_id` AS CHAR) = 3;
All the above queries return the correct output. But my database will be very large in the future. Which method is best to use based on performance? Thanks in advance...

I'd say none of the above. If you have control over the app that's inserting the records that is. If so, if you don't have a table for your servers, just create one. otherwise add a field called current_failure_count or something and stick it in that table. So when you insert the record, also do an update on your server table and set current_failure_count = current_failure_count + 1 for that server. That way you have to only read one record in the server table (indexed by server_id I'd assume) and you're set. No, this does not follow any of the normalization rules, but you are seeking speed and this is the best way to get it if you can control the client software.
If you cannot control the client software, perhaps you can put a trigger on the insert of records into the failures table that increments the current_failure_count value in the servers table. that should work as well.

Well, the second is definitely more efficient than the first.
I recommend you create a view for the server, which will severely speed things up
CREATE VIEW server3 AS
SELECT server_id
FROM failures
CAST(`f`.`server_id` AS CHAR) = 3;
Then Simply run a count on that view as if it was a table!

Like others, it's not clear to me why you're casting the server_id value. That is going to cost you more performance than any other issue.
If you can eliminate that cast so that you're searching WHERE server_id = (value) and you create an index on server_id then either of the first two queries you suggested will be able to perform index-only retrieval and will provide optimal performance.

SELECT COUNT(*) AS `total`
FROM failures f
WHERE f.server_id = 3;
count(*) will always be better than the arithmetic calculation, although applying index will give more faster result in this.
second best solution will be
SELECT IFNULL(SUM(1), 0) AS `total`
FROM failures `f`
WHERE f.server_id = 3;
this method is used my SQL engine of many tools such as microstrategy
hope answer helps...:)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008